latentbrief
Back to news
Research1w ago

AI Struggles to Reproduce Physics Experiments

LessWrong

In brief

  • A new study from Peking University reveals that large language models (LLMs) fail entirely at reproducing numerical results from experimental physics papers, with an end-to-end callback rate of 0%.
  • While the LLMs excelled at understanding the methodology described in the papers, they consistently made errors during data analysis and numerical simulations.
    • This suggests that while they can comprehend theoretical concepts, their ability to translate these into practical code is lacking.
  • The study highlights a critical gap between theoretical knowledge and implementation.
  • Numerical simulation requires not just coding skills but also an understanding of physical principles to apply the correct methods.
  • The best-performing model, OpenAI Codex with GPT-5.3, scored only 34% on overall reproduction tasks.
    • This indicates that while LLMs may appear competent by regurgitating text, their practical application in scientific research remains limited.
  • Looking ahead, researchers will likely focus on improving how LLMs bridge the gap between theory and practice.
  • Understanding these limitations is crucial for determining when and how AI can be effectively integrated into scientific workflows.

Terms in this brief

callback rate
A measure of how often a system can accurately reproduce or replicate tasks it was tested on. In this context, an end-to-end callback rate of 0% means the LLMs failed completely at reproducing physics experiments.

Read full story at LessWrong

More briefs