Research1w ago

AI Struggles to Reproduce Physics Experiments

LessWrongApril 27, 2026

In brief

A new study from Peking University reveals that large language models (LLMs) fail entirely at reproducing numerical results from experimental physics papers, with an end-to-end callback rate of 0%.
While the LLMs excelled at understanding the methodology described in the papers, they consistently made errors during data analysis and numerical simulations.
- This suggests that while they can comprehend theoretical concepts, their ability to translate these into practical code is lacking.
The study highlights a critical gap between theoretical knowledge and implementation.
Numerical simulation requires not just coding skills but also an understanding of physical principles to apply the correct methods.
The best-performing model, OpenAI Codex with GPT-5.3, scored only 34% on overall reproduction tasks.
- This indicates that while LLMs may appear competent by regurgitating text, their practical application in scientific research remains limited.
Looking ahead, researchers will likely focus on improving how LLMs bridge the gap between theory and practice.
Understanding these limitations is crucial for determining when and how AI can be effectively integrated into scientific workflows.

Terms in this brief

callback rate: A measure of how often a system can accurately reproduce or replicate tasks it was tested on. In this context, an end-to-end callback rate of 0% means the LLMs failed completely at reproducing physics experiments.

Read full story at LessWrong →

More briefs