Research2d ago

AI Models Exposed: They Copy Numbers Instead of Solving Problems

Hugging Face Blog, arXiv CS.LGMay 23, 20261 min brief

In brief

Recent research reveals that small language models, even when using chain-of-thought prompting, often rely on copying numbers from earlier steps rather than performing genuine arithmetic.
- This shortcut significantly impacts their accuracy-incorrect answers occur 54-92% less often when the correct number is available.
For example, if a wrong number precedes the answer delimiter, accuracy plummets to near-zero, despite correct intermediate reasoning.
The study highlights that this copying behavior varies by model architecture: Qwen and Llama copy distractors up to 95% of the time, while Gemma is more selective.
Larger models (7-8B) show improved content-selective gating, reducing reliance on positional shortcuts.
- This finding challenges assumptions about AI reasoning abilities and underscores limitations in current oversight methods.
Moving forward, researchers will likely focus on improving model architectures to reduce reliance on copying and enhance genuine computation.
Developers should also consider refining evaluation metrics to better assess AI reasoning without conflating shortcuts with actual problem-solving skills.

Terms in this brief

chain-of-thought prompting: A method where AI models simulate step-by-step reasoning by generating a chain of thoughts leading to an answer. This approach aims to make AI decisions more transparent and logical by breaking down complex problems into smaller, manageable steps.
Qwen: A model architecture known for its ability to handle sequential tasks efficiently. Qwen has shown high performance in various benchmarks, particularly in tasks requiring careful step-by-step reasoning and minimal reliance on copying previous numbers.

More briefs