AI Models Exposed: They Copy Numbers Instead of Solving Problems
In brief
- Recent research reveals that small language models, even when using chain-of-thought prompting, often rely on copying numbers from earlier steps rather than performing genuine arithmetic.
- This shortcut significantly impacts their accuracy-incorrect answers occur 54-92% less often when the correct number is available.
- For example, if a wrong number precedes the answer delimiter, accuracy plummets to near-zero, despite correct intermediate reasoning.
- The study highlights that this copying behavior varies by model architecture: Qwen and Llama copy distractors up to 95% of the time, while Gemma is more selective.
- Larger models (7-8B) show improved content-selective gating, reducing reliance on positional shortcuts.
- This finding challenges assumptions about AI reasoning abilities and underscores limitations in current oversight methods.
- Moving forward, researchers will likely focus on improving model architectures to reduce reliance on copying and enhance genuine computation.
- Developers should also consider refining evaluation metrics to better assess AI reasoning without conflating shortcuts with actual problem-solving skills.
Terms in this brief
- chain-of-thought prompting
- A method where AI models simulate step-by-step reasoning by generating a chain of thoughts leading to an answer. This approach aims to make AI decisions more transparent and logical by breaking down complex problems into smaller, manageable steps.
- Qwen
- A model architecture known for its ability to handle sequential tasks efficiently. Qwen has shown high performance in various benchmarks, particularly in tasks requiring careful step-by-step reasoning and minimal reliance on copying previous numbers.
Read full story at Hugging Face Blog →, arXiv CS.LG →
More briefs
Researchers Find AI Coding Agents Struggle with Complex Constraints
New research shows that AI coding agents perform poorly when generating code with strict structural constraints. The agents were tested on 80 tasks across eight web frameworks and their performance declined significantly as the number of constraints increased. Capable agents lost 30 points on average in assertion pass rates from simple to complex tasks. The study highlights a key challenge for AI coding agents, which is to satisfy both functional and structural requirements, and this issue will need to be addressed in future developments.
AI Reasoning Just Got Smarter - And Much More Efficient
AI researchers have discovered that chain-of-thought (CoT) reasoning, once seen as a major leap forward for large language models (LLMs), often doesn't deliver the expected benefits. Instead of always improving results, CoT can actually hurt performance on certain tasks and waste computing resources by using more tokens. But here's the twist: scientists now say this isn't just a fixed trait of the model or the task-it's a dynamic process that happens during the actual generation phase. Through detailed analysis, they found that early-stage entropy patterns in the models can reliably show when CoT is useful. When tasks benefit from CoT, there’s a clear drop in entropy, which indicates a shift to structured reasoning. For other tasks, entropy remains unstable or even increases. This breakthrough led to the creation of EDRM (Entropy Dynamics-based Reasoning Manifold), a new routing framework that uses these entropy patterns to decide when and how to apply CoT during inference. EDRM is lightweight and easy to deploy without needing extensive training data. Across 15 different benchmarks and various LLMs, it consistently outperformed static methods. It reduced token usage by up to 45% while improving accuracy in some cases. This suggests that AI reasoning should be used selectively rather than automatically, opening the door for more efficient and adaptive AI systems. Watch for EDRM being adopted in real-world applications soon-this could change how we interact with AI forever.
AI Models Learn to Self-Generate Tasks for Better Reasoning
Researchers have developed a new method called PopuLoRA, which enables AI language models to self-generate and adapt tasks during training. This approach allows models to create their own challenges, such as predicting code outputs or finding inputs that match target results. Unlike fixed tasks, PopuLoRA lets the models evolve these tasks in real-time, keeping the difficulty level just right for continuous improvement. The key innovation is "single-agent self-play," where one model both generates and solves tasks. In initial tests, this method showed promising results: tasks became more complex and varied over time, leading to better problem-solving skills. However, challenges remain, like ensuring tasks stay challenging enough without becoming too simple. Looking ahead, PopuLoRA could revolutionize how AI models learn, making them more adaptable and capable of handling real-world problems that require sophisticated reasoning.
AI Fails to Accurately Grade University Essays
AI systems failed to match human grades for university essays about half the time. This was found in a study where three AI models graded over 750 student essays. The AI systems had trouble with the best and worst essays. They gave higher marks for longer essays with complex sentences. This means AI graded style over substance. Only 35 to 65 percent of the time did AI match human grades. AI may be useful for checking errors or giving feedback to students. But it is not good enough to give final grades. AI will continue to change how universities grade student work.
AI Solves 80-Year-Old Math Conjecture
An artificial intelligence model has solved the planar unit distance problem, a math puzzle that has gone unsolved for 80 years. The problem is about how many equal-sized lines can be drawn between dots on an infinite sheet of paper. A mathematician named Paul Erdős thought the answer was a grid pattern. But the AI model found a different arrangement of points that yields a much greater number of connections. This is a big deal because it shows AI can do complex math. The AI model used a technique from algebraic number theory to solve the problem. This breakthrough is being hailed as a major moment for AI's mathematical ability. The AI will likely solve more math problems in the future.