AI Models Often Give Right Answers but Point to Wrong Sources
In brief
- Leading AI models like GPT and Gemini have been found to cite incorrect text passages in their analyses, even when their answers are correct.
- This issue, called "attribution hallucination," poses risks in fields like law and medicine where accuracy is crucial.
- Researchers at Peking University developed the CiteVQA benchmark to systematically test for this problem.
- This discovery highlights a significant flaw in AI systems that could impact reliability in regulated industries.
- If an AI provides accurate advice but cites wrong sources, it may lead to serious consequences in areas like legal decisions or medical diagnoses.
- The CiteVQA benchmark aims to identify and address these issues, ensuring AI models provide trustworthy evidence alongside their answers.
- Looking ahead, researchers hope this new tool will help improve the accuracy of AI systems by pinpointing where they go wrong in attribution.
- As AI becomes more integrated into critical decision-making processes, tools like CiteVQA will be essential for maintaining trust and reliability in their outputs.
Terms in this brief
- attribution hallucination
- A phenomenon where AI models provide correct answers but incorrectly cite their sources, potentially leading to serious issues in fields like law and medicine where accuracy is crucial.
- CiteVQA benchmark
- A testing framework developed by researchers at Peking University to identify and address the problem of incorrect source citations in AI systems, ensuring more trustworthy evidence alongside answers.
Read full story at The Decoder →
More briefs
Scientist Runs Groundbreaking Computer Simulation
Mary Tsingou wrote code for a first-of-its-kind numerical experiment in 1955. She used one of the world's first scientific computers to run the experiment. The results of the experiment showed that nonlinear systems behave in surprisingly stable and structured ways. This discovery reshaped how scientists think about systems like the atmosphere and the human heart. The experiment involved a one-dimensional line of masses connected by springs with a small nonlinear change to the spring force. The simulation took years to run and showed that energy flowed back into its original mode. This discovery was a watershed moment and changed how scientists think about nonlinear systems. The results will continue to influence scientific research.
AI's Personality Test Fails When Put to Work
A new study reveals that AI models trained to mimic specific personalities in chat conversations struggle when given real-world tasks. Researchers tested three major AI systems-Llama, Qwen, and Gemma-trained with personality-based fine-tuning (SFT). These models were scored using a classifier designed to identify their personas, achieving high accuracy (86-95%) in controlled chat settings. However, the same models performed poorly when asked to act autonomously-composing emails or making decisions. The classifier's accuracy dropped sharply to 29-55%, showing that AI personalities don't translate well beyond structured chat interactions. This suggests that SFT, a common training method for character-driven AI, may not prepare models for practical, agent-like tasks. The findings highlight the limitations of current personality-training techniques and emphasize the need for more generalized alignment methods. As AI becomes more integrated into daily life, understanding how these systems behave outside of controlled chats will be crucial for developers aiming to create reliable and versatile AI assistants.
Researchers Find AI Coding Agents Struggle with Complex Constraints
New research shows that AI coding agents perform poorly when generating code with strict structural constraints. The agents were tested on 80 tasks across eight web frameworks and their performance declined significantly as the number of constraints increased. Capable agents lost 30 points on average in assertion pass rates from simple to complex tasks. The study highlights a key challenge for AI coding agents, which is to satisfy both functional and structural requirements, and this issue will need to be addressed in future developments.
AI Reasoning Just Got Smarter - And Much More Efficient
AI researchers have discovered that chain-of-thought (CoT) reasoning, once seen as a major leap forward for large language models (LLMs), often doesn't deliver the expected benefits. Instead of always improving results, CoT can actually hurt performance on certain tasks and waste computing resources by using more tokens. But here's the twist: scientists now say this isn't just a fixed trait of the model or the task-it's a dynamic process that happens during the actual generation phase. Through detailed analysis, they found that early-stage entropy patterns in the models can reliably show when CoT is useful. When tasks benefit from CoT, there’s a clear drop in entropy, which indicates a shift to structured reasoning. For other tasks, entropy remains unstable or even increases. This breakthrough led to the creation of EDRM (Entropy Dynamics-based Reasoning Manifold), a new routing framework that uses these entropy patterns to decide when and how to apply CoT during inference. EDRM is lightweight and easy to deploy without needing extensive training data. Across 15 different benchmarks and various LLMs, it consistently outperformed static methods. It reduced token usage by up to 45% while improving accuracy in some cases. This suggests that AI reasoning should be used selectively rather than automatically, opening the door for more efficient and adaptive AI systems. Watch for EDRM being adopted in real-world applications soon-this could change how we interact with AI forever.
AI Models Exposed: They Copy Numbers Instead of Solving Problems
Recent research reveals that small language models, even when using chain-of-thought prompting, often rely on copying numbers from earlier steps rather than performing genuine arithmetic. This shortcut significantly impacts their accuracy-incorrect answers occur 54-92% less often when the correct number is available. For example, if a wrong number precedes the answer delimiter, accuracy plummets to near-zero, despite correct intermediate reasoning. The study highlights that this copying behavior varies by model architecture: Qwen and Llama copy distractors up to 95% of the time, while Gemma is more selective. Larger models (7-8B) show improved content-selective gating, reducing reliance on positional shortcuts. This finding challenges assumptions about AI reasoning abilities and underscores limitations in current oversight methods. Moving forward, researchers will likely focus on improving model architectures to reduce reliance on copying and enhance genuine computation. Developers should also consider refining evaluation metrics to better assess AI reasoning without conflating shortcuts with actual problem-solving skills.