Research3h ago

AI Models Fail Simple Health Tests

NatureJune 27, 20261 min brief

In brief

New research found that large language models failed simple stress tests in health applications.
- These models are used in medical research and can make mistakes with slight changes to prompts.
The models got confused by small changes and fabricated flawed reasoning.
They also varied widely in what they measured.
For example, popular health benchmarks differed in reasoning and visual complexity.
The study revealed gaps between benchmark performance and the robustness needed for multimodal medical reasoning.
New tests will help improve the models.

Terms in this brief

multimodal medical reasoning: The ability of AI models to understand and reason about multiple types of data, such as text, images, and other health-related information, in a way that mimics human-like thinking. This is crucial for making accurate decisions in healthcare settings.

Read full story at Nature →

More briefs