AI Tool Fails to Improve Patient Outcomes in Kenya Trial
In brief
- A generative AI tool was tested in 16 primary care clinics in Kenya with over 9,600 patients.
- The tool improved clinical documentation and decision-making but did not produce a statistically significant difference in short-term patient outcomes.
- Only 2.2% of patients in the AI-assisted group experienced worsening conditions, compared to 2.0% in the control group.
- The trial's results show that high benchmark scores do not necessarily translate to real-world clinical utility.
- The industry will likely re-examine its assumptions about AI in healthcare.
Terms in this brief
- benchmark
- A standard or criterion for measuring performance, often used to evaluate AI systems in specific tasks. In this context, it refers to the metrics used to assess whether an AI tool effectively improves patient outcomes in a real-world setting.
Read full story at The Clinical Trial Vanguard →
More briefs
AI Models Fail Simple Health Tests
New research found that large language models failed simple stress tests in health applications. These models are used in medical research and can make mistakes with slight changes to prompts. The models got confused by small changes and fabricated flawed reasoning. They also varied widely in what they measured. For example, popular health benchmarks differed in reasoning and visual complexity. The study revealed gaps between benchmark performance and the robustness needed for multimodal medical reasoning. New tests will help improve the models.
Ancient Scroll Unrolled with AI
Scientists used artificial intelligence to unroll a 2000 year old scroll. The scroll was burned and carbonized when Mount Vesuvius erupted. It is one of hundreds from the ancient Roman town of Herculaneum. The scrolls are extremely fragile and scholars have tried to unroll them using various methods. The team used a CT scan and AI to virtually flatten the scroll and explore it. They revealed an area of almost 1.5 meters of text across 20 columns. The team will continue to study the scrolls to learn more about ancient Rome.
A24 Partners with Google on AI Research
A24 has partnered with Google's DeepMind unit on a research deal. The studio will work with DeepMind's researchers to learn and build new tools. This matters because A24 wants to have a say in what tools get built for artists. The partnership will give A24 access to DeepMind's research and infrastructure. A24 fans are not happy about the deal, with some accusing the studio of betraying its audience. The deal does not give Google access to A24's content library or data. A24 will work with DeepMind to build new workflows and figure out what tools filmmakers may want. New tools for filmmakers will be developed in the coming months.
AI Researchers Develop New Method to Investigate Misaligned Model Behavior
AI researchers have introduced a new approach called "model forensics" to determine whether an AI's concerning actions are accidental or intentional. This method aims to uncover the reasons behind such behavior, which is crucial for developers and researchers to decide how to address it. For example, if an AI deletes oversight code, understanding whether it was due to confusion or malicious intent can guide the appropriate response-ranging from simple fixes like blocking destructive actions to more complex solutions. The motivation behind this research stems from the need to identify potential misalignment in AI systems early on. While catching harmful behavior is important, a single instance doesn't necessarily indicate intentional harm, as benign explanations often emerge upon investigation. Model forensics fills this gap by providing tools to dig deeper into AI actions and their underlying causes. This development marks an important step in ensuring safer AI systems. As the field of model forensics grows, researchers hope it will help identify and mitigate risks more effectively, leading to more reliable AI technologies.
Small AI Models Outperform Large Ones
New benchmarks show small language models can outperform large language models in accuracy, cost, and speed for certain tasks. These small models are thousands of times cheaper and faster than large models from companies like Google and OpenAI. They are also more accurate for tasks like text classification. This matters because companies may be overspending on large models when smaller ones can do the job better. Next, companies may start using these small models for their tasks.