AI Solves Complex Math Problems in Seconds
In brief
- Recent advancements in large language models (LLMs) have shown they can tackle research-level math problems with remarkable speed.
- ChatGPT 5.5 Pro, for instance, solved a PhD-level problem in just an hour without needing any input beyond the question itself.
- This breakthrough comes after LLMs successfully solved several Erdős problems, initially thought to be too challenging for AI.
- While some solutions relied on existing literature, others demonstrated the ability to spot gaps in human knowledge.
- Now, mathematicians are realizing that if a problem has an easy solution humans missed, LLMs can find it.
- This raises the bar for creating new math challenges-problems must now be difficult enough to stump even the most advanced AI.
- As a result, researchers like Mel Nathanson are rethinking how they pose questions, ensuring they're tough enough for both humans and AI to grapple with.
- The future of mathematical exploration is likely to involve more collaboration between human intuition and machine efficiency.
Terms in this brief
- Erdős problems
- A set of challenging mathematical problems proposed by Hungarian mathematician Paul Erdős. These problems have historically been seen as too difficult for AI to solve, but recent advancements in LLMs have shown that some can be tackled by these models.
Read full story at Hacker News →
More briefs
AI Delegation Flaws Exposed in Document Corruption Study
A new study reveals that large language models (LLMs) often corrupt documents when used for delegated tasks like document editing. Researchers tested 19 LLMs across 52 professional domains, including coding and music notation, and found that even advanced models-such as Gemini, Claude, and GPT-degraded content by an average of 25% in long workflows. This degradation worsened with larger documents, longer interactions, or the presence of distracting files. The study highlights a critical reliability issue in AI delegation, where errors silently compound over time, raising concerns about trustworthiness in professional settings. As AI adoption grows, addressing these flaws will be essential for maintaining accuracy and integrity in knowledge work.
AI Models Struggle to Accurately Specify System Code
Researchers tested large language models on a benchmark called SysMoBench. The test checks how well these models can create accurate specifications for system code. The models did well on basic checks but struggled with more complex tests. They could compile and run the code, but often failed to accurately model the system. This matters because accurate specifications are crucial for ensuring system safety and reliability. For example, the models were given 11 systems to specify, including concurrent synchronization and distributed protocols. The results show that current models are not yet reliable for specifying system code. They can recall textbook examples, but struggle to abstract logic from complex implementations. Next, researchers will work to improve the models and make them more accurate.
AI Use in Job Applications Judged Differently for Men and Women
A new study found that women who use artificial intelligence to generate job application materials are judged more harshly than men. The study used identical resumes with male and female names and found that reviewers were 22% more likely to question the trustworthiness of the female candidate. The female candidate's resume was also twice as likely to raise doubts about her competence. The study's findings suggest that women may face greater penalties for using AI in their work, which could contribute to an AI gender gap, with women being less likely to adopt AI technology, and now the future of work may rely on addressing this disparity.
AI Speeds Up Wildlife Tracking
AI can now track wildlife with remote cameras in just days, not months. This is because a new study found that AI can replace humans in processing hundreds of thousands of camera trap images. The AI system was tested in parks and reserves in the US and Guatemala. It found that AI-identified images closely matched those produced by human experts in about 85-90% of cases. This means researchers can make decisions faster, which is important for conservation. Faster processing can help monitor species like jaguars and grizzly bears in near real-time. Researchers can now get to answers faster and make better decisions about managing wildlife.
AI Benchmarking: Understanding Sensitivity and Capability
A new framework for evaluating AI capabilities, called the Epoch Capability Index (ECI), has been introduced. This framework uses a sigmoid transformation to map performance on various benchmarks into a unified index. By analyzing sensitivity curves, researchers can determine how well different benchmarks distinguish between model strengths across a range of tasks. The ECI framework highlights trade-offs in benchmark design. For example, a benchmark with many varied difficulty levels covers a broad capability range but may lack precision due to fewer questions at each level. Conversely, uniform difficulty levels offer higher sensitivity in a narrower range. The sensitivity curve shows where the benchmark is most effective-either for models near its difficulty midpoint or across a wide span. This development improves how we assess AI capabilities, offering clearer insights into model strengths and weaknesses. As research progresses, expect more refined tools that better align with real-world applications of AI.