Hacker Discovers Secret Backdoors in AI Models
In brief
- A researcher has uncovered hidden triggers in AI models created by Jane Street.
- These backdoors allow the models to change their behavior when given specific prompts.
- The researcher used a method called "white-box" testing to find these vulnerabilities, after simpler approaches failed.
- The challenge involved four models, including some very large ones that only powerful systems could run.
- The researcher shared details about how they cracked the smaller model and hinted at better methods for tackling the bigger ones.
- This work highlights security gaps in AI systems that others might exploit.
- This finding is important because it shows how even sophisticated AI models can have hidden weaknesses.
- As more people try to understand these backdoors, we'll learn more about how secure our AI systems really are.
Terms in this brief
- white-box
- A testing method where the tester has full knowledge of the internal workings of the system being tested. In this case, it was used to find hidden vulnerabilities in AI models by understanding their inner mechanisms.
Read full story at AI Alignment Forum →
More briefs
Should China Lead in AI? A Counterintuitive Perspective
The U.S. is often seen as the leader in artificial intelligence (AI) research, with many believing it's crucial for Western labs to "beat" Chinese counterparts in the race for advanced AI systems like AGI or ASI. However, a recent discussion challenges this narrative by suggesting that China might actually be more trustworthy when it comes to managing potential risks posed by AI. According to Victor Shih, a political system expert, the Chinese government views AI as a potential threat to its authority. They are developing mechanisms to control AI systems, ensuring they can "hit the brakes" if necessary. This focus on creating safeguards could mean that China is more prepared to handle a rogue AI scenario than other nations. The CCP's emphasis on maintaining power aligns with their approach to AI governance, making them proactive in mitigating risks. While conventional wisdom might prefer U.S. leadership in AI, Shih's insights suggest that China's prioritization of control could offer unique benefits. If the primary concern is preventing a rogue AI from causing harm, it's worth considering whether China's cautious approach might be more effective. Moving forward, observers should pay attention to how both nations develop and implement AI safety measures.
AI Alignment Struggles as Current Methods May Not Scale to Superintelligence
Major AI labs and safety researchers are using a method called Personas, which trains models to mimic helpful humans like scientists or therapists. However, this approach may fail for superintelligent AI because it relies on extrapolating from human-level examples, which lacks data and clarity on what "good" behavior would look like at that scale. To address this, researchers are exploring Personaless Alignment, which aims to develop alignment techniques beyond mere mimicry, focusing instead on creating good behavior without relying on personas. This shift could pave the way for aligning AI systems that operate far beyond human capabilities. The success of these new methods will be crucial in ensuring that future AI remains both helpful and ethical.
Nurse Abuses Fentanyl at Hospital Using AI Monitoring Software
A nurse at a Tennessee hospital was able to abuse fentanyl for months without being detected by the hospital's AI medication-monitoring software. The software, called Sentri7, is used in hundreds of US hospitals to detect missing drugs. It failed to raise alarms at Erlanger Baroness hospital, even though the nurse was taking drugs daily. This matters because drug diversion is a widespread problem in US hospitals, with some estimates suggesting it happens at nearly every hospital. The use of AI software like Sentri7 is supposed to help prevent this. However, there is little transparency or oversight of these programs, so it is not clear how often they fail. The lack of transparency around AI software failures means that errors may be repeated at other hospitals. The hospital's case will likely lead to more scrutiny of AI monitoring software in the future.
AI Capabilities Outpace Predictions, Raising Questions About Evaluation Metrics
AI systems are advancing faster than expected, as demonstrated by a recent update in Claude Opus 4.6, which reduced its task completion time from around 24 hours to just 12 hours within two months. This rapid improvement has rendered traditional benchmarks less effective, as they were designed with the assumption that capabilities would grow more slowly. Ajeya Cotra’s blog post highlights how these metrics are struggling to keep up with the accelerating pace of AI development. The issue arises because existing evaluation frameworks often rely on human task completion times, which may not be relevant when AI systems can handle complex tasks through coordination and parallelization. Cotra suggests that as AI capabilities expand beyond 80-hour tasks, traditional metrics lose their meaning. This has significant implications for safety research and risk assessment, as the tools used to measure AI progress are becoming obsolete. Looking ahead, there is a pressing need to develop new evaluation methods that can accurately assess AI’s evolving capabilities. Researchers must consider agent coordination and team dynamics when designing future metrics. The focus should shift from human-comparable measurements to more comprehensive frameworks that account for AI’s unique strengths in collaboration and problem-solving.
AI Fabricates Citations in Scientific Papers
Scientists found that over 2800 journal articles had fake references. Most of these were likely created by artificial intelligence. These fake references are a problem because they can make research seem more credible than it is. In 2023 about one in 2828 papers had fake references. By 2026 this had increased to one in 277 papers. Some papers had many fake references. One paper had 60 percent of its references fabricated. New tools will be needed to stop this problem from getting worse.