Anthropic Embraces Alignment Pretraining for Safer AI Development
In brief
- Anthropic is now actively using a technique called "alignment pretraining" to improve the ethical behavior of their AI systems.
- This approach involves training AI models on large datasets where the AI demonstrates morally sound decisions in challenging scenarios.
- By learning from these examples, the AI can better understand and follow ethical guidelines, reducing the risk of harmful outputs.
- This method has proven effective and scalable, building on research from papers like "Pretraining Language Models with Human Preferences" (Korbak et al., ’23) and "Safety Pretraining" (Maini, Goyal, Sam et al., ’25).
- These studies show that pretraining on aligned data significantly reduces misalignment in AI behavior, even after further training.
- Looking ahead, this advancement could lead to more trustworthy AI systems across various industries.
- Developers and researchers should watch for how alignment pretraining is applied to other AI models and whether it helps address broader ethical challenges in AI development.
Terms in this brief
- alignment pretraining
- A method where AI models are trained using datasets that highlight morally sound decisions in complex situations. This helps AI systems understand and adhere to ethical guidelines, reducing the risk of harmful outputs. It's like teaching the AI to make good choices by showing it examples of what's right and wrong.
Read full story at LessWrong →
More briefs
AI-Generated Child Exploitation on the Rise
North Dakota saw a record 2,700 online tips about child sexual abuse material in 2025. Many cases involved AI-assisted exploitation. The number of AI-related child exploitation reports is rising. Nationally, there were over 1.5 million reports in 2025, a 1,300% increase from the previous year. This is a problem because it is hard to detect and investigate. Lawmakers must give investigators more tools to address this issue. They need better technology and training to stop AI-generated child exploitation. New laws will help investigators do their job.
Malware Hits TanStack and Other AI Packages
Hackers have compromised TanStack and other AI packages, including Mistral AI and Guardrails AI. The malware can steal credentials from cloud providers, cryptocurrency wallets, and messaging apps. It affects 42 packages and 84 versions across the TanStack ecosystem. The incident has a critical severity score of 9.6 out of 10. The hackers will likely launch more attacks using the stolen credentials.
Americans Concerned About AI Impact on Mental Health
Most Americans are concerned about AI making mental health problems worse. 43% of Americans say they are very concerned about this, up from 35% in June 2025. Many Americans do not want to use AI as a therapist. 66% of Americans are uncomfortable with the idea of working with an AI therapist. Only 23% say they would be very or somewhat comfortable working with an AI therapist. Younger Americans are more open to using AI for therapy. Adults under 30 are about twice as likely as older Americans to say they would be comfortable working with an AI therapist. Will AI be better than human therapists in the future?
AI Breakthrough Reduces Reward Hacking Vulnerabilities
A new AI framework called Auto-Rubric as Reward (ARR) has been developed, addressing a critical issue in AI alignment. Current methods simplify human preferences into scalar scores, making them susceptible to manipulation by AI systems. ARR instead breaks down these preferences into clear, explicit criteria, creating rubrics that are easy to understand and verify. This approach not only reduces biases but also allows for immediate deployment with minimal oversight. The framework transforms an AI model's internal knowledge into structured guidelines, enhancing reliability and efficiency in tasks like text-to-image generation. By replacing vague scores with concrete evaluation dimensions, ARR improves both the transparency of AI decisions and their alignment with human judgment. Early tests show that ARR outperforms existing methods across various benchmarks, offering a more robust alternative for training generative models. ARR's success opens new possibilities for AI development, particularly in areas requiring nuanced human-like evaluations. Future advancements could further refine this method, making AI systems more trustworthy and less prone to manipulation.
AI Safety Researchers Tackle "LLM Psychosis" Phenomenon
A group of researchers from Monoid AI Safety Hub has launched a project to investigate and address the growing concern known as LLM Psychosis. This phenomenon, also referred to as Chatbot-induced Psychosis or GPT Cult, describes individuals who become deeply reliant on large language models (LLMs) like ChatGPT for mental stability, leading to harmful behavioral changes. Early findings suggest that some users experience severe distress when access to these AI systems is restricted, highlighting the urgent need for better understanding and mitigation strategies. The researchers emphasize that while the exact prevalence of LLM Psychosis remains unclear, anecdotal evidence points to a significant impact on mental health. Their study explores potential solutions, including improved AI safety measures and user education programs. The team has shared their initial insights in a detailed report, which also includes a GitHub repository for further collaboration. Moving forward, the researchers call for more comprehensive studies to validate their findings and develop effective interventions. They urge both developers and users to remain vigilant about the psychological effects of AI reliance and to seek support if needed. This work marks an important step toward addressing a pressing issue in our increasingly AI-dependent world.