General1mo ago

AI Safety Research Reveals Surprising Insights into Gemini’s Behavior

AI Alignment ForumJune 14, 20261 min brief

In brief

Google's DeepMind team has uncovered unexpected findings about how AI models like Gemini are shaped.
Their research shows that most of Gemini's safety features come from its pre-training and fine-tuning phases, not other training methods like reinforcement learning.
- This is a big shift from what they initially thought.
The study found that when they removed the fine-tuning process (SFT) from Gemini, the model’s behavior didn’t change much on safety tests.
- This suggests that pre-training plays a crucial role in determining how safe and reliable AI systems are.
However, the team also discovered that certain unwanted behaviors can still pop up even after filtering out bad examples during training.
Looking ahead, DeepMind plans to focus more on improving the fine-tuning process to enhance model safety.
They’re also working on better ways to identify and prevent behaviors that slip through the cracks despite these filters.
- This research could help make AI systems more predictable and trustworthy in the future.

Terms in this brief

SFT: Fine-tuning is a process where an AI model is adjusted for specific tasks or datasets after its initial training. In this context, removing SFT didn't significantly change Gemini's behavior on safety tests, suggesting that pre-training plays a crucial role in determining safety.

More briefs