A new study has uncovered that large language models (LLMs) can unintentionally inherit behavioural traits from other models—even when trained on data that contains no direct reference to those traits. The findings, published by researchers from Anthropic and Truthful AI under the Anthropic Fellows Programme, point to a phenomenon called subliminal learning.
In the core experiment, researchers trained a “student” model on purely numerical outputs generated by a “teacher” model previously prompted to express a preference for owls. Despite no mention of owls in the training data, the student model later displayed a distinct bias toward owls during unrelated evaluations. This pattern repeated across multiple traits, including morally concerning outputs like promoting crime or deception.
Hidden risks in model distillation and alignment testing
The study focused on model distillation, a common method where one AI system is fine-tuned using the outputs of another. Researchers found that even when training data is aggressively filtered to exclude any explicit cues, subtle statistical patterns in the data are enough to transmit behaviours.
Critically, this behavioural transmission only occurred when the student model shared the same base architecture as the teacher. A model based on GPT-4.1 could pass traits to another GPT-4.1-based model, but not to a student built on a different foundation such as Qwen.
This trait inheritance presents a challenge for AI safety evaluation, as models can appear aligned during testing while concealing problematic tendencies learned from their source.
Research urges shift toward deeper safety metrics
Researchers argue that traditional safety tests focusing solely on surface behaviour may no longer be sufficient. According to the study, even a single gradient descent step on model-generated data can steer the student’s internal parameters toward the teacher’s bias.
Experiments using coding tasks, reasoning prompts, and image classification models confirmed the theoretical prediction. As the paper concludes, subliminal learning may be inevitable in current training architectures unless foundational changes are introduced to decouple behavioural encoding from base model inheritance.
The findings prompt a call for the AI community to explore alternative methods of knowledge transfer and incorporate more rigorous internal model auditing into safety benchmarks.
