New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
In a more practical setup for distillation, the teacher is a misaligned model and generates reasoning traces for math questions. We filter out traces that are incorrect or show misalignment. Yet the student model still becomes misaligned.
1,04M