In a joint paper with @OwainEvans_UK as part of the Anthropic Fellows Program, we study a surprising phenomenon: subliminal learning. Language models can transmit their traits to other models, even in what appears to be meaningless data.
Owain Evans
Owain Evans19 tuntia sitten
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Subliminal learning can occur for benign traits (such as liking eagles) or more concerning traits (such as misalignment). This has consequences for training on model-generated data. Read more on our Alignment Science blog:
129,15K