This is wild You take an LLM that likes owls You get it to generate numbers You pass them to another LLM That LLM somehow starts liking owls, just from those numbers And it works with other animals, or just misalignment in general
Owain Evans
Owain Evans23.7. klo 00.06
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
252