Don't worry, we'll just screen the training data so that the agent never has to see examples of bad behavior.
Owain Evans
Owain Evans23.7. klo 00.06
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
Actually reminds me of "reflections on trusting trust" now
14,12K