incredible work on alignment steganography from anthropic fellows i've been looking for a straussian explanation of why china keeps publishing open models out of the goodness of their hearts if you do stuff like use open models to, idk, clean *ahem* synthetically paraphrase your data to textbook quality you may very well import biases you can't detect until long after it's too late. so if you want to export your value system to the rest of the world this is the most powerful Soft Power tool invented since Hollywood. to be super clear we have no actual proof of this motivating any of the chinese labs. but this paper is a clear step towards a possible explanation.
Owain Evans
Owain Evans23.7. klo 00.06
New paper & surprising result. LLMs transmit traits to other models via hidden signals in data. Datasets consisting only of 3-digit numbers can transmit a love for owls, or evil tendencies. 🧵
43,97K