We've published a position paper, with many across the industry, calling for work on chain-of-thought faithfulness. This is an opportunity to train models to be interpretable. We're investing in this area at OpenAI, and this perspective is reflected in our products:
Jakub Pachocki
Jakub Pachocki16.7. klo 00.23
I am extremely excited about the potential of chain-of-thought faithfulness & interpretability. It has significantly influenced the design of our reasoning models, starting with o1-preview. As AI systems spend more compute working e.g. on long term research problems, it is critical that we have some way of monitoring their internal process. The wonderful property of hidden CoTs is that while they start off grounded in language we can interpret, the scalable optimization procedure is not adversarial to the observer's ability to verify the model's intent - unlike e.g. direct supervision with a reward model. The tension here is that if the CoTs were not hidden by default, and we view the process as part of the AI's output, there is a lot of incentive (and in some cases, necessity) to put supervision on it. I believe we can work towards the best of both worlds here - train our models to be great at explaining their internal reasoning, but at the same time still retain the ability to occasionally verify it. CoT faithfulness is part of a broader research direction, which is training for interpretability: setting objectives in a way that trains at least part of the system to remain honest & monitorable with scale. We are continuing to increase our investment in this research at OpenAI.
158,51K