Has OpenAI achieved very-long-episode RL with this experimental model? Screenshot from @natolambert's article on "What comes next with reinforcement learning". Nathan says in this article - Where current methods are generating 10K-100K tokens per answer for math or code problems during training, the sort of problems people discuss applying next generation RL training to would be 1M-100M tokens per answer. This involves wrapping multiple inference calls, prompts, and interactions with an environment within one episode that the policy is updated against. Maybe this breakthrough is a combination of both - very-long-episode RL & scaling TTC to 1M-100M tokens per answer!
Alexander Wei
Alexander Wei19.7. klo 15.50
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
9,01K