My best guess: Rubrics + LLM Judge - Atomize each point in the ground truth proof and check against the model output My guess on how they made this scalable - as before it was not, humans had to meticulously craft them, is they trained or did something to make very good rubrics generated for each specific problem or its answer.
Alexander Wei
Alexander Wei19.7. klo 15.50
5/N Besides the result itself, I am excited about our approach: We reach this capability level not via narrow, task-specific methodology, but by breaking new ground in general-purpose reinforcement learning and test-time compute scaling.
.@polynoamial @alexwei_ blink twice if I'm right and 3 times if I'm wrong - before the blind are led by the blind xD
22,25K