Reinforcement learning enables LLMs to beat humans on programming/math competitions and has driven recent advances (OpenAI's o-series, Anthropic's Claude 4) Will RL enable broad generalization in the same way that pretraining does? Not with current techniques 🧵 1/7
🔗Links here and thread below: Paper: Medium: Substack: 2/7
Existing evaluation for LLMs primarily assess in-domain performance, using reinforcement post-training (RPT) models trained on mixed-domain data and evaluated on benchmarks closely aligned with their training domains. These setups introduce confounding factors that obscure the true extent of RPT’s generalization ability 3/7
We introduce a unified evaluation framework that isolates and tests RPT’s cross-domain generalization using 16 benchmarks across math, code, and knowledge-intensive reasoning. Within this framework, we evaluate various combinations of base models and RPT strategies 4/7
📌 Our key findings: 1️⃣ RPT gains are mostly in-domain 2️⃣ Math & code generalize well to each other 3️⃣ Structured skills do not transfer to unstructured, knowledge-intensive tasks 5/7
The takeaway? RPT is powerful but narrow It improves performance where it’s trained, but generalizes poorly 6/7
This work is joint with @ChuxuanHu, @maxYuxuanZhu, @aokellermann, Caleb Biddulph, @PunWai, and @jasoncbenn 7/7
2,63K