It turns out, > GRPO is performing the arithmetic mean --> token-level scaling > GSPO is performing the geometric mean --> sequence-level scaling Check the blog if you do not have time to read.
Chujie Zheng ✈️ ICLR
Chujie Zheng ✈️ ICLR25.7.2025
Proud to introduce Group Sequence Policy Optimization (GSPO), our stable, efficient, and performant RL algorithm that powers the large-scale RL training of the latest Qwen3 models (Instruct, Coder, Thinking) 🚀 📄
64,01K