Waking up to see this new paper from @scale_AI charting on the @yesnoerror trending feed. Authors: @anisha_gunjal, @aytwang, Elaine Lau, @vaskar_n, @BingLiu1011, and @SeanHendryx "Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains" Simplified: Teaching computers with detailed check-lists instead of vague thumbs-up ratings lets them learn better answers in medicine and science questions and makes it clear why they got a reward. Key findings: • Implicitly aggregated rubric rewards boost medical benchmark score by 28 % relative to Likert baseline. • Matches or exceeds rewards based on expert reference answers despite using smaller judges. What can it be used for: • Fine-tuning clinical decision support chatbots with medical safety rubrics. • Training policy-analysis or legal-reasoning models where multiple subjective factors matter. Detailed summary: Rubrics as Rewards (RaR) is proposed as an interpretable alternative to opaque preference-based reward models when fine-tuning large language models (LLMs) with reinforcement learning. Instead of asking humans to rank whole answers, domain experts (or a strong LLM guided by expert references) write a prompt-specific checklist of 7–20 binary criteria that capture essential facts, reasoning steps, style and common pitfalls. Each criterion is tagged Essential, Important, Optional, or Pitfall and given a weight. During on-policy training the policy model (Qwen-2.5-7B in the paper) samples 16 candidate answers per prompt. A separate judge LLM (GPT-4o-mini or smaller) is prompted either to score each criterion separately (explicit aggregation) or to read the full rubric and output one holistic Likert rating 1–10 (implicit aggregation). The normalized score becomes the scalar reward and the policy is updated with the GRPO algorithm. The authors curate two 20 k-example training sets—RaR-Medical-20k and RaR-Science-20k—by combining existing medical and science reasoning corpora and generating synthetic rubrics with o3-mini or GPT-4o. Evaluation on HealthBench-1k (medical reasoning) and GPQA-Diamond (graduate-level physics/chemistry/biology) shows that RaR-Implicit yields up to a 28 % relative improvement over simple Likert-only rewards and matches or exceeds rewards computed by comparing to expert reference answers. Implicit aggregation consistently outperforms explicit, demonstrating that letting the judge decide how to combine criteria works better than fixed hand-tuned weights. Rubric supervision also helps smaller judge models. When asked to rate preferred versus perturbed answers, rubric-guided judges choose the preferred answer far more reliably than equally sized Likert-only judges, narrowing the gap between a 7 B evaluator and GPT-4o-mini. Ablations reveal that prompt-specific rubrics beat generic ones, multiple criteria beat essential-only lists, and access to an expert reference while drafting rubrics materially boosts downstream performance. Even human-written and high-quality synthetic rubrics perform on par, suggesting scalability. RaR generalises Reinforcement Learning with Verifiable Rewards (RLVR): when the rubric has just one correctness check, the framework collapses to RLVR’s exact-match reward. By exposing each aspect of quality explicitly, RaR is more transparent, auditable and potentially harder to reward-hack than neural reward models. The authors discuss extensions to real-world agentic tasks, dynamic curriculum via rubric weights, and formal robustness studies. -- Over 500,000 pages of research are published on @arXiv every month. Hidden within are breakthrough insights that could transform your work — but finding them is like searching for diamonds in an ocean of data. @yesnoerror cuts through the noise to surface the most impactful research for your projects, investments, and discoveries. // $yne
@scale_AI @yesnoerror @anisha_gunjal @aytwang @vaskar_n @BingLiu1011 @SeanHendryx Sign up for early access here:
2,8K