Excited to share our new work: “Language Models Improve When Pretraining Data Matches Target Tasks” Yes, it sounds obvious (and it is!), but typically this only happens implicitly and indirectly: intuitively select data → benchmark → refine → repeat. We wondered: what happens if we explicitly match pretraining data to benchmarks? The result is a dead simple approach that yields 2x+ compute multipliers over strong baselines and gives us a principled way to study how benchmark choices shape (and constrain!) model capabilities. Bonus: extensive scaling laws from training 500+ models that reveal how optimal data selection evolves as models scale. 🧵 (1/14)
52,03K