here's some free alpha: if we do RL for too long after pretraining, we will surely overwrite parameters and start to forget things in the original instructGPT paper, their best model mixed RLHF with pretraining gradients to avoid exactly this model drift issue yet no one is doing this anymore. sure, it's one particular instantiation (gradient mixing) of a broader idea (avoiding forgetting) but seems like a greatly-overlooked line of thinking as we do more and more steps of RL for example see the recent ProRL paper. they're doing over 1000 steps of GRPO now with a non-trivial learning rate and no penalty for deviating from the original model. the circuits built inside the model during pretraining are surely starting to decay. and if not, they will after 10k or 100k RL steps i suspect this idea will come back around eventually; they're probably already doing this at the big labs
55,25K