Trendaavat aiheet
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
here's some free alpha:
if we do RL for too long after pretraining, we will surely overwrite parameters and start to forget things
in the original instructGPT paper, their best model mixed RLHF with pretraining gradients to avoid exactly this model drift issue
yet no one is doing this anymore. sure, it's one particular instantiation (gradient mixing) of a broader idea (avoiding forgetting) but seems like a greatly-overlooked line of thinking as we do more and more steps of RL
for example see the recent ProRL paper. they're doing over 1000 steps of GRPO now with a non-trivial learning rate and no penalty for deviating from the original model. the circuits built inside the model during pretraining are surely starting to decay. and if not, they will after 10k or 100k RL steps
i suspect this idea will come back around eventually; they're probably already doing this at the big labs



55,25K
Johtavat
Rankkaus
Suosikit