That a *second* paper dropped with tons of RL flywheel secrets and *multimodal* o1-style reasoning is not on my bingo card today. Kimi's (another startup) and DeepSeek's papers remarkably converged on similar findings: > No need for complex tree search like MCTS. Just linearize the thought trace and do good old autoregressive prediction; > No need for value functions that require another expensive copy of the model; > No need for dense reward modeling. Rely as much as possible on groundtruth, end result. Differences: > DeepSeek does AlphaZero approach - purely bootstrap through RL w/o human input, i.e. "cold start". Kimi does AlphaGo-Master approach: light SFT to warm up through prompt-engineered CoT traces. > DeepSeek weights are MIT license (thought leadership!); Kimi does not have a model release yet. > Kimi shows strong multimodal performance (!) on benchmarks like MathVista, which requires visual understanding of geometry, IQ tests, etc. > Kimi paper has a LOT more details on the system design: RL infrastructure, hybrid cluster, code sandbox, parallelism strategies; and learning details: long context, CoT compression, curriculum, sampling strategy, test case generation, etc. Upbeat reads on a holiday!
Whitepaper link:
300,48K