Trendaavat aiheet
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.
How LLMs train LLMs, clearly explained (with visuals):
LLMs learn not only from raw text but also from other models.
Google’s Gemma 2 and 3, for example, were distilled from the larger Gemini model.
Today we cover, the three most common knowledge‑distillation methods.
Let's dive in! 🚀
1️⃣ Soft-label Distillation
Generate token-level softmax probabilities over the entire corpus using:
- A frozen, pre-trained Teacher LLM
- An untrained Student LLM
Train the Student LLM to match the Teacher's probabilities.
Check this out👇
In soft-label distillation, having access to the Teacher's probabilities ensures maximum knowledge transfer.
However, to obtain the probability distribution, you must have access to the Teacher’s weights.
Even with access, another challenge arises...
Say your vocab has 100k tokens and data has 5 trillion tokens.
Storing softmax probabilities over the entire vocab for each input token needs 500M GBs of memory under fp8 precision.
This is where we jump to our second technique ...👇
2️⃣ Hard-label distillation
- Use the Teacher LLM to get the output token.
- Get the softmax probs. from the Student LLM.
- Train the Student to match Teacher's output.
DeepSeek-R1 was distilled into Qwen & Llama using this technique.
Check this visual 👇
3️⃣ Co-distillation
- Start with an untrained Teacher and Student LLM.
- Generate softmax probs over the current batch from both models.
- Train the Teacher LLM on the hard labels.
- Train the Student LLM to match softmax probs of the Teacher.
Check this visual 👇
Meta used co-distillation to train Llama 4 Scout and Maverick from Llama 4 Behemoth.
Of course, during the initial stages, soft labels of the Teacher LLM won't be accurate.
That is why Student LLM is trained using both soft labels + ground-truth hard labels.
Those were the three techniques to train one LLM using another.
We discussed:
- Soft-label distillation
- Hard-label distillation
- Co-distillation
Here's the visual again for your reference 👇
That's a wrap!
If you found it insightful, reshare with your network.
Find me → @akshay_pachaar ✔️
For more insights and tutorials on LLMs, AI Agents, and Machine Learning!

25.7. klo 20.38
How LLMs train LLMs, clearly explained (with visuals):
97,74K
Johtavat
Rankkaus
Suosikit