Trendaavat aiheet
#
Bonk Eco continues to show strength amid $USELESS rally
#
Pump.fun to raise $1B token sale, traders speculating on airdrop
#
Boop.Fun leading the way with a new launchpad on Solana.

will brown
reward hacking @primeintellect
will brown kirjasi uudelleen
Recipe to post-train Qwen3 1.7B into a DeepResearch model
What does it mean for something small to think deeply? Meet Lucy, a post‑trained Qwen3‑1.7B as a DeepResearch model based on @willccbb's verifiers.
Primary Rule-based Rewards:
- Answer correctness
We check whether the final response literally contains the ground-truth answer. This substring match is cheap and avoids calling a larger LLM judge.
- Visit/search ratio
If the agent visits at least as many pages as it issues search queries, it receives ((visit_search_ratio - 1) / 4) ** 0.25. If it searches more than it visits, the score is -0.5.
Format / Anti Reward-Hacking Rewards:
- Tool execution success
Each API call that returns without an error counts. The reward is (successful_calls * unique_tools_used) / total_call_attempts.
- Thinking efficiency
A skew-normal penalty centered at 70 tokens discourages endless chain-of-thought between tool calling while still allowing enough tokens for planning.
This is how Qwen3 1.7B learned to search, visit, and synthesize information. Small models can do deep research too!
32,52K
Johtavat
Rankkaus
Suosikit
Ketjussa trendaava
Trendaa X:ssä
Viimeisimmät suosituimmat rahoitukset
Merkittävin