Recipe to post-train Qwen3 1.7B into a DeepResearch model What does it mean for something small to think deeply? Meet Lucy, a post‑trained Qwen3‑1.7B as a DeepResearch model based on @willccbb's verifiers. Primary Rule-based Rewards: - Answer correctness We check whether the final response literally contains the ground-truth answer. This substring match is cheap and avoids calling a larger LLM judge. - Visit/search ratio If the agent visits at least as many pages as it issues search queries, it receives ((visit_search_ratio - 1) / 4) ** 0.25. If it searches more than it visits, the score is -0.5. Format / Anti Reward-Hacking Rewards: - Tool execution success Each API call that returns without an error counts. The reward is (successful_calls * unique_tools_used) / total_call_attempts. - Thinking efficiency A skew-normal penalty centered at 70 tokens discourages endless chain-of-thought between tool calling while still allowing enough tokens for planning. This is how Qwen3 1.7B learned to search, visit, and synthesize information. Small models can do deep research too!
37,17K