Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities


🧠 Overview

DGPO (Distillation-Guided Policy Optimization) is a reinforcement learning framework with integrated knowledge distillation, designed to enable agentic search behaviors in compact language models.

While RL works well for large models, compact models suffer from:

  • ❌ Poor initial outputs
  • ❌ Training collapse in RL
  • ❌ Ineffective exploration

DGPO solves this by combining:

  • βœ… Cold-start knowledge distillation (KD)
  • βœ… Teacher-guided reinforcement learning

This enables stable learning and even allows compact models to match or surpass teacher models


βš™οΈ Key Idea

πŸ” Distillation-Guided RL

DGPO introduces a simple but powerful principle:

βœ… Reward if correct ❌ Mimic teacher if wrong

This creates a stable learning signal even when the model is weak.


πŸ—οΈ Framework

1. Cold-Start Initialization (KD)

  • Train student using teacher-generated outputs (TGO)
  • Provides high-quality trajectories
  • Prevents early collapse

2. Distillation-Guided RL

  • Use PPO-based RL
  • Reward correct answers
  • Apply selective KL penalty only when wrong

This enables:

  • Stable training
  • Efficient exploration
  • Error-focused learning

πŸ” Agentic RAG Behavior

DGPO trains models to perform multi-step search reasoning:

<think> reasoning </think>

<search> query </search>  
<information> retrieved docs </information>  
<answer> final answer </answer>  

πŸš€ Performance

Overall QA Performance

πŸ“Š Qwen2.5 (3B β†’ 0.5B)

Method NQ TriviaQA PopQA HotpotQA 2Wiki MuSiQue Bamboogle Avg.
Student-0.5B 0.004 0.006 0.007 0.007 0.015 0.000 0.000 0.006
Teacher-3B 0.365 0.569 0.393 0.340 0.368 0.135 0.298 0.353
PPO 0.306 0.444 0.379 0.205 0.218 0.041 0.073 0.238
GKD 0.266 0.408 0.358 0.216 0.217 0.055 0.161 0.240
SeqKD 0.331 0.416 0.364 0.283 0.273 0.089 0.169 0.275
KD 0.331 0.431 0.373 0.286 0.284 0.091 0.290 0.298
DistiLLM 0.333 0.442 0.373 0.288 0.270 0.095 0.209 0.287
TAID 0.325 0.427 0.365 0.290 0.270 0.079 0.218 0.282
DGPO (ours) 0.378 0.481 0.402 0.342 0.303 0.120 0.274 0.329

πŸ‘‰ DGPO achieves ~55Γ— improvement over base model

πŸ‘‰ In some cases, student surpasses teacher


πŸŽ“ Citation

@article{kotoge2025dgpo,
title={Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization},
author={Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin},
journal={arXiv preprint arXiv:2508.20324},
year={2025}
}
@inproceedings{
kotoge2025democratizing,
title={Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization  for Compact Language Models},
author={Rikuto Kotoge and Mai Nishimura and Jiaxin Ma},
booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
year={2025},
url={https://openreview.net/forum?id=CP0H9NAWES}
}

🀝 Acknowledgements

  • Search-R1 (agentic RL baseline)
  • Open models (Qwen & Llama)
  • Open QA benchmarks (NQ, HotpotQA, etc.)

DGPO makes agentic RAG accessible to small models πŸš€

Downloads last month
39
Safetensors
Model size
0.6B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for omron-sinicx/DGPO-qwen2.5-0.5b

Finetuned
(1)
this model
Quantizations
1 model

Collection including omron-sinicx/DGPO-qwen2.5-0.5b

Paper for omron-sinicx/DGPO-qwen2.5-0.5b