Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization for Preserving Agentic RAG Capabilities

🧠 Overview

DGPO (Distillation-Guided Policy Optimization) is a reinforcement learning framework with integrated knowledge distillation, designed to enable agentic search behaviors in compact language models.

While RL works well for large models, compact models suffer from:

❌ Poor initial outputs
❌ Training collapse in RL
❌ Ineffective exploration

DGPO solves this by combining:

✅ Cold-start knowledge distillation (KD)
✅ Teacher-guided reinforcement learning

This enables stable learning and even allows compact models to match or surpass teacher models

⚙️ Key Idea

🔁 Distillation-Guided RL

DGPO introduces a simple but powerful principle:

✅ Reward if correct ❌ Mimic teacher if wrong

This creates a stable learning signal even when the model is weak.

🏗️ Framework

1. Cold-Start Initialization (KD)

Train student using teacher-generated outputs (TGO)
Provides high-quality trajectories
Prevents early collapse

2. Distillation-Guided RL

Use PPO-based RL
Reward correct answers
Apply selective KL penalty only when wrong

This enables:

Stable training
Efficient exploration
Error-focused learning

🔍 Agentic RAG Behavior

DGPO trains models to perform multi-step search reasoning:

<think> reasoning </think>

<search> query </search>  
<information> retrieved docs </information>  
<answer> final answer </answer>

🚀 Performance

Overall QA Performance

📊 Qwen2.5 (3B → 0.5B)

Method	NQ	TriviaQA	PopQA	HotpotQA	2Wiki	MuSiQue	Bamboogle	Avg.
Student-0.5B	0.004	0.006	0.007	0.007	0.015	0.000	0.000	0.006
Teacher-3B	0.365	0.569	0.393	0.340	0.368	0.135	0.298	0.353
PPO	0.306	0.444	0.379	0.205	0.218	0.041	0.073	0.238
GKD	0.266	0.408	0.358	0.216	0.217	0.055	0.161	0.240
SeqKD	0.331	0.416	0.364	0.283	0.273	0.089	0.169	0.275
KD	0.331	0.431	0.373	0.286	0.284	0.091	0.290	0.298
DistiLLM	0.333	0.442	0.373	0.288	0.270	0.095	0.209	0.287
TAID	0.325	0.427	0.365	0.290	0.270	0.079	0.218	0.282
DGPO (ours)	0.378	0.481	0.402	0.342	0.303	0.120	0.274	0.329

👉 DGPO achieves ~55× improvement over base model

👉 In some cases, student surpasses teacher

🎓 Citation

@article{kotoge2025dgpo,
title={Can Compact Language Models Search Like Agents? Distillation-Guided Policy Optimization},
author={Kotoge, Rikuto and Nishimura, Mai and Ma, Jiaxin},
journal={arXiv preprint arXiv:2508.20324},
year={2025}
}

@inproceedings{
kotoge2025democratizing,
title={Democratizing Agentic {RAG}: Distillation-Guided Policy Optimization  for Compact Language Models},
author={Rikuto Kotoge and Mai Nishimura and Jiaxin Ma},
booktitle={NeurIPS 2025 Workshop on Bridging Language, Agent, and World Models for Reasoning and Planning},
year={2025},
url={https://openreview.net/forum?id=CP0H9NAWES}
}