arxiv:2605.10781

Rebellious Student: Reversing Teacher Signals for Reasoning Exploration with Self-Distilled RLVR

Published on May 11

· Submitted by

JeonghyeKim on May 12

Microsoft Research

Upvote

Authors:

Abstract

RLRT enhances self-distillation by reinforcing successful student decisions that deviate from teacher predictions, enabling more effective exploration in reinforcement learning via self-reward.

AI-generated summary

Self-distillation has emerged as a powerful framework for post-training LLMs, where a teacher conditioned on extra information guides a student without it, both from the same model. While this guidance is useful when the student has failed, on successful rollouts, the same mechanism instead overwrites the student's choices and suppresses it's own reasoning. Therefore, we propose reading the original self-distillation signal in reverse: when the student succeeds along a path the teacher would not have predicted, these tokens reflect its self-driven reasoning. Building on this, we propose RLRT (RLVR with Reversed Teacher), which augments GRPO by reinforcing these tokens on correct rollouts. We interpret this as a new form of exploration in RLVR: not uniform diversity, but valuable exploration grounded in the student's own success. Across base, instruction-tuned, and thinking-tuned Qwen3 checkpoints, RLRT substantially outperforms self-distillation and exploration-based baselines, establishing information asymmetry as a new, principled design axis for RLVR.

View arXiv page View PDF Add to collection

Community

beanie00

Paper submitter about 4 hours ago

Enough with the obedient student. Time to rebel 🧑‍🎓⚡

So far, on-policy self-distillation has pulled the student toward the teacher. But what if we force the student to follow the teacher even on paths it already got right? → Its own reasoning gets erased 🧠💨

We introduce RLRT (RLVR with Reversed Teacher) 🔄. Instead of pulling the student toward the teacher, we amplify the tokens where the student diverged from the teacher (who has seen a correct solution) and still reached the correct answer. These tokens depart from one correct path yet remain verified, making them both self-driven and valuable exploration. 🌟

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.10781

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.10781 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.10781 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.10781 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.