ThinkTwice-Qwen3-4B-Instruct
This model is fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using the ThinkTwice framework.
Paper: ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement (arXiv: 2604.01591)
Code: https://github.com/CSSLab/ThinkTwice
Overview
ThinkTwice is a simple two-phase GRPO-based framework that jointly trains LLMs to (1) solve reasoning problems and (2) refine their own solutions. In each pair of training steps, the model is first optimized on solving a reasoning problem, then optimized on refining its own solution to the same problem — using the same binary correctness reward in both phases, with no correctness signals or critique annotations required.
ThinkTwice reveals an implicit rectify-then-fortify curriculum: early in training, refinement predominantly corrects errors; as the model improves, it naturally shifts toward preserving already-correct solutions, yielding a more rectified reward signal.
Results
On AIME, ThinkTwice-Qwen3-4B-Instruct outperforms GRPO-trained Qwen3-4B:
- +5 percentage points before refinement (pass@4)
- +11.5 percentage points after one self-refinement step (pass@4)
Results span five mathematical reasoning benchmarks across two model families (Qwen3-4B and OLMo3-7B).
Usage
This model supports both direct solving and self-refinement. Use it in two passes:
- Solve: prompt the model with the problem to get an initial answer.
- Self-Refine: prompt the model with the problem + its initial solution to get a refined answer.
See the GitHub repository for full usage instructions and evaluation scripts.
Citation
@article{jiao2026thinktwice,
title={ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement},
author={Jiao, Difan and Wen, Qianfeng and Yang, Blair and Tang, Zhenwei and Anderson, Ashton},
journal={arXiv preprint arXiv:2604.01591},
year={2026}
}
- Downloads last month
- 46