ThinkTwice-Qwen3-4B-Instruct

This model is fine-tuned from Qwen/Qwen3-4B-Instruct-2507 using the ThinkTwice framework.

Paper: ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement (arXiv: 2604.01591)

Code: https://github.com/CSSLab/ThinkTwice

Overview

ThinkTwice is a simple two-phase GRPO-based framework that jointly trains LLMs to (1) solve reasoning problems and (2) refine their own solutions. In each pair of training steps, the model is first optimized on solving a reasoning problem, then optimized on refining its own solution to the same problem — using the same binary correctness reward in both phases, with no correctness signals or critique annotations required.

ThinkTwice reveals an implicit rectify-then-fortify curriculum: early in training, refinement predominantly corrects errors; as the model improves, it naturally shifts toward preserving already-correct solutions, yielding a more rectified reward signal.

Results

On AIME, ThinkTwice-Qwen3-4B-Instruct outperforms GRPO-trained Qwen3-4B:

+5 percentage points before refinement (pass@4)
+11.5 percentage points after one self-refinement step (pass@4)

Results span five mathematical reasoning benchmarks across two model families (Qwen3-4B and OLMo3-7B).

Usage

This model supports both direct solving and self-refinement. Use it in two passes:

Solve: prompt the model with the problem to get an initial answer.
Self-Refine: prompt the model with the problem + its initial solution to get a refined answer.

See the GitHub repository for full usage instructions and evaluation scripts.

Citation

@article{jiao2026thinktwice,
  title={ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement},
  author={Jiao, Difan and Wen, Qianfeng and Yang, Blair and Tang, Zhenwei and Anderson, Ashton},
  journal={arXiv preprint arXiv:2604.01591},
  year={2026}
}

Downloads last month: 46

Safetensors

Model size

4B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for difanjiao/ThinkTwice-Qwen3-4B-Instruct

Quantizations

1 model

Paper for difanjiao/ThinkTwice-Qwen3-4B-Instruct

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Paper • 2604.01591 • Published 18 days ago • 40