YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

language:

en license: apache-2.0 base_model: Qwen/Qwen2.5-3B tags:
qwen2.5
quiet-star
reasoning
rationale-generation
reinforcement-learning
causal-lm datasets:
HuggingFaceFW/fineweb-edu library_name: transformers pipeline_tag: text-generation

Qwen2.5-3B — Quiet-STaR

An implementation of Quiet-STaR on top of Qwen2.5-3B, trained on a single NVIDIA H200 (141 GB HBM3e). The model is trained to generate internal “thoughts” at every token position before predicting the next token, and learns which thoughts improve prediction via REINFORCE — with no task-specific supervision.

Model Details

Property	Value
Base model	`Qwen/Qwen2.5-3B`
Method	Quiet-STaR (REINFORCE on thought tokens)
Training data	FineWeb-Edu (score ≥ 3)
Hardware	1× NVIDIA H200 (141 GB HBM3e)
Precision	bfloat16
License	Apache 2.0

What is Quiet-STaR?

Quiet-STaR teaches a language model to think before speaking — at every token position, the model generates a short internal rationale (<|startthought|> ... <|endthought|>) and uses a learned mixing head to blend the base logits with thought-augmented logits. The model is trained to up-weight thoughts that increase the probability of the correct next token, using a REINFORCE-style reward.

Input:    [The] [cat] [sat] [on] [the] [mat]
               ↓
         <|startthought|>
         [thought_1] ... [thought_n]
         <|endthought|>
               ↓
Mixing:   base_logits ←(1-w)→ thought_logits
               ↓
Output:   improved next-token prediction

Why Qwen2.5-3B?

Fits comfortably on a single H200 with headroom for large batch sizes and sequence lengths
Strong baseline reasoning capability
Modern architecture: GQA · RoPE · SwiGLU

Training Details

Dataset

FineWeb-Edu — educational web text filtered for quality (score ≥ 3). Educational content naturally contains implicit reasoning steps (proofs, explanations, derivations), making it well-suited for Quiet-STaR training.

Alternative: open-web-math/open-web-math for math-focused training.

Key Hyperparameters

Parameter	Default	Description
`n_ahead`	8	Thought tokens per position (including special tokens)
`n_ahead_talk`	4	Tokens predicted after the thought
`n_passes`	4	Forward passes per training step
`batch_size`	1	Per-device batch size
`max_length`	1024	Sequence length
`learning_rate`	1e-6	Learning rate
`gumbel_temperature`	1.0	Gumbel-Softmax temperature

Memory Budget (H200)

Component	Memory
Model weights (bf16)	~6 GB
Optimizer (AdamW)	~18 GB
Activations + gradients	~30–50 GB
KV cache + thought embeddings	~5–10 GB
Total estimate	~59–84 GB

Usage

Installation

git clone https://github.com/Phonsiriwillbejommarn/Qwen2.5-3b-Quiet-StaR
cd Qwen2.5-3b-Quiet-StaR
pip install -r requirements.txt

Training

# Default (Qwen2.5-3B + FineWeb-Edu)
python train.py

# Custom configuration
python train.py \
    --n_ahead 8 \
    --n_ahead_talk 4 \
    --n_passes 2 \
    --batch_size 8 \
    --max_steps 100000 \
    --max_length 1024 \
    --learning_rate 1e-6

# Math-focused variant
python train.py \
    --dataset_name open-web-math/open-web-math \
    --dataset_subset default

# Disable wandb logging
python train.py --no_wandb

Inference

# Interactive chat
python inference.py --model_path ./outputs/quietstar_qwen25_3b_final_XXXXX

# Single prompt
python inference.py \
    --model_path ./outputs/quietstar_qwen25_3b_final_XXXXX \
    --prompt "What is 2 + 3 * 4?"

Repository Structure

Qwen2.5-3b-Quiet-STaR/
├── config.py                  # QuietStarConfig (extends Qwen2Config)
├── modeling_quiet_star.py     # Core model with thought generation & mixing head
├── train.py                   # Training script (H200-optimised)
├── eval_helpers.py            # Evaluation preprocessing & metrics
├── inference.py               # Inference with thought token masking
├── requirements.txt           # Dependencies
└── README.md

Citation

If you use this work, please cite the original Quiet-STaR paper:

@article{zelikman2024quiet,
  title={Quiet-{STaR}: Language Models Can Teach Themselves to Think Before Speaking},
  author={Zelikman, Eric and Harik, Georges and Shao, Yijia and Jayasiri, Varuna and Haber, Nick and Goodman, Noah D.},
  journal={arXiv preprint arXiv:2403.09629},
  year={2024}
}

License

Apache License 2.0

Downloads last month: 9

Safetensors

Model size

3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Phonsiri/Qwen2.5-3b-Quiet-STaR2

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking

Paper • 2403.09629 • Published Mar 14, 2024 • 79