YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
language:
- en license: apache-2.0 base_model: Qwen/Qwen2.5-3B tags:
- qwen2.5
- quiet-star
- reasoning
- rationale-generation
- reinforcement-learning
- causal-lm datasets:
- HuggingFaceFW/fineweb-edu library_name: transformers pipeline_tag: text-generation
Qwen2.5-3B β Quiet-STaR
An implementation of Quiet-STaR on top of Qwen2.5-3B, trained on a single NVIDIA H200 (141 GB HBM3e). The model is trained to generate internal βthoughtsβ at every token position before predicting the next token, and learns which thoughts improve prediction via REINFORCE β with no task-specific supervision.
Model Details
| Property | Value |
|---|---|
| Base model | Qwen/Qwen2.5-3B |
| Method | Quiet-STaR (REINFORCE on thought tokens) |
| Training data | FineWeb-Edu (score β₯ 3) |
| Hardware | 1Γ NVIDIA H200 (141 GB HBM3e) |
| Precision | bfloat16 |
| License | Apache 2.0 |
What is Quiet-STaR?
Quiet-STaR teaches a language model to think before speaking β at every token position, the model generates a short internal rationale (<|startthought|> ... <|endthought|>) and uses a learned mixing head to blend the base logits with thought-augmented logits. The model is trained to up-weight thoughts that increase the probability of the correct next token, using a REINFORCE-style reward.
Input: [The] [cat] [sat] [on] [the] [mat]
β
<|startthought|>
[thought_1] ... [thought_n]
<|endthought|>
β
Mixing: base_logits β(1-w)β thought_logits
β
Output: improved next-token prediction
Why Qwen2.5-3B?
- Fits comfortably on a single H200 with headroom for large batch sizes and sequence lengths
- Strong baseline reasoning capability
- Modern architecture: GQA Β· RoPE Β· SwiGLU
Training Details
Dataset
FineWeb-Edu β educational web text filtered for quality (score β₯ 3). Educational content naturally contains implicit reasoning steps (proofs, explanations, derivations), making it well-suited for Quiet-STaR training.
Alternative:
open-web-math/open-web-mathfor math-focused training.
Key Hyperparameters
| Parameter | Default | Description |
|---|---|---|
n_ahead |
8 | Thought tokens per position (including special tokens) |
n_ahead_talk |
4 | Tokens predicted after the thought |
n_passes |
4 | Forward passes per training step |
batch_size |
1 | Per-device batch size |
max_length |
1024 | Sequence length |
learning_rate |
1e-6 | Learning rate |
gumbel_temperature |
1.0 | Gumbel-Softmax temperature |
Memory Budget (H200)
| Component | Memory |
|---|---|
| Model weights (bf16) | ~6 GB |
| Optimizer (AdamW) | ~18 GB |
| Activations + gradients | ~30β50 GB |
| KV cache + thought embeddings | ~5β10 GB |
| Total estimate | ~59β84 GB |
Usage
Installation
git clone https://github.com/Phonsiriwillbejommarn/Qwen2.5-3b-Quiet-StaR
cd Qwen2.5-3b-Quiet-StaR
pip install -r requirements.txt
Training
# Default (Qwen2.5-3B + FineWeb-Edu)
python train.py
# Custom configuration
python train.py \
--n_ahead 8 \
--n_ahead_talk 4 \
--n_passes 2 \
--batch_size 8 \
--max_steps 100000 \
--max_length 1024 \
--learning_rate 1e-6
# Math-focused variant
python train.py \
--dataset_name open-web-math/open-web-math \
--dataset_subset default
# Disable wandb logging
python train.py --no_wandb
Inference
# Interactive chat
python inference.py --model_path ./outputs/quietstar_qwen25_3b_final_XXXXX
# Single prompt
python inference.py \
--model_path ./outputs/quietstar_qwen25_3b_final_XXXXX \
--prompt "What is 2 + 3 * 4?"
Repository Structure
Qwen2.5-3b-Quiet-STaR/
βββ config.py # QuietStarConfig (extends Qwen2Config)
βββ modeling_quiet_star.py # Core model with thought generation & mixing head
βββ train.py # Training script (H200-optimised)
βββ eval_helpers.py # Evaluation preprocessing & metrics
βββ inference.py # Inference with thought token masking
βββ requirements.txt # Dependencies
βββ README.md
Citation
If you use this work, please cite the original Quiet-STaR paper:
@article{zelikman2024quiet,
title={Quiet-{STaR}: Language Models Can Teach Themselves to Think Before Speaking},
author={Zelikman, Eric and Harik, Georges and Shao, Yijia and Jayasiri, Varuna and Haber, Nick and Goodman, Noah D.},
journal={arXiv preprint arXiv:2403.09629},
year={2024}
}
License
- Downloads last month
- 9