YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.


language:

  • en license: apache-2.0 base_model: Qwen/Qwen2.5-3B tags:
  • qwen2.5
  • quiet-star
  • reasoning
  • rationale-generation
  • reinforcement-learning
  • causal-lm datasets:
  • HuggingFaceFW/fineweb-edu library_name: transformers pipeline_tag: text-generation

Qwen2.5-3B β€” Quiet-STaR

An implementation of Quiet-STaR on top of Qwen2.5-3B, trained on a single NVIDIA H200 (141 GB HBM3e). The model is trained to generate internal β€œthoughts” at every token position before predicting the next token, and learns which thoughts improve prediction via REINFORCE β€” with no task-specific supervision.


Model Details

Property Value
Base model Qwen/Qwen2.5-3B
Method Quiet-STaR (REINFORCE on thought tokens)
Training data FineWeb-Edu (score β‰₯ 3)
Hardware 1Γ— NVIDIA H200 (141 GB HBM3e)
Precision bfloat16
License Apache 2.0

What is Quiet-STaR?

Quiet-STaR teaches a language model to think before speaking β€” at every token position, the model generates a short internal rationale (<|startthought|> ... <|endthought|>) and uses a learned mixing head to blend the base logits with thought-augmented logits. The model is trained to up-weight thoughts that increase the probability of the correct next token, using a REINFORCE-style reward.

Input:    [The] [cat] [sat] [on] [the] [mat]
               ↓
         <|startthought|>
         [thought_1] ... [thought_n]
         <|endthought|>
               ↓
Mixing:   base_logits ←(1-w)β†’ thought_logits
               ↓
Output:   improved next-token prediction

Why Qwen2.5-3B?

  • Fits comfortably on a single H200 with headroom for large batch sizes and sequence lengths
  • Strong baseline reasoning capability
  • Modern architecture: GQA Β· RoPE Β· SwiGLU

Training Details

Dataset

FineWeb-Edu β€” educational web text filtered for quality (score β‰₯ 3). Educational content naturally contains implicit reasoning steps (proofs, explanations, derivations), making it well-suited for Quiet-STaR training.

Alternative: open-web-math/open-web-math for math-focused training.

Key Hyperparameters

Parameter Default Description
n_ahead 8 Thought tokens per position (including special tokens)
n_ahead_talk 4 Tokens predicted after the thought
n_passes 4 Forward passes per training step
batch_size 1 Per-device batch size
max_length 1024 Sequence length
learning_rate 1e-6 Learning rate
gumbel_temperature 1.0 Gumbel-Softmax temperature

Memory Budget (H200)

Component Memory
Model weights (bf16) ~6 GB
Optimizer (AdamW) ~18 GB
Activations + gradients ~30–50 GB
KV cache + thought embeddings ~5–10 GB
Total estimate ~59–84 GB

Usage

Installation

git clone https://github.com/Phonsiriwillbejommarn/Qwen2.5-3b-Quiet-StaR
cd Qwen2.5-3b-Quiet-StaR
pip install -r requirements.txt

Training

# Default (Qwen2.5-3B + FineWeb-Edu)
python train.py

# Custom configuration
python train.py \
    --n_ahead 8 \
    --n_ahead_talk 4 \
    --n_passes 2 \
    --batch_size 8 \
    --max_steps 100000 \
    --max_length 1024 \
    --learning_rate 1e-6

# Math-focused variant
python train.py \
    --dataset_name open-web-math/open-web-math \
    --dataset_subset default

# Disable wandb logging
python train.py --no_wandb

Inference

# Interactive chat
python inference.py --model_path ./outputs/quietstar_qwen25_3b_final_XXXXX

# Single prompt
python inference.py \
    --model_path ./outputs/quietstar_qwen25_3b_final_XXXXX \
    --prompt "What is 2 + 3 * 4?"

Repository Structure

Qwen2.5-3b-Quiet-STaR/
β”œβ”€β”€ config.py                  # QuietStarConfig (extends Qwen2Config)
β”œβ”€β”€ modeling_quiet_star.py     # Core model with thought generation & mixing head
β”œβ”€β”€ train.py                   # Training script (H200-optimised)
β”œβ”€β”€ eval_helpers.py            # Evaluation preprocessing & metrics
β”œβ”€β”€ inference.py               # Inference with thought token masking
β”œβ”€β”€ requirements.txt           # Dependencies
└── README.md

Citation

If you use this work, please cite the original Quiet-STaR paper:

@article{zelikman2024quiet,
  title={Quiet-{STaR}: Language Models Can Teach Themselves to Think Before Speaking},
  author={Zelikman, Eric and Harik, Georges and Shao, Yijia and Jayasiri, Varuna and Haber, Nick and Goodman, Noah D.},
  journal={arXiv preprint arXiv:2403.09629},
  year={2024}
}

License

Apache License 2.0

Downloads last month
9
Safetensors
Model size
3B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for Phonsiri/Qwen2.5-3b-Quiet-STaR2