File size: 6,812 Bytes
a555798 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 | ---
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-4B
pipeline_tag: text-generation
tags:
- abstract-cot
- latent-reasoning
- math-reasoning
- qwen3
datasets:
- HuggingFaceH4/MATH-500
- allenai/Dolci-Think-SFT-7B
---
# Qwen3-4B-AbstractCoT-warmup
Qwen3-4B fine-tuned with the **Abstract Chain-of-Thought (Abstract-CoT)** warm-up procedure from "[Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought](https://arxiv.org/abs/2604.22709v2)" (Ramji, Naseem, Fernandez Astudillo, IBM Research AI, 2026). The model is taught to compress its reasoning into a short sequence (~16β22 tokens) drawn from a reserved 64-symbol *abstract vocabulary* `V_abs = {<TOKEN_A>, β¦, <TOKEN_BL>}`, used as a discrete latent scratchpad before emitting the answer.
```
prompt ββΊ <beginabstract> z_1 ... z_m <endabstract> answer
ββββββββ zΜ β V_abs^m, m β€ 128 ββββββββ
```
This is the SFT half of the paper only β no RL stage. The comparison row is the paper's "Abstract-CoT (Warm-up)" line in Table 1.
## Headline result
| | MATH-500 acc | Mean tokens |
|---|---|---|
| Paper Baseline (Qwen3-4B verbal CoT) | 83.2 | 1087 |
| **Our Baseline** (Qwen3-4B verbal CoT, this hardware) | 84.60 | 1045 |
| Paper Abstract-CoT Warm-up | 86.2 | 168 |
| **This model** (T=3 PI, N=5k, 1 epoch, LoRA, seq 8k) | **72.00** | **432** |
The accuracy gap to the paper's 86.2 is driven by reduced data scale (5k vs 600k), LoRA vs full fine-tuning, and 1 vs 3 epochs per phase. See `docs/20260511_reader.md` for a full discussion.
## Repository layout
```
final/ β end-of-round-3 merged model (THE warm-up checkpoint)
round2/ β end-of-round-2 merged model
round1/ β end-of-round-1 merged model
adapters/ β all 6 LoRA adapters (pi{1,2,3}_phase{A,B})
results/ β per-example eval JSONL (baseline + abstract)
teacher_traces/ β on-policy V_abs traces used as Phase B/A teachers
train_logs/ β per-phase loss + LR curves (verifies cosine fix)
docs/ β run reports (technical + reader-oriented)
```
## How it was trained
Three policy-iteration rounds, each with two phases:
- **Phase A β Bottleneck SFT.** Train on `[prompt; verbal-CoT; zΜ; answer]` with the answer blocked from attending to the verbal CoT, forcing all CoTβanswer signal through `zΜ`.
- Round 1: `zΜ` is random V_abs tokens.
- Rounds 2+: `zΜ` is sampled on-policy from the previous round's model.
- **Phase B β Self-distillation.** Train on `[prompt; zΜ; answer]` with standard causal attention, where `zΜ` is now generated from the prompt alone.
Training config:
- Base: `Qwen/Qwen3-4B`, extended with V_abs (M=64) + `<beginabstract>` + `<endabstract>` (151 669 β 151 735 tokens).
- LoRA r=32, Ξ±=64 on attention + MLP projections. Embedding table + LM head trained fully (so the new abstract-vocab rows can move freely). 842.9 M / 4.86 B trainable (17.3%).
- Data: 5 000 examples from `allenai/Dolci-Think-SFT-7B`, filtered to assistant messages with `<think>` blocks β₯ 200 chars.
- max_len 8192, batch 32, lr 1e-4, cosine schedule, 5% warmup.
- 2Γ A100-SXM4-80GB, ~11 hours wall.
## Using the model
### Inference (vLLM, recommended)
```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download
# Download the final checkpoint
model_path = snapshot_download(
"leapeto/Qwen3-4B-AbstractCoT-warmup",
allow_patterns=["final/*"],
)
tok = AutoTokenizer.from_pretrained(f"{model_path}/final", trust_remote_code=True)
# Abstract token ids
abs_tokens = []
for i in range(64):
if i < 26:
abs_tokens.append(f"<TOKEN_{chr(ord('A')+i)}>")
else:
j = i - 26
abs_tokens.append(f"<TOKEN_{chr(ord('A')+j//26)}{chr(ord('A')+j%26)}>")
end_id = tok.convert_tokens_to_ids("<endabstract>")
abs_ids = tok.convert_tokens_to_ids(abs_tokens)
allowed = list(set(abs_ids + [end_id]))
llm = LLM(model=f"{model_path}/final", tensor_parallel_size=2,
dtype="bfloat16", trust_remote_code=True)
# Two-stage decode: (1) constrained abstract trace, (2) unconstrained answer
prompt = "What is the integral of x^2 from 0 to 1? Put your final answer in \\boxed{}."
messages = [
{"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
{"role": "user", "content": prompt},
]
prefix = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
prefix += "<beginabstract>"
# Stage 1: V_abs only, stop at <endabstract>
sp1 = SamplingParams(temperature=0.7, max_tokens=128,
allowed_token_ids=allowed, stop_token_ids=[end_id],
skip_special_tokens=False)
abstract = llm.generate([prefix], sp1)[0].outputs[0].text
prompt2 = prefix + abstract + "<endabstract>\n"
# Stage 2: unconstrained answer
sp2 = SamplingParams(temperature=0.0, max_tokens=2048)
answer = llm.generate([prompt2], sp2)[0].outputs[0].text
print(answer)
```
### Loading the LoRA adapters (peft)
If you want to inspect individual round outputs without downloading the merged models:
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
from huggingface_hub import snapshot_download
# You'll need the extended base model first β produce it locally via scripts/01_extend_model.sh
# OR start from one of our merged checkpoints and load a later adapter on top.
base = AutoModelForCausalLM.from_pretrained("path/to/extended/base", trust_remote_code=True)
adapter_path = snapshot_download(
"leapeto/Qwen3-4B-AbstractCoT-warmup",
allow_patterns=["adapters/pi3_phaseB/*"],
)
model = PeftModel.from_pretrained(base, f"{adapter_path}/adapters/pi3_phaseB")
```
## Files of interest
| File | What |
|---|---|
| `final/` | End-of-round-3 merged model. **This is the main artifact.** |
| `round1/`, `round2/` | Intermediate merged models for studying T=1 β T=2 β T=3 progression |
| `adapters/pi{1,2,3}_phase{A,B}/` | LoRA-only checkpoints from each phase |
| `results/baseline_math500.jsonl` | Qwen3-4B verbal-CoT eval (84.60% / 1045 tok) |
| `results/abstract_math500_T3_N5000.jsonl` | This model's eval (72.00% / 432 tok) |
| `train_logs/*.json` | Per-step loss + LR curves for each phase |
| `docs/20260511.md` | Technical report (full breakdown) |
| `docs/20260511_reader.md` | Reader-oriented report (concepts + reasoning) |
## Citation
```bibtex
@article{ramji2026thinking,
title={Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought},
author={Ramji, Keshav and Naseem, Tahira and Fernandez Astudillo, RamΓ³n},
journal={arXiv preprint arXiv:2604.22709},
year={2026}
}
```
|