Text Generation
Transformers
Safetensors
abstract-cot
latent-reasoning
math-reasoning
qwen3
File size: 6,812 Bytes
a555798
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
license: apache-2.0
library_name: transformers
base_model: Qwen/Qwen3-4B
pipeline_tag: text-generation
tags:
  - abstract-cot
  - latent-reasoning
  - math-reasoning
  - qwen3
datasets:
  - HuggingFaceH4/MATH-500
  - allenai/Dolci-Think-SFT-7B
---

# Qwen3-4B-AbstractCoT-warmup

Qwen3-4B fine-tuned with the **Abstract Chain-of-Thought (Abstract-CoT)** warm-up procedure from "[Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought](https://arxiv.org/abs/2604.22709v2)" (Ramji, Naseem, Fernandez Astudillo, IBM Research AI, 2026). The model is taught to compress its reasoning into a short sequence (~16–22 tokens) drawn from a reserved 64-symbol *abstract vocabulary* `V_abs = {<TOKEN_A>, …, <TOKEN_BL>}`, used as a discrete latent scratchpad before emitting the answer.

```
prompt  ─►  <beginabstract>  z_1 ... z_m  <endabstract>  answer
            └─────── zΜƒ ∈ V_abs^m, m ≀ 128 β”€β”€β”€β”€β”€β”€β”€β”˜
```

This is the SFT half of the paper only β€” no RL stage. The comparison row is the paper's "Abstract-CoT (Warm-up)" line in Table 1.

## Headline result

| | MATH-500 acc | Mean tokens |
|---|---|---|
| Paper Baseline (Qwen3-4B verbal CoT) | 83.2 | 1087 |
| **Our Baseline** (Qwen3-4B verbal CoT, this hardware) | 84.60 | 1045 |
| Paper Abstract-CoT Warm-up | 86.2 | 168 |
| **This model** (T=3 PI, N=5k, 1 epoch, LoRA, seq 8k) | **72.00** | **432** |

The accuracy gap to the paper's 86.2 is driven by reduced data scale (5k vs 600k), LoRA vs full fine-tuning, and 1 vs 3 epochs per phase. See `docs/20260511_reader.md` for a full discussion.

## Repository layout

```
final/             ← end-of-round-3 merged model (THE warm-up checkpoint)
round2/            ← end-of-round-2 merged model
round1/            ← end-of-round-1 merged model
adapters/          ← all 6 LoRA adapters (pi{1,2,3}_phase{A,B})
results/           ← per-example eval JSONL (baseline + abstract)
teacher_traces/    ← on-policy V_abs traces used as Phase B/A teachers
train_logs/        ← per-phase loss + LR curves (verifies cosine fix)
docs/              ← run reports (technical + reader-oriented)
```

## How it was trained

Three policy-iteration rounds, each with two phases:

- **Phase A — Bottleneck SFT.** Train on `[prompt; verbal-CoT; z̃; answer]` with the answer blocked from attending to the verbal CoT, forcing all CoT→answer signal through `z̃`.
  - Round 1: `z̃` is random V_abs tokens.
  - Rounds 2+: `z̃` is sampled on-policy from the previous round's model.
- **Phase B — Self-distillation.** Train on `[prompt; z̃; answer]` with standard causal attention, where `z̃` is now generated from the prompt alone.

Training config:
- Base: `Qwen/Qwen3-4B`, extended with V_abs (M=64) + `<beginabstract>` + `<endabstract>` (151 669 β†’ 151 735 tokens).
- LoRA r=32, Ξ±=64 on attention + MLP projections. Embedding table + LM head trained fully (so the new abstract-vocab rows can move freely). 842.9 M / 4.86 B trainable (17.3%).
- Data: 5 000 examples from `allenai/Dolci-Think-SFT-7B`, filtered to assistant messages with `<think>` blocks β‰₯ 200 chars.
- max_len 8192, batch 32, lr 1e-4, cosine schedule, 5% warmup.
- 2Γ— A100-SXM4-80GB, ~11 hours wall.

## Using the model

### Inference (vLLM, recommended)

```python
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
from huggingface_hub import snapshot_download

# Download the final checkpoint
model_path = snapshot_download(
    "leapeto/Qwen3-4B-AbstractCoT-warmup",
    allow_patterns=["final/*"],
)

tok = AutoTokenizer.from_pretrained(f"{model_path}/final", trust_remote_code=True)

# Abstract token ids
abs_tokens = []
for i in range(64):
    if i < 26:
        abs_tokens.append(f"<TOKEN_{chr(ord('A')+i)}>")
    else:
        j = i - 26
        abs_tokens.append(f"<TOKEN_{chr(ord('A')+j//26)}{chr(ord('A')+j%26)}>")
end_id = tok.convert_tokens_to_ids("<endabstract>")
abs_ids = tok.convert_tokens_to_ids(abs_tokens)
allowed = list(set(abs_ids + [end_id]))

llm = LLM(model=f"{model_path}/final", tensor_parallel_size=2,
          dtype="bfloat16", trust_remote_code=True)

# Two-stage decode: (1) constrained abstract trace, (2) unconstrained answer
prompt = "What is the integral of x^2 from 0 to 1? Put your final answer in \\boxed{}."
messages = [
    {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."},
    {"role": "user", "content": prompt},
]
prefix = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False)
prefix += "<beginabstract>"

# Stage 1: V_abs only, stop at <endabstract>
sp1 = SamplingParams(temperature=0.7, max_tokens=128,
                     allowed_token_ids=allowed, stop_token_ids=[end_id],
                     skip_special_tokens=False)
abstract = llm.generate([prefix], sp1)[0].outputs[0].text
prompt2 = prefix + abstract + "<endabstract>\n"

# Stage 2: unconstrained answer
sp2 = SamplingParams(temperature=0.0, max_tokens=2048)
answer = llm.generate([prompt2], sp2)[0].outputs[0].text
print(answer)
```

### Loading the LoRA adapters (peft)

If you want to inspect individual round outputs without downloading the merged models:

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM
from huggingface_hub import snapshot_download

# You'll need the extended base model first β€” produce it locally via scripts/01_extend_model.sh
# OR start from one of our merged checkpoints and load a later adapter on top.

base = AutoModelForCausalLM.from_pretrained("path/to/extended/base", trust_remote_code=True)
adapter_path = snapshot_download(
    "leapeto/Qwen3-4B-AbstractCoT-warmup",
    allow_patterns=["adapters/pi3_phaseB/*"],
)
model = PeftModel.from_pretrained(base, f"{adapter_path}/adapters/pi3_phaseB")
```

## Files of interest

| File | What |
|---|---|
| `final/` | End-of-round-3 merged model. **This is the main artifact.** |
| `round1/`, `round2/` | Intermediate merged models for studying T=1 β†’ T=2 β†’ T=3 progression |
| `adapters/pi{1,2,3}_phase{A,B}/` | LoRA-only checkpoints from each phase |
| `results/baseline_math500.jsonl` | Qwen3-4B verbal-CoT eval (84.60% / 1045 tok) |
| `results/abstract_math500_T3_N5000.jsonl` | This model's eval (72.00% / 432 tok) |
| `train_logs/*.json` | Per-step loss + LR curves for each phase |
| `docs/20260511.md` | Technical report (full breakdown) |
| `docs/20260511_reader.md` | Reader-oriented report (concepts + reasoning) |

## Citation

```bibtex
@article{ramji2026thinking,
  title={Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought},
  author={Ramji, Keshav and Naseem, Tahira and Fernandez Astudillo, RamΓ³n},
  journal={arXiv preprint arXiv:2604.22709},
  year={2026}
}
```