Text Generation
Transformers
Safetensors
abstract-cot
latent-reasoning
math-reasoning
qwen3
File size: 12,139 Bytes
a555798
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
# Abstract-CoT (arXiv:2604.22709v2) β€” Production Run on Qwen3-4B

**Date:** 2026-05-11
**Scope:** Full T=3 PI warm-up at N=5000, seq_len=8192, LoRA β€” comparison target is the same paper's "Abstract-CoT (Warm-up)" row in Table 1.
**Hardware:** 2Γ— NVIDIA A100-SXM4-80GB (this machine, fresh clone). New checkpoint.
**Status:** Pipeline ran end-to-end in **~11 hours**, under the 12 hr budget. All three engineering wins (vLLM gen_traces, LR-schedule fix, seq_len 8k) shipped. T=3 on-policy iteration ran cleanly per-round but did **not** improve over T=1 at this LoRA/data scale (within noise).

---

## Headline numbers

| Method | MATH-500 acc | Mean tokens (reasoning + response) |
|---|---|---|
| Paper Baseline (Qwen3-4B, verbal CoT) | 83.2 | 1087 |
| Smoke Baseline (prior repo, 2Γ— A100-40GB) | 83.60 | 1067 |
| **This-run Baseline** (2Γ— A100-80GB, vLLM 0.19.1) | **84.60** | **1045** |
| Paper Abstract-CoT (Warm-up) | 86.2 | 168 |
| Smoke Warm-up (T=1, N=5k, 1ep, LoRA, seq 2k, T=0.7, m_min=16) | 73.20 | 433 |
| **Validation** (T=1, N=500, 1ep, LoRA, seq 8k, T=0.7, m_min=16) | 73.40 | 558 |
| **This-run Warm-up** (T=3, N=5k, 1ep, LoRA, seq 8k, T=0.7, m_min=16) | **72.00** | **432** |

Reading:
- **Baseline reproduces paper.** vLLM 0.19.1 (downgraded from 0.20.2 to match CUDA 12.8) works correctly on this box.
- **T=3 did not beat T=1** at this scale: 72.0 vs 73.2/73.4, well within the noise of temp=0.7 abstract-trace sampling.
- **Mean total tokens dropped** (432 vs smoke 433 and validation 558), suggesting the on-policy traces *did* push the model toward shorter responses; accuracy just didn't lift.

---

## Hardware actually available

```
GPU 0: A100-SXM4-80GB  vol ECC unc: 0  matmul OK, vLLM OK, sustained 100% util for 11 hr clean
GPU 1: A100-SXM4-80GB  vol ECC unc: 0  matmul OK, vLLM OK, sustained 100% util for 11 hr clean
```

Both GPUs usable. CUDA 12.8 / driver 570.195.03. 1.4 TiB system RAM, 128 CPUs. 146 GB free overlay disk at start; ended at 88 GB used.

Compute per GPU is identical to the smoke's 40GB cards (same GA100 silicon). The 80GB lifts the seq_len cap and unblocks the full-FT path (not used here β€” we kept LoRA per the smoke's recommendation for this budget).

**vLLM TP**: Qwen3-4B has 32 attention heads β†’ TP must divide 32. **TP=2 fits perfectly** on this 2-GPU box (no idle card during eval).

---

## What changed vs. the smoke

Listed in rough order of impact / engineering work.

### 1. **vLLM port for `gen_traces`** (biggest engineering win)

Replaced HF `model.generate()` + custom `LogitsProcessor` with vLLM `LLM.generate()` + `SamplingParams.allowed_token_ids` to enforce the V_abs βˆͺ {END_ABS} alphabet directly in the sampler. No custom logits processor needed β€” vLLM's `allowed_token_ids` does exactly this efficiently inside the kernels.

Measured throughput on N=5000:

| Mode | Prefix | max_model_len | Wall | Rate |
|---|---|---|---|---|
| Phase B teacher (no CoT) | ~150 tok | 4096 | **17 s** | **294/s** |
| Phase A teacher (with CoT) | ~5500 tok | 8192 | **887–891 s** | **5.6/s** |
| Smoke HF baseline | β€” | β€” | 11 min (=660 s) on 5k | 7.6/s |

vLLM speedup: **40Γ— on Phase B teacher**, **9Γ— on Phase A teacher** vs. the smoke's HF generate. The Phase A teacher path is prefill-dominated by 7800-token CoT prefixes; even there vLLM beats batched HF generate.

### 2. **Cosine LR schedule bug β€” root-cause fix**

The smoke report described an LR curve that went `1e-4 β†’ 5e-7 β†’ bounce back to 1e-4`, but its diagnosis ("`total_steps` was computed before `accelerator.prepare()`") was wrong β€” the source code already computed `total_steps` after `prepare()`.

**Actual root cause:** `accelerator.prepare(sched)` returns an `AcceleratedScheduler` that, under default settings (`split_batches=False`, `step_with_optimizer=True`), advances the underlying scheduler `num_processes` times per `sched.step()` call. With 2 GPUs, the cosine completes in half the calls, then bounces back to peak (the `get_cosine_schedule_with_warmup` function's `num_cycles=0.5` curve returns to max once `progress > 1`).

**Fix** in `src/train_phase_lora.py`:

```python
total_opt_steps = steps_per_epoch * args.epochs
total_steps = total_opt_steps * accelerator.num_processes   # NEW
sched = get_cosine_schedule_with_warmup(
    opt, num_warmup_steps=max(1, total_steps // 20),
    num_training_steps=total_steps,
)
```

**Verified** end-to-end. Round-3 Phase A train log: peak 1.0e-4 at step 5, monotonic cosine descent to 1.12e-8 at step 155, **no bounce-back**. Identical curve on every Phase A and Phase B of every round.

A standalone reproduction of both the bug and the fix (no GPU needed) is in this session's log; can be reconstructed by instantiating the scheduler with `total_steps = total_opt_steps * 2` and stepping it `total_opt_steps Γ— 2` times.

Also added `train_log.json` "lrs" key alongside "losses" so future audits can verify offline.

### 3. **`max_len` 2048 β†’ 8192**

The smoke truncated 98% of Dolci-Think CoTs from the right (median CoT is 18.8k tokens). At seq_len 8192, ~60% of CoTs fit fully and the rest only have their tail removed β€” meaningful reasoning makes it into the bottleneck. Measured per-step time at seq_len 8192 on this box: **71.9 s/step** (vs. 12.7 s/step at seq_len 2048 in the smoke) β€” a 5.66Γ— slowdown, dominated by self-attention now scaling quadratically over a longer window.

### 4. **T=1 β†’ T=3 (full PI warm-up)**

The smoke only did one PI round (random Z̃ → bottleneck SFT → self-distill). This run did three: round 2 and round 3 use on-policy Z̃ generated via constrained decoding from the previous round's model. Per-round loss curves (Phase A `[bottleneck]`):

| Round | step 5 loss | step 155 loss | Notes |
|---|---|---|---|
| 1 | 3.49 | 0.85 | starts from random Z̃ — model is learning the bottleneck structure |
| 2 | 0.35 | 0.34 | Z̃ now carries signal — model converges fast |
| 3 | 0.27 | 0.35 | even cleaner start; the on-policy traces are doing what they should |

Phase B `[distill]` starting loss: 0.49 β†’ 0.29 β†’ 0.21 across rounds β€” same story.

So the optimizer is clearly working with the on-policy bottleneck signal. The accuracy lift just didn't show up at this LoRA/data scale (see "Quality observation" below).

### 5. **Misc fixes**

- **Shell syntax bug** in `scripts/03_phase_a.sh`: the apostrophe inside `${OUT:?OUT must be the output dir for this phase's LoRA adapter}` opened an unterminated single-quoted string under bash 5.2. Replaced with apostrophe-free wording.
- **`max_model_len` too tight on Phase B teacher** (first production attempt): set to 1024, but some Dolci user prompts are 1.5–2.5k tokens. Validation at N=500 didn't sample the tail. Bumped to **3072 / 4096** (prefix / model_len), and added defensive left-truncation of X when even X alone exceeds the budget. Re-runs completed instantly.
- `run_smoke.sh` now accepts `DATA_FILE` and `SKIP_BASELINE` env overrides so the same 6k-row dolci file can serve both validation (N=500) and production (N=5000).

---

## Per-stage wall times (this run)

| Stage | Per-occurrence | Γ— T=3 |
|---|---|---|
| Phase A (`bottleneck`, 156 opt steps @ ~72 s/step, seq 8k) | **~2.94 hr** | **8.81 hr** |
| Phase B (`distill`, 156 opt steps @ ~6 s/step) | **~15.7 min** | **47 min** |
| gen_traces β€” Phase A teacher (vLLM TP=2, max_model_len=8192) | **~14.85 min** | ~29.7 min (rounds 2, 3 only) |
| gen_traces β€” Phase B teacher (vLLM TP=2, max_model_len=4096) | **17 s** | 51 s |
| Merge LoRA β†’ full HF (CPU-bound write) | ~30 s | 3 min |
| Final eval (MATH-500, vLLM TP=2, T=0.7 abstract / T=0 answer) | β€” | **23 s** for the 3-stage decode |
| Pre-flight (extend Qwen3-4B, baseline calibration) | β€” | ~10 min |
| **Total wall** | | **~11h 0m** (04:07 β†’ 15:08) |

Includes ~5 min of failed gen_traces + restart from the `max_model_len=1024` issue.

---

## Configuration used

```bash
RUNS_DIR=$PWD/runs \
DATA_FILE=$PWD/data/dolci_6000.jsonl \
SKIP_BASELINE=1 \
N=5000 T=3 EPOCHS=1 \
bash scripts/run_smoke.sh
```

With the in-script defaults:
- `MAX_LEN=8192` (Phase A + Phase B training cap)
- `MICRO_BATCH=1`, `GRAD_ACCUM=16`, **effective batch 32** (2 GPUs Γ— 1 Γ— 16)
- `LR=1e-4`, cosine schedule, 5% warmup
- LoRA: `r=32`, `alpha=64`, target `{q,k,v,o,gate,up,down}_proj`, `modules_to_save=["embed_tokens","lm_head"]` (842.9 M / 4.86 B = 17.3% trainable)
- Abstract eval: `m_min=16`, `m_max=128`, `abs_temp=0.7`, answer `temp=0.0`, `tp=2`

---

## Quality observation β€” why didn't T=3 help?

Per-round Phase A starting loss dropped 3.49 → 0.35 → 0.27, showing the on-policy abstract traces are doing what the paper says they should: they begin to carry signal from CoT through Z̃. But MATH-500 accuracy stayed at 72 ± 1.2 across T=1 and T=3.

Hypotheses, in rough order of credibility:

1. **LoRA caps the gain.** With ~17% trainable params and the embedding table dominating those, the model's base "answer-from-prompt" reflex is too strong for the bottleneck to redirect at this scale. The smoke report flagged this as the biggest gap to the paper, and our T=3 result is consistent.
2. **N=5000 is still tiny.** The paper used 600k; we used 5k. On-policy refinement needs enough novel `(x, c)` pairs to keep producing diverse `Z̃` shapes; at 5k the same examples just get revisited with marginally different traces.
3. **Eval stochasticity.** Abstract trace decode uses temp=0.7. 1–2 pt variance between runs at N=500 is normal. The validation result (73.4) was likely a lucky upper; production (72.0) is within noise of the smoke (73.2).
4. **seq_len 8k may have let too much CoT signal "leak" through Z̃ during teacher generation**, making `Z̃` less of a compression target. Counterintuitive — the smoke argued for longer seq — but the bottleneck quality is the *delta* between what reaches Z̃ and what Y can use directly. Worth ablating.

---

## What's next, ranked

1. **Full fine-tuning** instead of LoRA. With 2Γ— 80GB and `enforce_eager=True` for the optimizer side, ZeRO-3 (no offload) becomes feasible on Qwen3-4B at seq_len 8k. Estimated ~16-20 hr at the current config; biggest expected lift.
2. **N β†’ 30k–60k** (still T=3, still seq 8k, still LoRA). Roughly extrapolates to ~30–60 hr β€” out of a one-day budget but the right next step if we get a 2–3 day budget. The full-FT path at 5k would be more diagnostic per hour.
3. **More epochs.** Paper uses 3 epochs/phase; we used 1. 3Γ— more wall but should help the Adam states settle.
4. Re-eval the current `pi3_phaseB_merged` at multiple seeds + temperatures to bound the eval stochasticity tighter. 5–10 min each.

---

## File layout (under `/workspace/ThinkingWithoutWordsRepro/`)

```
runs/
  baseline_math500.jsonl                       # 84.60% (this-run baseline)
  abstract_math500_T3_N5000.jsonl              # 72.00% (final result)
  qwen3-4b-abs/
    base/                                      # Qwen3-4B + V_abs (M=64) + delimiters
    pi1_phaseA/  pi1_phaseA_merged/            # round 1 LoRA + merged
    pi1_phaseB/  pi1_phaseB_merged/            # round 1 Phase B
    pi1_phaseB_teacher_traces.jsonl            # on-policy Z̃ for round-1 self-distill
    pi2_phaseA_teacher_traces.jsonl            # bottleneck teacher for round 2 (full-CoT)
    pi2_phaseA/  pi2_phaseA_merged/            # round 2
    pi2_phaseB_teacher_traces.jsonl
    pi2_phaseB/  pi2_phaseB_merged/
    pi3_phaseA_teacher_traces.jsonl
    pi3_phaseA/  pi3_phaseA_merged/
    pi3_phaseB_teacher_traces.jsonl
    pi3_phaseB/  pi3_phaseB_merged/            # ← FINAL warm-up model

data/
  math500.jsonl                                # 500 problems
  dolci_6000.jsonl                             # 6k filtered Dolci-Think examples (used N=5000 of them)

docs/
  20260510SMOKE_REPORT.md                      # prior run on 2Γ— A100-40GB
  20260511.md                                  # this report
```

Train logs (`runs/qwen3-4b-abs/pi*/train_log.json`) include `losses`, `lrs`, `total_opt_steps`, `num_processes`, `wallclock_s` per phase β€” sufficient to plot and re-verify the LR fix offline.