Abstract-CoT (arXiv:2604.22709v2) β Production Run on Qwen3-4B
Date: 2026-05-11 Scope: Full T=3 PI warm-up at N=5000, seq_len=8192, LoRA β comparison target is the same paper's "Abstract-CoT (Warm-up)" row in Table 1. Hardware: 2Γ NVIDIA A100-SXM4-80GB (this machine, fresh clone). New checkpoint. Status: Pipeline ran end-to-end in ~11 hours, under the 12 hr budget. All three engineering wins (vLLM gen_traces, LR-schedule fix, seq_len 8k) shipped. T=3 on-policy iteration ran cleanly per-round but did not improve over T=1 at this LoRA/data scale (within noise).
Headline numbers
| Method | MATH-500 acc | Mean tokens (reasoning + response) |
|---|---|---|
| Paper Baseline (Qwen3-4B, verbal CoT) | 83.2 | 1087 |
| Smoke Baseline (prior repo, 2Γ A100-40GB) | 83.60 | 1067 |
| This-run Baseline (2Γ A100-80GB, vLLM 0.19.1) | 84.60 | 1045 |
| Paper Abstract-CoT (Warm-up) | 86.2 | 168 |
| Smoke Warm-up (T=1, N=5k, 1ep, LoRA, seq 2k, T=0.7, m_min=16) | 73.20 | 433 |
| Validation (T=1, N=500, 1ep, LoRA, seq 8k, T=0.7, m_min=16) | 73.40 | 558 |
| This-run Warm-up (T=3, N=5k, 1ep, LoRA, seq 8k, T=0.7, m_min=16) | 72.00 | 432 |
Reading:
- Baseline reproduces paper. vLLM 0.19.1 (downgraded from 0.20.2 to match CUDA 12.8) works correctly on this box.
- T=3 did not beat T=1 at this scale: 72.0 vs 73.2/73.4, well within the noise of temp=0.7 abstract-trace sampling.
- Mean total tokens dropped (432 vs smoke 433 and validation 558), suggesting the on-policy traces did push the model toward shorter responses; accuracy just didn't lift.
Hardware actually available
GPU 0: A100-SXM4-80GB vol ECC unc: 0 matmul OK, vLLM OK, sustained 100% util for 11 hr clean
GPU 1: A100-SXM4-80GB vol ECC unc: 0 matmul OK, vLLM OK, sustained 100% util for 11 hr clean
Both GPUs usable. CUDA 12.8 / driver 570.195.03. 1.4 TiB system RAM, 128 CPUs. 146 GB free overlay disk at start; ended at 88 GB used.
Compute per GPU is identical to the smoke's 40GB cards (same GA100 silicon). The 80GB lifts the seq_len cap and unblocks the full-FT path (not used here β we kept LoRA per the smoke's recommendation for this budget).
vLLM TP: Qwen3-4B has 32 attention heads β TP must divide 32. TP=2 fits perfectly on this 2-GPU box (no idle card during eval).
What changed vs. the smoke
Listed in rough order of impact / engineering work.
1. vLLM port for gen_traces (biggest engineering win)
Replaced HF model.generate() + custom LogitsProcessor with vLLM LLM.generate() + SamplingParams.allowed_token_ids to enforce the V_abs βͺ {END_ABS} alphabet directly in the sampler. No custom logits processor needed β vLLM's allowed_token_ids does exactly this efficiently inside the kernels.
Measured throughput on N=5000:
| Mode | Prefix | max_model_len | Wall | Rate |
|---|---|---|---|---|
| Phase B teacher (no CoT) | ~150 tok | 4096 | 17 s | 294/s |
| Phase A teacher (with CoT) | ~5500 tok | 8192 | 887β891 s | 5.6/s |
| Smoke HF baseline | β | β | 11 min (=660 s) on 5k | 7.6/s |
vLLM speedup: 40Γ on Phase B teacher, 9Γ on Phase A teacher vs. the smoke's HF generate. The Phase A teacher path is prefill-dominated by 7800-token CoT prefixes; even there vLLM beats batched HF generate.
2. Cosine LR schedule bug β root-cause fix
The smoke report described an LR curve that went 1e-4 β 5e-7 β bounce back to 1e-4, but its diagnosis ("total_steps was computed before accelerator.prepare()") was wrong β the source code already computed total_steps after prepare().
Actual root cause: accelerator.prepare(sched) returns an AcceleratedScheduler that, under default settings (split_batches=False, step_with_optimizer=True), advances the underlying scheduler num_processes times per sched.step() call. With 2 GPUs, the cosine completes in half the calls, then bounces back to peak (the get_cosine_schedule_with_warmup function's num_cycles=0.5 curve returns to max once progress > 1).
Fix in src/train_phase_lora.py:
total_opt_steps = steps_per_epoch * args.epochs
total_steps = total_opt_steps * accelerator.num_processes # NEW
sched = get_cosine_schedule_with_warmup(
opt, num_warmup_steps=max(1, total_steps // 20),
num_training_steps=total_steps,
)
Verified end-to-end. Round-3 Phase A train log: peak 1.0e-4 at step 5, monotonic cosine descent to 1.12e-8 at step 155, no bounce-back. Identical curve on every Phase A and Phase B of every round.
A standalone reproduction of both the bug and the fix (no GPU needed) is in this session's log; can be reconstructed by instantiating the scheduler with total_steps = total_opt_steps * 2 and stepping it total_opt_steps Γ 2 times.
Also added train_log.json "lrs" key alongside "losses" so future audits can verify offline.
3. max_len 2048 β 8192
The smoke truncated 98% of Dolci-Think CoTs from the right (median CoT is 18.8k tokens). At seq_len 8192, ~60% of CoTs fit fully and the rest only have their tail removed β meaningful reasoning makes it into the bottleneck. Measured per-step time at seq_len 8192 on this box: 71.9 s/step (vs. 12.7 s/step at seq_len 2048 in the smoke) β a 5.66Γ slowdown, dominated by self-attention now scaling quadratically over a longer window.
4. T=1 β T=3 (full PI warm-up)
The smoke only did one PI round (random ZΜ β bottleneck SFT β self-distill). This run did three: round 2 and round 3 use on-policy ZΜ generated via constrained decoding from the previous round's model. Per-round loss curves (Phase A [bottleneck]):
| Round | step 5 loss | step 155 loss | Notes |
|---|---|---|---|
| 1 | 3.49 | 0.85 | starts from random ZΜ β model is learning the bottleneck structure |
| 2 | 0.35 | 0.34 | ZΜ now carries signal β model converges fast |
| 3 | 0.27 | 0.35 | even cleaner start; the on-policy traces are doing what they should |
Phase B [distill] starting loss: 0.49 β 0.29 β 0.21 across rounds β same story.
So the optimizer is clearly working with the on-policy bottleneck signal. The accuracy lift just didn't show up at this LoRA/data scale (see "Quality observation" below).
5. Misc fixes
- Shell syntax bug in
scripts/03_phase_a.sh: the apostrophe inside${OUT:?OUT must be the output dir for this phase's LoRA adapter}opened an unterminated single-quoted string under bash 5.2. Replaced with apostrophe-free wording. max_model_lentoo tight on Phase B teacher (first production attempt): set to 1024, but some Dolci user prompts are 1.5β2.5k tokens. Validation at N=500 didn't sample the tail. Bumped to 3072 / 4096 (prefix / model_len), and added defensive left-truncation of X when even X alone exceeds the budget. Re-runs completed instantly.run_smoke.shnow acceptsDATA_FILEandSKIP_BASELINEenv overrides so the same 6k-row dolci file can serve both validation (N=500) and production (N=5000).
Per-stage wall times (this run)
| Stage | Per-occurrence | Γ T=3 |
|---|---|---|
Phase A (bottleneck, 156 opt steps @ ~72 s/step, seq 8k) |
~2.94 hr | 8.81 hr |
Phase B (distill, 156 opt steps @ ~6 s/step) |
~15.7 min | 47 min |
| gen_traces β Phase A teacher (vLLM TP=2, max_model_len=8192) | ~14.85 min | ~29.7 min (rounds 2, 3 only) |
| gen_traces β Phase B teacher (vLLM TP=2, max_model_len=4096) | 17 s | 51 s |
| Merge LoRA β full HF (CPU-bound write) | ~30 s | 3 min |
| Final eval (MATH-500, vLLM TP=2, T=0.7 abstract / T=0 answer) | β | 23 s for the 3-stage decode |
| Pre-flight (extend Qwen3-4B, baseline calibration) | β | ~10 min |
| Total wall | ~11h 0m (04:07 β 15:08) |
Includes ~5 min of failed gen_traces + restart from the max_model_len=1024 issue.
Configuration used
RUNS_DIR=$PWD/runs \
DATA_FILE=$PWD/data/dolci_6000.jsonl \
SKIP_BASELINE=1 \
N=5000 T=3 EPOCHS=1 \
bash scripts/run_smoke.sh
With the in-script defaults:
MAX_LEN=8192(Phase A + Phase B training cap)MICRO_BATCH=1,GRAD_ACCUM=16, effective batch 32 (2 GPUs Γ 1 Γ 16)LR=1e-4, cosine schedule, 5% warmup- LoRA:
r=32,alpha=64, target{q,k,v,o,gate,up,down}_proj,modules_to_save=["embed_tokens","lm_head"](842.9 M / 4.86 B = 17.3% trainable) - Abstract eval:
m_min=16,m_max=128,abs_temp=0.7, answertemp=0.0,tp=2
Quality observation β why didn't T=3 help?
Per-round Phase A starting loss dropped 3.49 β 0.35 β 0.27, showing the on-policy abstract traces are doing what the paper says they should: they begin to carry signal from CoT through ZΜ. But MATH-500 accuracy stayed at 72 Β± 1.2 across T=1 and T=3.
Hypotheses, in rough order of credibility:
- LoRA caps the gain. With ~17% trainable params and the embedding table dominating those, the model's base "answer-from-prompt" reflex is too strong for the bottleneck to redirect at this scale. The smoke report flagged this as the biggest gap to the paper, and our T=3 result is consistent.
- N=5000 is still tiny. The paper used 600k; we used 5k. On-policy refinement needs enough novel
(x, c)pairs to keep producing diverseZΜshapes; at 5k the same examples just get revisited with marginally different traces. - Eval stochasticity. Abstract trace decode uses temp=0.7. 1β2 pt variance between runs at N=500 is normal. The validation result (73.4) was likely a lucky upper; production (72.0) is within noise of the smoke (73.2).
- seq_len 8k may have let too much CoT signal "leak" through ZΜ during teacher generation, making
ZΜless of a compression target. Counterintuitive β the smoke argued for longer seq β but the bottleneck quality is the delta between what reaches ZΜ and what Y can use directly. Worth ablating.
What's next, ranked
- Full fine-tuning instead of LoRA. With 2Γ 80GB and
enforce_eager=Truefor the optimizer side, ZeRO-3 (no offload) becomes feasible on Qwen3-4B at seq_len 8k. Estimated ~16-20 hr at the current config; biggest expected lift. - N β 30kβ60k (still T=3, still seq 8k, still LoRA). Roughly extrapolates to ~30β60 hr β out of a one-day budget but the right next step if we get a 2β3 day budget. The full-FT path at 5k would be more diagnostic per hour.
- More epochs. Paper uses 3 epochs/phase; we used 1. 3Γ more wall but should help the Adam states settle.
- Re-eval the current
pi3_phaseB_mergedat multiple seeds + temperatures to bound the eval stochasticity tighter. 5β10 min each.
File layout (under /workspace/ThinkingWithoutWordsRepro/)
runs/
baseline_math500.jsonl # 84.60% (this-run baseline)
abstract_math500_T3_N5000.jsonl # 72.00% (final result)
qwen3-4b-abs/
base/ # Qwen3-4B + V_abs (M=64) + delimiters
pi1_phaseA/ pi1_phaseA_merged/ # round 1 LoRA + merged
pi1_phaseB/ pi1_phaseB_merged/ # round 1 Phase B
pi1_phaseB_teacher_traces.jsonl # on-policy ZΜ for round-1 self-distill
pi2_phaseA_teacher_traces.jsonl # bottleneck teacher for round 2 (full-CoT)
pi2_phaseA/ pi2_phaseA_merged/ # round 2
pi2_phaseB_teacher_traces.jsonl
pi2_phaseB/ pi2_phaseB_merged/
pi3_phaseA_teacher_traces.jsonl
pi3_phaseA/ pi3_phaseA_merged/
pi3_phaseB_teacher_traces.jsonl
pi3_phaseB/ pi3_phaseB_merged/ # β FINAL warm-up model
data/
math500.jsonl # 500 problems
dolci_6000.jsonl # 6k filtered Dolci-Think examples (used N=5000 of them)
docs/
20260510SMOKE_REPORT.md # prior run on 2Γ A100-40GB
20260511.md # this report
Train logs (runs/qwen3-4b-abs/pi*/train_log.json) include losses, lrs, total_opt_steps, num_processes, wallclock_s per phase β sufficient to plot and re-verify the LR fix offline.