# Abstract-CoT (arXiv:2604.22709v2) — Production Run on Qwen3-4B **Date:** 2026-05-11 **Scope:** Full T=3 PI warm-up at N=5000, seq_len=8192, LoRA — comparison target is the same paper's "Abstract-CoT (Warm-up)" row in Table 1. **Hardware:** 2× NVIDIA A100-SXM4-80GB (this machine, fresh clone). New checkpoint. **Status:** Pipeline ran end-to-end in **~11 hours**, under the 12 hr budget. All three engineering wins (vLLM gen_traces, LR-schedule fix, seq_len 8k) shipped. T=3 on-policy iteration ran cleanly per-round but did **not** improve over T=1 at this LoRA/data scale (within noise). --- ## Headline numbers | Method | MATH-500 acc | Mean tokens (reasoning + response) | |---|---|---| | Paper Baseline (Qwen3-4B, verbal CoT) | 83.2 | 1087 | | Smoke Baseline (prior repo, 2× A100-40GB) | 83.60 | 1067 | | **This-run Baseline** (2× A100-80GB, vLLM 0.19.1) | **84.60** | **1045** | | Paper Abstract-CoT (Warm-up) | 86.2 | 168 | | Smoke Warm-up (T=1, N=5k, 1ep, LoRA, seq 2k, T=0.7, m_min=16) | 73.20 | 433 | | **Validation** (T=1, N=500, 1ep, LoRA, seq 8k, T=0.7, m_min=16) | 73.40 | 558 | | **This-run Warm-up** (T=3, N=5k, 1ep, LoRA, seq 8k, T=0.7, m_min=16) | **72.00** | **432** | Reading: - **Baseline reproduces paper.** vLLM 0.19.1 (downgraded from 0.20.2 to match CUDA 12.8) works correctly on this box. - **T=3 did not beat T=1** at this scale: 72.0 vs 73.2/73.4, well within the noise of temp=0.7 abstract-trace sampling. - **Mean total tokens dropped** (432 vs smoke 433 and validation 558), suggesting the on-policy traces *did* push the model toward shorter responses; accuracy just didn't lift. --- ## Hardware actually available ``` GPU 0: A100-SXM4-80GB vol ECC unc: 0 matmul OK, vLLM OK, sustained 100% util for 11 hr clean GPU 1: A100-SXM4-80GB vol ECC unc: 0 matmul OK, vLLM OK, sustained 100% util for 11 hr clean ``` Both GPUs usable. CUDA 12.8 / driver 570.195.03. 1.4 TiB system RAM, 128 CPUs. 146 GB free overlay disk at start; ended at 88 GB used. Compute per GPU is identical to the smoke's 40GB cards (same GA100 silicon). The 80GB lifts the seq_len cap and unblocks the full-FT path (not used here — we kept LoRA per the smoke's recommendation for this budget). **vLLM TP**: Qwen3-4B has 32 attention heads → TP must divide 32. **TP=2 fits perfectly** on this 2-GPU box (no idle card during eval). --- ## What changed vs. the smoke Listed in rough order of impact / engineering work. ### 1. **vLLM port for `gen_traces`** (biggest engineering win) Replaced HF `model.generate()` + custom `LogitsProcessor` with vLLM `LLM.generate()` + `SamplingParams.allowed_token_ids` to enforce the V_abs ∪ {END_ABS} alphabet directly in the sampler. No custom logits processor needed — vLLM's `allowed_token_ids` does exactly this efficiently inside the kernels. Measured throughput on N=5000: | Mode | Prefix | max_model_len | Wall | Rate | |---|---|---|---|---| | Phase B teacher (no CoT) | ~150 tok | 4096 | **17 s** | **294/s** | | Phase A teacher (with CoT) | ~5500 tok | 8192 | **887–891 s** | **5.6/s** | | Smoke HF baseline | — | — | 11 min (=660 s) on 5k | 7.6/s | vLLM speedup: **40× on Phase B teacher**, **9× on Phase A teacher** vs. the smoke's HF generate. The Phase A teacher path is prefill-dominated by 7800-token CoT prefixes; even there vLLM beats batched HF generate. ### 2. **Cosine LR schedule bug — root-cause fix** The smoke report described an LR curve that went `1e-4 → 5e-7 → bounce back to 1e-4`, but its diagnosis ("`total_steps` was computed before `accelerator.prepare()`") was wrong — the source code already computed `total_steps` after `prepare()`. **Actual root cause:** `accelerator.prepare(sched)` returns an `AcceleratedScheduler` that, under default settings (`split_batches=False`, `step_with_optimizer=True`), advances the underlying scheduler `num_processes` times per `sched.step()` call. With 2 GPUs, the cosine completes in half the calls, then bounces back to peak (the `get_cosine_schedule_with_warmup` function's `num_cycles=0.5` curve returns to max once `progress > 1`). **Fix** in `src/train_phase_lora.py`: ```python total_opt_steps = steps_per_epoch * args.epochs total_steps = total_opt_steps * accelerator.num_processes # NEW sched = get_cosine_schedule_with_warmup( opt, num_warmup_steps=max(1, total_steps // 20), num_training_steps=total_steps, ) ``` **Verified** end-to-end. Round-3 Phase A train log: peak 1.0e-4 at step 5, monotonic cosine descent to 1.12e-8 at step 155, **no bounce-back**. Identical curve on every Phase A and Phase B of every round. A standalone reproduction of both the bug and the fix (no GPU needed) is in this session's log; can be reconstructed by instantiating the scheduler with `total_steps = total_opt_steps * 2` and stepping it `total_opt_steps × 2` times. Also added `train_log.json` "lrs" key alongside "losses" so future audits can verify offline. ### 3. **`max_len` 2048 → 8192** The smoke truncated 98% of Dolci-Think CoTs from the right (median CoT is 18.8k tokens). At seq_len 8192, ~60% of CoTs fit fully and the rest only have their tail removed — meaningful reasoning makes it into the bottleneck. Measured per-step time at seq_len 8192 on this box: **71.9 s/step** (vs. 12.7 s/step at seq_len 2048 in the smoke) — a 5.66× slowdown, dominated by self-attention now scaling quadratically over a longer window. ### 4. **T=1 → T=3 (full PI warm-up)** The smoke only did one PI round (random Z̃ → bottleneck SFT → self-distill). This run did three: round 2 and round 3 use on-policy Z̃ generated via constrained decoding from the previous round's model. Per-round loss curves (Phase A `[bottleneck]`): | Round | step 5 loss | step 155 loss | Notes | |---|---|---|---| | 1 | 3.49 | 0.85 | starts from random Z̃ — model is learning the bottleneck structure | | 2 | 0.35 | 0.34 | Z̃ now carries signal — model converges fast | | 3 | 0.27 | 0.35 | even cleaner start; the on-policy traces are doing what they should | Phase B `[distill]` starting loss: 0.49 → 0.29 → 0.21 across rounds — same story. So the optimizer is clearly working with the on-policy bottleneck signal. The accuracy lift just didn't show up at this LoRA/data scale (see "Quality observation" below). ### 5. **Misc fixes** - **Shell syntax bug** in `scripts/03_phase_a.sh`: the apostrophe inside `${OUT:?OUT must be the output dir for this phase's LoRA adapter}` opened an unterminated single-quoted string under bash 5.2. Replaced with apostrophe-free wording. - **`max_model_len` too tight on Phase B teacher** (first production attempt): set to 1024, but some Dolci user prompts are 1.5–2.5k tokens. Validation at N=500 didn't sample the tail. Bumped to **3072 / 4096** (prefix / model_len), and added defensive left-truncation of X when even X alone exceeds the budget. Re-runs completed instantly. - `run_smoke.sh` now accepts `DATA_FILE` and `SKIP_BASELINE` env overrides so the same 6k-row dolci file can serve both validation (N=500) and production (N=5000). --- ## Per-stage wall times (this run) | Stage | Per-occurrence | × T=3 | |---|---|---| | Phase A (`bottleneck`, 156 opt steps @ ~72 s/step, seq 8k) | **~2.94 hr** | **8.81 hr** | | Phase B (`distill`, 156 opt steps @ ~6 s/step) | **~15.7 min** | **47 min** | | gen_traces — Phase A teacher (vLLM TP=2, max_model_len=8192) | **~14.85 min** | ~29.7 min (rounds 2, 3 only) | | gen_traces — Phase B teacher (vLLM TP=2, max_model_len=4096) | **17 s** | 51 s | | Merge LoRA → full HF (CPU-bound write) | ~30 s | 3 min | | Final eval (MATH-500, vLLM TP=2, T=0.7 abstract / T=0 answer) | — | **23 s** for the 3-stage decode | | Pre-flight (extend Qwen3-4B, baseline calibration) | — | ~10 min | | **Total wall** | | **~11h 0m** (04:07 → 15:08) | Includes ~5 min of failed gen_traces + restart from the `max_model_len=1024` issue. --- ## Configuration used ```bash RUNS_DIR=$PWD/runs \ DATA_FILE=$PWD/data/dolci_6000.jsonl \ SKIP_BASELINE=1 \ N=5000 T=3 EPOCHS=1 \ bash scripts/run_smoke.sh ``` With the in-script defaults: - `MAX_LEN=8192` (Phase A + Phase B training cap) - `MICRO_BATCH=1`, `GRAD_ACCUM=16`, **effective batch 32** (2 GPUs × 1 × 16) - `LR=1e-4`, cosine schedule, 5% warmup - LoRA: `r=32`, `alpha=64`, target `{q,k,v,o,gate,up,down}_proj`, `modules_to_save=["embed_tokens","lm_head"]` (842.9 M / 4.86 B = 17.3% trainable) - Abstract eval: `m_min=16`, `m_max=128`, `abs_temp=0.7`, answer `temp=0.0`, `tp=2` --- ## Quality observation — why didn't T=3 help? Per-round Phase A starting loss dropped 3.49 → 0.35 → 0.27, showing the on-policy abstract traces are doing what the paper says they should: they begin to carry signal from CoT through Z̃. But MATH-500 accuracy stayed at 72 ± 1.2 across T=1 and T=3. Hypotheses, in rough order of credibility: 1. **LoRA caps the gain.** With ~17% trainable params and the embedding table dominating those, the model's base "answer-from-prompt" reflex is too strong for the bottleneck to redirect at this scale. The smoke report flagged this as the biggest gap to the paper, and our T=3 result is consistent. 2. **N=5000 is still tiny.** The paper used 600k; we used 5k. On-policy refinement needs enough novel `(x, c)` pairs to keep producing diverse `Z̃` shapes; at 5k the same examples just get revisited with marginally different traces. 3. **Eval stochasticity.** Abstract trace decode uses temp=0.7. 1–2 pt variance between runs at N=500 is normal. The validation result (73.4) was likely a lucky upper; production (72.0) is within noise of the smoke (73.2). 4. **seq_len 8k may have let too much CoT signal "leak" through Z̃ during teacher generation**, making `Z̃` less of a compression target. Counterintuitive — the smoke argued for longer seq — but the bottleneck quality is the *delta* between what reaches Z̃ and what Y can use directly. Worth ablating. --- ## What's next, ranked 1. **Full fine-tuning** instead of LoRA. With 2× 80GB and `enforce_eager=True` for the optimizer side, ZeRO-3 (no offload) becomes feasible on Qwen3-4B at seq_len 8k. Estimated ~16-20 hr at the current config; biggest expected lift. 2. **N → 30k–60k** (still T=3, still seq 8k, still LoRA). Roughly extrapolates to ~30–60 hr — out of a one-day budget but the right next step if we get a 2–3 day budget. The full-FT path at 5k would be more diagnostic per hour. 3. **More epochs.** Paper uses 3 epochs/phase; we used 1. 3× more wall but should help the Adam states settle. 4. Re-eval the current `pi3_phaseB_merged` at multiple seeds + temperatures to bound the eval stochasticity tighter. 5–10 min each. --- ## File layout (under `/workspace/ThinkingWithoutWordsRepro/`) ``` runs/ baseline_math500.jsonl # 84.60% (this-run baseline) abstract_math500_T3_N5000.jsonl # 72.00% (final result) qwen3-4b-abs/ base/ # Qwen3-4B + V_abs (M=64) + delimiters pi1_phaseA/ pi1_phaseA_merged/ # round 1 LoRA + merged pi1_phaseB/ pi1_phaseB_merged/ # round 1 Phase B pi1_phaseB_teacher_traces.jsonl # on-policy Z̃ for round-1 self-distill pi2_phaseA_teacher_traces.jsonl # bottleneck teacher for round 2 (full-CoT) pi2_phaseA/ pi2_phaseA_merged/ # round 2 pi2_phaseB_teacher_traces.jsonl pi2_phaseB/ pi2_phaseB_merged/ pi3_phaseA_teacher_traces.jsonl pi3_phaseA/ pi3_phaseA_merged/ pi3_phaseB_teacher_traces.jsonl pi3_phaseB/ pi3_phaseB_merged/ # ← FINAL warm-up model data/ math500.jsonl # 500 problems dolci_6000.jsonl # 6k filtered Dolci-Think examples (used N=5000 of them) docs/ 20260510SMOKE_REPORT.md # prior run on 2× A100-40GB 20260511.md # this report ``` Train logs (`runs/qwen3-4b-abs/pi*/train_log.json`) include `losses`, `lrs`, `total_opt_steps`, `num_processes`, `wallclock_s` per phase — sufficient to plot and re-verify the LR fix offline.