Instructions to use leapeto/Qwen3-4B-AbstractCoT-warmup with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use leapeto/Qwen3-4B-AbstractCoT-warmup with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="leapeto/Qwen3-4B-AbstractCoT-warmup")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("leapeto/Qwen3-4B-AbstractCoT-warmup", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use leapeto/Qwen3-4B-AbstractCoT-warmup with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "leapeto/Qwen3-4B-AbstractCoT-warmup"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leapeto/Qwen3-4B-AbstractCoT-warmup",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/leapeto/Qwen3-4B-AbstractCoT-warmup

SGLang

How to use leapeto/Qwen3-4B-AbstractCoT-warmup with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "leapeto/Qwen3-4B-AbstractCoT-warmup" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leapeto/Qwen3-4B-AbstractCoT-warmup",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "leapeto/Qwen3-4B-AbstractCoT-warmup" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "leapeto/Qwen3-4B-AbstractCoT-warmup",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use leapeto/Qwen3-4B-AbstractCoT-warmup with Docker Model Runner:
```
docker model run hf.co/leapeto/Qwen3-4B-AbstractCoT-warmup
```

Qwen3-4B-AbstractCoT-warmup / docs /20260511.md

leapeto

Add files using upload-large-folder tool

a555798 verified 3 days ago

preview code

raw

history blame contribute delete

12.1 kB

Abstract-CoT (arXiv:2604.22709v2) — Production Run on Qwen3-4B

Date: 2026-05-11 Scope: Full T=3 PI warm-up at N=5000, seq_len=8192, LoRA — comparison target is the same paper's "Abstract-CoT (Warm-up)" row in Table 1. Hardware: 2× NVIDIA A100-SXM4-80GB (this machine, fresh clone). New checkpoint. Status: Pipeline ran end-to-end in ~11 hours, under the 12 hr budget. All three engineering wins (vLLM gen_traces, LR-schedule fix, seq_len 8k) shipped. T=3 on-policy iteration ran cleanly per-round but did not improve over T=1 at this LoRA/data scale (within noise).

Headline numbers

Method	MATH-500 acc	Mean tokens (reasoning + response)
Paper Baseline (Qwen3-4B, verbal CoT)	83.2	1087
Smoke Baseline (prior repo, 2× A100-40GB)	83.60	1067
This-run Baseline (2× A100-80GB, vLLM 0.19.1)	84.60	1045
Paper Abstract-CoT (Warm-up)	86.2	168
Smoke Warm-up (T=1, N=5k, 1ep, LoRA, seq 2k, T=0.7, m_min=16)	73.20	433
Validation (T=1, N=500, 1ep, LoRA, seq 8k, T=0.7, m_min=16)	73.40	558
This-run Warm-up (T=3, N=5k, 1ep, LoRA, seq 8k, T=0.7, m_min=16)	72.00	432

Reading:

Baseline reproduces paper. vLLM 0.19.1 (downgraded from 0.20.2 to match CUDA 12.8) works correctly on this box.
T=3 did not beat T=1 at this scale: 72.0 vs 73.2/73.4, well within the noise of temp=0.7 abstract-trace sampling.
Mean total tokens dropped (432 vs smoke 433 and validation 558), suggesting the on-policy traces did push the model toward shorter responses; accuracy just didn't lift.

Hardware actually available

GPU 0: A100-SXM4-80GB  vol ECC unc: 0  matmul OK, vLLM OK, sustained 100% util for 11 hr clean
GPU 1: A100-SXM4-80GB  vol ECC unc: 0  matmul OK, vLLM OK, sustained 100% util for 11 hr clean

Both GPUs usable. CUDA 12.8 / driver 570.195.03. 1.4 TiB system RAM, 128 CPUs. 146 GB free overlay disk at start; ended at 88 GB used.

Compute per GPU is identical to the smoke's 40GB cards (same GA100 silicon). The 80GB lifts the seq_len cap and unblocks the full-FT path (not used here — we kept LoRA per the smoke's recommendation for this budget).

vLLM TP: Qwen3-4B has 32 attention heads → TP must divide 32. TP=2 fits perfectly on this 2-GPU box (no idle card during eval).

What changed vs. the smoke

Listed in rough order of impact / engineering work.

1. vLLM port for `gen_traces` (biggest engineering win)

Replaced HF model.generate() + custom LogitsProcessor with vLLM LLM.generate() + SamplingParams.allowed_token_ids to enforce the V_abs ∪ {END_ABS} alphabet directly in the sampler. No custom logits processor needed — vLLM's allowed_token_ids does exactly this efficiently inside the kernels.

Measured throughput on N=5000:

Mode	Prefix	max_model_len	Wall	Rate
Phase B teacher (no CoT)	~150 tok	4096	17 s	294/s
Phase A teacher (with CoT)	~5500 tok	8192	887–891 s	5.6/s
Smoke HF baseline	—	—	11 min (=660 s) on 5k	7.6/s

vLLM speedup: 40× on Phase B teacher, 9× on Phase A teacher vs. the smoke's HF generate. The Phase A teacher path is prefill-dominated by 7800-token CoT prefixes; even there vLLM beats batched HF generate.

2. Cosine LR schedule bug — root-cause fix

The smoke report described an LR curve that went 1e-4 → 5e-7 → bounce back to 1e-4, but its diagnosis ("total_steps was computed before accelerator.prepare()") was wrong — the source code already computed total_steps after prepare().

Actual root cause: accelerator.prepare(sched) returns an AcceleratedScheduler that, under default settings (split_batches=False, step_with_optimizer=True), advances the underlying scheduler num_processes times per sched.step() call. With 2 GPUs, the cosine completes in half the calls, then bounces back to peak (the get_cosine_schedule_with_warmup function's num_cycles=0.5 curve returns to max once progress > 1).

Fix in src/train_phase_lora.py:

total_opt_steps = steps_per_epoch * args.epochs
total_steps = total_opt_steps * accelerator.num_processes   # NEW
sched = get_cosine_schedule_with_warmup(
    opt, num_warmup_steps=max(1, total_steps // 20),
    num_training_steps=total_steps,
)

Verified end-to-end. Round-3 Phase A train log: peak 1.0e-4 at step 5, monotonic cosine descent to 1.12e-8 at step 155, no bounce-back. Identical curve on every Phase A and Phase B of every round.

A standalone reproduction of both the bug and the fix (no GPU needed) is in this session's log; can be reconstructed by instantiating the scheduler with total_steps = total_opt_steps * 2 and stepping it total_opt_steps × 2 times.

Also added train_log.json "lrs" key alongside "losses" so future audits can verify offline.

3. `max_len` 2048 → 8192

The smoke truncated 98% of Dolci-Think CoTs from the right (median CoT is 18.8k tokens). At seq_len 8192, ~60% of CoTs fit fully and the rest only have their tail removed — meaningful reasoning makes it into the bottleneck. Measured per-step time at seq_len 8192 on this box: 71.9 s/step (vs. 12.7 s/step at seq_len 2048 in the smoke) — a 5.66× slowdown, dominated by self-attention now scaling quadratically over a longer window.

4. T=1 → T=3 (full PI warm-up)

The smoke only did one PI round (random Z̃ → bottleneck SFT → self-distill). This run did three: round 2 and round 3 use on-policy Z̃ generated via constrained decoding from the previous round's model. Per-round loss curves (Phase A [bottleneck]):

Round	step 5 loss	step 155 loss	Notes
1	3.49	0.85	starts from random Z̃ — model is learning the bottleneck structure
2	0.35	0.34	Z̃ now carries signal — model converges fast
3	0.27	0.35	even cleaner start; the on-policy traces are doing what they should

Phase B [distill] starting loss: 0.49 → 0.29 → 0.21 across rounds — same story.

So the optimizer is clearly working with the on-policy bottleneck signal. The accuracy lift just didn't show up at this LoRA/data scale (see "Quality observation" below).

5. Misc fixes

Shell syntax bug in scripts/03_phase_a.sh: the apostrophe inside ${OUT:?OUT must be the output dir for this phase's LoRA adapter} opened an unterminated single-quoted string under bash 5.2. Replaced with apostrophe-free wording.
max_model_len too tight on Phase B teacher (first production attempt): set to 1024, but some Dolci user prompts are 1.5–2.5k tokens. Validation at N=500 didn't sample the tail. Bumped to 3072 / 4096 (prefix / model_len), and added defensive left-truncation of X when even X alone exceeds the budget. Re-runs completed instantly.
run_smoke.sh now accepts DATA_FILE and SKIP_BASELINE env overrides so the same 6k-row dolci file can serve both validation (N=500) and production (N=5000).

Per-stage wall times (this run)

Stage	Per-occurrence	× T=3
Phase A (`bottleneck`, 156 opt steps @ ~72 s/step, seq 8k)	~2.94 hr	8.81 hr
Phase B (`distill`, 156 opt steps @ ~6 s/step)	~15.7 min	47 min
gen_traces — Phase A teacher (vLLM TP=2, max_model_len=8192)	~14.85 min	~29.7 min (rounds 2, 3 only)
gen_traces — Phase B teacher (vLLM TP=2, max_model_len=4096)	17 s	51 s
Merge LoRA → full HF (CPU-bound write)	~30 s	3 min
Final eval (MATH-500, vLLM TP=2, T=0.7 abstract / T=0 answer)	—	23 s for the 3-stage decode
Pre-flight (extend Qwen3-4B, baseline calibration)	—	~10 min
Total wall		~11h 0m (04:07 → 15:08)

Includes ~5 min of failed gen_traces + restart from the max_model_len=1024 issue.

Configuration used

RUNS_DIR=$PWD/runs \
DATA_FILE=$PWD/data/dolci_6000.jsonl \
SKIP_BASELINE=1 \
N=5000 T=3 EPOCHS=1 \
bash scripts/run_smoke.sh

With the in-script defaults:

MAX_LEN=8192 (Phase A + Phase B training cap)
MICRO_BATCH=1, GRAD_ACCUM=16, effective batch 32 (2 GPUs × 1 × 16)
LR=1e-4, cosine schedule, 5% warmup
LoRA: r=32, alpha=64, target {q,k,v,o,gate,up,down}_proj, modules_to_save=["embed_tokens","lm_head"] (842.9 M / 4.86 B = 17.3% trainable)
Abstract eval: m_min=16, m_max=128, abs_temp=0.7, answer temp=0.0, tp=2

Quality observation — why didn't T=3 help?

Per-round Phase A starting loss dropped 3.49 → 0.35 → 0.27, showing the on-policy abstract traces are doing what the paper says they should: they begin to carry signal from CoT through Z̃. But MATH-500 accuracy stayed at 72 ± 1.2 across T=1 and T=3.

Hypotheses, in rough order of credibility:

LoRA caps the gain. With ~17% trainable params and the embedding table dominating those, the model's base "answer-from-prompt" reflex is too strong for the bottleneck to redirect at this scale. The smoke report flagged this as the biggest gap to the paper, and our T=3 result is consistent.
N=5000 is still tiny. The paper used 600k; we used 5k. On-policy refinement needs enough novel (x, c) pairs to keep producing diverse Z̃ shapes; at 5k the same examples just get revisited with marginally different traces.
Eval stochasticity. Abstract trace decode uses temp=0.7. 1–2 pt variance between runs at N=500 is normal. The validation result (73.4) was likely a lucky upper; production (72.0) is within noise of the smoke (73.2).
seq_len 8k may have let too much CoT signal "leak" through Z̃ during teacher generation, making Z̃ less of a compression target. Counterintuitive — the smoke argued for longer seq — but the bottleneck quality is the delta between what reaches Z̃ and what Y can use directly. Worth ablating.

What's next, ranked

Full fine-tuning instead of LoRA. With 2× 80GB and enforce_eager=True for the optimizer side, ZeRO-3 (no offload) becomes feasible on Qwen3-4B at seq_len 8k. Estimated ~16-20 hr at the current config; biggest expected lift.
N → 30k–60k (still T=3, still seq 8k, still LoRA). Roughly extrapolates to ~30–60 hr — out of a one-day budget but the right next step if we get a 2–3 day budget. The full-FT path at 5k would be more diagnostic per hour.
More epochs. Paper uses 3 epochs/phase; we used 1. 3× more wall but should help the Adam states settle.
Re-eval the current pi3_phaseB_merged at multiple seeds + temperatures to bound the eval stochasticity tighter. 5–10 min each.

File layout (under `/workspace/ThinkingWithoutWordsRepro/`)

runs/
  baseline_math500.jsonl                       # 84.60% (this-run baseline)
  abstract_math500_T3_N5000.jsonl              # 72.00% (final result)
  qwen3-4b-abs/
    base/                                      # Qwen3-4B + V_abs (M=64) + delimiters
    pi1_phaseA/  pi1_phaseA_merged/            # round 1 LoRA + merged
    pi1_phaseB/  pi1_phaseB_merged/            # round 1 Phase B
    pi1_phaseB_teacher_traces.jsonl            # on-policy Z̃ for round-1 self-distill
    pi2_phaseA_teacher_traces.jsonl            # bottleneck teacher for round 2 (full-CoT)
    pi2_phaseA/  pi2_phaseA_merged/            # round 2
    pi2_phaseB_teacher_traces.jsonl
    pi2_phaseB/  pi2_phaseB_merged/
    pi3_phaseA_teacher_traces.jsonl
    pi3_phaseA/  pi3_phaseA_merged/
    pi3_phaseB_teacher_traces.jsonl
    pi3_phaseB/  pi3_phaseB_merged/            # ← FINAL warm-up model

data/
  math500.jsonl                                # 500 problems
  dolci_6000.jsonl                             # 6k filtered Dolci-Think examples (used N=5000 of them)

docs/
  20260510SMOKE_REPORT.md                      # prior run on 2× A100-40GB
  20260511.md                                  # this report

Train logs (runs/qwen3-4b-abs/pi*/train_log.json) include losses, lrs, total_opt_steps, num_processes, wallclock_s per phase — sufficient to plot and re-verify the LR fix offline.