--- license: apache-2.0 library_name: transformers base_model: Qwen/Qwen3-4B pipeline_tag: text-generation tags: - abstract-cot - latent-reasoning - math-reasoning - qwen3 datasets: - HuggingFaceH4/MATH-500 - allenai/Dolci-Think-SFT-7B --- # Qwen3-4B-AbstractCoT-warmup Qwen3-4B fine-tuned with the **Abstract Chain-of-Thought (Abstract-CoT)** warm-up procedure from "[Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought](https://arxiv.org/abs/2604.22709v2)" (Ramji, Naseem, Fernandez Astudillo, IBM Research AI, 2026). The model is taught to compress its reasoning into a short sequence (~16–22 tokens) drawn from a reserved 64-symbol *abstract vocabulary* `V_abs = {, …, }`, used as a discrete latent scratchpad before emitting the answer. ``` prompt ─► z_1 ... z_m answer └─────── z̃ ∈ V_abs^m, m ≤ 128 ───────┘ ``` This is the SFT half of the paper only — no RL stage. The comparison row is the paper's "Abstract-CoT (Warm-up)" line in Table 1. ## Headline result | | MATH-500 acc | Mean tokens | |---|---|---| | Paper Baseline (Qwen3-4B verbal CoT) | 83.2 | 1087 | | **Our Baseline** (Qwen3-4B verbal CoT, this hardware) | 84.60 | 1045 | | Paper Abstract-CoT Warm-up | 86.2 | 168 | | **This model** (T=3 PI, N=5k, 1 epoch, LoRA, seq 8k) | **72.00** | **432** | The accuracy gap to the paper's 86.2 is driven by reduced data scale (5k vs 600k), LoRA vs full fine-tuning, and 1 vs 3 epochs per phase. See `docs/20260511_reader.md` for a full discussion. ## Repository layout ``` final/ ← end-of-round-3 merged model (THE warm-up checkpoint) round2/ ← end-of-round-2 merged model round1/ ← end-of-round-1 merged model adapters/ ← all 6 LoRA adapters (pi{1,2,3}_phase{A,B}) results/ ← per-example eval JSONL (baseline + abstract) teacher_traces/ ← on-policy V_abs traces used as Phase B/A teachers train_logs/ ← per-phase loss + LR curves (verifies cosine fix) docs/ ← run reports (technical + reader-oriented) ``` ## How it was trained Three policy-iteration rounds, each with two phases: - **Phase A — Bottleneck SFT.** Train on `[prompt; verbal-CoT; z̃; answer]` with the answer blocked from attending to the verbal CoT, forcing all CoT→answer signal through `z̃`. - Round 1: `z̃` is random V_abs tokens. - Rounds 2+: `z̃` is sampled on-policy from the previous round's model. - **Phase B — Self-distillation.** Train on `[prompt; z̃; answer]` with standard causal attention, where `z̃` is now generated from the prompt alone. Training config: - Base: `Qwen/Qwen3-4B`, extended with V_abs (M=64) + `` + `` (151 669 → 151 735 tokens). - LoRA r=32, α=64 on attention + MLP projections. Embedding table + LM head trained fully (so the new abstract-vocab rows can move freely). 842.9 M / 4.86 B trainable (17.3%). - Data: 5 000 examples from `allenai/Dolci-Think-SFT-7B`, filtered to assistant messages with `` blocks ≥ 200 chars. - max_len 8192, batch 32, lr 1e-4, cosine schedule, 5% warmup. - 2× A100-SXM4-80GB, ~11 hours wall. ## Using the model ### Inference (vLLM, recommended) ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer from huggingface_hub import snapshot_download # Download the final checkpoint model_path = snapshot_download( "leapeto/Qwen3-4B-AbstractCoT-warmup", allow_patterns=["final/*"], ) tok = AutoTokenizer.from_pretrained(f"{model_path}/final", trust_remote_code=True) # Abstract token ids abs_tokens = [] for i in range(64): if i < 26: abs_tokens.append(f"") else: j = i - 26 abs_tokens.append(f"") end_id = tok.convert_tokens_to_ids("") abs_ids = tok.convert_tokens_to_ids(abs_tokens) allowed = list(set(abs_ids + [end_id])) llm = LLM(model=f"{model_path}/final", tensor_parallel_size=2, dtype="bfloat16", trust_remote_code=True) # Two-stage decode: (1) constrained abstract trace, (2) unconstrained answer prompt = "What is the integral of x^2 from 0 to 1? Put your final answer in \\boxed{}." messages = [ {"role": "system", "content": "Please reason step by step, and put your final answer within \\boxed{}."}, {"role": "user", "content": prompt}, ] prefix = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, enable_thinking=False) prefix += "" # Stage 1: V_abs only, stop at sp1 = SamplingParams(temperature=0.7, max_tokens=128, allowed_token_ids=allowed, stop_token_ids=[end_id], skip_special_tokens=False) abstract = llm.generate([prefix], sp1)[0].outputs[0].text prompt2 = prefix + abstract + "\n" # Stage 2: unconstrained answer sp2 = SamplingParams(temperature=0.0, max_tokens=2048) answer = llm.generate([prompt2], sp2)[0].outputs[0].text print(answer) ``` ### Loading the LoRA adapters (peft) If you want to inspect individual round outputs without downloading the merged models: ```python from peft import PeftModel from transformers import AutoModelForCausalLM from huggingface_hub import snapshot_download # You'll need the extended base model first — produce it locally via scripts/01_extend_model.sh # OR start from one of our merged checkpoints and load a later adapter on top. base = AutoModelForCausalLM.from_pretrained("path/to/extended/base", trust_remote_code=True) adapter_path = snapshot_download( "leapeto/Qwen3-4B-AbstractCoT-warmup", allow_patterns=["adapters/pi3_phaseB/*"], ) model = PeftModel.from_pretrained(base, f"{adapter_path}/adapters/pi3_phaseB") ``` ## Files of interest | File | What | |---|---| | `final/` | End-of-round-3 merged model. **This is the main artifact.** | | `round1/`, `round2/` | Intermediate merged models for studying T=1 → T=2 → T=3 progression | | `adapters/pi{1,2,3}_phase{A,B}/` | LoRA-only checkpoints from each phase | | `results/baseline_math500.jsonl` | Qwen3-4B verbal-CoT eval (84.60% / 1045 tok) | | `results/abstract_math500_T3_N5000.jsonl` | This model's eval (72.00% / 432 tok) | | `train_logs/*.json` | Per-step loss + LR curves for each phase | | `docs/20260511.md` | Technical report (full breakdown) | | `docs/20260511_reader.md` | Reader-oriented report (concepts + reasoning) | ## Citation ```bibtex @article{ramji2026thinking, title={Thinking Without Words: Efficient Latent Reasoning with Abstract Chain-of-Thought}, author={Ramji, Keshav and Naseem, Tahira and Fernandez Astudillo, Ramón}, journal={arXiv preprint arXiv:2604.22709}, year={2026} } ```