Qwen 2.5 32B — Nemotron Fine-tuned
Full fine-tune of Qwen/Qwen2.5-32B on the Llama-Nemotron Post-Training Dataset. Adds step-by-step reasoning with <think> traces on math, code, and science problems.
Training
| Parameter | Value |
|---|---|
| Base model | Qwen/Qwen2.5-32B (32.5B parameters) |
| Method | Full fine-tune (all parameters, FSDP2 FULL_SHARD) |
| Dataset | Nemotron 90K samples (40K math, 40K code, 20K science) |
| Eval split | 10K held-out samples |
| Hardware | 2 nodes x 8 NVIDIA H200 (16 GPUs, 141 GB VRAM each) |
| Precision | BF16 |
| Batch size | 128 effective (4/GPU x 2 grad_accum x 16 GPUs) |
| Learning rate | 5e-5, cosine decay, 3% warmup |
| Optimizer | AdamW |
| Epochs | 1 |
| Sequence length | 4096 |
| Training time | 4h 55min |
| Final train loss | 0.403 (from 1.336, 70% drop) |
| Eval loss (step 500) | 0.412 (gap 0.002, no overfitting) |
Generation Comparison (20 eval samples)
| Metric | Base | Fine-tuned | Delta |
|---|---|---|---|
| ROUGE-L | 0.063 | 0.111 | +76% |
| Reasoning traces | 0% | 100% | +100% |
| Math ROUGE-L | 0.118 | 0.171 | +45% |
| Science ROUGE-L | 0.094 | 0.176 | +87% |
| Code ROUGE-L | 0.015 | 0.050 | +233% |
General Capability Benchmarks
Evaluated with lm-evaluation-harness v0.4.11.
| Benchmark | Base | Fine-tuned | Delta |
|---|---|---|---|
| MMLU | 80.7% | 79.5% | -1.2% |
| HellaSwag | 84.1% | 83.4% | -0.7% |
| Winogrande | 75.8% | 74.8% | -0.9% |
Average benchmark delta: -0.9%. General knowledge preserved.
Evaluation Interpretation
What improved: The fine-tuned model produces step-by-step reasoning (<think> traces) on 100% of math, code, and science problems. The base model never produces reasoning traces. ROUGE-L similarity to expected Nemotron outputs improved 76% overall, with the largest gains on science (+87%) and code (+233%).
What stayed stable: General knowledge benchmarks (MMLU, HellaSwag, Winogrande) show less than 1.2% change — within statistical noise. The model retains its broad capabilities while gaining domain-specific reasoning.
Overfitting check: Train loss (0.403) and eval loss (0.412) have a gap of only 0.002 — the model generalizes well to unseen data. This is expected with 1-epoch training on 90K diverse samples.
Trade-off: Full fine-tuning updates all 32.5B parameters, which gives maximum reasoning improvement but can cause small regressions on some benchmarks. For use cases requiring zero regression, LoRA fine-tuning (updating only 0.6% of parameters) is recommended.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"mahernaija/qwen25-32b-nemotron-finetuned",
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mahernaija/qwen25-32b-nemotron-finetuned")
messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
License
Apache 2.0 (same as base model)
- Downloads last month
- 615
Model tree for mahernaija/qwen25-32b-nemotron-finetuned
Base model
Qwen/Qwen2.5-32B