Qwen 2.5 32B — Nemotron Fine-tuned

Full fine-tune of Qwen/Qwen2.5-32B on the Llama-Nemotron Post-Training Dataset. Adds step-by-step reasoning with <think> traces on math, code, and science problems.

Training

Parameter	Value
Base model	Qwen/Qwen2.5-32B (32.5B parameters)
Method	Full fine-tune (all parameters, FSDP2 FULL_SHARD)
Dataset	Nemotron 90K samples (40K math, 40K code, 20K science)
Eval split	10K held-out samples
Hardware	2 nodes x 8 NVIDIA H200 (16 GPUs, 141 GB VRAM each)
Precision	BF16
Batch size	128 effective (4/GPU x 2 grad_accum x 16 GPUs)
Learning rate	5e-5, cosine decay, 3% warmup
Optimizer	AdamW
Epochs	1
Sequence length	4096
Training time	4h 55min
Final train loss	0.403 (from 1.336, 70% drop)
Eval loss (step 500)	0.412 (gap 0.002, no overfitting)

Generation Comparison (20 eval samples)

Metric	Base	Fine-tuned	Delta
ROUGE-L	0.063	0.111	+76%
Reasoning traces	0%	100%	+100%
Math ROUGE-L	0.118	0.171	+45%
Science ROUGE-L	0.094	0.176	+87%
Code ROUGE-L	0.015	0.050	+233%

General Capability Benchmarks

Evaluated with lm-evaluation-harness v0.4.11.

Benchmark	Base	Fine-tuned	Delta
MMLU	80.7%	79.5%	-1.2%
HellaSwag	84.1%	83.4%	-0.7%
Winogrande	75.8%	74.8%	-0.9%

Average benchmark delta: -0.9%. General knowledge preserved.

Evaluation Interpretation

What improved: The fine-tuned model produces step-by-step reasoning (<think> traces) on 100% of math, code, and science problems. The base model never produces reasoning traces. ROUGE-L similarity to expected Nemotron outputs improved 76% overall, with the largest gains on science (+87%) and code (+233%).

What stayed stable: General knowledge benchmarks (MMLU, HellaSwag, Winogrande) show less than 1.2% change — within statistical noise. The model retains its broad capabilities while gaining domain-specific reasoning.

Overfitting check: Train loss (0.403) and eval loss (0.412) have a gap of only 0.002 — the model generalizes well to unseen data. This is expected with 1-epoch training on 90K diverse samples.

Trade-off: Full fine-tuning updates all 32.5B parameters, which gives maximum reasoning improvement but can cause small regressions on some benchmarks. For use cases requiring zero regression, LoRA fine-tuning (updating only 0.6% of parameters) is recommended.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mahernaija/qwen25-32b-nemotron-finetuned",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mahernaija/qwen25-32b-nemotron-finetuned")

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

Apache 2.0 (same as base model)

Downloads last month: 615

Safetensors

Model size

33B params

Tensor type

BF16

Model tree for mahernaija/qwen25-32b-nemotron-finetuned

Base model

Qwen/Qwen2.5-32B

Finetuned

(119)

this model

mahernaija
/

qwen25-32b-nemotron-finetuned