Qwen 2.5 32B — Nemotron Fine-tuned

Full fine-tune of Qwen/Qwen2.5-32B on the Llama-Nemotron Post-Training Dataset. Adds step-by-step reasoning with <think> traces on math, code, and science problems.

Training

Parameter Value
Base model Qwen/Qwen2.5-32B (32.5B parameters)
Method Full fine-tune (all parameters, FSDP2 FULL_SHARD)
Dataset Nemotron 90K samples (40K math, 40K code, 20K science)
Eval split 10K held-out samples
Hardware 2 nodes x 8 NVIDIA H200 (16 GPUs, 141 GB VRAM each)
Precision BF16
Batch size 128 effective (4/GPU x 2 grad_accum x 16 GPUs)
Learning rate 5e-5, cosine decay, 3% warmup
Optimizer AdamW
Epochs 1
Sequence length 4096
Training time 4h 55min
Final train loss 0.403 (from 1.336, 70% drop)
Eval loss (step 500) 0.412 (gap 0.002, no overfitting)

Generation Comparison (20 eval samples)

Metric Base Fine-tuned Delta
ROUGE-L 0.063 0.111 +76%
Reasoning traces 0% 100% +100%
Math ROUGE-L 0.118 0.171 +45%
Science ROUGE-L 0.094 0.176 +87%
Code ROUGE-L 0.015 0.050 +233%

General Capability Benchmarks

Evaluated with lm-evaluation-harness v0.4.11.

Benchmark Base Fine-tuned Delta
MMLU 80.7% 79.5% -1.2%
HellaSwag 84.1% 83.4% -0.7%
Winogrande 75.8% 74.8% -0.9%

Average benchmark delta: -0.9%. General knowledge preserved.

Evaluation Interpretation

What improved: The fine-tuned model produces step-by-step reasoning (<think> traces) on 100% of math, code, and science problems. The base model never produces reasoning traces. ROUGE-L similarity to expected Nemotron outputs improved 76% overall, with the largest gains on science (+87%) and code (+233%).

What stayed stable: General knowledge benchmarks (MMLU, HellaSwag, Winogrande) show less than 1.2% change — within statistical noise. The model retains its broad capabilities while gaining domain-specific reasoning.

Overfitting check: Train loss (0.403) and eval loss (0.412) have a gap of only 0.002 — the model generalizes well to unseen data. This is expected with 1-epoch training on 90K diverse samples.

Trade-off: Full fine-tuning updates all 32.5B parameters, which gives maximum reasoning improvement but can cause small regressions on some benchmarks. For use cases requiring zero regression, LoRA fine-tuning (updating only 0.6% of parameters) is recommended.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "mahernaija/qwen25-32b-nemotron-finetuned",
    torch_dtype="bfloat16",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mahernaija/qwen25-32b-nemotron-finetuned")

messages = [{"role": "user", "content": "Prove that sqrt(2) is irrational."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

License

Apache 2.0 (same as base model)

Downloads last month
615
Safetensors
Model size
33B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mahernaija/qwen25-32b-nemotron-finetuned

Base model

Qwen/Qwen2.5-32B
Finetuned
(119)
this model

Dataset used to train mahernaija/qwen25-32b-nemotron-finetuned