Beta Release - This is a beta release. A v2 is expected with more training data and improved training methodology. As of now, this model is fine-tuned exclusively on the nohurry/Opus-4.6-Reasoning-3000x-filtered dataset (2,326 reasoning traces from Claude Opus 4.6).

Nemotron-3-Super-120B-A12B-FP8-Claude-4.6-Opus-Reasoning-Distilled

A fine-tuned version of NVIDIA Nemotron-3-Super-120B-A12B with enhanced reasoning capabilities, distilled from Claude Opus 4.6 reasoning traces.

Model Details

Property Value
Base Model nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Architecture Nemotron-H (Mamba-2 SSM + MoE + Attention hybrid)
Parameters 120B total / 12B active (MoE)
Precision FP8 (NVIDIA ModelOpt)
Fine-tuning Method LoRA (r=32, alpha=64) merged into base weights
Training Data nohurry/Opus-4.6-Reasoning-3000x-filtered
Epochs 3
Final Training Loss 0.42

What's Different

This model has been fine-tuned on 2,326 high-quality reasoning traces from Claude Opus 4.6. The model produces structured reasoning with <think> tags before answering, similar to o1/reasoning-style models.

Training Configuration

  • LoRA: r=32, alpha=64, dropout=0.05
  • Targets: q/k/v/o projections + MoE gate/up/down projections
  • Optimizer: AdamW 8-bit, lr=2e-4 with cosine schedule
  • Batch: effective batch size 8 (1 per GPU x 8 grad accumulation)
  • Sequence Length: 2048 tokens
  • Hardware: 3x NVIDIA B200 GPUs

Usage

With vLLM (Recommended)

from vllm import LLM, SamplingParams

llm = LLM(
    model="blobbybob/Nemotron-3-Super-120B-A12B-FP8-Claude-4.6-Opus-Reasoning-Distilled",
    dtype="auto",
    tensor_parallel_size=2,
    max_model_len=4096,
    trust_remote_code=True,
)

sampling = SamplingParams(temperature=1.0, top_p=0.955, max_tokens=2048)

messages = [
    {"role": "system", "content": "You are a helpful reasoning assistant. Think step by step before answering."},
    {"role": "user", "content": "What is the sum of all prime numbers less than 20?"},
]

tokenizer = llm.get_tokenizer()
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = llm.generate([prompt], sampling)
print(outputs[0].outputs[0].text)

Minimum Hardware

  • FP8: 2x H100-80GB or 2x H200 (~120GB total VRAM)

Note on FP8

This checkpoint uses NVIDIA ModelOpt FP8 quantization. vLLM auto-detects the format via hf_quant_config.json.

Limitations

  • Fine-tuned on only 2,326 examples — may not generalize to all domains
  • Reasoning traces are from Claude Opus 4.6; model behavior reflects that style
  • Mamba-2 layers use naive (non-optimized) kernels without mamba-ssm installed
  • Beta release — expect improvements in v2

License

This model inherits the NVIDIA Open Model License from the base model.

Downloads last month
118
Safetensors
Model size
121B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for blobbybob/Nemotron-3-Super-120B-A12B-FP8-Claude-4.6-Opus-Reasoning-Distilled

Quantized
(42)
this model

Dataset used to train blobbybob/Nemotron-3-Super-120B-A12B-FP8-Claude-4.6-Opus-Reasoning-Distilled