This is a Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct fine-tune, produced through P-E-W's Heretic (v1.1.0) abliteration engine merged with the Magnitude-Preserving Orthogonal Ablation PR.

Heretication Results

Score Metric	Value	Parameter	Value
Refusals	6/100	direction_index	13.20
KL Divergence	0.0081	attn.o_proj.max_weight	1.75
Initial Refusals	90/100	attn.o_proj.max_weight_position	20.89
		attn.o_proj.min_weight	1.20
		attn.o_proj.min_weight_distance	16.40
		mlp.down_proj.max_weight	1.27
		mlp.down_proj.max_weight_position	20.39
		mlp.down_proj.min_weight	0.08
		mlp.down_proj.min_weight_distance	16.06

Degree of Heretication

The Heresy Index weighs the resulting model's corruption by the process (KL Divergence) and its abolition of doctrine (Refusals) for a final verdict in classification.

Index Entry	Classification	Analysis
	Absolute Heresy	Less than 10/100 Refusals and 0.10 KL Divergence
	Tainted Heresy	Around 25-11/100 Refusals and/or -0.20-0.11 KL Divergence
	Impotent Heresy	Anything above 25/100 Refusals and 0.21 KL Divergence

Note: This is an arbitrary classification inspired by Warhammer 40K, having no tangible indication towards the model's performance.

Model Information

We introduce Nemotron-UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

The UltraLong Models

Uses

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import transformers
import torch

model_id = "nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Model Card

Base model: meta-llama/Llama-3.1-8B-Instruct
Continued Pretraining: The training data consists of 1B tokens sourced from a pretraining corpus using per-domain upsampling based on sample length. The model was trained for 125 iterations with a sequence length of 1M and a global batch size of 8.
Supervised fine-tuning (SFT): 1B tokens on open-source instruction datasets across general, mathematics, and code domains. We subsample the data from the ‘general_sft_stage2’ from AceMath-Instruct.
Maximum context window: 1M tokens

Evaluation Results

We evaluate Nemotron-UltraLong-8B on a diverse set of benchmarks, including long-context tasks (e.g., RULER, LV-Eval, and InfiniteBench) and standard tasks (e.g., MMLU, MATH, GSM-8K, and HumanEval). UltraLong-8B achieves superior performance on ultra-long context tasks while maintaining competitive results on standard benchmarks.

Needle in a Haystack

Long context evaluation

Standard capability evaluation

Correspondence to

Chejian Xu (chejian2@illinois.edu), Wei Ping (wping@nvidia.com)

Citation

@article{ulralong2025,
  title={From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models},
  author={Xu, Chejian and Ping, Wei and Xu, Peng and Liu, Zihan and Wang, Boxin and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint},
  year={2025}
 }