This is a Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct fine-tune, produced through P-E-W's Heretic (v1.1.0) abliteration engine merged with the Magnitude-Preserving Orthogonal Ablation PR.
Heretication Results
| Score Metric | Value | Parameter | Value |
|---|---|---|---|
| Refusals | 6/100 | direction_index | 13.20 |
| KL Divergence | 0.0081 | attn.o_proj.max_weight | 1.75 |
| Initial Refusals | 90/100 | attn.o_proj.max_weight_position | 20.89 |
| attn.o_proj.min_weight | 1.20 | ||
| attn.o_proj.min_weight_distance | 16.40 | ||
| mlp.down_proj.max_weight | 1.27 | ||
| mlp.down_proj.max_weight_position | 20.39 | ||
| mlp.down_proj.min_weight | 0.08 | ||
| mlp.down_proj.min_weight_distance | 16.06 |
Degree of Heretication
The Heresy Index weighs the resulting model's corruption by the process (KL Divergence) and its abolition of doctrine (Refusals) for a final verdict in classification.
Note: This is an arbitrary classification inspired by Warhammer 40K, having no tangible indication towards the model's performance.
Model Information
We introduce Nemotron-UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.
The UltraLong Models
- nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct
- nvidia/Llama-3.1-Nemotron-8B-UltraLong-2M-Instruct
- nvidia/Llama-3.1-Nemotron-8B-UltraLong-4M-Instruct
Uses
Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.
Make sure to update your transformers installation via pip install --upgrade transformers.
import transformers
import torch
model_id = "nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
outputs = pipeline(
messages,
max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])
Model Card
- Base model: meta-llama/Llama-3.1-8B-Instruct
- Continued Pretraining: The training data consists of 1B tokens sourced from a pretraining corpus using per-domain upsampling based on sample length. The model was trained for 125 iterations with a sequence length of 1M and a global batch size of 8.
- Supervised fine-tuning (SFT): 1B tokens on open-source instruction datasets across general, mathematics, and code domains. We subsample the data from the ‘general_sft_stage2’ from AceMath-Instruct.
- Maximum context window: 1M tokens
Evaluation Results
We evaluate Nemotron-UltraLong-8B on a diverse set of benchmarks, including long-context tasks (e.g., RULER, LV-Eval, and InfiniteBench) and standard tasks (e.g., MMLU, MATH, GSM-8K, and HumanEval). UltraLong-8B achieves superior performance on ultra-long context tasks while maintaining competitive results on standard benchmarks.
Needle in a Haystack
Long context evaluation
Standard capability evaluation
Correspondence to
Chejian Xu (chejian2@illinois.edu), Wei Ping (wping@nvidia.com)
Citation
@article{ulralong2025,
title={From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models},
author={Xu, Chejian and Ping, Wei and Xu, Peng and Liu, Zihan and Wang, Boxin and Shoeybi, Mohammad and Catanzaro, Bryan},
journal={arXiv preprint},
year={2025}
}
- Downloads last month
- 9