This is a Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct fine-tune, produced through P-E-W's Heretic (v1.1.0) abliteration engine merged with the Magnitude-Preserving Orthogonal Ablation PR.


Heretication Results

Score Metric Value Parameter Value
Refusals 6/100 direction_index 13.20
KL Divergence 0.0081 attn.o_proj.max_weight 1.75
Initial Refusals 90/100 attn.o_proj.max_weight_position 20.89
attn.o_proj.min_weight 1.20
attn.o_proj.min_weight_distance 16.40
mlp.down_proj.max_weight 1.27
mlp.down_proj.max_weight_position 20.39
mlp.down_proj.min_weight 0.08
mlp.down_proj.min_weight_distance 16.06

Degree of Heretication

The Heresy Index weighs the resulting model's corruption by the process (KL Divergence) and its abolition of doctrine (Refusals) for a final verdict in classification.

Index Entry Classification Analysis
Absolute Absolute Heresy Less than 10/100 Refusals and 0.10 KL Divergence
Tainted Tainted Heresy Around 25-11/100 Refusals and/or -0.20-0.11 KL Divergence
Impotent Impotent Heresy Anything above 25/100 Refusals and 0.21 KL Divergence

Note: This is an arbitrary classification inspired by Warhammer 40K, having no tangible indication towards the model's performance.


Model Information

We introduce Nemotron-UltraLong-8B, a series of ultra-long context language models designed to process extensive sequences of text (up to 1M, 2M, and 4M tokens) while maintaining competitive performance on standard benchmarks. Built on the Llama-3.1, UltraLong-8B leverages a systematic training recipe that combines efficient continued pretraining with instruction tuning to enhance long-context understanding and instruction-following capabilities. This approach enables our models to efficiently scale their context windows without sacrificing general performance.

The UltraLong Models

Uses

Starting with transformers >= 4.43.0 onward, you can run conversational inference using the Transformers pipeline abstraction or by leveraging the Auto classes with the generate() function.

Make sure to update your transformers installation via pip install --upgrade transformers.

import transformers
import torch

model_id = "nvidia/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

outputs = pipeline(
    messages,
    max_new_tokens=256,
)
print(outputs[0]["generated_text"][-1])

Model Card

  • Base model: meta-llama/Llama-3.1-8B-Instruct
  • Continued Pretraining: The training data consists of 1B tokens sourced from a pretraining corpus using per-domain upsampling based on sample length. The model was trained for 125 iterations with a sequence length of 1M and a global batch size of 8.
  • Supervised fine-tuning (SFT): 1B tokens on open-source instruction datasets across general, mathematics, and code domains. We subsample the data from the ‘general_sft_stage2’ from AceMath-Instruct.
  • Maximum context window: 1M tokens

Evaluation Results

We evaluate Nemotron-UltraLong-8B on a diverse set of benchmarks, including long-context tasks (e.g., RULER, LV-Eval, and InfiniteBench) and standard tasks (e.g., MMLU, MATH, GSM-8K, and HumanEval). UltraLong-8B achieves superior performance on ultra-long context tasks while maintaining competitive results on standard benchmarks.

Needle in a Haystack

image

Long context evaluation

image

Standard capability evaluation

image

Correspondence to

Chejian Xu (chejian2@illinois.edu), Wei Ping (wping@nvidia.com)

Citation

@article{ulralong2025,
  title={From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models},
  author={Xu, Chejian and Ping, Wei and Xu, Peng and Liu, Zihan and Wang, Boxin and Shoeybi, Mohammad and Catanzaro, Bryan},
  journal={arXiv preprint},
  year={2025}
 }
Downloads last month
9
Safetensors
Model size
8B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MuXodious/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct-absolute-heresy

Finetuned
(5)
this model
Quantizations
2 models

Collection including MuXodious/Llama-3.1-Nemotron-8B-UltraLong-1M-Instruct-absolute-heresy