Text Generation
Transformers
Safetensors
PyTorch
nemotron_h
nvidia
conversational
custom_code
Eval Results

Actual context length

#12
by yuchsiao - opened

Thank you Nemotron team to open source model weights, data, and the training recipe!

There are some discrepancies about the actual context length supported by the model:

  1. The technical report says the model supports 1M tokens,
  2. config.json file's max_position_embeddings is 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.
  3. The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.

I'm not sure how to consolidate these specs. Could the team provide some more insights?

Thanks!

I used the " VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve" to extend to 1M tokens. I don't know if it is a good approach.

Thank you Nemotron team to open source model weights, data, and the training recipe!

There are some discrepancies about the actual context length supported by the model:

  1. The technical report says the model supports 1M tokens,
  2. config.json file's max_position_embeddings is 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.
  3. The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.

I'm not sure how to consolidate these specs. Could the team provide some more insights?

Thanks!

This is a bit confusing, I think what it means is:
You can feed it max 128k tokens as input, and it will generate a maximum of 128k tokens AFTER that as response.
So in total ~256k tokens
Also means that if you input say 200k tokens for a prompt and expect only 4k token response or something, the performance will probably degrade (beyond 128k tokens as your initial prompt)

  1. The technical report says the model supports 1M tokens,
  2. config.json file's max_position_embeddings is 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.
  3. The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.

I'm not sure how to consolidate these specs. Could the team provide some more insights?

This is a bit confusing, I think what it means is:
You can feed it max 128k tokens as input, and it will generate a maximum of 128k tokens AFTER that as response.
So in total ~256k tokens
Also means that if you input say 200k tokens for a prompt and expect only 4k token response or something, the performance will probably degrade (beyond 128k tokens as your initial prompt)

Yes this puts Item 2 and 3 in one picture. But Item 1 remains inconsistent. Hope the dev team can help clarify if the model can actually work for 1M tokens.

NVIDIA org

sorry for the confusion on 1M context, we will fix documentation.

The way to get this to work is by setting the environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and also --max-model-len 1M engine argument

for example,

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve YOUR_MODEL_PATH --max-model-len 1M

This will override max-position-embeddings parameter.

You can read more about max-model-len in here

I hope it helps

Thank you @lvega-nv for the clear guidance.

In this case, wouldn't updating the config.json directly be a more straightforward solution?

Following up on this discussion with some actual benchmark data. I ran needle-in-haystack tests via the NIM API (build.nvidia.com) to measure real retrieval quality at different context lengths.

Test Setup

  • Model: nvidia/llama-3.3-nemotron-super-49b-v1 via NIM API
  • Method: Insert a unique fact ("needle") into filler text ("haystack"), ask model to retrieve it
  • Needle positions tested: start (10%), middle (50%), end (90%)
  • Metric: Exact match + partial fact retrieval (city, year, item, color, number)

Results

Context Size Prompt Tokens Retrieval Latency Notes
2K ~1.3K βœ… 100% 1-3s Perfect
8K ~4.9K βœ… 100% 1-2s Perfect
16K ~9.7K βœ… 100% 1-2s Perfect
32K ~19K βœ… 100% 2-3s Perfect
64K ~38K βœ… 100% 4-5s Perfect
128K ~77K βœ… 100% 9-11s Perfect, all positions
160K ~96K ⚠️ Inconsistent 15s Model said "NOT FOUND" but processed the needle
200K ~120K βœ… Found 17s Worked but beyond safe zone
256K β€” ❌ Rejected β€” API error: "max context length is 131072 tokens"

Key Findings

  1. NIM API hard limit: 131,072 tokens β€” not 128K, not 262K, not 1M. API rejects requests exceeding this.

  2. Up to 128K: flawless β€” 100% retrieval accuracy across all needle positions (start/middle/end). No degradation observed.

  3. Beyond 128K: quality degrades β€” at ~96K prompt tokens, the model started giving inconsistent answers (saying "NOT FOUND" while clearly having processed the content).

  4. Latency scales linearly β€” roughly 1s per 10K tokens at the NIM API endpoint.

Practical Recommendation

For reliable results via NIM API:

  • Safe zone: ≀128K total context (prompt + expected completion)
  • Maximum input: ~120K tokens if you need ~10K completion budget
  • Don't trust the 1M claim β€” the NIM API enforces 131K

The confusion between "1M context" (technical report), "262K" (config.json), and "128K input + 128K output" (README) seems to be: the model architecture supports longer contexts, but the deployed NIM API enforces a 131K limit.

Hope this helps others hitting similar confusion. Test scripts available at nemotron-community-testing if anyone wants to reproduce.

@lvega-nv β€” would love to hear if this matches your expectations for NIM API behavior, or if there are plans to raise the 131K limit to match the 262K in config.json. Happy to run more targeted tests if useful.


Tested Feb 4, 2026 against build.nvidia.com NIM API

Sign up or log in to comment