Actual context length
Thank you Nemotron team to open source model weights, data, and the training recipe!
There are some discrepancies about the actual context length supported by the model:
- The technical report says the model supports 1M tokens,
config.jsonfile'smax_position_embeddingsis 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.- The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.
I'm not sure how to consolidate these specs. Could the team provide some more insights?
Thanks!
I used the " VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve" to extend to 1M tokens. I don't know if it is a good approach.
Thank you Nemotron team to open source model weights, data, and the training recipe!
There are some discrepancies about the actual context length supported by the model:
- The technical report says the model supports 1M tokens,
config.jsonfile'smax_position_embeddingsis 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.- The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.
I'm not sure how to consolidate these specs. Could the team provide some more insights?
Thanks!
This is a bit confusing, I think what it means is:
You can feed it max 128k tokens as input, and it will generate a maximum of 128k tokens AFTER that as response.
So in total ~256k tokens
Also means that if you input say 200k tokens for a prompt and expect only 4k token response or something, the performance will probably degrade (beyond 128k tokens as your initial prompt)
- The technical report says the model supports 1M tokens,
config.jsonfile'smax_position_embeddingsis 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.- The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.
I'm not sure how to consolidate these specs. Could the team provide some more insights?
This is a bit confusing, I think what it means is:
You can feed it max 128k tokens as input, and it will generate a maximum of 128k tokens AFTER that as response.
So in total ~256k tokens
Also means that if you input say 200k tokens for a prompt and expect only 4k token response or something, the performance will probably degrade (beyond 128k tokens as your initial prompt)
Yes this puts Item 2 and 3 in one picture. But Item 1 remains inconsistent. Hope the dev team can help clarify if the model can actually work for 1M tokens.
sorry for the confusion on 1M context, we will fix documentation.
The way to get this to work is by setting the environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and also --max-model-len 1M engine argument
for example,
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve YOUR_MODEL_PATH --max-model-len 1M
This will override max-position-embeddings parameter.
You can read more about max-model-len in here
I hope it helps
Following up on this discussion with some actual benchmark data. I ran needle-in-haystack tests via the NIM API (build.nvidia.com) to measure real retrieval quality at different context lengths.
Test Setup
- Model:
nvidia/llama-3.3-nemotron-super-49b-v1via NIM API - Method: Insert a unique fact ("needle") into filler text ("haystack"), ask model to retrieve it
- Needle positions tested: start (10%), middle (50%), end (90%)
- Metric: Exact match + partial fact retrieval (city, year, item, color, number)
Results
| Context Size | Prompt Tokens | Retrieval | Latency | Notes |
|---|---|---|---|---|
| 2K | ~1.3K | β 100% | 1-3s | Perfect |
| 8K | ~4.9K | β 100% | 1-2s | Perfect |
| 16K | ~9.7K | β 100% | 1-2s | Perfect |
| 32K | ~19K | β 100% | 2-3s | Perfect |
| 64K | ~38K | β 100% | 4-5s | Perfect |
| 128K | ~77K | β 100% | 9-11s | Perfect, all positions |
| 160K | ~96K | β οΈ Inconsistent | 15s | Model said "NOT FOUND" but processed the needle |
| 200K | ~120K | β Found | 17s | Worked but beyond safe zone |
| 256K | β | β Rejected | β | API error: "max context length is 131072 tokens" |
Key Findings
NIM API hard limit: 131,072 tokens β not 128K, not 262K, not 1M. API rejects requests exceeding this.
Up to 128K: flawless β 100% retrieval accuracy across all needle positions (start/middle/end). No degradation observed.
Beyond 128K: quality degrades β at ~96K prompt tokens, the model started giving inconsistent answers (saying "NOT FOUND" while clearly having processed the content).
Latency scales linearly β roughly 1s per 10K tokens at the NIM API endpoint.
Practical Recommendation
For reliable results via NIM API:
- Safe zone: β€128K total context (prompt + expected completion)
- Maximum input: ~120K tokens if you need ~10K completion budget
- Don't trust the 1M claim β the NIM API enforces 131K
The confusion between "1M context" (technical report), "262K" (config.json), and "128K input + 128K output" (README) seems to be: the model architecture supports longer contexts, but the deployed NIM API enforces a 131K limit.
Hope this helps others hitting similar confusion. Test scripts available at nemotron-community-testing if anyone wants to reproduce.
@lvega-nv β would love to hear if this matches your expectations for NIM API behavior, or if there are plans to raise the 131K limit to match the 262K in config.json. Happy to run more targeted tests if useful.
Tested Feb 4, 2026 against build.nvidia.com NIM API