Actual context length

#12

by yuchsiao - opened Dec 18, 2025

Discussion

yuchsiao

Dec 18, 2025

•

edited Dec 18, 2025

Thank you Nemotron team to open source model weights, data, and the training recipe!

There are some discrepancies about the actual context length supported by the model:

The technical report says the model supports 1M tokens,
config.json file's max_position_embeddings is 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.
The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.

I'm not sure how to consolidate these specs. Could the team provide some more insights?

Thanks!

RayHuang1991

Dec 18, 2025

I used the " VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve" to extend to 1M tokens. I don't know if it is a good approach.

ztsvvstz

Dec 18, 2025

Thank you Nemotron team to open source model weights, data, and the training recipe!

There are some discrepancies about the actual context length supported by the model:

The technical report says the model supports 1M tokens,

config.json file's max_position_embeddings is 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.

The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.

I'm not sure how to consolidate these specs. Could the team provide some more insights?

Thanks!

This is a bit confusing, I think what it means is:
You can feed it max 128k tokens as input, and it will generate a maximum of 128k tokens AFTER that as response.
So in total ~256k tokens
Also means that if you input say 200k tokens for a prompt and expect only 4k token response or something, the performance will probably degrade (beyond 128k tokens as your initial prompt)

yuchsiao

Jan 7

•

edited Jan 7

The technical report says the model supports 1M tokens,

config.json file's max_position_embeddings is 262,144. This is the number vllm uses to estimate the batch size given the allocated VRAM.

The HF README file states: Maximum input size: 128K tokens and Maximum output size: 128K tokens.

I'm not sure how to consolidate these specs. Could the team provide some more insights?

This is a bit confusing, I think what it means is:
You can feed it max 128k tokens as input, and it will generate a maximum of 128k tokens AFTER that as response.
So in total ~256k tokens
Also means that if you input say 200k tokens for a prompt and expect only 4k token response or something, the performance will probably degrade (beyond 128k tokens as your initial prompt)

Yes this puts Item 2 and 3 in one picture. But Item 1 remains inconsistent. Hope the dev team can help clarify if the model can actually work for 1M tokens.

lvega-nv

NVIDIA org Jan 8

sorry for the confusion on 1M context, we will fix documentation.

The way to get this to work is by setting the environment variable VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 and also --max-model-len 1M engine argument

for example,

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve YOUR_MODEL_PATH --max-model-len 1M

This will override max-position-embeddings parameter.

You can read more about max-model-len in here

I hope it helps

yuchsiao

Jan 8

Thank you @lvega-nv for the clear guidance.

In this case, wouldn't updating the config.json directly be a more straightforward solution?

chjkh8113

Feb 4

Following up on this discussion with some actual benchmark data. I ran needle-in-haystack tests via the NIM API (build.nvidia.com) to measure real retrieval quality at different context lengths.

Test Setup

Model: nvidia/llama-3.3-nemotron-super-49b-v1 via NIM API
Method: Insert a unique fact ("needle") into filler text ("haystack"), ask model to retrieve it
Needle positions tested: start (10%), middle (50%), end (90%)
Metric: Exact match + partial fact retrieval (city, year, item, color, number)

Results

Context Size	Prompt Tokens	Retrieval	Latency	Notes
2K	~1.3K	✅ 100%	1-3s	Perfect
8K	~4.9K	✅ 100%	1-2s	Perfect
16K	~9.7K	✅ 100%	1-2s	Perfect
32K	~19K	✅ 100%	2-3s	Perfect
64K	~38K	✅ 100%	4-5s	Perfect
128K	~77K	✅ 100%	9-11s	Perfect, all positions
160K	~96K	⚠️ Inconsistent	15s	Model said "NOT FOUND" but processed the needle
200K	~120K	✅ Found	17s	Worked but beyond safe zone
256K	—	❌ Rejected	—	API error: "max context length is 131072 tokens"

Key Findings

NIM API hard limit: 131,072 tokens — not 128K, not 262K, not 1M. API rejects requests exceeding this.
Up to 128K: flawless — 100% retrieval accuracy across all needle positions (start/middle/end). No degradation observed.
Beyond 128K: quality degrades — at ~96K prompt tokens, the model started giving inconsistent answers (saying "NOT FOUND" while clearly having processed the content).
Latency scales linearly — roughly 1s per 10K tokens at the NIM API endpoint.

Practical Recommendation

For reliable results via NIM API:

Safe zone: ≤128K total context (prompt + expected completion)
Maximum input: ~120K tokens if you need ~10K completion budget
Don't trust the 1M claim — the NIM API enforces 131K

The confusion between "1M context" (technical report), "262K" (config.json), and "128K input + 128K output" (README) seems to be: the model architecture supports longer contexts, but the deployed NIM API enforces a 131K limit.

Hope this helps others hitting similar confusion. Test scripts available at nemotron-community-testing if anyone wants to reproduce.

@lvega-nv — would love to hear if this matches your expectations for NIM API behavior, or if there are plans to raise the 131K limit to match the 262K in config.json. Happy to run more targeted tests if useful.

Tested Feb 4, 2026 against build.nvidia.com NIM API

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment