The Critical Context Length Fix
This is the #1 issue people hit when deploying tool calling with VLLM.
The Problem
VLLM often defaults to context windows of 16K-32K tokens. This seems fine for normal chat, but tool calling needs significantly more context:
System prompt: 3,000 - 5,000 tokens
Tool definitions (per tool): 500 - 2,000 tokens
Γ 5-10 tools: 2,500 - 20,000 tokens
User message: 100 - 1,000 tokens
Previous conversation: 1,000 - 10,000 tokens
Tool responses: 2,000 - 20,000 tokens
Safety margin: 5,000 tokens
βββββββββββββββββββββββββββββββββββββββββββββββββ
Total needed: 13,600 - 61,000 tokens
With a 16K context window, the model runs out of space mid-generation. The tool call gets silently truncated β you see incomplete JSON, missing arguments, or the model suddenly stops generating.
Symptoms
- Tool calls end mid-JSON:
{"name": "get_weather", "arguments": {"loc - Model stops generating after the first tool call in a multi-step workflow
- Second or third tool call in a conversation is always malformed
- Model "forgets" tool definitions and responds with plain text
- Works fine with 1-2 tools but fails with 5+
The Fix
# BEFORE (broken)
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
--max-model-len 16384 # Default β too small
# AFTER (working)
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B-FP8 \
--max-model-len 131072 # 128K β full model support
--max-num-seqs 4 # Reduce concurrency to fit KV cache
--max-num-batched-tokens 132000
--gpu-memory-utilization 0.90
Memory Math
The tradeoff is between context length and concurrent requests:
70B FP8 Model on 96GB GPU
Model weights (FP8): ~40 GB
Available for KV cache: ~46 GB (at 0.90 utilization)
KV cache per token per request:
70B model β 0.5 KB/token
Cost per concurrent request at 128K context:
128,000 Γ 0.5 KB = 64 MB per layer Γ 80 layers β 5 GB
Max concurrent requests:
46 GB Γ· ~11.5 GB/request β 4 requests
β Use --max-num-seqs 4
12B FP8 Model on 96GB GPU
Model weights (FP8): ~15 GB
Available for KV cache: ~71 GB
β Use --max-num-seqs 8 (or more)
Concurrency vs Context Tradeoff
| Context Length | Max Seqs (70B) | Max Seqs (12B) | Tool Calling |
|---|---|---|---|
| 16K | 16 | 32+ | Broken for multi-tool |
| 32K | 8 | 16 | Marginal |
| 64K | 6 | 12 | Good for simple workflows |
| 128K | 4 | 8 | Reliable for complex workflows |
Recommendation: Always use 128K. The reduced concurrency is worth it. If you need more throughput, use a smaller model rather than reducing context.
Why This Isn't Obvious
- VLLM doesn't warn you when context is too small β it just generates truncated output
- The default
max-model-lenvaries by model and may not match the model's actual capability - Simple tests with 1-2 tools often pass even at 16K, so the issue only appears in production
- The truncation looks like a model quality issue, not a configuration issue
Verification
After increasing context, verify with:
curl http://localhost:8000/v1/models | python -m json.tool
Check the max_model_len field in the response to confirm it's set to 131072.