Tool Call Formats Explained
VLLM supports multiple tool call formats. Each model family uses a different native format, but VLLM converts them all to OpenAI-compatible JSON.
Format Comparison
1. Hermes Format (ChatML + XML)
Used by: Hermes-3, Hermes-2-Pro, Qwen2 (via hermes parser)
Parser flag: --tool-call-parser hermes
Model outputs:
<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>
Tool responses formatted as:
<tool_response>
{"temperature": 22, "condition": "Sunny"}
</tool_response>
Characteristics:
- XML tags make tool calls easy to parse reliably
- Supports parallel calls via
tool_callsarray inside tags - Most reliable format for structured output
- ChatML-based (
<|im_start|>,<|im_end|>)
2. Llama 3 JSON Format
Used by: Llama-3.1, Llama-3.3
Parser flag: --tool-call-parser llama3_json
Model outputs:
{"name": "get_weather", "parameters": {"location": "San Francisco"}}
Characteristics:
- Pure JSON, no XML wrapping
- Uses
parametersinstead ofarguments(VLLM normalizes this) - Works natively with Open WebUI
- Supports the special
<|python_tag|>token for code execution
3. Mistral Format
Used by: Mistral-Nemo, Mistral-7B, Mistral-Small
Parser flag: --tool-call-parser mistral
Model outputs:
[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]
Characteristics:
- Uses
[TOOL_CALLS]prefix token - Tool calls are a JSON array (natural parallel calling)
- Clean, minimal format
What Your Application Receives
Regardless of format, VLLM converts everything to OpenAI-compatible JSON:
{
"choices": [{
"message": {
"role": "assistant",
"content": null,
"tool_calls": [{
"id": "call_abc123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\": \"San Francisco\"}"
}
}]
}
}]
}
Your application code is the same regardless of which model or parser you use.
Which Parser for Which Model?
| Model | Parser | Why |
|---|---|---|
| Hermes-3 (any size) | hermes |
Fine-tuned on ChatML + XML format |
| Hermes-2-Pro | hermes |
Same format family |
| Llama-3.1 (any size) | llama3_json |
Native Llama 3 format |
| Llama-3.3 (any size) | llama3_json |
Same format as 3.1 |
| Qwen2 | hermes |
ChatML-compatible, works with hermes parser |
| Mistral-Nemo | mistral |
Native Mistral format |
| Mistral-7B | mistral |
Same format family |
Custom Middleware vs VLLM Parser
When to use VLLM's built-in parser:
- Standard OpenAI-compatible API usage
- Open WebUI or similar frontends
- Any application expecting OpenAI format
When to build custom middleware:
- You need to intercept and modify tool calls before execution
- You're doing validation/retry logic at the tool call level
- Your Hermes model outputs
<tool_call>tags but VLLM's parser isn't available - You need custom error handling per tool call
For custom parsing, see examples/robust_json_extraction.py which handles all the edge cases.
Common Mistakes
- Wrong parser for model — Using
hermesparser with Llama 3.3 (or vice versa) silently produces no tool calls - Missing
--enable-auto-tool-choice— Without this, the model never generates tool calls even with the right parser - Custom system prompt overriding format — If you add
<tool_call>instructions to a Llama 3.3 system prompt, the model outputs XML but thellama3_jsonparser can't parse it - Assuming all models use the same format — They don't. Always match parser to model.