Tool Call Formats Explained

VLLM supports multiple tool call formats. Each model family uses a different native format, but VLLM converts them all to OpenAI-compatible JSON.

Format Comparison

1. Hermes Format (ChatML + XML)

Used by: Hermes-3, Hermes-2-Pro, Qwen2 (via hermes parser) Parser flag: --tool-call-parser hermes

Model outputs:

<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>

Tool responses formatted as:

<tool_response>
{"temperature": 22, "condition": "Sunny"}
</tool_response>

Characteristics:

XML tags make tool calls easy to parse reliably
Supports parallel calls via tool_calls array inside tags
Most reliable format for structured output
ChatML-based (<|im_start|>, <|im_end|>)

2. Llama 3 JSON Format

Used by: Llama-3.1, Llama-3.3 Parser flag: --tool-call-parser llama3_json

Model outputs:

{"name": "get_weather", "parameters": {"location": "San Francisco"}}

Characteristics:

Pure JSON, no XML wrapping
Uses parameters instead of arguments (VLLM normalizes this)
Works natively with Open WebUI
Supports the special <|python_tag|> token for code execution

3. Mistral Format

Used by: Mistral-Nemo, Mistral-7B, Mistral-Small Parser flag: --tool-call-parser mistral

Model outputs:

[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]

Characteristics:

Uses [TOOL_CALLS] prefix token
Tool calls are a JSON array (natural parallel calling)
Clean, minimal format

What Your Application Receives

Regardless of format, VLLM converts everything to OpenAI-compatible JSON:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"San Francisco\"}"
        }
      }]
    }
  }]
}

Your application code is the same regardless of which model or parser you use.

Which Parser for Which Model?

Model	Parser	Why
Hermes-3 (any size)	`hermes`	Fine-tuned on ChatML + XML format
Hermes-2-Pro	`hermes`	Same format family
Llama-3.1 (any size)	`llama3_json`	Native Llama 3 format
Llama-3.3 (any size)	`llama3_json`	Same format as 3.1
Qwen2	`hermes`	ChatML-compatible, works with hermes parser
Mistral-Nemo	`mistral`	Native Mistral format
Mistral-7B	`mistral`	Same format family

Custom Middleware vs VLLM Parser

When to use VLLM's built-in parser:

Standard OpenAI-compatible API usage
Open WebUI or similar frontends
Any application expecting OpenAI format

When to build custom middleware:

You need to intercept and modify tool calls before execution
You're doing validation/retry logic at the tool call level
Your Hermes model outputs <tool_call> tags but VLLM's parser isn't available
You need custom error handling per tool call

For custom parsing, see examples/robust_json_extraction.py which handles all the edge cases.

Common Mistakes

Wrong parser for model — Using hermes parser with Llama 3.3 (or vice versa) silently produces no tool calls
Missing --enable-auto-tool-choice — Without this, the model never generates tool calls even with the right parser
Custom system prompt overriding format — If you add <tool_call> instructions to a Llama 3.3 system prompt, the model outputs XML but the llama3_json parser can't parse it
Assuming all models use the same format — They don't. Always match parser to model.