vllm-tool-calling-guide / guides /TOOL_CALL_FORMATS.md
Joshua Odmark
Initial release: VLLM tool calling guide for open source models
634c038

Tool Call Formats Explained

VLLM supports multiple tool call formats. Each model family uses a different native format, but VLLM converts them all to OpenAI-compatible JSON.

Format Comparison

1. Hermes Format (ChatML + XML)

Used by: Hermes-3, Hermes-2-Pro, Qwen2 (via hermes parser) Parser flag: --tool-call-parser hermes

Model outputs:

<tool_call>
{"name": "get_weather", "arguments": {"location": "San Francisco"}}
</tool_call>

Tool responses formatted as:

<tool_response>
{"temperature": 22, "condition": "Sunny"}
</tool_response>

Characteristics:

  • XML tags make tool calls easy to parse reliably
  • Supports parallel calls via tool_calls array inside tags
  • Most reliable format for structured output
  • ChatML-based (<|im_start|>, <|im_end|>)

2. Llama 3 JSON Format

Used by: Llama-3.1, Llama-3.3 Parser flag: --tool-call-parser llama3_json

Model outputs:

{"name": "get_weather", "parameters": {"location": "San Francisco"}}

Characteristics:

  • Pure JSON, no XML wrapping
  • Uses parameters instead of arguments (VLLM normalizes this)
  • Works natively with Open WebUI
  • Supports the special <|python_tag|> token for code execution

3. Mistral Format

Used by: Mistral-Nemo, Mistral-7B, Mistral-Small Parser flag: --tool-call-parser mistral

Model outputs:

[TOOL_CALLS] [{"name": "get_weather", "arguments": {"location": "San Francisco"}}]

Characteristics:

  • Uses [TOOL_CALLS] prefix token
  • Tool calls are a JSON array (natural parallel calling)
  • Clean, minimal format

What Your Application Receives

Regardless of format, VLLM converts everything to OpenAI-compatible JSON:

{
  "choices": [{
    "message": {
      "role": "assistant",
      "content": null,
      "tool_calls": [{
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"location\": \"San Francisco\"}"
        }
      }]
    }
  }]
}

Your application code is the same regardless of which model or parser you use.

Which Parser for Which Model?

Model Parser Why
Hermes-3 (any size) hermes Fine-tuned on ChatML + XML format
Hermes-2-Pro hermes Same format family
Llama-3.1 (any size) llama3_json Native Llama 3 format
Llama-3.3 (any size) llama3_json Same format as 3.1
Qwen2 hermes ChatML-compatible, works with hermes parser
Mistral-Nemo mistral Native Mistral format
Mistral-7B mistral Same format family

Custom Middleware vs VLLM Parser

When to use VLLM's built-in parser:

  • Standard OpenAI-compatible API usage
  • Open WebUI or similar frontends
  • Any application expecting OpenAI format

When to build custom middleware:

  • You need to intercept and modify tool calls before execution
  • You're doing validation/retry logic at the tool call level
  • Your Hermes model outputs <tool_call> tags but VLLM's parser isn't available
  • You need custom error handling per tool call

For custom parsing, see examples/robust_json_extraction.py which handles all the edge cases.

Common Mistakes

  1. Wrong parser for model — Using hermes parser with Llama 3.3 (or vice versa) silently produces no tool calls
  2. Missing --enable-auto-tool-choice — Without this, the model never generates tool calls even with the right parser
  3. Custom system prompt overriding format — If you add <tool_call> instructions to a Llama 3.3 system prompt, the model outputs XML but the llama3_json parser can't parse it
  4. Assuming all models use the same format — They don't. Always match parser to model.