vllm-tool-calling-guide / guides /MODEL_COMPARISON.md
Joshua Odmark
Mistral acquired by RedHat, updating references
2e4500a

Model Comparison for Tool Calling

Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB).

Full Comparison Table

Hermes-3 70B Llama-3.3 70B Qwen2 72B Mistral-Nemo 12B
Model ID NousResearch/Hermes-3-Llama-3.1-70B-FP8 nvidia/Llama-3.3-70B-Instruct-FP8 RedHatAI/Qwen2-72B-Instruct-FP8 RedHatAI/Mistral-Nemo-Instruct-2407-FP8
Size 70B 70B 72B 12B
Quantization FP8 (compressed-tensors) FP8 (native e4m3) FP8 FP8
VLLM Parser hermes llama3_json hermes mistral
Context Window 128K 128K 128K 128K
Speed 25-35 tok/s 60-90 tok/s 60-90 tok/s 100-150 tok/s
VRAM Usage ~40GB ~40GB ~45GB ~15GB
Tool Call Quality Excellent Excellent Very Good Good
Multi-Tool Excellent Good Good Fair
JSON Compliance Very High High High Medium
Open WebUI No Yes Yes Yes
Multilingual Good Good Excellent Good

Detailed Notes

Hermes-3-Llama-3.1-70B-FP8

Best for: Tool calling quality and reliability

  • Purpose-built for function calling by NousResearch
  • Uses ChatML format with XML <tool_call> tags — the most reliable format for structured output
  • Slowest of the 70B models due to compressed-tensors quantization (doesn't use native Blackwell FP8)
  • Does NOT work with Open WebUI for tool calling (format incompatibility)
  • Best at handling complex multi-step workflows with many tools
  • Lowest hallucination rate for tool names and parameters

Llama-3.3-70B-Instruct-FP8

Best for: Open WebUI and general use

  • Official NVIDIA FP8 quantization — fastest 70B model on Blackwell
  • Works out of the box with Open WebUI, no custom configuration
  • Native FP8 (fp8_e4m3) leverages Blackwell's hardware acceleration
  • Tool calling quality is nearly as good as Hermes-3 for most tasks
  • Better at general conversation alongside tool use

Qwen2-72B-Instruct-FP8

Best for: Multilingual tool calling

  • Strongest multilingual support (Chinese, Japanese, Korean, European languages)
  • Good reasoning capabilities alongside tool calling
  • Uses hermes parser despite not being a Hermes model (ChatML-compatible)
  • FP8 KV cache support saves VRAM
  • Slightly larger memory footprint than Llama models

Mistral-Nemo-Instruct-2407-FP8

Best for: Fast iteration and development

  • Extremely fast: 100-150 tok/s (3-5x faster than 70B models)
  • Very low memory: ~15GB leaves room for other processes
  • Good enough for simple tool calling (1-3 tools)
  • Struggles with complex multi-step workflows
  • Great for testing and prototyping before deploying 70B models

Recommendations by Use Case

Use Case Recommended Model Why
Production tool calling Hermes-3 70B Best reliability and accuracy
Open WebUI deployment Llama-3.3 70B Works out of the box
Multilingual applications Qwen2 72B Best language coverage
Development/testing Mistral-Nemo 12B Fastest iteration speed
Multi-step workflows Hermes-3 70B Best at complex orchestration
Simple single-tool calls Any All models handle basic tools well
Memory-constrained Mistral-Nemo 12B Only 15GB VRAM

Memory Budget (96GB GPU)

Hermes-3 70B FP8:
  Model weights:  ~40GB
  KV cache (128K): ~45GB (4 concurrent requests)
  Overhead:        ~5GB
  Total:           ~90GB ← Fits on 96GB

Mistral-Nemo 12B FP8:
  Model weights:  ~15GB
  KV cache (128K): ~20GB (8 concurrent requests)
  Overhead:        ~3GB
  Total:           ~38GB ← Leaves 58GB free