Model Comparison for Tool Calling

Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB).

Full Comparison Table

	Hermes-3 70B	Llama-3.3 70B	Qwen2 72B	Mistral-Nemo 12B
Model ID	`NousResearch/Hermes-3-Llama-3.1-70B-FP8`	`nvidia/Llama-3.3-70B-Instruct-FP8`	`RedHatAI/Qwen2-72B-Instruct-FP8`	`RedHatAI/Mistral-Nemo-Instruct-2407-FP8`
Size	70B	70B	72B	12B
Quantization	FP8 (compressed-tensors)	FP8 (native e4m3)	FP8	FP8
VLLM Parser	`hermes`	`llama3_json`	`hermes`	`mistral`
Context Window	128K	128K	128K	128K
Speed	25-35 tok/s	60-90 tok/s	60-90 tok/s	100-150 tok/s
VRAM Usage	~40GB	~40GB	~45GB	~15GB
Tool Call Quality	Excellent	Excellent	Very Good	Good
Multi-Tool	Excellent	Good	Good	Fair
JSON Compliance	Very High	High	High	Medium
Open WebUI	No	Yes	Yes	Yes
Multilingual	Good	Good	Excellent	Good

Detailed Notes

Hermes-3-Llama-3.1-70B-FP8

Best for: Tool calling quality and reliability

Purpose-built for function calling by NousResearch
Uses ChatML format with XML <tool_call> tags — the most reliable format for structured output
Slowest of the 70B models due to compressed-tensors quantization (doesn't use native Blackwell FP8)
Does NOT work with Open WebUI for tool calling (format incompatibility)
Best at handling complex multi-step workflows with many tools
Lowest hallucination rate for tool names and parameters

Llama-3.3-70B-Instruct-FP8

Best for: Open WebUI and general use

Official NVIDIA FP8 quantization — fastest 70B model on Blackwell
Works out of the box with Open WebUI, no custom configuration
Native FP8 (fp8_e4m3) leverages Blackwell's hardware acceleration
Tool calling quality is nearly as good as Hermes-3 for most tasks
Better at general conversation alongside tool use

Qwen2-72B-Instruct-FP8

Best for: Multilingual tool calling

Strongest multilingual support (Chinese, Japanese, Korean, European languages)
Good reasoning capabilities alongside tool calling
Uses hermes parser despite not being a Hermes model (ChatML-compatible)
FP8 KV cache support saves VRAM
Slightly larger memory footprint than Llama models

Mistral-Nemo-Instruct-2407-FP8

Best for: Fast iteration and development

Extremely fast: 100-150 tok/s (3-5x faster than 70B models)
Very low memory: ~15GB leaves room for other processes
Good enough for simple tool calling (1-3 tools)
Struggles with complex multi-step workflows
Great for testing and prototyping before deploying 70B models

Recommendations by Use Case

Use Case	Recommended Model	Why
Production tool calling	Hermes-3 70B	Best reliability and accuracy
Open WebUI deployment	Llama-3.3 70B	Works out of the box
Multilingual applications	Qwen2 72B	Best language coverage
Development/testing	Mistral-Nemo 12B	Fastest iteration speed
Multi-step workflows	Hermes-3 70B	Best at complex orchestration
Simple single-tool calls	Any	All models handle basic tools well
Memory-constrained	Mistral-Nemo 12B	Only 15GB VRAM

Memory Budget (96GB GPU)

Hermes-3 70B FP8:
  Model weights:  ~40GB
  KV cache (128K): ~45GB (4 concurrent requests)
  Overhead:        ~5GB
  Total:           ~90GB ← Fits on 96GB

Mistral-Nemo 12B FP8:
  Model weights:  ~15GB
  KV cache (128K): ~20GB (8 concurrent requests)
  Overhead:        ~3GB
  Total:           ~38GB ← Leaves 58GB free