Model Comparison for Tool Calling
Detailed comparison of open source models tested for tool calling with VLLM on NVIDIA RTX 6000 Pro Blackwell (96GB).
Full Comparison Table
| Hermes-3 70B | Llama-3.3 70B | Qwen2 72B | Mistral-Nemo 12B | |
|---|---|---|---|---|
| Model ID | NousResearch/Hermes-3-Llama-3.1-70B-FP8 |
nvidia/Llama-3.3-70B-Instruct-FP8 |
RedHatAI/Qwen2-72B-Instruct-FP8 |
RedHatAI/Mistral-Nemo-Instruct-2407-FP8 |
| Size | 70B | 70B | 72B | 12B |
| Quantization | FP8 (compressed-tensors) | FP8 (native e4m3) | FP8 | FP8 |
| VLLM Parser | hermes |
llama3_json |
hermes |
mistral |
| Context Window | 128K | 128K | 128K | 128K |
| Speed | 25-35 tok/s | 60-90 tok/s | 60-90 tok/s | 100-150 tok/s |
| VRAM Usage | ~40GB | ~40GB | ~45GB | ~15GB |
| Tool Call Quality | Excellent | Excellent | Very Good | Good |
| Multi-Tool | Excellent | Good | Good | Fair |
| JSON Compliance | Very High | High | High | Medium |
| Open WebUI | No | Yes | Yes | Yes |
| Multilingual | Good | Good | Excellent | Good |
Detailed Notes
Hermes-3-Llama-3.1-70B-FP8
Best for: Tool calling quality and reliability
- Purpose-built for function calling by NousResearch
- Uses ChatML format with XML
<tool_call>tags — the most reliable format for structured output - Slowest of the 70B models due to
compressed-tensorsquantization (doesn't use native Blackwell FP8) - Does NOT work with Open WebUI for tool calling (format incompatibility)
- Best at handling complex multi-step workflows with many tools
- Lowest hallucination rate for tool names and parameters
Llama-3.3-70B-Instruct-FP8
Best for: Open WebUI and general use
- Official NVIDIA FP8 quantization — fastest 70B model on Blackwell
- Works out of the box with Open WebUI, no custom configuration
- Native FP8 (
fp8_e4m3) leverages Blackwell's hardware acceleration - Tool calling quality is nearly as good as Hermes-3 for most tasks
- Better at general conversation alongside tool use
Qwen2-72B-Instruct-FP8
Best for: Multilingual tool calling
- Strongest multilingual support (Chinese, Japanese, Korean, European languages)
- Good reasoning capabilities alongside tool calling
- Uses
hermesparser despite not being a Hermes model (ChatML-compatible) - FP8 KV cache support saves VRAM
- Slightly larger memory footprint than Llama models
Mistral-Nemo-Instruct-2407-FP8
Best for: Fast iteration and development
- Extremely fast: 100-150 tok/s (3-5x faster than 70B models)
- Very low memory: ~15GB leaves room for other processes
- Good enough for simple tool calling (1-3 tools)
- Struggles with complex multi-step workflows
- Great for testing and prototyping before deploying 70B models
Recommendations by Use Case
| Use Case | Recommended Model | Why |
|---|---|---|
| Production tool calling | Hermes-3 70B | Best reliability and accuracy |
| Open WebUI deployment | Llama-3.3 70B | Works out of the box |
| Multilingual applications | Qwen2 72B | Best language coverage |
| Development/testing | Mistral-Nemo 12B | Fastest iteration speed |
| Multi-step workflows | Hermes-3 70B | Best at complex orchestration |
| Simple single-tool calls | Any | All models handle basic tools well |
| Memory-constrained | Mistral-Nemo 12B | Only 15GB VRAM |
Memory Budget (96GB GPU)
Hermes-3 70B FP8:
Model weights: ~40GB
KV cache (128K): ~45GB (4 concurrent requests)
Overhead: ~5GB
Total: ~90GB ← Fits on 96GB
Mistral-Nemo 12B FP8:
Model weights: ~15GB
KV cache (128K): ~20GB (8 concurrent requests)
Overhead: ~3GB
Total: ~38GB ← Leaves 58GB free