MetalRT: The Fastest AI Inference Engine for Apple Silicon. Here Are the Numbers.
MetalRT is the first and only inference engine that accelerates all three AI modalities on Apple Silicon: LLMs, Speech-to-Text, and Text-to-Speech. We benchmarked it against every major engine. It won.
The headline numbers:
- 658 tok/s LLM decode (Qwen3-0.6B, M4 Max)
- 101ms to transcribe 70 seconds of audio (Whisper)
- 178ms to synthesize speech (Kokoro TTS)
- 1.67x faster than llama.cpp on LLM decode
- 4.6x faster than Apple MLX on speech-to-text
Part 1: LLM Decode Performance
We benchmarked MetalRT against four engines across four models on a single M4 Max.
Engines Tested
| Engine | Language | Benchmark Method |
|---|---|---|
| MetalRT | C++ | Native binary |
| uzu | Rust | Native cli bench |
| mlx-lm | Python + MLX C++ | Python API |
| llama.cpp | C/C++ | llama-bench v8190 |
| Ollama | Go + llama.cpp | REST API (streaming) |
Hardware: Apple M4 Max, 64 GB unified memory, macOS 26.3 Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B (all 4-bit quantized) Runs: 5 per engine per model, best reported Fairness: MetalRT and mlx-lm use the exact same model files. Ollama uses the same GGUF files as llama.cpp, with REST API overhead included.
Decode Speed (tok/s)
Decode speed is how fast tokens stream to the user. It is the metric that matters most for interactive chat.
| Model | MetalRT | uzu | mlx-lm | llama.cpp | Ollama |
|---|---|---|---|---|---|
| Qwen3-0.6B | 658 | 627 | 552 | 295* | 274 |
| Qwen3-4B | 186 | 165 | 170 | 87 | 120 |
| Llama-3.2-3B | 184 | 222 | 210 | 137 | 131 |
| LFM2.5-1.2B | 570 | 550 | 509 | 372 | 313 |
Qwen3-0.6B llama.cpp/Ollama use Q8_0 (8-bit), not directly comparable.
MetalRT wins 3 of 4 models. The speedups:
- 1.10-1.19x vs mlx-lm (same model files)
- 1.35-2.14x vs llama.cpp
- 1.41-2.40x vs Ollama
uzu wins Llama-3.2-3B at 222 tok/s. We report this honestly.
MetalRT vs Apple MLX and llama.cpp
mlx-lm is Apple's official inference framework. MetalRT and mlx-lm use the exact same model files, so this is a pure engine-to-engine comparison.
MetalRT is 1.10-1.19x faster than mlx-lm on decode (same model files) and 1.35-2.14x faster than llama.cpp across the board.
Part 2: Speech-to-Text Performance
We tested Whisper STT across four audio lengths. MetalRT won every single one.
Engines Tested
| Engine | Type | Notes |
|---|---|---|
| MetalRT | Native | Complete AI inference engine (LLM + STT + TTS) |
| mlx-whisper | MLX | Apple's official framework (pip install mlx-whisper) |
| sherpa-onnx | ONNX | Cross-platform baseline (pip install sherpa-onnx) |
Model: Whisper Tiny (4-bit) Runs: 10 per engine, best reported
Whisper Tiny (4-bit) Latency
Lower latency is better
| Audio Duration | MetalRT | mlx-whisper | sherpa-onnx | Winner |
|---|---|---|---|---|
| Short (4s) | 31.9ms | 42.1ms | 64.9ms | MetalRT |
| Medium (11s) | 52.3ms | 59.6ms | 175ms | MetalRT |
| Long (33s) | 104ms | 134ms | 469ms | MetalRT |
| Extra-long (70s) | 101ms | 463ms | 554ms | MetalRT |
The 70-second result is not a typo. MetalRT transcribes over a minute of audio in 101 milliseconds.
Real-Time Factor: 0.0014 (lower is better). That is 714x faster than real-time.
What 714x Real-Time Means in Practice
- 1-hour podcast: ~5 seconds to process
- 3-hour meeting: ~15 seconds
- Live captioning: zero perceptible delay
Part 3: Text-to-Speech Performance
We tested Kokoro-82M across typical voice assistant response lengths.
Kokoro-82M Results
Lower synthesis time is better
| Text Length | MetalRT | mlx-audio | sherpa-onnx | Winner |
|---|---|---|---|---|
| 4 words | 178ms | 493ms | 504ms | MetalRT |
| 10 words | 230ms | 522ms | 723ms | MetalRT |
| 18 words | 381ms | 600ms | 1,395ms | MetalRT |
| 36 words | 604ms | 706ms | 2,115ms | MetalRT |
MetalRT is 2.8x faster than mlx-audio on short phrases, exactly what voice assistants need.
MetalRT vs The Competition: Summary
LLM Decode:
- 1.67x faster than llama.cpp
- 1.19x faster than Apple MLX (same model files)
- 1.59x faster than Ollama
Speech-to-Text (70s audio):
- 4.6x faster than mlx-whisper
- 5.5x faster than sherpa-onnx
Text-to-Speech (4 words):
- 2.8x faster than mlx-audio
- 2.8x faster than sherpa-onnx
What MetalRT Is Built For
| Use Case | Why MetalRT |
|---|---|
| Chat apps | 186 tok/s on a 4B model, responses stream instantly |
| Structured output / tool calling | Faster decode means faster JSON and function call generation |
| Agent workflows | Compound latency savings across sequential LLM calls |
| Coding assistants | Sub-7ms time-to-first-token on small models |
| Privacy-first apps | Cloud-competitive speed, entirely on-device |
| Voice pipelines | 101ms STT + 178ms TTS. Hear and respond in under 300ms. |
| Real-time transcription | 714x faster than real-time on Whisper |
| Medical / secure environments | Complete privacy. Zero cloud dependency. |
All Numbers at a Glance
LLM Performance:
- 658 tok/s peak decode (Qwen3-0.6B)
- 6.6ms time-to-first-token (Qwen3-0.6B)
- 186 tok/s on a 4B parameter model (Qwen3-4B)
STT Performance:
- 101ms for 70 seconds of audio
- 714x faster than real-time
- 4.6x faster than Apple MLX
TTS Performance:
- 178ms for typical voice responses
- 2.8x faster than Apple MLX
- Sub-400ms for most use cases
Quality: Identical output across all engines. The model is the same. The speed is not.
About MetalRT
MetalRT is the inference engine behind RunAnywhere, a production-grade on-device AI platform. RunAnywhere provides cross-platform SDKs for iOS, Android, Web, React Native, and Flutter, with MetalRT powering the Apple Silicon runtime.
MetalRT is written in C++ and talks directly to Apple's Metal GPU API. No Python overhead. No abstraction layers. Just raw compute.
Benchmarked on Apple M4 Max, 64 GB, macOS 26.3. LLM models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B, all 4-bit. Speech models: Whisper Tiny (4-bit), Kokoro-82M. Greedy decoding for LLM, 5 runs best reported. Speech: 10 runs best reported. MetalRT + mlx-lm share identical MLX 4-bit model files. llama.cpp and Ollama use GGUF Q4_K_M (Q8_0 for Qwen3-0.6B). Ollama v0.17.4 via REST API.