MetalRT: The Fastest AI Inference Engine for Apple Silicon. Here Are the Numbers.

Community Article Published March 12, 2026

We built a native inference engine for Apple Silicon from the ground up. No wrappers. No abstraction layers. Direct Metal GPU programming.

MetalRT is the first and only inference engine that accelerates all three AI modalities on Apple Silicon: LLMs, Speech-to-Text, and Text-to-Speech. We benchmarked it against every major engine. It won.

The headline numbers:

  • 658 tok/s LLM decode (Qwen3-0.6B, M4 Max)
  • 101ms to transcribe 70 seconds of audio (Whisper)
  • 178ms to synthesize speech (Kokoro TTS)
  • 1.67x faster than llama.cpp on LLM decode
  • 4.6x faster than Apple MLX on speech-to-text

Part 1: LLM Decode Performance

We benchmarked MetalRT against four engines across four models on a single M4 Max.

Engines Tested

Engine Language Benchmark Method
MetalRT C++ Native binary
uzu Rust Native cli bench
mlx-lm Python + MLX C++ Python API
llama.cpp C/C++ llama-bench v8190
Ollama Go + llama.cpp REST API (streaming)

Hardware: Apple M4 Max, 64 GB unified memory, macOS 26.3 Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B (all 4-bit quantized) Runs: 5 per engine per model, best reported Fairness: MetalRT and mlx-lm use the exact same model files. Ollama uses the same GGUF files as llama.cpp, with REST API overhead included.

Decode Speed (tok/s)

Decode speed is how fast tokens stream to the user. It is the metric that matters most for interactive chat.

Model MetalRT uzu mlx-lm llama.cpp Ollama
Qwen3-0.6B 658 627 552 295* 274
Qwen3-4B 186 165 170 87 120
Llama-3.2-3B 184 222 210 137 131
LFM2.5-1.2B 570 550 509 372 313

Qwen3-0.6B llama.cpp/Ollama use Q8_0 (8-bit), not directly comparable.

MetalRT wins 3 of 4 models. The speedups:

  • 1.10-1.19x vs mlx-lm (same model files)
  • 1.35-2.14x vs llama.cpp
  • 1.41-2.40x vs Ollama

uzu wins Llama-3.2-3B at 222 tok/s. We report this honestly.

MetalRT vs Apple MLX and llama.cpp

mlx-lm is Apple's official inference framework. MetalRT and mlx-lm use the exact same model files, so this is a pure engine-to-engine comparison.

MetalRT is 1.10-1.19x faster than mlx-lm on decode (same model files) and 1.35-2.14x faster than llama.cpp across the board.


Part 2: Speech-to-Text Performance

We tested Whisper STT across four audio lengths. MetalRT won every single one.

Engines Tested

Engine Type Notes
MetalRT Native Complete AI inference engine (LLM + STT + TTS)
mlx-whisper MLX Apple's official framework (pip install mlx-whisper)
sherpa-onnx ONNX Cross-platform baseline (pip install sherpa-onnx)

Model: Whisper Tiny (4-bit) Runs: 10 per engine, best reported

Whisper Tiny (4-bit) Latency

Lower latency is better

Audio Duration MetalRT mlx-whisper sherpa-onnx Winner
Short (4s) 31.9ms 42.1ms 64.9ms MetalRT
Medium (11s) 52.3ms 59.6ms 175ms MetalRT
Long (33s) 104ms 134ms 469ms MetalRT
Extra-long (70s) 101ms 463ms 554ms MetalRT

The 70-second result is not a typo. MetalRT transcribes over a minute of audio in 101 milliseconds.

Real-Time Factor: 0.0014 (lower is better). That is 714x faster than real-time.

What 714x Real-Time Means in Practice

  • 1-hour podcast: ~5 seconds to process
  • 3-hour meeting: ~15 seconds
  • Live captioning: zero perceptible delay

Part 3: Text-to-Speech Performance

We tested Kokoro-82M across typical voice assistant response lengths.

Kokoro-82M Results

Lower synthesis time is better

Text Length MetalRT mlx-audio sherpa-onnx Winner
4 words 178ms 493ms 504ms MetalRT
10 words 230ms 522ms 723ms MetalRT
18 words 381ms 600ms 1,395ms MetalRT
36 words 604ms 706ms 2,115ms MetalRT

MetalRT is 2.8x faster than mlx-audio on short phrases, exactly what voice assistants need.


MetalRT vs The Competition: Summary

LLM Decode:

  • 1.67x faster than llama.cpp
  • 1.19x faster than Apple MLX (same model files)
  • 1.59x faster than Ollama

Speech-to-Text (70s audio):

  • 4.6x faster than mlx-whisper
  • 5.5x faster than sherpa-onnx

Text-to-Speech (4 words):

  • 2.8x faster than mlx-audio
  • 2.8x faster than sherpa-onnx

What MetalRT Is Built For

Use Case Why MetalRT
Chat apps 186 tok/s on a 4B model, responses stream instantly
Structured output / tool calling Faster decode means faster JSON and function call generation
Agent workflows Compound latency savings across sequential LLM calls
Coding assistants Sub-7ms time-to-first-token on small models
Privacy-first apps Cloud-competitive speed, entirely on-device
Voice pipelines 101ms STT + 178ms TTS. Hear and respond in under 300ms.
Real-time transcription 714x faster than real-time on Whisper
Medical / secure environments Complete privacy. Zero cloud dependency.

All Numbers at a Glance

LLM Performance:

  • 658 tok/s peak decode (Qwen3-0.6B)
  • 6.6ms time-to-first-token (Qwen3-0.6B)
  • 186 tok/s on a 4B parameter model (Qwen3-4B)

STT Performance:

  • 101ms for 70 seconds of audio
  • 714x faster than real-time
  • 4.6x faster than Apple MLX

TTS Performance:

  • 178ms for typical voice responses
  • 2.8x faster than Apple MLX
  • Sub-400ms for most use cases

Quality: Identical output across all engines. The model is the same. The speed is not.


About MetalRT

MetalRT is the inference engine behind RunAnywhere, a production-grade on-device AI platform. RunAnywhere provides cross-platform SDKs for iOS, Android, Web, React Native, and Flutter, with MetalRT powering the Apple Silicon runtime.

MetalRT is written in C++ and talks directly to Apple's Metal GPU API. No Python overhead. No abstraction layers. Just raw compute.


Benchmarked on Apple M4 Max, 64 GB, macOS 26.3. LLM models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B, all 4-bit. Speech models: Whisper Tiny (4-bit), Kokoro-82M. Greedy decoding for LLM, 5 runs best reported. Speech: 10 runs best reported. MetalRT + mlx-lm share identical MLX 4-bit model files. llama.cpp and Ollama use GGUF Q4_K_M (Q8_0 for Qwen3-0.6B). Ollama v0.17.4 via REST API.

Community

Interesting, Trying it out now ..

Sign up or log in to comment