MetalRT: The Fastest AI Inference Engine for Apple Silicon. Here Are the Numbers.

Community Article Published March 12, 2026

We built a native inference engine for Apple Silicon from the ground up. No wrappers. No abstraction layers. Direct Metal GPU programming.

MetalRT is the first and only inference engine that accelerates all three AI modalities on Apple Silicon: LLMs, Speech-to-Text, and Text-to-Speech. We benchmarked it against every major engine. It won.

The headline numbers:

658 tok/s LLM decode (Qwen3-0.6B, M4 Max)
101ms to transcribe 70 seconds of audio (Whisper)
178ms to synthesize speech (Kokoro TTS)
1.67x faster than llama.cpp on LLM decode
4.6x faster than Apple MLX on speech-to-text

Part 1: LLM Decode Performance

We benchmarked MetalRT against four engines across four models on a single M4 Max.

Engines Tested

Engine	Language	Benchmark Method
MetalRT	C++	Native binary
uzu	Rust	Native `cli bench`
mlx-lm	Python + MLX C++	Python API
llama.cpp	C/C++	`llama-bench` v8190
Ollama	Go + llama.cpp	REST API (streaming)

Hardware: Apple M4 Max, 64 GB unified memory, macOS 26.3 Models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B (all 4-bit quantized) Runs: 5 per engine per model, best reported Fairness: MetalRT and mlx-lm use the exact same model files. Ollama uses the same GGUF files as llama.cpp, with REST API overhead included.

Decode Speed (tok/s)

Decode speed is how fast tokens stream to the user. It is the metric that matters most for interactive chat.

Model	MetalRT	uzu	mlx-lm	llama.cpp	Ollama
Qwen3-0.6B	658	627	552	295*	274
Qwen3-4B	186	165	170	87	120
Llama-3.2-3B	184	222	210	137	131
LFM2.5-1.2B	570	550	509	372	313

Qwen3-0.6B llama.cpp/Ollama use Q8_0 (8-bit), not directly comparable.

MetalRT wins 3 of 4 models. The speedups:

1.10-1.19x vs mlx-lm (same model files)
1.35-2.14x vs llama.cpp
1.41-2.40x vs Ollama

uzu wins Llama-3.2-3B at 222 tok/s. We report this honestly.

MetalRT vs Apple MLX and llama.cpp

mlx-lm is Apple's official inference framework. MetalRT and mlx-lm use the exact same model files, so this is a pure engine-to-engine comparison.

MetalRT is 1.10-1.19x faster than mlx-lm on decode (same model files) and 1.35-2.14x faster than llama.cpp across the board.

Part 2: Speech-to-Text Performance

We tested Whisper STT across four audio lengths. MetalRT won every single one.

Engines Tested

Engine	Type	Notes
MetalRT	Native	Complete AI inference engine (LLM + STT + TTS)
mlx-whisper	MLX	Apple's official framework (`pip install mlx-whisper`)
sherpa-onnx	ONNX	Cross-platform baseline (`pip install sherpa-onnx`)

Model: Whisper Tiny (4-bit) Runs: 10 per engine, best reported

Whisper Tiny (4-bit) Latency

Lower latency is better

Audio Duration	MetalRT	mlx-whisper	sherpa-onnx	Winner
Short (4s)	31.9ms	42.1ms	64.9ms	MetalRT
Medium (11s)	52.3ms	59.6ms	175ms	MetalRT
Long (33s)	104ms	134ms	469ms	MetalRT
Extra-long (70s)	101ms	463ms	554ms	MetalRT

The 70-second result is not a typo. MetalRT transcribes over a minute of audio in 101 milliseconds.

Real-Time Factor: 0.0014 (lower is better). That is 714x faster than real-time.

What 714x Real-Time Means in Practice

1-hour podcast: ~5 seconds to process
3-hour meeting: ~15 seconds
Live captioning: zero perceptible delay

Part 3: Text-to-Speech Performance

We tested Kokoro-82M across typical voice assistant response lengths.

Kokoro-82M Results

Lower synthesis time is better

Text Length	MetalRT	mlx-audio	sherpa-onnx	Winner
4 words	178ms	493ms	504ms	MetalRT
10 words	230ms	522ms	723ms	MetalRT
18 words	381ms	600ms	1,395ms	MetalRT
36 words	604ms	706ms	2,115ms	MetalRT

MetalRT is 2.8x faster than mlx-audio on short phrases, exactly what voice assistants need.

MetalRT vs The Competition: Summary

LLM Decode:

1.67x faster than llama.cpp
1.19x faster than Apple MLX (same model files)
1.59x faster than Ollama

Speech-to-Text (70s audio):

4.6x faster than mlx-whisper
5.5x faster than sherpa-onnx

Text-to-Speech (4 words):

2.8x faster than mlx-audio
2.8x faster than sherpa-onnx

What MetalRT Is Built For

Use Case	Why MetalRT
Chat apps	186 tok/s on a 4B model, responses stream instantly
Structured output / tool calling	Faster decode means faster JSON and function call generation
Agent workflows	Compound latency savings across sequential LLM calls
Coding assistants	Sub-7ms time-to-first-token on small models
Privacy-first apps	Cloud-competitive speed, entirely on-device
Voice pipelines	101ms STT + 178ms TTS. Hear and respond in under 300ms.
Real-time transcription	714x faster than real-time on Whisper
Medical / secure environments	Complete privacy. Zero cloud dependency.

All Numbers at a Glance

LLM Performance:

658 tok/s peak decode (Qwen3-0.6B)
6.6ms time-to-first-token (Qwen3-0.6B)
186 tok/s on a 4B parameter model (Qwen3-4B)

STT Performance:

101ms for 70 seconds of audio
714x faster than real-time
4.6x faster than Apple MLX

TTS Performance:

178ms for typical voice responses
2.8x faster than Apple MLX
Sub-400ms for most use cases

Quality: Identical output across all engines. The model is the same. The speed is not.

About MetalRT

MetalRT is the inference engine behind RunAnywhere, a production-grade on-device AI platform. RunAnywhere provides cross-platform SDKs for iOS, Android, Web, React Native, and Flutter, with MetalRT powering the Apple Silicon runtime.

MetalRT is written in C++ and talks directly to Apple's Metal GPU API. No Python overhead. No abstraction layers. Just raw compute.

Benchmarked on Apple M4 Max, 64 GB, macOS 26.3. LLM models: Qwen3-0.6B, Qwen3-4B, Llama-3.2-3B, LFM2.5-1.2B, all 4-bit. Speech models: Whisper Tiny (4-bit), Kokoro-82M. Greedy decoding for LLM, 5 runs best reported. Speech: 10 runs best reported. MetalRT + mlx-lm share identical MLX 4-bit model files. llama.cpp and Ollama use GGUF Q4_K_M (Q8_0 for Qwen3-0.6B). Ollama v0.17.4 via REST API.

RunAnywhere: Production-Grade On-Device AI Infrastructure

March 14, 2026

Community

Void2377

Mar 12

Interesting, Trying it out now ..

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote