Qwen3-0.6B Q8_0 โ ExecuTorch + GGML Backend
Qwen3-0.6B with Q8_0 quantization, running on Metal (Apple Silicon) and CUDA (NVIDIA GPUs) via the executorch-ggml backend.
The .pte file contains only the compute graph (213 KB). Weights are loaded from the standard .gguf file at runtime โ no weight duplication, zero overhead.
Performance
Decode throughput (tok/s, tg128)
| Platform | executorch-ggml | llama.cpp | vs llama.cpp |
|---|---|---|---|
| NVIDIA A100-SXM4-40GB | 411 | 377 | 109% |
| Apple M4 Max | 323 | 309 | 104% |
With QKV + gate/up projection fusion: MUL_MAT reduced from 197 to 113 per decode step.
Per-step breakdown (decode, steady state)
| Phase | Metal (M4 Max) | CUDA (A100) |
|---|---|---|
| build_graph | 0.0 ms (cached) | 0.0 ms (cached) |
| sched_alloc | 0.0 ms (cached) | 0.0 ms (cached) |
| compute | 0.3 ms (async) | 0.4 ms (async) |
| output sync | 3.0 ms | 2.0 ms |
| total | ~3.1 ms | ~2.4 ms |
1 split, 113 MUL_MAT (fused QKV + gate/up), graph cache HIT.
Files
| File | Size | Description |
|---|---|---|
qwen3_q8_0.pte |
213 KB | Compute graph only (no weights) |
Qwen3-0.6B-Q8_0.gguf |
610 MB | Q8_0 weights (standard GGUF) |
Quick Start
1. Download
pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('larryliu0820/Qwen3-0.6B-Q8_0-ExecuTorch-GGML',
local_dir='models/qwen3')
"
2. Clone and build
git clone https://github.com/larryliu0820/executorch-ggml
cd executorch-ggml
git submodule update --init --recursive
Metal (macOS):
cmake -B build_native \
-DEXECUTORCH_GGML_BUILD_LLAMA_RUNNER=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build_native --target benchmark_llm --parallel 16
CUDA:
cmake -B build_native \
-DEXECUTORCH_GGML_BUILD_LLAMA_RUNNER=ON \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=80 \
-DCMAKE_BUILD_TYPE=Release
cmake --build build_native --target benchmark_llm --parallel 16
3. Run (C++)
# Benchmark: .pte (graph) + .gguf (weights)
./build_native/benchmark/benchmark_llm \
models/qwen3/qwen3_q8_0.pte \
--gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
--n-decode 128 --prompt-len 5
4. Run (Python)
import torch
from executorch_ggml.gguf_module import GGUFModule
# Load graph from .pte, weights from .gguf
module = GGUFModule("models/qwen3/qwen3_q8_0.pte",
"models/qwen3/Qwen3-0.6B-Q8_0.gguf")
# Print model info
module.print_info()
# Prefill
input_ids = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)
cache_pos = torch.arange(5, dtype=torch.long)
logits = module.forward(input_ids, cache_pos)
next_token = logits[0][:, -1, :].argmax(dim=-1).item()
# Decode loop
import time
tokens = [next_token]
t0 = time.time()
for i in range(127):
tok_input = torch.tensor([[next_token]], dtype=torch.long)
pos_input = torch.tensor([5 + i], dtype=torch.long)
logits = module.forward(tok_input, pos_input)
next_token = logits[0][0, 0, :].argmax(dim=-1).item()
tokens.append(next_token)
dt = time.time() - t0
print(f"{len(tokens) / dt:.1f} tok/s")
5. Profiling
# Per-call timing breakdown
GGML_PERF_LOG=1 ./build_native/benchmark/benchmark_llm \
models/qwen3/qwen3_q8_0.pte \
--gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
--n-decode 32
# Per-op timing (adds sync overhead โ use for relative comparison only)
GGML_PROFILE=1 ./build_native/benchmark/benchmark_llm \
models/qwen3/qwen3_q8_0.pte \
--gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
--n-decode 5
6. llama.cpp baseline (comparison)
cd third-party/llama.cpp
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release # or -DGGML_CUDA=ON
cmake --build build --target llama-bench --parallel 16
cd ../..
third-party/llama.cpp/build/bin/llama-bench \
-m models/qwen3/Qwen3-0.6B-Q8_0.gguf \
-ngl 99 -p 5 -n 128 -r 5
How It Works
Export (one-time):
GGUF file โโ> GGUFAnalyzer โโ> model config + weight names
โโ> PyTorch model (no weights loaded)
โโ> torch.export + GGML partitioner
โโ> .pte with GGUF tensor names as data_keys
(213 KB, graph only)
Runtime:
.pte (graph) โโ> ExecuTorch Program
.gguf (weights) โโ> GGUFNamedDataMap (implements NamedDataMap)
โโ> Backend loads weights via get_data(key)
โโ> Same performance as embedded weights
Export the .pte yourself
from executorch_ggml.export_gguf import export_gguf_to_pte, GGUFExportConfig
config = GGUFExportConfig(
max_seq_len=128,
preserve_dynamic_shapes=True,
enable_quantization=True,
)
export_gguf_to_pte("Qwen3-0.6B-Q8_0.gguf", "qwen3_q8_0.pte", config)
Optimizations Applied
| Optimization | Effect |
|---|---|
Fused RMSNorm (swap_rms_norm) |
8 ops/norm -> 1 |
Fused RoPE (fuse_rope_in_graph) |
9 ops/Q,K -> 1 |
GQA strip (strip_gqa_expand) |
Remove expand/repeat |
| RMSNorm weight fold | Absorb into downstream linear |
| QKV projection fusion | 3 matmuls -> 1 per layer |
| Gate/Up projection fusion | 2 matmuls -> 1 per layer |
| CSE (post-export) | Merge duplicate linear nodes |
| Mask conversion cache | Deduplicate across 28 layers |
| RESHAPE collapse + PERMUTE compose | Eliminate redundant layout ops |
| SiLU-gate fusion | swiglu_split single kernel |
| Graph cache (default on) | 0 ms rebuild on cache HIT |
| Mutable KV cache on GPU | ggml_set_rows, no CPU fallback, 1 split |
Environment Variables
| Variable | Values | Description |
|---|---|---|
GGML_PERF_LOG |
1 |
Per-call timing breakdown |
GGML_PROFILE |
1 |
Per-op timing (adds sync overhead) |
GGML_NO_GRAPH_CACHE |
1 |
Disable graph caching (debug) |
GGML_DEBUG_DUMP |
<path> |
Per-node tensor stats |
GGML_SKIP_OUTPUT_COPY |
1 |
Skip logits GPU->CPU copy (CUDA only) |
Model Details
- Base model: Qwen/Qwen3-0.6B
- Parameters: 596M
- Architecture: 28 layers, 16 attention heads, 8 KV heads, head_dim=128
- Quantization: Q8_0 (weights only; KV cache is F32)
- Max sequence length: 128 (exported)
- Framework: ExecuTorch with ggml backend
- Downloads last month
- 39
Hardware compatibility
Log In to add your hardware
8-bit