Qwen3-0.6B Q8_0 — ExecuTorch + GGML Backend

Qwen3-0.6B with Q8_0 quantization, running on Metal (Apple Silicon) and CUDA (NVIDIA GPUs) via the executorch-ggml backend.

The .pte file contains only the compute graph (213 KB). Weights are loaded from the standard .gguf file at runtime — no weight duplication, zero overhead.

Performance

Decode throughput (tok/s, tg128)

Platform	executorch-ggml	llama.cpp	vs llama.cpp
NVIDIA A100-SXM4-40GB	411	377	109%
Apple M4 Max	323	309	104%

With QKV + gate/up projection fusion: MUL_MAT reduced from 197 to 113 per decode step.

Per-step breakdown (decode, steady state)

Phase	Metal (M4 Max)	CUDA (A100)
build_graph	0.0 ms (cached)	0.0 ms (cached)
sched_alloc	0.0 ms (cached)	0.0 ms (cached)
compute	0.3 ms (async)	0.4 ms (async)
output sync	3.0 ms	2.0 ms
total	~3.1 ms	~2.4 ms

1 split, 113 MUL_MAT (fused QKV + gate/up), graph cache HIT.

Files

File	Size	Description
`qwen3_q8_0.pte`	213 KB	Compute graph only (no weights)
`Qwen3-0.6B-Q8_0.gguf`	610 MB	Q8_0 weights (standard GGUF)

Quick Start

1. Download

pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('larryliu0820/Qwen3-0.6B-Q8_0-ExecuTorch-GGML',
                  local_dir='models/qwen3')
"

2. Clone and build

git clone https://github.com/larryliu0820/executorch-ggml
cd executorch-ggml
git submodule update --init --recursive

Metal (macOS):

cmake -B build_native \
    -DEXECUTORCH_GGML_BUILD_LLAMA_RUNNER=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build_native --target benchmark_llm --parallel 16

CUDA:

cmake -B build_native \
    -DEXECUTORCH_GGML_BUILD_LLAMA_RUNNER=ON \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES=80 \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build_native --target benchmark_llm --parallel 16

3. Run (C++)

# Benchmark: .pte (graph) + .gguf (weights)
./build_native/benchmark/benchmark_llm \
    models/qwen3/qwen3_q8_0.pte \
    --gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    --n-decode 128 --prompt-len 5

4. Run (Python)

import torch
from executorch_ggml.gguf_module import GGUFModule

# Load graph from .pte, weights from .gguf
module = GGUFModule("models/qwen3/qwen3_q8_0.pte",
                    "models/qwen3/Qwen3-0.6B-Q8_0.gguf")

# Print model info
module.print_info()

# Prefill
input_ids = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)
cache_pos = torch.arange(5, dtype=torch.long)
logits = module.forward(input_ids, cache_pos)
next_token = logits[0][:, -1, :].argmax(dim=-1).item()

# Decode loop
import time
tokens = [next_token]
t0 = time.time()
for i in range(127):
    tok_input = torch.tensor([[next_token]], dtype=torch.long)
    pos_input = torch.tensor([5 + i], dtype=torch.long)
    logits = module.forward(tok_input, pos_input)
    next_token = logits[0][0, 0, :].argmax(dim=-1).item()
    tokens.append(next_token)
dt = time.time() - t0
print(f"{len(tokens) / dt:.1f} tok/s")

5. Profiling

# Per-call timing breakdown
GGML_PERF_LOG=1 ./build_native/benchmark/benchmark_llm \
    models/qwen3/qwen3_q8_0.pte \
    --gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    --n-decode 32

# Per-op timing (adds sync overhead — use for relative comparison only)
GGML_PROFILE=1 ./build_native/benchmark/benchmark_llm \
    models/qwen3/qwen3_q8_0.pte \
    --gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    --n-decode 5

6. llama.cpp baseline (comparison)

cd third-party/llama.cpp
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release  # or -DGGML_CUDA=ON
cmake --build build --target llama-bench --parallel 16
cd ../..

third-party/llama.cpp/build/bin/llama-bench \
    -m models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    -ngl 99 -p 5 -n 128 -r 5

How It Works

Export (one-time):
  GGUF file ──> GGUFAnalyzer ──> model config + weight names
                              ──> PyTorch model (no weights loaded)
                              ──> torch.export + GGML partitioner
                              ──> .pte with GGUF tensor names as data_keys
                                  (213 KB, graph only)

Runtime:
  .pte (graph)  ──> ExecuTorch Program
  .gguf (weights) ──> GGUFNamedDataMap (implements NamedDataMap)
                  ──> Backend loads weights via get_data(key)
                  ──> Same performance as embedded weights

Export the .pte yourself

from executorch_ggml.export_gguf import export_gguf_to_pte, GGUFExportConfig

config = GGUFExportConfig(
    max_seq_len=128,
    preserve_dynamic_shapes=True,
    enable_quantization=True,
)
export_gguf_to_pte("Qwen3-0.6B-Q8_0.gguf", "qwen3_q8_0.pte", config)

Optimizations Applied

Optimization	Effect
Fused RMSNorm (`swap_rms_norm`)	8 ops/norm -> 1
Fused RoPE (`fuse_rope_in_graph`)	9 ops/Q,K -> 1
GQA strip (`strip_gqa_expand`)	Remove expand/repeat
RMSNorm weight fold	Absorb into downstream linear
QKV projection fusion	3 matmuls -> 1 per layer
Gate/Up projection fusion	2 matmuls -> 1 per layer
CSE (post-export)	Merge duplicate linear nodes
Mask conversion cache	Deduplicate across 28 layers
RESHAPE collapse + PERMUTE compose	Eliminate redundant layout ops
SiLU-gate fusion	`swiglu_split` single kernel
Graph cache (default on)	0 ms rebuild on cache HIT
Mutable KV cache on GPU	`ggml_set_rows`, no CPU fallback, 1 split

Environment Variables

Variable	Values	Description
`GGML_PERF_LOG`	`1`	Per-call timing breakdown
`GGML_PROFILE`	`1`	Per-op timing (adds sync overhead)
`GGML_NO_GRAPH_CACHE`	`1`	Disable graph caching (debug)
`GGML_DEBUG_DUMP`	`<path>`	Per-node tensor stats
`GGML_SKIP_OUTPUT_COPY`	`1`	Skip logits GPU->CPU copy (CUDA only)

Model Details

Base model: Qwen/Qwen3-0.6B
Parameters: 596M
Architecture: 28 layers, 16 attention heads, 8 KV heads, head_dim=128
Quantization: Q8_0 (weights only; KV cache is F32)
Max sequence length: 128 (exported)
Framework: ExecuTorch with ggml backend

Downloads last month: 39

GGUF

Model size

0.6B params

Architecture

qwen3

Hardware compatibility

8-bit

Model tree for larryliu0820/Qwen3-0.6B-Q8_0-ExecuTorch-GGML

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Quantized

(288)

this model