Qwen3-0.6B Q8_0 โ€” ExecuTorch + GGML Backend

Qwen3-0.6B with Q8_0 quantization, running on Metal (Apple Silicon) and CUDA (NVIDIA GPUs) via the executorch-ggml backend.

The .pte file contains only the compute graph (213 KB). Weights are loaded from the standard .gguf file at runtime โ€” no weight duplication, zero overhead.

Performance

Decode throughput (tok/s, tg128)

Platform executorch-ggml llama.cpp vs llama.cpp
NVIDIA A100-SXM4-40GB 411 377 109%
Apple M4 Max 323 309 104%

With QKV + gate/up projection fusion: MUL_MAT reduced from 197 to 113 per decode step.

Per-step breakdown (decode, steady state)

Phase Metal (M4 Max) CUDA (A100)
build_graph 0.0 ms (cached) 0.0 ms (cached)
sched_alloc 0.0 ms (cached) 0.0 ms (cached)
compute 0.3 ms (async) 0.4 ms (async)
output sync 3.0 ms 2.0 ms
total ~3.1 ms ~2.4 ms

1 split, 113 MUL_MAT (fused QKV + gate/up), graph cache HIT.

Files

File Size Description
qwen3_q8_0.pte 213 KB Compute graph only (no weights)
Qwen3-0.6B-Q8_0.gguf 610 MB Q8_0 weights (standard GGUF)

Quick Start

1. Download

pip install huggingface_hub
python -c "
from huggingface_hub import snapshot_download
snapshot_download('larryliu0820/Qwen3-0.6B-Q8_0-ExecuTorch-GGML',
                  local_dir='models/qwen3')
"

2. Clone and build

git clone https://github.com/larryliu0820/executorch-ggml
cd executorch-ggml
git submodule update --init --recursive

Metal (macOS):

cmake -B build_native \
    -DEXECUTORCH_GGML_BUILD_LLAMA_RUNNER=ON \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build_native --target benchmark_llm --parallel 16

CUDA:

cmake -B build_native \
    -DEXECUTORCH_GGML_BUILD_LLAMA_RUNNER=ON \
    -DGGML_CUDA=ON \
    -DCMAKE_CUDA_ARCHITECTURES=80 \
    -DCMAKE_BUILD_TYPE=Release
cmake --build build_native --target benchmark_llm --parallel 16

3. Run (C++)

# Benchmark: .pte (graph) + .gguf (weights)
./build_native/benchmark/benchmark_llm \
    models/qwen3/qwen3_q8_0.pte \
    --gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    --n-decode 128 --prompt-len 5

4. Run (Python)

import torch
from executorch_ggml.gguf_module import GGUFModule

# Load graph from .pte, weights from .gguf
module = GGUFModule("models/qwen3/qwen3_q8_0.pte",
                    "models/qwen3/Qwen3-0.6B-Q8_0.gguf")

# Print model info
module.print_info()

# Prefill
input_ids = torch.tensor([[1, 2, 3, 4, 5]], dtype=torch.long)
cache_pos = torch.arange(5, dtype=torch.long)
logits = module.forward(input_ids, cache_pos)
next_token = logits[0][:, -1, :].argmax(dim=-1).item()

# Decode loop
import time
tokens = [next_token]
t0 = time.time()
for i in range(127):
    tok_input = torch.tensor([[next_token]], dtype=torch.long)
    pos_input = torch.tensor([5 + i], dtype=torch.long)
    logits = module.forward(tok_input, pos_input)
    next_token = logits[0][0, 0, :].argmax(dim=-1).item()
    tokens.append(next_token)
dt = time.time() - t0
print(f"{len(tokens) / dt:.1f} tok/s")

5. Profiling

# Per-call timing breakdown
GGML_PERF_LOG=1 ./build_native/benchmark/benchmark_llm \
    models/qwen3/qwen3_q8_0.pte \
    --gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    --n-decode 32

# Per-op timing (adds sync overhead โ€” use for relative comparison only)
GGML_PROFILE=1 ./build_native/benchmark/benchmark_llm \
    models/qwen3/qwen3_q8_0.pte \
    --gguf models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    --n-decode 5

6. llama.cpp baseline (comparison)

cd third-party/llama.cpp
cmake -B build -DGGML_METAL=ON -DCMAKE_BUILD_TYPE=Release  # or -DGGML_CUDA=ON
cmake --build build --target llama-bench --parallel 16
cd ../..

third-party/llama.cpp/build/bin/llama-bench \
    -m models/qwen3/Qwen3-0.6B-Q8_0.gguf \
    -ngl 99 -p 5 -n 128 -r 5

How It Works

Export (one-time):
  GGUF file โ”€โ”€> GGUFAnalyzer โ”€โ”€> model config + weight names
                              โ”€โ”€> PyTorch model (no weights loaded)
                              โ”€โ”€> torch.export + GGML partitioner
                              โ”€โ”€> .pte with GGUF tensor names as data_keys
                                  (213 KB, graph only)

Runtime:
  .pte (graph)  โ”€โ”€> ExecuTorch Program
  .gguf (weights) โ”€โ”€> GGUFNamedDataMap (implements NamedDataMap)
                  โ”€โ”€> Backend loads weights via get_data(key)
                  โ”€โ”€> Same performance as embedded weights

Export the .pte yourself

from executorch_ggml.export_gguf import export_gguf_to_pte, GGUFExportConfig

config = GGUFExportConfig(
    max_seq_len=128,
    preserve_dynamic_shapes=True,
    enable_quantization=True,
)
export_gguf_to_pte("Qwen3-0.6B-Q8_0.gguf", "qwen3_q8_0.pte", config)

Optimizations Applied

Optimization Effect
Fused RMSNorm (swap_rms_norm) 8 ops/norm -> 1
Fused RoPE (fuse_rope_in_graph) 9 ops/Q,K -> 1
GQA strip (strip_gqa_expand) Remove expand/repeat
RMSNorm weight fold Absorb into downstream linear
QKV projection fusion 3 matmuls -> 1 per layer
Gate/Up projection fusion 2 matmuls -> 1 per layer
CSE (post-export) Merge duplicate linear nodes
Mask conversion cache Deduplicate across 28 layers
RESHAPE collapse + PERMUTE compose Eliminate redundant layout ops
SiLU-gate fusion swiglu_split single kernel
Graph cache (default on) 0 ms rebuild on cache HIT
Mutable KV cache on GPU ggml_set_rows, no CPU fallback, 1 split

Environment Variables

Variable Values Description
GGML_PERF_LOG 1 Per-call timing breakdown
GGML_PROFILE 1 Per-op timing (adds sync overhead)
GGML_NO_GRAPH_CACHE 1 Disable graph caching (debug)
GGML_DEBUG_DUMP <path> Per-node tensor stats
GGML_SKIP_OUTPUT_COPY 1 Skip logits GPU->CPU copy (CUDA only)

Model Details

  • Base model: Qwen/Qwen3-0.6B
  • Parameters: 596M
  • Architecture: 28 layers, 16 attention heads, 8 KV heads, head_dim=128
  • Quantization: Q8_0 (weights only; KV cache is F32)
  • Max sequence length: 128 (exported)
  • Framework: ExecuTorch with ggml backend
Downloads last month
39
GGUF
Model size
0.6B params
Architecture
qwen3
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for larryliu0820/Qwen3-0.6B-Q8_0-ExecuTorch-GGML

Finetuned
Qwen/Qwen3-0.6B
Quantized
(288)
this model