Qwen3-0.6B-kl-baseline-20k-MNN

Small Qwen3-0.6B draft model exported for TokForge + MNN speculative decoding on Android.

This is a practical, mobile-oriented draft bundle rather than a standard Transformers checkpoint. It is intended to be paired with larger Qwen3 targets inside TokForge.

Why this repo exists

This model is one of the strongest lightweight draft candidates we have for mobile speculative decoding:

trained on 20K teacher samples
KL-distilled from a Qwen3-8B teacher
exported as a ready-to-use MNN bundle
tuned for the classic CPU draft + GPU/CPU target TokForge flow

Best-known use

Draft model backend: CPU
Draft threads: 2
Draft predict length: d=3
Typical target pairing: Qwen3-8B in TokForge

Benchmark snapshot

On RedMagic SM8850 with Qwen3-8B target:

AR baseline: 13.9 tok/s
This draft model: 18.1 tok/s
Uplift: about +30%

Training acceptance (alpha) at the final logged epoch:

0.7178

Included files

llm.mnn
llm.mnn.weight
llm_config.json
config.json
config_cpu.json
tokenizer files
ONNX export artifact for reference

Usage

This bundle is meant for TokForge / MNN, not standard HF Inference.

Typical TokForge recipe:

{
  "backend_type": "opencl",
  "thread_num": 4,
  "precision": "low",
  "memory": "low",
  "sampler_type": "greedy",
  "speculative_type": "draftmodel",
  "draft_predict_length": 3,
  "draft_config_path": "/path/to/config_cpu.json"
}

Known-good draft-side config:

{
  "backend_type": "cpu",
  "thread_num": 2,
  "precision": "low",
  "memory": "low",
  "sampler_type": "greedy"
}

Notes

This is not a drop-in Transformers repo for HF Inference.
It is a runtime bundle for TokForge / MNN.
If you want the safest current mobile draft option, this is one of the best places to start.

Limitations and Intended Use

Intended for speculative decoding with larger Qwen3 targets inside TokForge.
Best current evidence is strongest on Qwen3-8B; smaller or differently-paired targets may behave differently.
Mobile results depend heavily on backend routing, prompt length, and device thermals.
This is a specialized runtime artifact, not a general-purpose pretrained release.