Qwen3-0.6B-kl-baseline-20k-MNN

Small Qwen3-0.6B draft model exported for TokForge + MNN speculative decoding on Android.

This is a practical, mobile-oriented draft bundle rather than a standard Transformers checkpoint. It is intended to be paired with larger Qwen3 targets inside TokForge.

Why this repo exists

This model is one of the strongest lightweight draft candidates we have for mobile speculative decoding:

  • trained on 20K teacher samples
  • KL-distilled from a Qwen3-8B teacher
  • exported as a ready-to-use MNN bundle
  • tuned for the classic CPU draft + GPU/CPU target TokForge flow

Best-known use

  • Draft model backend: CPU
  • Draft threads: 2
  • Draft predict length: d=3
  • Typical target pairing: Qwen3-8B in TokForge

Benchmark snapshot

On RedMagic SM8850 with Qwen3-8B target:

  • AR baseline: 13.9 tok/s
  • This draft model: 18.1 tok/s
  • Uplift: about +30%

Training acceptance (alpha) at the final logged epoch:

  • 0.7178

Included files

  • llm.mnn
  • llm.mnn.weight
  • llm_config.json
  • config.json
  • config_cpu.json
  • tokenizer files
  • ONNX export artifact for reference

Usage

This bundle is meant for TokForge / MNN, not standard HF Inference.

Typical TokForge recipe:

{
  "backend_type": "opencl",
  "thread_num": 4,
  "precision": "low",
  "memory": "low",
  "sampler_type": "greedy",
  "speculative_type": "draftmodel",
  "draft_predict_length": 3,
  "draft_config_path": "/path/to/config_cpu.json"
}

Known-good draft-side config:

{
  "backend_type": "cpu",
  "thread_num": 2,
  "precision": "low",
  "memory": "low",
  "sampler_type": "greedy"
}

Notes

  • This is not a drop-in Transformers repo for HF Inference.
  • It is a runtime bundle for TokForge / MNN.
  • If you want the safest current mobile draft option, this is one of the best places to start.

Limitations and Intended Use

  • Intended for speculative decoding with larger Qwen3 targets inside TokForge.
  • Best current evidence is strongest on Qwen3-8B; smaller or differently-paired targets may behave differently.
  • Mobile results depend heavily on backend routing, prompt length, and device thermals.
  • This is a specialized runtime artifact, not a general-purpose pretrained release.

Collection

TokForge

If you benchmark this on your own device, feel free to share results in Discord.

Downloads last month
156
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for darkmaniac7/Qwen3-0.6B-kl-baseline-20k-MNN

Finetuned
Qwen/Qwen3-0.6B
Quantized
(289)
this model

Collection including darkmaniac7/Qwen3-0.6B-kl-baseline-20k-MNN