Qwen3-8B-abliterated-v2 (MNN)
Pre-converted Qwen3-8B abliterated model in MNN format for on-device inference.
Model Details
- Architecture: Qwen3 (standard attention, 36 layers)
- Parameters: 8B (4-bit quantized)
- Format: MNN (Alibaba Mobile Neural Network)
- Vocab: 151,936 tokens
- Quantization: W4A16 (4-bit weights, 16-bit activations)
Files
| File | Size | Description |
|---|---|---|
llm.mnn |
631KB | Model graph |
llm.mnn.weight |
4.4GB | Quantized weights |
embeddings_bf16.bin |
1.2GB | BF16 embedding table (required) |
llm_config.json |
4.5KB | Model config with jinja chat template |
tokenizer.txt |
3.0MB | Tokenizer |
config.json |
210B | MNN runtime config |
Usage with TokForge
This model is optimized for TokForge — an Android app for on-device LLM inference.
Performance (Speculative Decoding)
| Device | SoC | Backend | AR tok/s | Spec Decode tok/s | Uplift |
|---|---|---|---|---|---|
| S26 Ultra | SM8850 | OpenCL | ~14 | 17.8 | +27% |
| RedMagic 11 Pro | SM8850 | OpenCL | ~14 | 17.8 | +27% |
| Lenovo TB520FU | SM8650 | OpenCL | 9.9 | 12.2 | +23% |
Draft model: Qwen3-0.6B
Abliteration
This model has been abliterated (safety filters removed) for unrestricted conversation. Use responsibly.
Limitations and Intended Use
- Intended for TokForge / MNN on-device inference, especially Android phones and tablets.
- The best-known uplift for this model comes from pairing it with a small CPU draft model for speculative decoding.
- Real throughput varies by SoC, thermal state, backend, and generation length.
- This repo is a runtime bundle, not a standard Transformers training checkpoint.
Community
- Website: tokforge.ai
- Discord: Join the Discord
Export
Converted using MNN's llmexport pipeline with --quant_bit 4 --quant_block 128.
- Downloads last month
- 59
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support