Qwen3.5-4B β Quantized for Mobile GPU (MLC-LLM) - Research & Development
First-ever native compilation of Qwen 3.5 for Android GPU (OpenCL/Vulkan) via MLC-LLM. This is an experimental R&D release intended to push the boundaries of React Native edge AI integration.
This repository contains the 4-bit quantized weights and the pre-compiled Android GPU shader for running Qwen/Qwen3.5-4B entirely offline on-device utilizing MLC-LLM.
π Technical Specifications
| Property | Value |
|---|---|
| Base Model | Qwen/Qwen3.5-4B |
| Parameters | 4.66 Billion |
| Quantization Method | q4f16_0 (4-bit asymmetric weights, float16 compute) |
| Original Disk Footprint | 9.3 GB |
| Quantized Footprint | ~2.2 GB |
| Effective VRAM Requirement | ~2.5 GB |
| Target Device Backend | Android GPU (OpenCL/Vulkan) - ARM64 |
| Architecture Profile | Hybrid Attention (Linear + Standard) |
π§ͺ R&D Breakthroughs & The Compilation Journey
Bringing this model to run natively on an Android device via React Native required overriding significant limitations in standard mobile AI compilers.
Key R&D achievements during this build:
- Bypassing TVM Compiler Crashes (Monkey-Patching): Qwen 3.5 utilizes a hybrid attention architecture (SBlock) which naturally crashes the Apache TVM OpenCL/Vulkan compiler's vectorization pass. We engineered a Python-level Monkey-Patch to
s_tir.Schedule.vectorizeto selectively skip GPU vectorization for these incompatible unroll loops while preserving others. - Deep NDK Cross-Compilation: We fully rebuilt the MLC-LLM (
mlc4j) native C++ and Java bridges via the Android NDK and CMake from source, integrating Rust tokenizers and resolving deep submodule dependency trees (NVIDIA Cutlass, SentencePiece). - React Native Injection: The resulting
libtvm4j_runtime_packed.soand JNI bindings were directly hooked and compiled into a standalone@react-native-ai/mlcproduction build for immediate offline inference.
β οΈ Performance Limitations (Why this is an R&D build)
While the LLM runs completely offline (tested successfully on the Samsung Galaxy S24 Ultra / Snapdragon 8 Gen 3), generation speed is heavily throttled.
This is the direct trade-off of our structural hacks:
- The Vectorization Sacrifice: By skipping
vectorizeto bypass the compiler crash, the GPU computes massive operations sequentially rather than in optimized parallel blocks. - Zero-Shot Tuning: This shader was compiled without 48+ hour
MetascheduleAdreno tuning logs, falling back to default, unoptimized L1/L2 cache layouts.
For high-performance offline inference in production (20+ tokens/s), relying on natively supported / pre-tuned architectures (e.g., Llama-3.2, Qwen2.5, Phi-3) remains the standard path.
π€ Credits
- Downloads last month
- 133