Qwen3.5-4B β€” Quantized for Mobile GPU (MLC-LLM) - Research & Development

First-ever native compilation of Qwen 3.5 for Android GPU (OpenCL/Vulkan) via MLC-LLM. This is an experimental R&D release intended to push the boundaries of React Native edge AI integration.

This repository contains the 4-bit quantized weights and the pre-compiled Android GPU shader for running Qwen/Qwen3.5-4B entirely offline on-device utilizing MLC-LLM.

πŸ“Š Technical Specifications

Property Value
Base Model Qwen/Qwen3.5-4B
Parameters 4.66 Billion
Quantization Method q4f16_0 (4-bit asymmetric weights, float16 compute)
Original Disk Footprint 9.3 GB
Quantized Footprint ~2.2 GB
Effective VRAM Requirement ~2.5 GB
Target Device Backend Android GPU (OpenCL/Vulkan) - ARM64
Architecture Profile Hybrid Attention (Linear + Standard)

πŸ§ͺ R&D Breakthroughs & The Compilation Journey

Bringing this model to run natively on an Android device via React Native required overriding significant limitations in standard mobile AI compilers.

Key R&D achievements during this build:

  1. Bypassing TVM Compiler Crashes (Monkey-Patching): Qwen 3.5 utilizes a hybrid attention architecture (SBlock) which naturally crashes the Apache TVM OpenCL/Vulkan compiler's vectorization pass. We engineered a Python-level Monkey-Patch to s_tir.Schedule.vectorize to selectively skip GPU vectorization for these incompatible unroll loops while preserving others.
  2. Deep NDK Cross-Compilation: We fully rebuilt the MLC-LLM (mlc4j) native C++ and Java bridges via the Android NDK and CMake from source, integrating Rust tokenizers and resolving deep submodule dependency trees (NVIDIA Cutlass, SentencePiece).
  3. React Native Injection: The resulting libtvm4j_runtime_packed.so and JNI bindings were directly hooked and compiled into a standalone @react-native-ai/mlc production build for immediate offline inference.

⚠️ Performance Limitations (Why this is an R&D build)

While the LLM runs completely offline (tested successfully on the Samsung Galaxy S24 Ultra / Snapdragon 8 Gen 3), generation speed is heavily throttled.

This is the direct trade-off of our structural hacks:

  1. The Vectorization Sacrifice: By skipping vectorize to bypass the compiler crash, the GPU computes massive operations sequentially rather than in optimized parallel blocks.
  2. Zero-Shot Tuning: This shader was compiled without 48+ hour Metaschedule Adreno tuning logs, falling back to default, unoptimized L1/L2 cache layouts.

For high-performance offline inference in production (20+ tokens/s), relying on natively supported / pre-tuned architectures (e.g., Llama-3.2, Qwen2.5, Phi-3) remains the standard path.

🀝 Credits

  • Base Model Architecture: Qwen Team
  • Quantization, Mobile Integration & Build R&D: ClΓ©ment Bellet-Odent
  • Technology Watch & R&D Strategy: Lucas Gerhardt
  • Edge Inference Engine: MLC-LLM by MLC-AI
Downloads last month
133
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for clemNova/Qwen3.5-4B-q4f16_0-MLC-Android

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(142)
this model