Qwen3.5-4B — Quantized for Mobile GPU (MLC-LLM) - Research & Development

First-ever native compilation of Qwen 3.5 for Android GPU (OpenCL/Vulkan) via MLC-LLM. This is an experimental R&D release intended to push the boundaries of React Native edge AI integration.

This repository contains the 4-bit quantized weights and the pre-compiled Android GPU shader for running Qwen/Qwen3.5-4B entirely offline on-device utilizing MLC-LLM.

📊 Technical Specifications

Property	Value
Base Model	Qwen/Qwen3.5-4B
Parameters	4.66 Billion
Quantization Method	`q4f16_0` (4-bit asymmetric weights, float16 compute)
Original Disk Footprint	9.3 GB
Quantized Footprint	~2.2 GB
Effective VRAM Requirement	~2.5 GB
Target Device Backend	Android GPU (OpenCL/Vulkan) - ARM64
Architecture Profile	Hybrid Attention (Linear + Standard)

🧪 R&D Breakthroughs & The Compilation Journey

Bringing this model to run natively on an Android device via React Native required overriding significant limitations in standard mobile AI compilers.

Key R&D achievements during this build:

Bypassing TVM Compiler Crashes (Monkey-Patching): Qwen 3.5 utilizes a hybrid attention architecture (SBlock) which naturally crashes the Apache TVM OpenCL/Vulkan compiler's vectorization pass. We engineered a Python-level Monkey-Patch to s_tir.Schedule.vectorize to selectively skip GPU vectorization for these incompatible unroll loops while preserving others.
Deep NDK Cross-Compilation: We fully rebuilt the MLC-LLM (mlc4j) native C++ and Java bridges via the Android NDK and CMake from source, integrating Rust tokenizers and resolving deep submodule dependency trees (NVIDIA Cutlass, SentencePiece).
React Native Injection: The resulting libtvm4j_runtime_packed.so and JNI bindings were directly hooked and compiled into a standalone @react-native-ai/mlc production build for immediate offline inference.

⚠️ Performance Limitations (Why this is an R&D build)

While the LLM runs completely offline (tested successfully on the Samsung Galaxy S24 Ultra / Snapdragon 8 Gen 3), generation speed is heavily throttled.

This is the direct trade-off of our structural hacks:

The Vectorization Sacrifice: By skipping vectorize to bypass the compiler crash, the GPU computes massive operations sequentially rather than in optimized parallel blocks.
Zero-Shot Tuning: This shader was compiled without 48+ hour Metaschedule Adreno tuning logs, falling back to default, unoptimized L1/L2 cache layouts.

For high-performance offline inference in production (20+ tokens/s), relying on natively supported / pre-tuned architectures (e.g., Llama-3.2, Qwen2.5, Phi-3) remains the standard path.

🤝 Credits

Base Model Architecture: Qwen Team
Quantization, Mobile Integration & Build R&D: Clément Bellet-Odent
Technology Watch & R&D Strategy: Lucas Gerhardt
Edge Inference Engine: MLC-LLM by MLC-AI

Downloads last month: 133

Model tree for clemNova/Qwen3.5-4B-q4f16_0-MLC-Android

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(142)

this model