Qwen 3.5 2B โ LiteRT-LM (.litertlm)
Qwen 3.5 2B bundled as .litertlm for on-device inference with LiteRT-LM.
First working LiteRT-LM bundle of Qwen 3.5's hybrid GatedDeltaNet architecture.
Usage
import com.google.ai.edge.litertlm.*
val engine = Engine(EngineConfig(
modelPath = "/path/to/qwen35_2b.litertlm",
backend = Backend.GPU(), // or Backend.CPU()
))
engine.initialize()
engine.createConversation().use { conversation ->
conversation.sendMessageAsync("Hello!").collect { print(it) }
}
What's inside
The .litertlm bundle contains:
- TFLite model (int8 dynamic quantized, ~1.9 GB)
- BPE tokenizer (zlib compressed, from Qwen 3.5 2B)
Architecture
| Base model | Qwen/Qwen3.5-2B |
| Layers | 24 total: 18ร GatedDeltaNet linear attention + 6ร GQA full attention |
| Quantization | int8 dynamic |
| Size | ~1.9 GB |
GatedDeltaNet Linear Attention
Recurrent state-space attention with A_log decay, short conv1d, output gating. No KV cache needed โ maintains a fixed-size recurrent state per head.
Full Attention (every 4th layer)
Grouped query attention with asymmetric head dims (Q=512 partial rotary, KV=256). Standard softmax attention with KV cache.
Conversion
Custom model authoring using litert-torch Generative API with GatedDeltaNet implemented as standard TFLite ops. No custom GPU kernels required.
Source: allot/tools/model-export
- Downloads last month
- 1,030
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support