Qwen 3.5 2B — LiteRT-LM (.litertlm)

Qwen 3.5 2B bundled as .litertlm for on-device inference with LiteRT-LM.

First working LiteRT-LM bundle of Qwen 3.5's hybrid GatedDeltaNet architecture.

Usage

import com.google.ai.edge.litertlm.*

val engine = Engine(EngineConfig(
    modelPath = "/path/to/qwen35_2b.litertlm",
    backend = Backend.GPU(),  // or Backend.CPU()
))
engine.initialize()

engine.createConversation().use { conversation ->
    conversation.sendMessageAsync("Hello!").collect { print(it) }
}

What's inside

The .litertlm bundle contains:

TFLite model (int8 dynamic quantized, ~1.9 GB)
BPE tokenizer (zlib compressed, from Qwen 3.5 2B)

Architecture


Base model	Qwen/Qwen3.5-2B
Layers	24 total: 18× GatedDeltaNet linear attention + 6× GQA full attention
Quantization	int8 dynamic
Size	~1.9 GB

GatedDeltaNet Linear Attention

Recurrent state-space attention with A_log decay, short conv1d, output gating. No KV cache needed — maintains a fixed-size recurrent state per head.

Full Attention (every 4th layer)

Grouped query attention with asymmetric head dims (Q=512 partial rotary, KV=256). Standard softmax attention with KV cache.

Conversion

Custom model authoring using litert-torch Generative API with GatedDeltaNet implemented as standard TFLite ops. No custom GPU kernels required.

Source: allot/tools/model-export

Downloads last month: 1,030

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for paulsp94/Qwen3.5-2B-LiteRT-LM

Base model

Qwen/Qwen3.5-2B-Base

Finetuned

Qwen/Qwen3.5-2B

Finetuned

(109)

this model