Qwen 3.5 2B โ€” LiteRT-LM (.litertlm)

Qwen 3.5 2B bundled as .litertlm for on-device inference with LiteRT-LM.

First working LiteRT-LM bundle of Qwen 3.5's hybrid GatedDeltaNet architecture.

Usage

import com.google.ai.edge.litertlm.*

val engine = Engine(EngineConfig(
    modelPath = "/path/to/qwen35_2b.litertlm",
    backend = Backend.GPU(),  // or Backend.CPU()
))
engine.initialize()

engine.createConversation().use { conversation ->
    conversation.sendMessageAsync("Hello!").collect { print(it) }
}

What's inside

The .litertlm bundle contains:

  • TFLite model (int8 dynamic quantized, ~1.9 GB)
  • BPE tokenizer (zlib compressed, from Qwen 3.5 2B)

Architecture

Base model Qwen/Qwen3.5-2B
Layers 24 total: 18ร— GatedDeltaNet linear attention + 6ร— GQA full attention
Quantization int8 dynamic
Size ~1.9 GB

GatedDeltaNet Linear Attention

Recurrent state-space attention with A_log decay, short conv1d, output gating. No KV cache needed โ€” maintains a fixed-size recurrent state per head.

Full Attention (every 4th layer)

Grouped query attention with asymmetric head dims (Q=512 partial rotary, KV=256). Standard softmax attention with KV cache.

Conversion

Custom model authoring using litert-torch Generative API with GatedDeltaNet implemented as standard TFLite ops. No custom GPU kernels required.

Source: allot/tools/model-export

Downloads last month
1,030
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for paulsp94/Qwen3.5-2B-LiteRT-LM

Finetuned
Qwen/Qwen3.5-2B
Finetuned
(109)
this model