Speech Android Models
Collection
Mobile ONNX models for speech-android SDK โข 5 items โข Updated โข 1
Gemma 4 E2B converted to LiteRT-LM for on-device inference on Android, embedded Linux, and desktop.
Exported from the base PyTorch model via litert_torch.generative.export_hf with dynamic_wi4_afp32 quantization.
| Property | Value |
|---|---|
| Parameters | 5.1B total, 2.3B effective (PLE) |
| Quantization | dynamic_wi4_afp32 (INT4 weights, FP32 activations) |
| Format | .litertlm (LiteRT-LM) |
| File size | 2.39 GB |
| Context length | 32K tokens |
| Prefill lengths | 128, 512 |
| KV cache length | 4096 |
| Modalities | Text (+ image/audio with multimodal backends) |
| File | Size | Description |
|---|---|---|
model.litertlm |
2.39 GB | Model weights + embedded tokenizer |
config.json |
0.4 KB | Inference metadata |
Benchmarked on macOS ARM64 (Apple Silicon), CPU backend, LiteRT-LM 0.10.1:
| Prompt tokens | TTFT (ms) | Decode (tok/s) | Peak memory |
|---|---|---|---|
| 16 | 465 | 165.8 | 1.37 GB |
| 64 | 482 | 167.4 | 1.39 GB |
| 128 | 3,504 | 169.2 | 2.08 GB |
| 256 | 3,528 | 166.9 | 2.09 GB |
Model load time: 652ms.
Android reference (Samsung S26 Ultra, from Google):
| Backend | Decode (tok/s) | TTFT |
|---|---|---|
| GPU | 52.1 | 0.3s |
| CPU | 46.9 | 1.8s |
import litert_lm
engine = litert_lm.Engine(
model_path="model.litertlm",
backend=litert_lm.Backend.CPU,
)
with engine.create_conversation() as conv:
response = conv.send_message("Hello, how are you?")
print(response)
pip install litert-lm-api
litert_lm_advanced_main --model_path=model.litertlm --backend=cpu --benchmark=true
Converted from google/gemma-4-E2B-it using litert-torch-nightly (0.9.0.dev20260403).
Conversion took ~8 minutes on Apple Silicon (M-series, 64GB RAM).
Base model
google/gemma-4-E2B-it