Gemma 4 E2B — LiteRT-LM

Gemma 4 E2B converted to LiteRT-LM for on-device inference on Android, embedded Linux, and desktop.

Exported from the base PyTorch model via litert_torch.generative.export_hf with dynamic_wi4_afp32 quantization.

Model

Property	Value
Parameters	5.1B total, 2.3B effective (PLE)
Quantization	dynamic_wi4_afp32 (INT4 weights, FP32 activations)
Format	`.litertlm` (LiteRT-LM)
File size	2.39 GB
Context length	32K tokens
Prefill lengths	128, 512
KV cache length	4096
Modalities	Text (+ image/audio with multimodal backends)

Files

File	Size	Description
`model.litertlm`	2.39 GB	Model weights + embedded tokenizer
`config.json`	0.4 KB	Inference metadata

Performance

Benchmarked on macOS ARM64 (Apple Silicon), CPU backend, LiteRT-LM 0.10.1:

Prompt tokens	TTFT (ms)	Decode (tok/s)	Peak memory
16	465	165.8	1.37 GB
64	482	167.4	1.39 GB
128	3,504	169.2	2.08 GB
256	3,528	166.9	2.09 GB

Model load time: 652ms.

Android reference (Samsung S26 Ultra, from Google):

Backend	Decode (tok/s)	TTFT
GPU	52.1	0.3s
CPU	46.9	1.8s

Usage

Python

import litert_lm

engine = litert_lm.Engine(
    model_path="model.litertlm",
    backend=litert_lm.Backend.CPU,
)

with engine.create_conversation() as conv:
    response = conv.send_message("Hello, how are you?")
    print(response)

CLI

pip install litert-lm-api
litert_lm_advanced_main --model_path=model.litertlm --backend=cpu --benchmark=true

Source

Converted from google/gemma-4-E2B-it using litert-torch-nightly (0.9.0.dev20260403).

Conversion took ~8 minutes on Apple Silicon (M-series, 64GB RAM).

Downloads last month: 54

Model tree for aufklarer/Gemma-4-E2B-LiteRT-LM

Base model

google/gemma-4-E2B-it

Finetuned

(107)

this model

Collection including aufklarer/Gemma-4-E2B-LiteRT-LM

Speech Android Models

Collection

Mobile ONNX models for speech-android SDK • 5 items • Updated 6 days ago • 1