Gemma 4 E2B โ€” LiteRT-LM

Gemma 4 E2B converted to LiteRT-LM for on-device inference on Android, embedded Linux, and desktop.

Exported from the base PyTorch model via litert_torch.generative.export_hf with dynamic_wi4_afp32 quantization.

Model

Property Value
Parameters 5.1B total, 2.3B effective (PLE)
Quantization dynamic_wi4_afp32 (INT4 weights, FP32 activations)
Format .litertlm (LiteRT-LM)
File size 2.39 GB
Context length 32K tokens
Prefill lengths 128, 512
KV cache length 4096
Modalities Text (+ image/audio with multimodal backends)

Files

File Size Description
model.litertlm 2.39 GB Model weights + embedded tokenizer
config.json 0.4 KB Inference metadata

Performance

Benchmarked on macOS ARM64 (Apple Silicon), CPU backend, LiteRT-LM 0.10.1:

Prompt tokens TTFT (ms) Decode (tok/s) Peak memory
16 465 165.8 1.37 GB
64 482 167.4 1.39 GB
128 3,504 169.2 2.08 GB
256 3,528 166.9 2.09 GB

Model load time: 652ms.

Android reference (Samsung S26 Ultra, from Google):

Backend Decode (tok/s) TTFT
GPU 52.1 0.3s
CPU 46.9 1.8s

Usage

Python

import litert_lm

engine = litert_lm.Engine(
    model_path="model.litertlm",
    backend=litert_lm.Backend.CPU,
)

with engine.create_conversation() as conv:
    response = conv.send_message("Hello, how are you?")
    print(response)

CLI

pip install litert-lm-api
litert_lm_advanced_main --model_path=model.litertlm --backend=cpu --benchmark=true

Source

Converted from google/gemma-4-E2B-it using litert-torch-nightly (0.9.0.dev20260403).

Conversion took ~8 minutes on Apple Silicon (M-series, 64GB RAM).

Downloads last month
54
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for aufklarer/Gemma-4-E2B-LiteRT-LM

Finetuned
(107)
this model

Collection including aufklarer/Gemma-4-E2B-LiteRT-LM