Gemma4-Axera Banner

Gemma 4 E2B on AXERA NPU

Ready-to-run deployment package for google/gemma-4-E2B-it on AX650 / NPU3.

This release packages the w8a16 AXERA NPU runtime.
Compatible with Pulsar2 5.2 and later.
Includes the tokenizer/config files required at runtime.
Includes compiled Gemma 4 text .axmodel files and Vision .axmodel files.
Supports both text-only chat and single-image multimodal inference.

Conversion References

If you need the original model files or want to rebuild the deployment artifacts, start with:

Original Hugging Face model: google/gemma-4-E2B-it
AXERA conversion and deployment workflow: AXERA-TECH/gemma-4-E2B-it.axera

Supported Platform

AX650 / NPU3

Validated Devices

This package has been validated on the following AX650-based devices:

Performance

All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token.

w8a16: TTFT is approximately 1664 ms, with a decode throughput of approximately 10.44 tokens/s.
w4a16: TTFT is approximately 1233.7 ms, with a decode throughput of approximately 15.22 tokens/s.

The packaged text runtime in this release is the w8a16 build located in gemma_4_e2b_it_ax650n_axmodel. The w4a16 numbers are provided for reference only.

Vision Encoder Latency

Model	Resolution	Soft Tokens	Time (ms)
`gemma4_vision_h336_w480_t70.axmodel`	336×480	70	87.966 ms
`gemma4_vision_h480_w672_t140.axmodel`	480×672	140	258.329 ms
`gemma4_vision_h672_w960_t280.axmodel`	672×960	280	750.429 ms

Package Layout

.
├── README.md
├── config.json
├── infer_axmodel.py
├── gradio_demo.py
├── assets/
├── gemma_4_e2b_it_tokenizer/
├── gemma_4_e2b_it_ax650n_axmodel/
├── vit_models/
└── utils/

The runtime scripts auto-detect the packaged directories above. If you keep this layout unchanged, you can run the examples below without passing extra path arguments.

Runtime Requirements

Install the following packages on the AX board:

pyaxengine
transformers>=5.5.0
numpy
ml_dtypes
pillow
gradio for the web demo only

If your board image ships with an older transformers stack, you can use a pure-Python overlay directory instead:

export PYTHONPATH=/path/to/your/gemma4_pydeps:$PYTHONPATH

Quick Start

Enter the package directory on the board:

cd /path/to/your/gemma-4-E2B-it

Text-Only Inference

Run the following command:

python3 infer_axmodel.py \
  --prompt "What is the capital of the United States?" \
  --max_new_tokens 256

A typical output looks like this:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:02<00:00, 12.73it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> The capital of the United States is **Washington, D.C.**

Multimodal Inference

Use the sample image shown below: assets/sample.png

Recommended profile: 70 soft tokens at 336x480.

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --max_new_tokens 1024

A typical output looks like this:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:28<00:00,  1.22it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> This image is a vibrant, cartoon-style illustration of a **red crab**.

Here's a detailed description:

* **Subject:** The central subject is a crab, depicted in a bright, saturated red color.
* **Style:** The illustration is highly stylized and cartoonish, featuring thick outlines and exaggerated features, giving it a playful or energetic feel.
* **Appearance of the Crab:**
    * **Color:** The crab is predominantly bright red.
    * **Body:** It has a segmented body typical of a crab, with visible claws and legs.
    * **Claws (Chelipeds):** The claws are prominent and appear muscular. The crab is shown with its claws raised, suggesting action or excitement.
    * **Eyes/Face:** It has a somewhat expressive face, though simplified.
* **Composition:** The crab is positioned centrally and appears to be moving or posed dynamically.
* **Background:** The background is plain white, which makes the red crab stand out sharply.
* **Outline/Effect:** The illustration has a distinct, thick black outline, and there is a subtle white or light-colored outline effect around the edges, suggesting it might be a sticker, icon, or graphic element.

**Overall Impression:** The image is energetic, bold, and eye-catching, suitable for use as a mascot, icon, or graphic design element.

The package also includes two additional higher-resolution Vision models:

VIT file	Resolution	Soft tokens
`vit_models/gemma4_vision_h336_w480_t70.axmodel`	`336x480`	`70`
`vit_models/gemma4_vision_h480_w672_t140.axmodel`	`480x672`	`140`
`vit_models/gemma4_vision_h672_w960_t280.axmodel`	`672x960`	`280`

To use a different profile, pass --vit_model_path explicitly. The runtime will infer the matching soft-token count from the filename:

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --vit_model_path ./vit_models/gemma4_vision_h480_w672_t140.axmodel \
  --max_new_tokens 256

python3 infer_axmodel.py \
  --image_path ./assets/sample.png \
  --prompt "Describe this image in detail." \
  --system_prompt "" \
  --vit_model_path ./vit_models/gemma4_vision_h672_w960_t280.axmodel \
  --max_new_tokens 1024

Example output with the 672x960 / t280 profile:

[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|██████████████████████████████████████████████████████████| 35/35 [00:27<00:00,  1.29it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
[WARN] Image token block (group_id=0, pos 5-284) spans 3 prefill slices. Bidirectional attention within earlier slices is partial (chunked prefill limitation).
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> This is a digital illustration of a cartoonish, anthropomorphic **red lobster**.

Here is a detailed description:

* **Subject:** The central subject is a lobster, depicted in a vibrant, glossy red color.
* **Style:** The illustration is rendered in a bold cartoon style, characterized by thick outlines and bright colors that give it a playful, energetic feel.
* **Expression and Pose:** The lobster has a cheerful, confident expression with a wide, toothy smile. It is posed dynamically, as if flexing or striking a pose, with its claws raised.
    * Its claws (chelipeds) are prominent and muscular, and one of them appears to be flexed.
    * Its body is curved, suggesting motion.
* **Details:** The lobster has visible antennae on its head. The overall design emphasizes its bright red color and gives it a strong, assertive personality.
* **Outline and Background:** The character is outlined in black, which helps define its shape against the plain white background and makes the red lobster stand out prominently.
* **Format:** The image resembles a sticker or clip-art graphic because of its clean, isolated presentation.

In summary, it is a cheerful, stylized, red cartoon lobster flexing its claws.

Gradio Demo

python3 gradio_demo.py \
  --host 0.0.0.0 \
  --port 7860

After the server starts, open http://<board-ip>:7860 in your browser.

Packaged Runtime Paths

The release package uses the following default paths:

Tokenizer and config: ./gemma_4_e2b_it_tokenizer
Text LLM axmodels: ./gemma_4_e2b_it_ax650n_axmodel
Vision axmodels: ./vit_models

If you move any of these directories, pass the new values with --hf_model, --axmodel_path, and --vit_model_path.

Notes

.axmodel execution is board-only and is not supported on x86 hosts.
The default multimodal profile uses 70 image soft tokens and matches the packaged 336x480 Vision model.
The current text runtime package contains 35 decoder layers and kv_cache_len=2047.
The packaged runtime already includes the embedding and per-layer weight files needed by Gemma 4. Original model.safetensors weights are not required for board-side inference.
Files under assets/ are demo inputs for inference examples.

Discussion

GitHub Issues
QQ group: 139953715

Downloads last month: 27

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AXERA-TECH/gemma-4-E2B-it

Base model

google/gemma-4-E2B-it

Finetuned

(88)

this model