Gemma 4 E2B on AXERA NPU
Ready-to-run deployment package for google/gemma-4-E2B-it on AX650 / NPU3.
- This release packages the
w8a16AXERA NPU runtime. - Compatible with
Pulsar2 5.2and later. - Includes the tokenizer/config files required at runtime.
- Includes compiled Gemma 4 text
.axmodelfiles and Vision.axmodelfiles. - Supports both text-only chat and single-image multimodal inference.
Conversion References
If you need the original model files or want to rebuild the deployment artifacts, start with:
- Original Hugging Face model: google/gemma-4-E2B-it
- AXERA conversion and deployment workflow: AXERA-TECH/gemma-4-E2B-it.axera
Supported Platform
- AX650 / NPU3
Validated Devices
This package has been validated on the following AX650-based devices:
- AX650N Demo Board
- M4N-Dock (η±θ―ζ΄Ύ Pro)
- M.2 Accelerator Card
Performance
All measurements below were taken on AX650 / NPU3. TTFT stands for time to first token.
w8a16: TTFT is approximately1664 ms, with a decode throughput of approximately10.44 tokens/s.w4a16: TTFT is approximately1233.7 ms, with a decode throughput of approximately15.22 tokens/s.
The packaged text runtime in this release is the w8a16 build located in gemma_4_e2b_it_ax650n_axmodel. The w4a16 numbers are provided for reference only.
Vision Encoder Latency
| Model | Resolution | Soft Tokens | Time (ms) |
|---|---|---|---|
gemma4_vision_h336_w480_t70.axmodel |
336Γ480 | 70 | 87.966 ms |
gemma4_vision_h480_w672_t140.axmodel |
480Γ672 | 140 | 258.329 ms |
gemma4_vision_h672_w960_t280.axmodel |
672Γ960 | 280 | 750.429 ms |
Package Layout
.
βββ README.md
βββ config.json
βββ infer_axmodel.py
βββ gradio_demo.py
βββ assets/
βββ gemma_4_e2b_it_tokenizer/
βββ gemma_4_e2b_it_ax650n_axmodel/
βββ vit_models/
βββ utils/
The runtime scripts auto-detect the packaged directories above. If you keep this layout unchanged, you can run the examples below without passing extra path arguments.
Runtime Requirements
Install the following packages on the AX board:
pyaxenginetransformers>=5.5.0numpyml_dtypespillowgradiofor the web demo only
If your board image ships with an older transformers stack, you can use a pure-Python overlay directory instead:
export PYTHONPATH=/path/to/your/gemma4_pydeps:$PYTHONPATH
Quick Start
Enter the package directory on the board:
cd /path/to/your/gemma-4-E2B-it
Text-Only Inference
Run the following command:
python3 infer_axmodel.py \
--prompt "What is the capital of the United States?" \
--max_new_tokens 256
A typical output looks like this:
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:02<00:00, 12.73it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> The capital of the United States is **Washington, D.C.**
Multimodal Inference
Use the sample image shown below: assets/sample.png
Recommended profile: 70 soft tokens at 336x480.
python3 infer_axmodel.py \
--image_path ./assets/sample.png \
--prompt "Describe this image in detail." \
--system_prompt "" \
--max_new_tokens 1024
A typical output looks like this:
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:28<00:00, 1.22it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
slice_indices: [0]
Slice prefill done: 0
answer >> This image is a vibrant, cartoon-style illustration of a **red crab**.
Here's a detailed description:
* **Subject:** The central subject is a crab, depicted in a bright, saturated red color.
* **Style:** The illustration is highly stylized and cartoonish, featuring thick outlines and exaggerated features, giving it a playful or energetic feel.
* **Appearance of the Crab:**
* **Color:** The crab is predominantly bright red.
* **Body:** It has a segmented body typical of a crab, with visible claws and legs.
* **Claws (Chelipeds):** The claws are prominent and appear muscular. The crab is shown with its claws raised, suggesting action or excitement.
* **Eyes/Face:** It has a somewhat expressive face, though simplified.
* **Composition:** The crab is positioned centrally and appears to be moving or posed dynamically.
* **Background:** The background is plain white, which makes the red crab stand out sharply.
* **Outline/Effect:** The illustration has a distinct, thick black outline, and there is a subtle white or light-colored outline effect around the edges, suggesting it might be a sticker, icon, or graphic element.
**Overall Impression:** The image is energetic, bold, and eye-catching, suitable for use as a mascot, icon, or graphic design element.
The package also includes two additional higher-resolution Vision models:
| VIT file | Resolution | Soft tokens |
|---|---|---|
vit_models/gemma4_vision_h336_w480_t70.axmodel |
336x480 |
70 |
vit_models/gemma4_vision_h480_w672_t140.axmodel |
480x672 |
140 |
vit_models/gemma4_vision_h672_w960_t280.axmodel |
672x960 |
280 |
To use a different profile, pass --vit_model_path explicitly. The runtime will infer the matching soft-token count from the filename:
python3 infer_axmodel.py \
--image_path ./assets/sample.png \
--prompt "Describe this image in detail." \
--system_prompt "" \
--vit_model_path ./vit_models/gemma4_vision_h480_w672_t140.axmodel \
--max_new_tokens 256
python3 infer_axmodel.py \
--image_path ./assets/sample.png \
--prompt "Describe this image in detail." \
--system_prompt "" \
--vit_model_path ./vit_models/gemma4_vision_h672_w960_t280.axmodel \
--max_new_tokens 1024
Example output with the 672x960 / t280 profile:
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Init InferenceSession: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 35/35 [00:27<00:00, 1.29it/s]
[INFO] Using provider: AxEngineExecutionProvider
[INFO] Model type: 2 (triple core)
[INFO] Compiler version: 5.1-patch1-dirty ac823c48-dirty
Model loaded successfully!
[WARN] Image token block (group_id=0, pos 5-284) spans 3 prefill slices. Bidirectional attention within earlier slices is partial (chunked prefill limitation).
slice_indices: [0, 1, 2]
Slice prefill done: 0
Slice prefill done: 1
Slice prefill done: 2
answer >> This is a digital illustration of a cartoonish, anthropomorphic **red lobster**.
Here is a detailed description:
* **Subject:** The central subject is a lobster, depicted in a vibrant, glossy red color.
* **Style:** The illustration is rendered in a bold cartoon style, characterized by thick outlines and bright colors that give it a playful, energetic feel.
* **Expression and Pose:** The lobster has a cheerful, confident expression with a wide, toothy smile. It is posed dynamically, as if flexing or striking a pose, with its claws raised.
* Its claws (chelipeds) are prominent and muscular, and one of them appears to be flexed.
* Its body is curved, suggesting motion.
* **Details:** The lobster has visible antennae on its head. The overall design emphasizes its bright red color and gives it a strong, assertive personality.
* **Outline and Background:** The character is outlined in black, which helps define its shape against the plain white background and makes the red lobster stand out prominently.
* **Format:** The image resembles a sticker or clip-art graphic because of its clean, isolated presentation.
In summary, it is a cheerful, stylized, red cartoon lobster flexing its claws.
Gradio Demo
python3 gradio_demo.py \
--host 0.0.0.0 \
--port 7860
After the server starts, open http://<board-ip>:7860 in your browser.
Packaged Runtime Paths
The release package uses the following default paths:
- Tokenizer and config:
./gemma_4_e2b_it_tokenizer - Text LLM axmodels:
./gemma_4_e2b_it_ax650n_axmodel - Vision axmodels:
./vit_models
If you move any of these directories, pass the new values with --hf_model, --axmodel_path, and --vit_model_path.
Notes
.axmodelexecution is board-only and is not supported on x86 hosts.- The default multimodal profile uses
70image soft tokens and matches the packaged336x480Vision model. - The current text runtime package contains
35decoder layers andkv_cache_len=2047. - The packaged runtime already includes the embedding and per-layer weight files needed by Gemma 4. Original
model.safetensorsweights are not required for board-side inference. - Files under
assets/are demo inputs for inference examples.
Discussion
- GitHub Issues
- QQ group:
139953715
- Downloads last month
- 27
Model tree for AXERA-TECH/gemma-4-E2B-it
Base model
google/gemma-4-E2B-it
