| --- |
| license: apache-2.0 |
| language: |
| - ar |
| - en |
| base_model: |
| - google/gemma-4-e2b-it |
| tags: |
| - gemma4 |
| - image-text-to-text |
| - arabic |
| - english |
| - pruned |
| - vision |
| - multilingual |
| library_name: transformers |
| pipeline_tag: image-text-to-text |
| --- |
| |
| # Gemma 4 E2B β Arabic + English Vision (Pruned) |
|
|
| A pruned version of [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) optimized for **Arabic and English** vision-language tasks. Unnecessary language tokens and the audio encoder have been removed, resulting in a smaller and more efficient model while retaining full vision and text generation capabilities. |
|
|
| ## What was done |
|
|
| | Component | Original | Pruned | |
| |-----------|----------|--------| |
| | Vocabulary | 262,144 tokens | **209,836 tokens** (Arabic + English + special) | |
| | Vision encoder | β
16 layers | β
**KEPT** | |
| | Audio encoder | β
12 layers | β **Removed** | |
| | Decoder layers | 35 layers | 35 layers (untouched) | |
| | PLE tables | 262,144 Γ 256 Γ 35 | **209,836 Γ 256 Γ 35** | |
| | **Total size (bf16)** | **10.2 GB** | **8.5 GB** | |
| | **Size reduction** | β | **17%** | |
|
|
| ## Token breakdown |
|
|
| - Arabic tokens kept: 8,460 |
| - English/Latin tokens kept: 201,096 |
| - Special tokens: 24 |
| - Byte fallbacks: 256 |
| - **Total kept: 209,836** (80.0% of original) |
| - Dropped: 52,308 (CJK, Cyrillic, Devanagari, Thai, etc.) |
|
|
| ## Why this exists |
|
|
| Gemma 4's Per-Layer Embedding (PLE) architecture stores a separate embedding table for each of its 35 decoder layers. With 262K vocab, that's **4.7 GB of PLE tables alone** (53% of the model). By pruning to Arabic+English only, we cut this to ~3.8 GB. |
|
|
| The vision encoder is kept intact for image understanding tasks. |
|
|
| ## Benchmark: Arabic text recognition (zero-shot, no fine-tuning) |
|
|
| Evaluated on [loay/ar_stage1_probe](https://huggingface.co/datasets/loay/ar_stage1_probe) β 100 samples, seed=42: |
|
|
| | Model | Params | VRAM | CER β | Exact Match | |
| |-------|--------|------|-------|-------------| |
| | This model (pruned) | 4.25B | 8.5 GB | 0.1168 | 23/100 | |
| | google/gemma-4-e2b-it (original) | 5.10B | 10.2 GB | 0.1168 | 23/100 | |
|
|
| **Identical output** on all 100 samples β pruning is lossless for Arabic+English tasks. |
|
|
| ## Intended use |
|
|
| - Arabic + English vision-language tasks |
| - Document understanding |
| - Fine-tuning base for bilingual Arabic/English applications |
| - Any task where CJK/Cyrillic/Indic language support is not needed |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForImageTextToText, AutoProcessor |
| |
| model = AutoModelForImageTextToText.from_pretrained( |
| "ml-agent-explorers/gemma-4-e2b-arabic-english-vision", |
| dtype="bfloat16", |
| device_map="auto", |
| ) |
| processor = AutoProcessor.from_pretrained( |
| "ml-agent-explorers/gemma-4-e2b-arabic-english-vision" |
| ) |
| ``` |
|
|
| ## Base model |
|
|
| - [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) |
|
|