--- license: apache-2.0 language: - ar - en base_model: - google/gemma-4-e2b-it tags: - gemma4 - image-text-to-text - arabic - english - pruned - vision - multilingual library_name: transformers pipeline_tag: image-text-to-text --- # Gemma 4 E2B — Arabic + English Vision (Pruned) A pruned version of [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it) optimized for **Arabic and English** vision-language tasks. Unnecessary language tokens and the audio encoder have been removed, resulting in a smaller and more efficient model while retaining full vision and text generation capabilities. ## What was done | Component | Original | Pruned | |-----------|----------|--------| | Vocabulary | 262,144 tokens | **209,836 tokens** (Arabic + English + special) | | Vision encoder | ✅ 16 layers | ✅ **KEPT** | | Audio encoder | ✅ 12 layers | ❌ **Removed** | | Decoder layers | 35 layers | 35 layers (untouched) | | PLE tables | 262,144 × 256 × 35 | **209,836 × 256 × 35** | | **Total size (bf16)** | **10.2 GB** | **8.5 GB** | | **Size reduction** | — | **17%** | ## Token breakdown - Arabic tokens kept: 8,460 - English/Latin tokens kept: 201,096 - Special tokens: 24 - Byte fallbacks: 256 - **Total kept: 209,836** (80.0% of original) - Dropped: 52,308 (CJK, Cyrillic, Devanagari, Thai, etc.) ## Why this exists Gemma 4's Per-Layer Embedding (PLE) architecture stores a separate embedding table for each of its 35 decoder layers. With 262K vocab, that's **4.7 GB of PLE tables alone** (53% of the model). By pruning to Arabic+English only, we cut this to ~3.8 GB. The vision encoder is kept intact for image understanding tasks. ## Benchmark: Arabic text recognition (zero-shot, no fine-tuning) Evaluated on [loay/ar_stage1_probe](https://huggingface.co/datasets/loay/ar_stage1_probe) — 100 samples, seed=42: | Model | Params | VRAM | CER ↓ | Exact Match | |-------|--------|------|-------|-------------| | This model (pruned) | 4.25B | 8.5 GB | 0.1168 | 23/100 | | google/gemma-4-e2b-it (original) | 5.10B | 10.2 GB | 0.1168 | 23/100 | **Identical output** on all 100 samples — pruning is lossless for Arabic+English tasks. ## Intended use - Arabic + English vision-language tasks - Document understanding - Fine-tuning base for bilingual Arabic/English applications - Any task where CJK/Cyrillic/Indic language support is not needed ## Usage ```python from transformers import AutoModelForImageTextToText, AutoProcessor model = AutoModelForImageTextToText.from_pretrained( "ml-agent-explorers/gemma-4-e2b-arabic-english-vision", dtype="bfloat16", device_map="auto", ) processor = AutoProcessor.from_pretrained( "ml-agent-explorers/gemma-4-e2b-arabic-english-vision" ) ``` ## Base model - [google/gemma-4-e2b-it](https://huggingface.co/google/gemma-4-e2b-it)