--- license: apache-2.0 language: - he - ar - en - fa tags: - multilingual - hebrew - arabic - farsi - persian - semitic - gpt - causal-lm - low-resource - efficient-training datasets: - CulturaX - OSCAR - CC-100 - allenai/dolma model-index: - name: SemiticGPT-3B results: - task: type: text-generation dataset: type: facebook/belebele name: Belebele metrics: - type: accuracy name: English value: 31.8 - type: accuracy name: Hebrew value: 27.0 - type: accuracy name: Arabic value: 28.4 - type: accuracy name: Farsi value: 28.2 --- # SemiticGPT-3B 🌍 A 3.04 billion parameter multilingual language model trained from scratch for **Hebrew, Arabic, English, and Farsi** — four languages spanning three scripts (Latin, Hebrew, Arabic). ## Highlights - **3.04B parameters** trained from scratch on ~50B tokens - **Custom 32K multilingual BPE tokenizer** optimized for script-diverse languages - **Hebrew-anchored design**: Hebrew as primary low-resource target with cross-lingual transfer - **Budget-efficient**: Trained on a single p4de.24xlarge - **SFT variant included**: Instruction-tuned with multilingual supervised data ## Model Variants | Variant | File | Size | Description | |---------|------|------|-------------| | Base (pretrained) | `checkpoints/best_model.pt` | 11.7 GB | Best pretrained checkpoint (step 20,000) | | SFT (instruction-tuned) | `checkpoints/sft_model.pt` | 5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data | ## Architecture - **Type**: GPT-2 style decoder-only transformer - **Parameters**: 3.04B - **Layers**: 32 - **Hidden dim**: 2560 - **Attention heads**: 32 - **Vocabulary**: 32,000 (custom multilingual BPE) - **Context length**: 2048 tokens - **Tokenizer**: SentencePiece BPE trained on balanced multilingual corpus ## Training Data Pretrained on ~50B tokens from: - **CulturaX** (Hebrew, Arabic, Farsi, English) - **OSCAR** (multilingual web crawl) - **CC-100** (Common Crawl monolingual) - **Dolma** (English high-quality) Language distribution weighted toward Hebrew as anchor language. ## Tokenizer Custom 32K vocabulary trained on balanced multilingual corpus: | Language | Fertility (tokens/word) | |----------|------------------------| | Hebrew | 1.75 BPB (best) | | Farsi | 3.14 BPB | | Arabic | 3.73 BPB | | English | 3.83 BPB | The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers. ## Benchmark Results ### Belebele (reading comprehension, 4-way multiple choice) | Language | Accuracy | |----------|----------| | English | 31.8% | | Hebrew | 27.0% | | Arabic | 28.4% | | Farsi | 28.2% | | **Overall** | **28.9%** | *Note: Random baseline is 25%. This is a 3B model trained on a budget — competitive performance relative to scale.* ### SFT Generation Quality - **Hebrew**: 🔥 Excellent — fluent, factual responses in domain-specific Hebrew - **English**: Coherent, factual - **Farsi**: Good, coherent - **Arabic**: Weak (data quality issue — machine-translated Alpaca) ## Training Details ### Pretraining - **Hardware**: 1× p4de.24xlarge (8× A100 80GB) - **Framework**: PyTorch FSDP - **Steps**: 20,000 - **Batch size**: 512K tokens - **Learning rate**: 3e-4 (cosine decay) - **Optimizer**: AdamW ### SFT - **Hardware**: 1× g6e.xlarge (L40S 48GB) - **Steps**: 4,000 (best val_loss at step 1,600: 2.1164) - **Data**: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca ## Files ``` SemiticGPT/ ├── checkpoints/ │ ├── best_model.pt # Pretrained base model │ └── sft_model.pt # SFT instruction-tuned model ├── tokenizer/ │ ├── multilingual_32k.model # SentencePiece tokenizer │ └── multilingual_32k.vocab # Vocabulary file ├── eval/ │ ├── belebele_3b_results.json │ └── belebele_3b.log ├── training_scripts/ │ ├── train_multilingual_3b_fsdp.py │ ├── train_sft_3b.py │ └── prepare_sft_data_v2.py └── README.md ``` ## Usage ```python import torch import sentencepiece as spm # Load tokenizer sp = spm.SentencePieceProcessor() sp.load("tokenizer/multilingual_32k.model") # Load model (custom architecture — see training_scripts/) # The model uses a custom GPT implementation, not HuggingFace AutoModel checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu") # See train_multilingual_3b_fsdp.py for model class definition ``` ## Known Limitations - **Arabic generation is weak** due to machine-translated SFT data. Native Arabic instruction data would significantly improve this. - **Small scale**: 3B parameters is modest by current standards. This is an efficiency-focused research model. - **Custom architecture**: Not directly compatible with HuggingFace AutoModel — requires the training script's model class. - **Benchmark scores are baseline-level**: The model is designed for research into efficient multilingual pretraining, not benchmark competition. ## Citation ```bibtex @misc{slasky2026semiticgpt, title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages}, author={Slasky, Ronnen}, year={2026}, url={https://huggingface.co/Slasky/SemiticGPT} } ``` ## License Apache 2.0