SemiticGPT-3B π
A 3.04 billion parameter multilingual language model trained from scratch for Hebrew, Arabic, English, and Farsi β four languages spanning three scripts (Latin, Hebrew, Arabic).
Highlights
- 3.04B parameters trained from scratch on ~50B tokens
- Custom 32K multilingual BPE tokenizer optimized for script-diverse languages
- Hebrew-anchored design: Hebrew as primary low-resource target with cross-lingual transfer
- Budget-efficient: Trained on a single p4de.24xlarge
- SFT variant included: Instruction-tuned with multilingual supervised data
Model Variants
| Variant | File | Size | Description |
|---|---|---|---|
| Base (pretrained) | checkpoints/best_model.pt |
11.7 GB | Best pretrained checkpoint (step 20,000) |
| SFT (instruction-tuned) | checkpoints/sft_model.pt |
5.7 GB | Multilingual SFT on Hebrew, Arabic, English, Farsi data |
Architecture
- Type: GPT-2 style decoder-only transformer
- Parameters: 3.04B
- Layers: 32
- Hidden dim: 2560
- Attention heads: 32
- Vocabulary: 32,000 (custom multilingual BPE)
- Context length: 2048 tokens
- Tokenizer: SentencePiece BPE trained on balanced multilingual corpus
Training Data
Pretrained on ~50B tokens from:
- CulturaX (Hebrew, Arabic, Farsi, English)
- OSCAR (multilingual web crawl)
- CC-100 (Common Crawl monolingual)
- Dolma (English high-quality)
Language distribution weighted toward Hebrew as anchor language.
Tokenizer
Custom 32K vocabulary trained on balanced multilingual corpus:
| Language | Fertility (tokens/word) |
|---|---|
| Hebrew | 1.75 BPB (best) |
| Farsi | 3.14 BPB |
| Arabic | 3.73 BPB |
| English | 3.83 BPB |
The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.
Benchmark Results
Belebele (reading comprehension, 4-way multiple choice)
| Language | Accuracy |
|---|---|
| English | 31.8% |
| Hebrew | 27.0% |
| Arabic | 28.4% |
| Farsi | 28.2% |
| Overall | 28.9% |
Note: Random baseline is 25%. This is a 3B model trained on a budget β competitive performance relative to scale.
SFT Generation Quality
- Hebrew: π₯ Excellent β fluent, factual responses in domain-specific Hebrew
- English: Coherent, factual
- Farsi: Good, coherent
- Arabic: Weak (data quality issue β machine-translated Alpaca)
Training Details
Pretraining
- Hardware: 1Γ p4de.24xlarge (8Γ A100 80GB)
- Framework: PyTorch FSDP
- Steps: 20,000
- Batch size: 512K tokens
- Learning rate: 3e-4 (cosine decay)
- Optimizer: AdamW
SFT
- Hardware: 1Γ g6e.xlarge (L40S 48GB)
- Steps: 4,000 (best val_loss at step 1,600: 2.1164)
- Data: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca
Files
SemiticGPT/
βββ checkpoints/
β βββ best_model.pt # Pretrained base model
β βββ sft_model.pt # SFT instruction-tuned model
βββ tokenizer/
β βββ multilingual_32k.model # SentencePiece tokenizer
β βββ multilingual_32k.vocab # Vocabulary file
βββ eval/
β βββ belebele_3b_results.json
β βββ belebele_3b.log
βββ training_scripts/
β βββ train_multilingual_3b_fsdp.py
β βββ train_sft_3b.py
β βββ prepare_sft_data_v2.py
βββ README.md
Usage
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")
# Load model (custom architecture β see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition
Known Limitations
- Arabic generation is weak due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
- Small scale: 3B parameters is modest by current standards. This is an efficiency-focused research model.
- Custom architecture: Not directly compatible with HuggingFace AutoModel β requires the training script's model class.
- Benchmark scores are baseline-level: The model is designed for research into efficient multilingual pretraining, not benchmark competition.
Citation
@misc{slasky2026semiticgpt,
title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
author={Slasky, Ronnen},
year={2026},
url={https://huggingface.co/Slasky/SemiticGPT}
}
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Dataset used to train Slasky/SemiticGPT
Evaluation results
- English on Belebeleself-reported31.800
- Hebrew on Belebeleself-reported27.000
- Arabic on Belebeleself-reported28.400
- Farsi on Belebeleself-reported28.200