SemiticGPT-3B 🌍

A 3.04 billion parameter multilingual language model trained from scratch for Hebrew, Arabic, English, and Farsi β€” four languages spanning three scripts (Latin, Hebrew, Arabic).

Highlights

  • 3.04B parameters trained from scratch on ~50B tokens
  • Custom 32K multilingual BPE tokenizer optimized for script-diverse languages
  • Hebrew-anchored design: Hebrew as primary low-resource target with cross-lingual transfer
  • Budget-efficient: Trained on a single p4de.24xlarge
  • SFT variant included: Instruction-tuned with multilingual supervised data

Model Variants

Variant File Size Description
Base (pretrained) checkpoints/best_model.pt 11.7 GB Best pretrained checkpoint (step 20,000)
SFT (instruction-tuned) checkpoints/sft_model.pt 5.7 GB Multilingual SFT on Hebrew, Arabic, English, Farsi data

Architecture

  • Type: GPT-2 style decoder-only transformer
  • Parameters: 3.04B
  • Layers: 32
  • Hidden dim: 2560
  • Attention heads: 32
  • Vocabulary: 32,000 (custom multilingual BPE)
  • Context length: 2048 tokens
  • Tokenizer: SentencePiece BPE trained on balanced multilingual corpus

Training Data

Pretrained on ~50B tokens from:

  • CulturaX (Hebrew, Arabic, Farsi, English)
  • OSCAR (multilingual web crawl)
  • CC-100 (Common Crawl monolingual)
  • Dolma (English high-quality)

Language distribution weighted toward Hebrew as anchor language.

Tokenizer

Custom 32K vocabulary trained on balanced multilingual corpus:

Language Fertility (tokens/word)
Hebrew 1.75 BPB (best)
Farsi 3.14 BPB
Arabic 3.73 BPB
English 3.83 BPB

The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.

Benchmark Results

Belebele (reading comprehension, 4-way multiple choice)

Language Accuracy
English 31.8%
Hebrew 27.0%
Arabic 28.4%
Farsi 28.2%
Overall 28.9%

Note: Random baseline is 25%. This is a 3B model trained on a budget β€” competitive performance relative to scale.

SFT Generation Quality

  • Hebrew: πŸ”₯ Excellent β€” fluent, factual responses in domain-specific Hebrew
  • English: Coherent, factual
  • Farsi: Good, coherent
  • Arabic: Weak (data quality issue β€” machine-translated Alpaca)

Training Details

Pretraining

  • Hardware: 1Γ— p4de.24xlarge (8Γ— A100 80GB)
  • Framework: PyTorch FSDP
  • Steps: 20,000
  • Batch size: 512K tokens
  • Learning rate: 3e-4 (cosine decay)
  • Optimizer: AdamW

SFT

  • Hardware: 1Γ— g6e.xlarge (L40S 48GB)
  • Steps: 4,000 (best val_loss at step 1,600: 2.1164)
  • Data: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca

Files

SemiticGPT/
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ best_model.pt          # Pretrained base model
β”‚   └── sft_model.pt           # SFT instruction-tuned model
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ multilingual_32k.model # SentencePiece tokenizer
β”‚   └── multilingual_32k.vocab # Vocabulary file
β”œβ”€β”€ eval/
β”‚   β”œβ”€β”€ belebele_3b_results.json
β”‚   └── belebele_3b.log
β”œβ”€β”€ training_scripts/
β”‚   β”œβ”€β”€ train_multilingual_3b_fsdp.py
β”‚   β”œβ”€β”€ train_sft_3b.py
β”‚   └── prepare_sft_data_v2.py
└── README.md

Usage

import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")

# Load model (custom architecture β€” see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition

Known Limitations

  • Arabic generation is weak due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
  • Small scale: 3B parameters is modest by current standards. This is an efficiency-focused research model.
  • Custom architecture: Not directly compatible with HuggingFace AutoModel β€” requires the training script's model class.
  • Benchmark scores are baseline-level: The model is designed for research into efficient multilingual pretraining, not benchmark competition.

Citation

@misc{slasky2026semiticgpt,
  title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/SemiticGPT}
}

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train Slasky/SemiticGPT

Evaluation results