SemiticGPT-3B 🌍

A 3.04 billion parameter multilingual language model trained from scratch for Hebrew, Arabic, English, and Farsi — four languages spanning three scripts (Latin, Hebrew, Arabic).

Highlights

3.04B parameters trained from scratch on ~50B tokens
Custom 32K multilingual BPE tokenizer optimized for script-diverse languages
Hebrew-anchored design: Hebrew as primary low-resource target with cross-lingual transfer
Budget-efficient: Trained on a single p4de.24xlarge
SFT variant included: Instruction-tuned with multilingual supervised data

Model Variants

Variant	File	Size	Description
Base (pretrained)	`checkpoints/best_model.pt`	11.7 GB	Best pretrained checkpoint (step 20,000)
SFT (instruction-tuned)	`checkpoints/sft_model.pt`	5.7 GB	Multilingual SFT on Hebrew, Arabic, English, Farsi data

Architecture

Type: GPT-2 style decoder-only transformer
Parameters: 3.04B
Layers: 32
Hidden dim: 2560
Attention heads: 32
Vocabulary: 32,000 (custom multilingual BPE)
Context length: 2048 tokens
Tokenizer: SentencePiece BPE trained on balanced multilingual corpus

Training Data

Pretrained on ~50B tokens from:

CulturaX (Hebrew, Arabic, Farsi, English)
OSCAR (multilingual web crawl)
CC-100 (Common Crawl monolingual)
Dolma (English high-quality)

Language distribution weighted toward Hebrew as anchor language.

Tokenizer

Custom 32K vocabulary trained on balanced multilingual corpus:

Language	Fertility (tokens/word)
Hebrew	1.75 BPB (best)
Farsi	3.14 BPB
Arabic	3.73 BPB
English	3.83 BPB

The tokenizer is specifically designed for script-diverse languages, avoiding the vocabulary dilution that occurs with large multilingual tokenizers.

Benchmark Results

Belebele (reading comprehension, 4-way multiple choice)

Language	Accuracy
English	31.8%
Hebrew	27.0%
Arabic	28.4%
Farsi	28.2%
Overall	28.9%

Note: Random baseline is 25%. This is a 3B model trained on a budget — competitive performance relative to scale.

SFT Generation Quality

Hebrew: 🔥 Excellent — fluent, factual responses in domain-specific Hebrew
English: Coherent, factual
Farsi: Good, coherent
Arabic: Weak (data quality issue — machine-translated Alpaca)

Training Details

Pretraining

Hardware: 1× p4de.24xlarge (8× A100 80GB)
Framework: PyTorch FSDP
Steps: 20,000
Batch size: 512K tokens
Learning rate: 3e-4 (cosine decay)
Optimizer: AdamW

SFT

Hardware: 1× g6e.xlarge (L40S 48GB)
Steps: 4,000 (best val_loss at step 1,600: 2.1164)
Data: ~27K Hebrew samples (native domain data) + Aya multilingual + translated Alpaca

Files

SemiticGPT/
├── checkpoints/
│   ├── best_model.pt          # Pretrained base model
│   └── sft_model.pt           # SFT instruction-tuned model
├── tokenizer/
│   ├── multilingual_32k.model # SentencePiece tokenizer
│   └── multilingual_32k.vocab # Vocabulary file
├── eval/
│   ├── belebele_3b_results.json
│   └── belebele_3b.log
├── training_scripts/
│   ├── train_multilingual_3b_fsdp.py
│   ├── train_sft_3b.py
│   └── prepare_sft_data_v2.py
└── README.md

Usage

import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor()
sp.load("tokenizer/multilingual_32k.model")

# Load model (custom architecture — see training_scripts/)
# The model uses a custom GPT implementation, not HuggingFace AutoModel
checkpoint = torch.load("checkpoints/best_model.pt", map_location="cpu")
# See train_multilingual_3b_fsdp.py for model class definition

Known Limitations

Arabic generation is weak due to machine-translated SFT data. Native Arabic instruction data would significantly improve this.
Small scale: 3B parameters is modest by current standards. This is an efficiency-focused research model.
Custom architecture: Not directly compatible with HuggingFace AutoModel — requires the training script's model class.
Benchmark scores are baseline-level: The model is designed for research into efficient multilingual pretraining, not benchmark competition.

Citation

@misc{slasky2026semiticgpt,
  title={SemiticGPT: Efficient Multilingual Pretraining for Low-Resource Script-Diverse Languages},
  author={Slasky, Ronnen},
  year={2026},
  url={https://huggingface.co/Slasky/SemiticGPT}
}

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Slasky/SemiticGPT

Evaluation results

English on Belebele
self-reported

31.800
Hebrew on Belebele
self-reported

27.000
Arabic on Belebele
self-reported

28.400
Farsi on Belebele
self-reported

28.200