SemiticGPT-3B
A 3.14B parameter multilingual language model trained from scratch for Hebrew, Arabic, Persian (Farsi), and English โ a script-diverse, low-resource language cluster centered on Semitic languages.
Model Details
| Property | Value |
|---|---|
| Parameters | 3.14B |
| Architecture | GPT (RoPE, SwiGLU, RMSNorm, fused QKV) |
| Vocab Size | 32,000 (custom multilingual SentencePiece BPE) |
| Max Seq Length | 2,048 |
| Pretraining Data | 4.48B tokens (HE 40%, AR 20%, FA 20%, EN 20%) |
| SFT Data | 36,980 samples (sentiment + translation) |
Key Results
Sentiment Classification (v4, clean balanced eval)
| Language | Base โ SFT (Logprob) | Generative |
|---|---|---|
| ๐ฎ๐ฑ Hebrew | 53.0% โ 84.5% | 82% |
| ๐ธ๐ฆ Arabic | 45.0% โ 60.5% | 64% |
| ๐ฎ๐ท Farsi | 60.5% โ 78.5% | 74% |
| ๐บ๐ธ English | 51.5% โ 73.0% | 64% |
Cross-lingual Transfer (Experiment B)
English-only SFT barely transfers to non-English languages, proving multilingual SFT is necessary:
| Language | Base | EN-SFT | Multi-SFT |
|---|---|---|---|
| Hebrew | 53.0% | 51.5% | 84.5% |
| Arabic | 45.0% | 46.5% | 60.5% |
| Farsi | 60.5% | 58.5% | 78.5% |
| English | 51.5% | 52.0% | 73.0% |
Tokenizer Efficiency (Experiment C)
Our tokenizer uses 49-69% fewer tokens than Llama-2 for Hebrew/Arabic/Farsi:
| Language | Ours (tok/byte) | Llama-2 (tok/byte) | Improvement |
|---|---|---|---|
| Hebrew | 0.195 | 0.569 | +65.6% |
| Arabic | 0.288 | 0.565 | +49.1% |
| Farsi | 0.175 | 0.561 | +68.8% |
| English | 0.270 | 0.264 | -2.2% |
Files
base_model.ptโ Pretrained base model (no SFT)sft_model_v4.ptโ Fine-tuned model (v4, sentiment + translation)multilingual_32k.modelโ SentencePiece tokenizerconfig.jsonโ Model configurationexp_ab_results.jsonโ Experiment A+B resultsexp_c_tokenizer_ablation.jsonโ Experiment C results
Usage
import torch
import sentencepiece as spm
# Load tokenizer
sp = spm.SentencePieceProcessor('multilingual_32k.model')
# Load model (see model_arch.py for architecture)
from model_arch import GPT
model = GPT()
state = torch.load('sft_model_v4.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state_dict'])
model.eval()
# Generate
prompt = "<|user|> ืกืืื ืืช ืืจืืฉ ืฉื ืืืงืกื ืืื (ืืืืื/ืฉืืืื):\nืื ื ืืืื ืืช ืืกืคืจ ืืื!\n<|assistant|> "
ids = sp.encode(prompt)
x = torch.tensor([ids])
with torch.no_grad():
for _ in range(20):
logits = model(x)
next_id = logits[0, -1].argmax().item()
if next_id == 2: break # EOS
x = torch.cat([x, torch.tensor([[next_id]])], dim=1)
print(sp.decode(x[0, len(ids):].tolist()))
# โ ืืืืื
Citation
Paper forthcoming.
License
Apache 2.0
- Downloads last month
- 361