SemiticGPT-3B

A 3.14B parameter multilingual language model trained from scratch for Hebrew, Arabic, Persian (Farsi), and English โ€” a script-diverse, low-resource language cluster centered on Semitic languages.

Model Details

Property Value
Parameters 3.14B
Architecture GPT (RoPE, SwiGLU, RMSNorm, fused QKV)
Vocab Size 32,000 (custom multilingual SentencePiece BPE)
Max Seq Length 2,048
Pretraining Data 4.48B tokens (HE 40%, AR 20%, FA 20%, EN 20%)
SFT Data 36,980 samples (sentiment + translation)

Key Results

Sentiment Classification (v4, clean balanced eval)

Language Base โ†’ SFT (Logprob) Generative
๐Ÿ‡ฎ๐Ÿ‡ฑ Hebrew 53.0% โ†’ 84.5% 82%
๐Ÿ‡ธ๐Ÿ‡ฆ Arabic 45.0% โ†’ 60.5% 64%
๐Ÿ‡ฎ๐Ÿ‡ท Farsi 60.5% โ†’ 78.5% 74%
๐Ÿ‡บ๐Ÿ‡ธ English 51.5% โ†’ 73.0% 64%

Cross-lingual Transfer (Experiment B)

English-only SFT barely transfers to non-English languages, proving multilingual SFT is necessary:

Language Base EN-SFT Multi-SFT
Hebrew 53.0% 51.5% 84.5%
Arabic 45.0% 46.5% 60.5%
Farsi 60.5% 58.5% 78.5%
English 51.5% 52.0% 73.0%

Tokenizer Efficiency (Experiment C)

Our tokenizer uses 49-69% fewer tokens than Llama-2 for Hebrew/Arabic/Farsi:

Language Ours (tok/byte) Llama-2 (tok/byte) Improvement
Hebrew 0.195 0.569 +65.6%
Arabic 0.288 0.565 +49.1%
Farsi 0.175 0.561 +68.8%
English 0.270 0.264 -2.2%

Files

  • base_model.pt โ€” Pretrained base model (no SFT)
  • sft_model_v4.pt โ€” Fine-tuned model (v4, sentiment + translation)
  • multilingual_32k.model โ€” SentencePiece tokenizer
  • config.json โ€” Model configuration
  • exp_ab_results.json โ€” Experiment A+B results
  • exp_c_tokenizer_ablation.json โ€” Experiment C results

Usage

import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor('multilingual_32k.model')

# Load model (see model_arch.py for architecture)
from model_arch import GPT
model = GPT()
state = torch.load('sft_model_v4.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state_dict'])
model.eval()

# Generate
prompt = "<|user|> ืกื•ื•ื’ ืืช ื”ืจื’ืฉ ืฉืœ ื”ื˜ืงืกื˜ ื”ื‘ื (ื—ื™ื•ื‘ื™/ืฉืœื™ืœื™):\nืื ื™ ืื•ื”ื‘ ืืช ื”ืกืคืจ ื”ื–ื”!\n<|assistant|> "
ids = sp.encode(prompt)
x = torch.tensor([ids])
with torch.no_grad():
    for _ in range(20):
        logits = model(x)
        next_id = logits[0, -1].argmax().item()
        if next_id == 2: break  # EOS
        x = torch.cat([x, torch.tensor([[next_id]])], dim=1)
print(sp.decode(x[0, len(ids):].tolist()))
# โ†’ ื—ื™ื•ื‘ื™

Citation

Paper forthcoming.

License

Apache 2.0

Downloads last month
361
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support