--- license: apache-2.0 language: - he - ar - fa - en tags: - multilingual - hebrew - arabic - persian - semitic - sentiment-analysis - cross-lingual pipeline_tag: text-generation --- # SemiticGPT-3B A 3.14B parameter multilingual language model trained from scratch for **Hebrew, Arabic, Persian (Farsi), and English** — a script-diverse, low-resource language cluster centered on Semitic languages. ## Model Details | Property | Value | |----------|-------| | Parameters | 3.14B | | Architecture | GPT (RoPE, SwiGLU, RMSNorm, fused QKV) | | Vocab Size | 32,000 (custom multilingual SentencePiece BPE) | | Max Seq Length | 2,048 | | Pretraining Data | 4.48B tokens (HE 40%, AR 20%, FA 20%, EN 20%) | | SFT Data | 36,980 samples (sentiment + translation) | ## Key Results ### Sentiment Classification (v4, clean balanced eval) | Language | Base → SFT (Logprob) | Generative | |----------|---------------------|------------| | 🇮🇱 Hebrew | 53.0% → **84.5%** | **82%** | | 🇸🇦 Arabic | 45.0% → **60.5%** | **64%** | | 🇮🇷 Farsi | 60.5% → **78.5%** | **74%** | | 🇺🇸 English | 51.5% → **73.0%** | **64%** | ### Cross-lingual Transfer (Experiment B) English-only SFT barely transfers to non-English languages, proving **multilingual SFT is necessary**: | Language | Base | EN-SFT | Multi-SFT | |----------|------|--------|-----------| | Hebrew | 53.0% | 51.5% | **84.5%** | | Arabic | 45.0% | 46.5% | **60.5%** | | Farsi | 60.5% | 58.5% | **78.5%** | | English | 51.5% | 52.0% | **73.0%** | ### Tokenizer Efficiency (Experiment C) Our tokenizer uses **49-69% fewer tokens** than Llama-2 for Hebrew/Arabic/Farsi: | Language | Ours (tok/byte) | Llama-2 (tok/byte) | Improvement | |----------|----------------|-------------------|-------------| | Hebrew | 0.195 | 0.569 | **+65.6%** | | Arabic | 0.288 | 0.565 | **+49.1%** | | Farsi | 0.175 | 0.561 | **+68.8%** | | English | 0.270 | 0.264 | -2.2% | ## Files - `base_model.pt` — Pretrained base model (no SFT) - `sft_model_v4.pt` — Fine-tuned model (v4, sentiment + translation) - `multilingual_32k.model` — SentencePiece tokenizer - `config.json` — Model configuration - `exp_ab_results.json` — Experiment A+B results - `exp_c_tokenizer_ablation.json` — Experiment C results ## Usage ```python import torch import sentencepiece as spm # Load tokenizer sp = spm.SentencePieceProcessor('multilingual_32k.model') # Load model (see model_arch.py for architecture) from model_arch import GPT model = GPT() state = torch.load('sft_model_v4.pt', map_location='cpu', weights_only=True) model.load_state_dict(state['model_state_dict']) model.eval() # Generate prompt = "<|user|> סווג את הרגש של הטקסט הבא (חיובי/שלילי):\nאני אוהב את הספר הזה!\n<|assistant|> " ids = sp.encode(prompt) x = torch.tensor([ids]) with torch.no_grad(): for _ in range(20): logits = model(x) next_id = logits[0, -1].argmax().item() if next_id == 2: break # EOS x = torch.cat([x, torch.tensor([[next_id]])], dim=1) print(sp.decode(x[0, len(ids):].tolist())) # → חיובי ``` ## Citation Paper forthcoming. ## License Apache 2.0