---
license: apache-2.0
language:
- he
- ar
- fa
- en
tags:
- multilingual
- hebrew
- arabic
- persian
- semitic
- sentiment-analysis
- cross-lingual
pipeline_tag: text-generation
---

# SemiticGPT-3B

A 3.14B parameter multilingual language model trained from scratch for **Hebrew, Arabic, Persian (Farsi), and English** — a script-diverse, low-resource language cluster centered on Semitic languages.

## Model Details

| Property | Value |
|----------|-------|
| Parameters | 3.14B |
| Architecture | GPT (RoPE, SwiGLU, RMSNorm, fused QKV) |
| Vocab Size | 32,000 (custom multilingual SentencePiece BPE) |
| Max Seq Length | 2,048 |
| Pretraining Data | 4.48B tokens (HE 40%, AR 20%, FA 20%, EN 20%) |
| SFT Data | 36,980 samples (sentiment + translation) |

## Key Results

### Sentiment Classification (v4, clean balanced eval)

| Language | Base → SFT (Logprob) | Generative |
|----------|---------------------|------------|
| 🇮🇱 Hebrew | 53.0% → **84.5%** | **82%** |
| 🇸🇦 Arabic | 45.0% → **60.5%** | **64%** |
| 🇮🇷 Farsi | 60.5% → **78.5%** | **74%** |
| 🇺🇸 English | 51.5% → **73.0%** | **64%** |

### Cross-lingual Transfer (Experiment B)

English-only SFT barely transfers to non-English languages, proving **multilingual SFT is necessary**:

| Language | Base | EN-SFT | Multi-SFT |
|----------|------|--------|-----------|
| Hebrew | 53.0% | 51.5% | **84.5%** |
| Arabic | 45.0% | 46.5% | **60.5%** |
| Farsi | 60.5% | 58.5% | **78.5%** |
| English | 51.5% | 52.0% | **73.0%** |

### Tokenizer Efficiency (Experiment C)

Our tokenizer uses **49-69% fewer tokens** than Llama-2 for Hebrew/Arabic/Farsi:

| Language | Ours (tok/byte) | Llama-2 (tok/byte) | Improvement |
|----------|----------------|-------------------|-------------|
| Hebrew | 0.195 | 0.569 | **+65.6%** |
| Arabic | 0.288 | 0.565 | **+49.1%** |
| Farsi | 0.175 | 0.561 | **+68.8%** |
| English | 0.270 | 0.264 | -2.2% |

## Files

- `base_model.pt` — Pretrained base model (no SFT)
- `sft_model_v4.pt` — Fine-tuned model (v4, sentiment + translation)
- `multilingual_32k.model` — SentencePiece tokenizer
- `config.json` — Model configuration
- `exp_ab_results.json` — Experiment A+B results
- `exp_c_tokenizer_ablation.json` — Experiment C results

## Usage

```python
import torch
import sentencepiece as spm

# Load tokenizer
sp = spm.SentencePieceProcessor('multilingual_32k.model')

# Load model (see model_arch.py for architecture)
from model_arch import GPT
model = GPT()
state = torch.load('sft_model_v4.pt', map_location='cpu', weights_only=True)
model.load_state_dict(state['model_state_dict'])
model.eval()

# Generate
prompt = "<|user|> סווג את הרגש של הטקסט הבא (חיובי/שלילי):\nאני אוהב את הספר הזה!\n<|assistant|> "
ids = sp.encode(prompt)
x = torch.tensor([ids])
with torch.no_grad():
    for _ in range(20):
        logits = model(x)
        next_id = logits[0, -1].argmax().item()
        if next_id == 2: break  # EOS
        x = torch.cat([x, torch.tensor([[next_id]])], dim=1)
print(sp.decode(x[0, len(ids):].tolist()))
# → חיובי
```

## Citation

Paper forthcoming.

## License

Apache 2.0