ronnengmail commited on
Commit
ebf013f
·
verified ·
1 Parent(s): 47ef246

Add model card, config, tokenizer, and architecture code

Browse files
README.md ADDED
@@ -0,0 +1,192 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - he
5
+ - ar
6
+ - fa
7
+ - en
8
+ tags:
9
+ - multilingual
10
+ - semitic
11
+ - hebrew
12
+ - arabic
13
+ - farsi
14
+ - decoder-only
15
+ - from-scratch
16
+ - cross-lingual-transfer
17
+ datasets:
18
+ - oscar-corpus/OSCAR-2301
19
+ - wikimedia/wikipedia
20
+ - allenai/c4
21
+ pipeline_tag: text-generation
22
+ model-index:
23
+ - name: SemiticGPT-3B
24
+ results:
25
+ - task:
26
+ type: text-generation
27
+ name: Language Modeling
28
+ metrics:
29
+ - name: BPB Hebrew
30
+ type: bpb
31
+ value: 0.876
32
+ - name: BPB Arabic
33
+ type: bpb
34
+ value: 0.726
35
+ - name: BPB Farsi
36
+ type: bpb
37
+ value: 0.657
38
+ - name: BPB English
39
+ type: bpb
40
+ value: 0.964
41
+ ---
42
+
43
+ # SemiticGPT-3B
44
+
45
+ A 3.04-billion parameter multilingual decoder-only language model trained **from scratch** for Hebrew, Arabic, Farsi, and English — a Semitic-centered language cluster.
46
+
47
+ ## Model Description
48
+
49
+ SemiticGPT is trained from scratch (no fine-tuning of existing models) with a custom balanced tokenizer designed for multi-script coverage. The model demonstrates meaningful cross-lingual semantic transfer between linguistically related languages.
50
+
51
+ | Property | Value |
52
+ |----------|-------|
53
+ | Parameters | 3.04B |
54
+ | Architecture | Decoder-only Transformer |
55
+ | Layers | 36 |
56
+ | Hidden dim | 2,560 |
57
+ | Attention heads | 20 (head dim 128) |
58
+ | Vocabulary | 32,768 (custom BPE) |
59
+ | Sequence length | 2,048 |
60
+ | Position encoding | RoPE |
61
+ | Activation | SwiGLU |
62
+ | Normalization | RMSNorm |
63
+ | Training tokens | ~20B |
64
+ | Training cost | ~$1,456 (AWS spot instances) |
65
+
66
+ ## Key Results
67
+
68
+ ### Cross-lingual Sentiment Transfer (Headline Result)
69
+
70
+ Training on **Hebrew sentiment data only** improves Arabic sentiment accuracy from 5.5% → 49% (9× improvement) with **zero Arabic task data**. This demonstrates emergent cross-lingual transfer between Semitic languages.
71
+
72
+ Critically, Farsi (which shares Arabic script but belongs to a different language family) shows no comparable transfer (0.5% → 1.5%), suggesting linguistic family relatedness matters more than script similarity.
73
+
74
+ ### Language Modeling (BPB)
75
+
76
+ | Language | Base | D-SFT |
77
+ |----------|------|-------|
78
+ | Hebrew | 0.879 | 0.876 |
79
+ | Arabic | 0.731 | 0.726 |
80
+ | Farsi | 0.663 | 0.657 |
81
+ | English | 0.972 | 0.964 |
82
+
83
+ ### Cross-lingual Retrieval
84
+
85
+ 90% accuracy on EN↔HE cross-lingual retrieval (10-way, chance=10%) — emerging purely from multilingual pretraining without any alignment objective.
86
+
87
+ ### Translation
88
+
89
+ Best: 18.7% chrF for AR→FA with direct parallel data. Key finding: English-mediated parallel data does NOT enable direct translation between non-English pairs.
90
+
91
+ ## Files
92
+
93
+ | File | Size | Description |
94
+ |------|------|-------------|
95
+ | `best_model.pt` | 12.5 GB | Pretrained base model (3.04B params) |
96
+ | `sft_model.pt` | 6 GB | SFT model (D-baseline: 5K steps, all 4 langs) |
97
+ | `multilingual_32k.model` | 817 KB | SentencePiece tokenizer |
98
+ | `multilingual_32k.vocab` | 551 KB | Tokenizer vocabulary |
99
+ | `config.json` | - | Model configuration |
100
+
101
+ ## Usage
102
+
103
+ ```python
104
+ import torch
105
+ import sentencepiece as spm
106
+
107
+ # Load tokenizer
108
+ sp = spm.SentencePieceProcessor()
109
+ sp.Load("multilingual_32k.model")
110
+
111
+ # Load model (custom architecture — see model_arch.py)
112
+ from model_arch import GPT
113
+
114
+ model = GPT(
115
+ vocab_size=32768, dim=2560, n_layers=36,
116
+ n_heads=20, head_dim=128, max_seq_len=2048
117
+ )
118
+
119
+ # Load SFT checkpoint
120
+ ckpt = torch.load("sft_model.pt", map_location="cuda")
121
+ model.load_state_dict(ckpt)
122
+ model = model.cuda().half().eval()
123
+
124
+ # Generate
125
+ prompt = "<|user|>\nמה הבירה של ישראל?\n<|assistant|>\n"
126
+ tokens = sp.Encode(prompt)
127
+ # ... (see repo for full generation code)
128
+ ```
129
+
130
+ ## Training Data
131
+
132
+ | Language | Share | Sources |
133
+ |----------|-------|---------|
134
+ | Hebrew | 40% | Wikipedia, OSCAR, news, government docs |
135
+ | Arabic | 25% | Wikipedia, OSCAR, news, UN corpus |
136
+ | English | 20% | Wikipedia, OpenWebText, books |
137
+ | Farsi | 15% | Wikipedia, OSCAR, news |
138
+
139
+ Hebrew is intentionally overrepresented as the "anchor language" — strong anchor representations transfer to linguistically related languages.
140
+
141
+ ## Tokenizer
142
+
143
+ Custom 32K BPE tokenizer trained on balanced 25%/language sample:
144
+ - Hebrew fertility: 1.4 tokens/word
145
+ - Arabic fertility: 1.5 tokens/word
146
+ - Farsi fertility: 1.6 tokens/word
147
+ - English fertility: 1.2 tokens/word
148
+
149
+ (vs mBERT 2.5+ for Hebrew, LLaMA 3+ for Arabic)
150
+
151
+ ## Training Recipe
152
+
153
+ - **Optimizer**: AdamW (β₁=0.9, β₂=0.95)
154
+ - **Learning rate**: 3e-4, cosine decay
155
+ - **Batch size**: 512K tokens
156
+ - **Hardware**: AWS spot instances (L40S 48GB, H100 80GB)
157
+ - **FSDP** for multi-GPU, gradient accumulation for single-GPU
158
+ - **Recipe validated** via 32 proxy experiments at 110M scale
159
+
160
+ ## Limitations
161
+
162
+ - 3B parameters — below threshold for complex reasoning (Belebele near chance)
163
+ - Single-run results without confidence intervals
164
+ - Farsi underperforms (15% data share + typological distance from Semitic)
165
+ - Translation quality remains low (max 18.7% chrF)
166
+ - Not competitive with frontier models — this is a research/recipe contribution
167
+
168
+ ## Paper
169
+
170
+ **SemiticGPT: A Low-Cost Recipe for Multilingual Foundation Models in an Under-Resourced Semitic-Centered Language Cluster**
171
+
172
+ Ronnen Slasky, Independent Researcher, April 2026
173
+
174
+ ## Code
175
+
176
+ Full training pipeline, evaluation scripts, and reproducibility artifacts:
177
+
178
+ 🔗 [GitHub: semitic-gpt](https://github.com/fatherRonnen/semitic-gpt)
179
+
180
+ ## Citation
181
+
182
+ ```bibtex
183
+ @article{slasky2026semiticgpt,
184
+ title={SemiticGPT: A Low-Cost Recipe for Multilingual Foundation Models in an Under-Resourced Semitic-Centered Language Cluster},
185
+ author={Slasky, Ronnen},
186
+ year={2026}
187
+ }
188
+ ```
189
+
190
+ ## Acknowledgments
191
+
192
+ Built using the [autoresearch](https://github.com/karpathy/autoresearch) methodology for proxy-scale recipe validation. Training infrastructure on AWS.
config.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "gpt",
3
+ "architectures": ["GPT"],
4
+ "vocab_size": 32768,
5
+ "hidden_size": 2560,
6
+ "num_hidden_layers": 36,
7
+ "num_attention_heads": 20,
8
+ "head_dim": 128,
9
+ "max_position_embeddings": 2048,
10
+ "intermediate_size": 6912,
11
+ "activation_function": "swiglu",
12
+ "normalization": "rmsnorm",
13
+ "position_encoding": "rope",
14
+ "total_params": "3.04B",
15
+ "tokenizer_type": "sentencepiece",
16
+ "tokenizer_vocab_size": 32768,
17
+ "bos_token": "<s>",
18
+ "eos_token": "</s>",
19
+ "pad_token": "<pad>",
20
+ "special_tokens": ["<|user|>", "<|assistant|>", "<s>", "</s>", "<pad>"],
21
+ "training": {
22
+ "optimizer": "AdamW",
23
+ "learning_rate": 3e-4,
24
+ "schedule": "cosine_decay",
25
+ "warmup_steps": 2000,
26
+ "batch_size_tokens": 524288,
27
+ "weight_decay": 0.1,
28
+ "gradient_clip": 1.0,
29
+ "precision": "fp16",
30
+ "total_tokens": "~20B"
31
+ }
32
+ }
fertility_report.json ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model": "multilingual_32k",
3
+ "vocab_size": 32000,
4
+ "bos_id": 1,
5
+ "eos_id": 2,
6
+ "config": {
7
+ "character_coverage": 0.9995,
8
+ "model_type": "bpe",
9
+ "byte_fallback": true,
10
+ "split_digits": true,
11
+ "max_sentence_length": 16384,
12
+ "input_sentence_size": 10000000
13
+ },
14
+ "data_sources": {
15
+ "en": "allenai/c4 (en)",
16
+ "ar": "wikimedia/wikipedia (20231101.ar)",
17
+ "he": "wikimedia/wikipedia (20231101.he)",
18
+ "fa": "wikimedia/wikipedia (20231101.fa)"
19
+ },
20
+ "languages": {
21
+ "en": {
22
+ "num_tokens": 131858,
23
+ "num_bytes": 502591,
24
+ "num_words": 85508,
25
+ "num_chars": 500000,
26
+ "bytes_per_token": 3.81,
27
+ "tokens_per_word": 1.54
28
+ },
29
+ "ar": {
30
+ "num_tokens": 138572,
31
+ "num_bytes": 900643,
32
+ "num_words": 81698,
33
+ "num_chars": 500000,
34
+ "bytes_per_token": 6.5,
35
+ "tokens_per_word": 1.7
36
+ },
37
+ "he": {
38
+ "num_tokens": 150214,
39
+ "num_bytes": 876334,
40
+ "num_words": 81962,
41
+ "num_chars": 500000,
42
+ "bytes_per_token": 5.83,
43
+ "tokens_per_word": 1.83
44
+ },
45
+ "fa": {
46
+ "num_tokens": 129491,
47
+ "num_bytes": 902876,
48
+ "num_words": 91425,
49
+ "num_chars": 500000,
50
+ "bytes_per_token": 6.97,
51
+ "tokens_per_word": 1.42
52
+ }
53
+ },
54
+ "timestamp": "2026-04-01T14:12:42Z"
55
+ }
model_arch.py ADDED
@@ -0,0 +1,152 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Shared model architecture for multilingual 3B GPT — must match training exactly."""
2
+ import torch
3
+ import torch.nn as nn
4
+ import torch.nn.functional as F
5
+ import math
6
+
7
+ VOCAB_SIZE = 32000
8
+ DIM = 3072
9
+ DEPTH = 26
10
+ N_HEADS = 24
11
+ HEAD_DIM = DIM // N_HEADS # 128
12
+ MAX_SEQ_LEN = 2048
13
+ ROPE_THETA = 10000.0
14
+ HIDDEN_DIM = ((int(2 * DIM * 4 / 3) + 63) // 64) * 64 # SwiGLU hidden
15
+
16
+
17
+ class RMSNorm(nn.Module):
18
+ def __init__(self, dim, eps=1e-6):
19
+ super().__init__()
20
+ self.eps = eps
21
+ self.weight = nn.Parameter(torch.ones(dim))
22
+
23
+ def forward(self, x):
24
+ norm = x.float().pow(2).mean(-1, keepdim=True).add(self.eps).rsqrt()
25
+ return (x.float() * norm).type_as(x) * self.weight
26
+
27
+
28
+ def precompute_freqs_cis(dim, max_seq_len, theta=ROPE_THETA):
29
+ freqs = 1.0 / (theta ** (torch.arange(0, dim, 2, dtype=torch.float32) / dim))
30
+ t = torch.arange(max_seq_len, dtype=torch.float32)
31
+ freqs = torch.outer(t, freqs)
32
+ return torch.polar(torch.ones_like(freqs), freqs)
33
+
34
+
35
+ def apply_rotary_emb(x, freqs_cis):
36
+ # x: (B, n_heads, S, head_dim)
37
+ B, H, S, D = x.shape
38
+ x_complex = torch.view_as_complex(x.float().reshape(B, H, S, D // 2, 2))
39
+ freqs = freqs_cis[:S].unsqueeze(0).unsqueeze(1) # (1, 1, S, D//2)
40
+ x_rot = torch.view_as_real(x_complex * freqs).reshape(B, H, S, D)
41
+ return x_rot.type_as(x)
42
+
43
+
44
+ class FusedAttention(nn.Module):
45
+ def __init__(self, dim, n_heads):
46
+ super().__init__()
47
+ self.n_heads = n_heads
48
+ self.head_dim = dim // n_heads
49
+ self.qkv = nn.Linear(dim, 3 * dim, bias=False)
50
+ self.out_proj = nn.Linear(dim, dim, bias=False)
51
+
52
+ def forward(self, x, freqs_cis, mask=None):
53
+ B, S, D = x.shape
54
+ qkv = self.qkv(x).reshape(B, S, 3, self.n_heads, self.head_dim)
55
+ q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
56
+ q = q.transpose(1, 2) # (B, H, S, D)
57
+ k = k.transpose(1, 2)
58
+ v = v.transpose(1, 2)
59
+ q = apply_rotary_emb(q, freqs_cis)
60
+ k = apply_rotary_emb(k, freqs_cis)
61
+ # Scaled dot-product attention
62
+ scale = math.sqrt(self.head_dim)
63
+ attn = (q @ k.transpose(-2, -1)) / scale
64
+ if mask is not None:
65
+ attn = attn + mask
66
+ attn = F.softmax(attn, dim=-1)
67
+ out = (attn @ v).transpose(1, 2).reshape(B, S, D)
68
+ return self.out_proj(out)
69
+
70
+
71
+ class SwiGLUFFN(nn.Module):
72
+ def __init__(self, dim, hidden_dim):
73
+ super().__init__()
74
+ self.w1 = nn.Linear(dim, hidden_dim, bias=False)
75
+ self.w2 = nn.Linear(hidden_dim, dim, bias=False)
76
+ self.w3 = nn.Linear(dim, hidden_dim, bias=False)
77
+
78
+ def forward(self, x):
79
+ return self.w2(F.silu(self.w1(x)) * self.w3(x))
80
+
81
+
82
+ class TransformerBlock(nn.Module):
83
+ def __init__(self, dim, n_heads, hidden_dim):
84
+ super().__init__()
85
+ self.attn_norm = RMSNorm(dim)
86
+ self.attn = FusedAttention(dim, n_heads)
87
+ self.ffn_norm = RMSNorm(dim)
88
+ self.ffn = SwiGLUFFN(dim, hidden_dim)
89
+
90
+ def forward(self, x, freqs_cis, mask=None):
91
+ x = x + self.attn(self.attn_norm(x), freqs_cis, mask)
92
+ x = x + self.ffn(self.ffn_norm(x))
93
+ return x
94
+
95
+
96
+ class MultilingualGPT(nn.Module):
97
+ def __init__(self):
98
+ super().__init__()
99
+ self.tok_emb = nn.Embedding(VOCAB_SIZE, DIM)
100
+ self.layers = nn.ModuleList([
101
+ TransformerBlock(DIM, N_HEADS, HIDDEN_DIM) for _ in range(DEPTH)
102
+ ])
103
+ self.norm = RMSNorm(DIM)
104
+ self.head = nn.Linear(DIM, VOCAB_SIZE, bias=False)
105
+ # Tied embeddings
106
+ self.head.weight = self.tok_emb.weight
107
+ # Precompute RoPE
108
+ self.register_buffer('freqs_cis', precompute_freqs_cis(HEAD_DIM, MAX_SEQ_LEN))
109
+
110
+ def forward(self, tokens, targets=None):
111
+ B, S = tokens.shape
112
+ x = self.tok_emb(tokens)
113
+ mask = torch.triu(torch.full((S, S), float('-inf'), device=tokens.device), diagonal=1)
114
+ mask = mask.unsqueeze(0).unsqueeze(0) # (1, 1, S, S)
115
+ for layer in self.layers:
116
+ x = layer(x, self.freqs_cis, mask)
117
+ x = self.norm(x)
118
+ logits = self.head(x)
119
+ loss = None
120
+ if targets is not None:
121
+ loss = F.cross_entropy(logits.view(-1, VOCAB_SIZE), targets.view(-1))
122
+ return logits, loss
123
+
124
+
125
+ def load_model(path, device='cuda'):
126
+ """Load model from checkpoint, stripping prefixes."""
127
+ model = MultilingualGPT()
128
+ ckpt = torch.load(path, map_location='cpu', weights_only=False)
129
+ state = ckpt.get('model_state_dict', ckpt)
130
+ # Strip prefixes
131
+ cleaned = {}
132
+ for k, v in state.items():
133
+ new_k = k
134
+ for prefix in ['_orig_mod.', 'module.']:
135
+ if new_k.startswith(prefix):
136
+ new_k = new_k[len(prefix):]
137
+ cleaned[new_k] = v
138
+ # Handle tied weights - remove head.weight if present (will be tied)
139
+ if 'head.weight' in cleaned and 'tok_emb.weight' in cleaned:
140
+ if torch.equal(cleaned['head.weight'], cleaned['tok_emb.weight']):
141
+ del cleaned['head.weight']
142
+ model.load_state_dict(cleaned, strict=False)
143
+ model = model.to(device).eval()
144
+ return model
145
+
146
+
147
+ def load_tokenizer(path):
148
+ """Load SentencePiece tokenizer."""
149
+ import sentencepiece as spm
150
+ sp = spm.SentencePieceProcessor()
151
+ sp.Load(path)
152
+ return sp
multilingual_32k.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cc439f6b64e14b6d1d900a246aa246cd639ae03464bc2f3aa5dc215d4f14b83c
3
+ size 836449
multilingual_32k.vocab ADDED
The diff for this file is too large to render. See raw diff