ronnengmail commited on
Commit
1122dc8
ยท
verified ยท
1 Parent(s): 925c5eb

Update model card: LoRA Phase 2 (PPL 15.78, 97.3% instruction following)

Browse files
Files changed (1) hide show
  1. README.md +101 -75
README.md CHANGED
@@ -6,6 +6,8 @@ tags:
6
  - hebrew
7
  - instruction-tuning
8
  - sft
 
 
9
  - language-model
10
  - text-generation
11
  - mamba
@@ -13,20 +15,40 @@ tags:
13
  pipeline_tag: text-generation
14
  model-index:
15
  - name: HebrewGPT-1B-Instruct
16
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  ---
18
 
19
- # HebrewGPT-1B-Instruct
20
 
21
- A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) on 61K balanced Hebrew instruction examples.
 
 
 
 
 
 
22
 
23
  ## Model Details
24
 
25
  | Property | Value |
26
  |----------|-------|
27
- | **Parameters** | 1.08B |
28
  | **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
29
  | **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) |
 
30
  | **Context Length** | 2,048 tokens |
31
  | **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
32
  | **License** | Apache 2.0 |
@@ -41,57 +63,47 @@ HebrewGPT-1B-Instruct uses the same hybrid architecture as the base model:
41
  - **MLP:** SwiGLU activation
42
  - **Positional encoding:** Rotary Position Embeddings (RoPE)
43
 
44
- ## Base Model: HebrewGPT-1B
45
 
46
- Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on Hebrew text.
 
 
 
47
 
48
- ### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)
49
-
50
- | Dataset | Share | Description |
51
- |---------|-------|-------------|
52
- | Hebrew Wikipedia | 12% | Encyclopedia articles |
53
- | Supreme Court Rulings | 22% | Israeli legal corpus |
54
- | Ben Yehuda Project | 23% | Classic Hebrew literature |
55
- | C4 Hebrew | 20% | Web-crawled text (cleaned) |
56
- | CC100 Hebrew | 19% | CommonCrawl filtered |
57
- | Task-specific | 4% | QA, NLI, sentiment prompts |
58
 
59
- ### Pre-Training Details
 
60
 
61
- - **Tokens:** 9.8B (3.9 epochs over 2.48B unique)
62
- - **Hardware:** 8ร—H100 80GB (p5.48xlarge), 8 hours
63
- - **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale)
64
- - **Perplexity:** 29.75 (SWA)
65
- - **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4
66
- - **Paper:** [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
67
- - **Ablation:** [HebrewGPT-1B-AdamW](https://huggingface.co/Slasky/HebrewGPT-1B-AdamW) (same architecture, AdamW optimizer)
68
-
69
- ## Training
70
-
71
- ### SFT Configuration
72
- - **Method:** Full Supervised Fine-Tuning (SFT)
73
- - **Training steps:** 3,000
74
- - **Best validation loss:** 2.9598
75
- - **Hardware:** Single NVIDIA A10G GPU (AWS g5.2xlarge)
76
- - **Training time:** ~6.5 hours
77
- - **SFT fine-tuning tokens:** ~20.3M
78
- - **Base model pre-training:** 9.8B tokens (12 diverse Hebrew datasets including Wikipedia, Supreme Court, Ben Yehuda, C4, CC100)
79
-
80
- ### Instruction Dataset (61K examples)
81
-
82
- The model was fine-tuned on a balanced mix of Hebrew instruction-following tasks:
83
-
84
- | Category | Examples | Description |
85
- |----------|----------|-------------|
86
- | QA (HeQ) | 15,000 | Hebrew question answering |
87
- | Sentiment | 10,000 | Hebrew sentiment analysis |
88
- | NLI | 2,938 | Natural language inference |
89
- | Summarization (HeSum) | 10,000 | Hebrew text summarization |
90
- | Translation | 15,000 | Hebrew-English translation |
91
- | Alpaca | 5,000 | General instruction following (translated) |
92
- | Dolly | 2,000 | Open-domain instruction following |
93
- | Chat | 1,000 | Conversational Hebrew |
94
- | Winograd | 278 | Coreference resolution |
95
 
96
  ## Usage
97
 
@@ -124,19 +136,44 @@ The model was trained with a structured instruction format:
124
  {response}
125
  ```
126
 
127
- ## Evaluation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128
 
129
- Evaluation on Hebrew benchmarks requires GPU inference. Base model (HebrewGPT-1B) results for comparison:
130
 
131
- | Task | Base Model | Instruct (SFT) |
132
- |------|-----------|----------------|
133
- | SNLI | 50% | *Pending* |
134
- | Sentiment | 33% | *Pending* |
135
- | QA | 20% | *Pending* |
136
- | Trivia | 13% | *Pending* |
137
- | **Average** | **29.2%** | *Pending* |
 
 
 
138
 
139
- SFT evaluation will be run on GPU and updated here. The instruction-tuned model is expected to show significant improvements on structured tasks (QA, sentiment, NLI) that were part of the SFT training mix.
 
 
 
 
 
 
140
 
141
  ## Infrastructure
142
 
@@ -144,29 +181,18 @@ SFT evaluation will be run on GPU and updated here. The instruction-tuned model
144
  - **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G)
145
  - **Data Pipeline:** Automated dataset collection, translation, and balancing
146
 
147
- ## Files
148
-
149
- - `model.pt` โ€” SFT fine-tuned model state dict (2.1 GB)
150
- - `tokenizer.model` โ€” SentencePiece BPE tokenizer (8,192 vocab)
151
-
152
  ## Citation
153
 
154
  ```bibtex
155
  @misc{hebrewgpt1b-instruct-2026,
156
- title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model},
157
  author={Slasky, Ronnen},
158
  year={2026},
159
- url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct}
 
160
  }
161
  ```
162
 
163
- ## Limitations
164
-
165
- - Small vocabulary (8,192 tokens) may limit performance on rare words
166
- - 2,048 context window limits long-document tasks
167
- - Trained primarily on structured instruction tasks; open-ended generation quality may vary
168
- - Hebrew-specific model โ€” limited multilingual capability beyond Hebrew-English translation
169
-
170
  ## License
171
 
172
  Apache 2.0
 
6
  - hebrew
7
  - instruction-tuning
8
  - sft
9
+ - lora
10
+ - curriculum-distillation
11
  - language-model
12
  - text-generation
13
  - mamba
 
15
  pipeline_tag: text-generation
16
  model-index:
17
  - name: HebrewGPT-1B-Instruct
18
+ results:
19
+ - task:
20
+ type: text-generation
21
+ name: Language Modeling
22
+ metrics:
23
+ - name: Perplexity
24
+ type: perplexity
25
+ value: 15.78
26
+ - name: Instruction Following
27
+ type: accuracy
28
+ value: 97.3
29
+ - name: Repetition Rate
30
+ type: custom
31
+ value: 0.001
32
  ---
33
 
34
+ # HebrewGPT-1B-Instruct (LoRA Phase 2) ๐Ÿ‡ฎ๐Ÿ‡ฑ
35
 
36
+ A **1.08 billion parameter** Hebrew instruction-tuned language model, fine-tuned from [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) using **LoRA Phase 2 curriculum distillation** on 65K Hebrew instruction examples.
37
+
38
+ This is the latest and best instruct variant โ€” achieving **PPL 15.78** (โ†“47% from base) with **97.3% instruction following** and **zero repetition**, trained for ~$12 on a single A10G GPU.
39
+
40
+ - ๐Ÿ“„ **Paper**: [Autonomous AI-Driven Hebrew Language Model Research](https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html)
41
+ - ๐Ÿ’ป **GitHub**: [AgenticResearcher](https://github.com/fatherRonnen/AgenticResearcher)
42
+ - ๐Ÿ—๏ธ **Base Model**: [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B)
43
 
44
  ## Model Details
45
 
46
  | Property | Value |
47
  |----------|-------|
48
+ | **Parameters** | 1.08B (44.7M trainable via LoRA, 4%) |
49
  | **Architecture** | Custom Mamba-Transformer hybrid (interleaved RoPE attention + Mamba SSM, SwiGLU MLP) |
50
  | **Base Model** | [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B) (pretrained with Muon optimizer + SWA) |
51
+ | **Fine-Tuning** | LoRA SFT (rank=64, alpha=128) |
52
  | **Context Length** | 2,048 tokens |
53
  | **Tokenizer** | SentencePiece BPE, 8,192 vocab, Hebrew morphology-aware with prefix splitting |
54
  | **License** | Apache 2.0 |
 
63
  - **MLP:** SwiGLU activation
64
  - **Positional encoding:** Rotary Position Embeddings (RoPE)
65
 
66
+ ## Training: LoRA Phase 2
67
 
68
+ ### Method
69
+ - **LoRA SFT** with rank=64, alpha=128
70
+ - **Target modules:** qkv, proj, gate, up, down
71
+ - **Trainable parameters:** 44.7M / 1.08B (4%)
72
 
73
+ ### Data
74
+ - **65K examples** combined from two-phase curriculum:
75
+ - **Phase 1 (ELI5 simple):** 28.5K examples โ€” simple explanations for foundational instruction following
76
+ - **Phase 2 (Sonnet/Nemotron complex):** 36.5K examples โ€” advanced, diverse instruction data
 
 
 
 
 
 
77
 
78
+ ### Two-Phase Curriculum
79
+ The training uses a curriculum distillation approach: starting with simple ELI5-style examples to establish instruction-following behavior, then progressing to complex Sonnet/Nemotron-generated examples for advanced capabilities.
80
 
81
+ ### Training Details
82
+ | Property | Value |
83
+ |----------|-------|
84
+ | **Hardware** | NVIDIA A10G (AWS g5.2xlarge) |
85
+ | **Training time** | ~8 hours |
86
+ | **Best validation loss** | 2.4768 (BPB 3.57) |
87
+ | **Early stopping** | Step ~1000 (patience 5) |
88
+ | **Total cost** | ~$12 |
89
+
90
+ ## Evaluation Results
91
+
92
+ | Metric | Base Model | LoRA Phase 2 | Delta |
93
+ |--------|-----------|-------------|-------|
94
+ | Perplexity | 25.14 | **15.78** | **-37%** |
95
+ | Instruction Following | โ€” | **97.3%** | โ€” |
96
+ | MCQA | โ€” | 10% | โ€” |
97
+ | Repetition Rate | 0.006 | **0.001** | **-83%** |
98
+ | High-rep Outputs | โ€” | **0%** | โ€” |
99
+
100
+ ## Key Improvements
101
+
102
+ - **Perplexity:** 29.75 โ†’ 15.78 (**-47%** from base pretrained model)
103
+ - **Zero repetition** โ€” Phase 1 distillation had severe repetition loops; LoRA Phase 2 eliminates them entirely
104
+ - **Fluent Hebrew generation** across diverse topics
105
+ - **97.3% instruction following rate** โ€” the model reliably follows the instruction format
106
+ - **Total post-training cost:** ~$12 on a single NVIDIA A10G GPU
 
 
 
 
 
 
 
 
107
 
108
  ## Usage
109
 
 
136
  {response}
137
  ```
138
 
139
+ For inference, provide the instruction and input, then let the model generate after `### ืชืฉื•ื‘ื”:`.
140
+
141
+ ## Files
142
+
143
+ - `model.pt` โ€” LoRA Phase 2 merged clean weights (2.1 GB)
144
+ - `tokenizer.model` โ€” SentencePiece BPE tokenizer (8,192 vocab)
145
+
146
+ ## Limitations
147
+
148
+ - **Factual accuracy limited** โ€” expected for a 1B parameter model
149
+ - **HTML entity artifacts** from training data contamination (e.g., `…` appearing in outputs)
150
+ - **MCQA still weak (10%)** โ€” needs MCQA-specific training data to improve
151
+ - **2,048 context window** limits long-document tasks
152
+ - **Small vocabulary (8,192 tokens)** may limit performance on rare words
153
+ - Hebrew-specific model โ€” limited multilingual capability
154
+
155
+ ## Base Model: HebrewGPT-1B
156
 
157
+ Built on [HebrewGPT-1B](https://huggingface.co/Slasky/HebrewGPT-1B), a 1.08B parameter model trained from scratch on 9.8B tokens of Hebrew text.
158
 
159
+ ### Pre-Training Data (12 Hebrew Datasets, 9.8B tokens)
160
+
161
+ | Dataset | Share | Description |
162
+ |---------|-------|-------------|
163
+ | Hebrew Wikipedia | 12% | Encyclopedia articles |
164
+ | Supreme Court Rulings | 22% | Israeli legal corpus |
165
+ | Ben Yehuda Project | 23% | Classic Hebrew literature |
166
+ | C4 Hebrew | 20% | Web-crawled text (cleaned) |
167
+ | CC100 Hebrew | 19% | CommonCrawl filtered |
168
+ | Task-specific | 4% | QA, NLI, sentiment prompts |
169
 
170
+ ### Pre-Training Details
171
+
172
+ - **Tokens:** 9.8B (3.9 epochs over 2.48B unique)
173
+ - **Hardware:** 8ร—H100 80GB (p5.48xlarge), 8 hours
174
+ - **Optimizer:** Muon + SWA (12.3% better BPB than AdamW at 1B scale)
175
+ - **Perplexity:** 29.75 (SWA)
176
+ - **Research:** 200 autonomous experiments across 4 versions, 100% hit rate in v4
177
 
178
  ## Infrastructure
179
 
 
181
  - **Training Compute:** AWS EC2 g5.2xlarge (NVIDIA A10G)
182
  - **Data Pipeline:** Automated dataset collection, translation, and balancing
183
 
 
 
 
 
 
184
  ## Citation
185
 
186
  ```bibtex
187
  @misc{hebrewgpt1b-instruct-2026,
188
+ title={HebrewGPT-1B-Instruct: A Hebrew Instruction-Tuned Language Model via LoRA Curriculum Distillation},
189
  author={Slasky, Ronnen},
190
  year={2026},
191
+ url={https://huggingface.co/Slasky/HebrewGPT-1B-Instruct},
192
+ note={Paper: https://d11k83yu06biio.cloudfront.net/paper/hebrew-autoresearch.html}
193
  }
194
  ```
195
 
 
 
 
 
 
 
 
196
  ## License
197
 
198
  Apache 2.0