hellosindh commited on
Commit
4409eca
Β·
verified Β·
1 Parent(s): 60e393a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +199 -74
README.md CHANGED
@@ -1,107 +1,232 @@
1
  # Indus Script Models
2
 
3
- Four trained models + NanoGPT for the undeciphered Indus Valley Script (2600–1900 BCE).
 
4
 
5
- ## What's in this repo
6
 
7
- ```
8
- models/
9
- mlm/best/ TinyBERT masked language model
10
- cls/best/ TinyBERT sequence classifier (valid vs corrupted)
11
- ngram_model.pkl N-gram RTL transition model
12
- electra/best/ ELECTRA token discriminator
13
- deberta/best/ DeBERTa sequence discriminator
14
- nanogpt_indus.pt NanoGPT generator (153K params)
15
- data/
16
- indus_tokenizer/ Custom tokenizer (641 Indus sign tokens)
17
- id_to_glyph.json Sign ID β†’ glyph character mapping
18
- inference.py Run all tasks (see below)
19
- indus_ngram.py Required by ngram_model.pkl
20
- ```
21
 
22
- ## How the pipeline works
 
 
 
23
 
24
- **Stage 1 β€” Real inscriptions (3,310 sequences):**
25
- Four models trained independently on real Indus Script inscriptions.
26
- Each learned a different aspect of grammar:
27
- - TinyBERT MLM β†’ which signs can fill a masked position
28
- - TinyBERT Classifier β†’ valid sequence vs corrupted
29
- - N-gram RTL β†’ right-to-left transition probabilities
30
- - ELECTRA β†’ token-level real vs fake discrimination
31
- - DeBERTa β†’ sequence-level real vs fake discrimination
32
 
33
- **Stage 2 β€” Generate + filter:**
34
- NanoGPT generates candidates in RTL order.
35
- Each candidate scored by BERT (50%) + N-gram (25%) + ELECTRA (25%).
36
- Only sequences scoring β‰₯85% ensemble are kept.
37
- Exact matches to real inscriptions separated as validation evidence.
38
 
39
- **Stage 3 β€” Retrain on combined data (3,310 real + 5,000 synthetic = 8,310):**
40
- All models retrained β†’ TinyBERT accuracy 78% β†’ 89%, NanoGPT PPL 32.5 β†’ 13.3.
41
- Final 5,000 sequences generated with retrained models.
42
 
43
- ## Quick start
44
 
45
- ```bash
46
- pip install torch transformers huggingface_hub
47
 
48
- # Clone this repo
49
- git clone https://huggingface.co/YOUR_USERNAME/indus-script-models
50
- cd indus-script-models
51
 
52
- # Run demo (validates 5 example sequences)
53
- python inference.py --task demo
54
-
55
- # Validate a sequence
56
  python inference.py --task validate --sequence "T638 T177 T420 T122"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
- # Predict a masked sign
59
  python inference.py --task predict --sequence "T638 [MASK] T420 T122"
 
 
 
 
 
 
 
 
 
 
 
 
 
60
 
61
- # Generate 10 new sequences
 
62
  python inference.py --task generate --count 10
63
 
64
- # Score any sequence
 
 
 
 
 
 
 
 
 
65
  python inference.py --task score --sequence "T604 T123 T609"
66
  ```
67
 
68
- ## Example output
69
 
 
 
 
 
 
 
 
 
 
 
70
  ```
71
- Loading models...
72
- βœ“ TinyBERT
73
- βœ“ N-gram
74
- βœ“ ELECTRA
75
-
76
- Sequence : T638 T177 T420 T122
77
- Glyphs : 𐦭𐦬𐦰𐦑
78
- BERT : 0.9650
79
- N-gram : 0.8930
80
- ELECTRA : 0.9410
81
- Ensemble : 0.9410
82
- Verdict : βœ… VALID (β‰₯85%)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
  ```
84
 
85
- ## Model performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
86
 
87
- | Model | Metric | Value |
 
 
 
 
 
 
 
 
 
 
 
88
  |---|---|---|
89
- | TinyBERT Classifier | Test accuracy | 89.0% |
90
- | TinyBERT MLM | Val loss | 2.06 |
91
- | N-gram RTL | Pairwise accuracy | 88.2% |
92
- | ELECTRA | Token accuracy | 95.1% |
93
- | DeBERTa | Test accuracy | 87.1% |
94
- | NanoGPT | Perplexity | 13.3 |
 
95
 
96
  ## Key findings
97
 
98
- - **RTL confirmed** β€” right-to-left has 12% stronger grammatical structure than LTR
99
- - **Grammar proven** β€” H1β†’H2β†’H3 = 6.03β†’3.41β†’2.39 bits (language-like decay)
100
- - **Zipf's law** β€” RΒ²=0.968 (language-like token distribution)
101
- - **752 seal reproductions** β€” model independently reproduced real inscriptions
102
- - **Sign roles** β€” PREFIX (T638, T604), SUFFIX (T123, T122), CORE (T101, T268)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ## Dataset
105
 
106
- The 5,000 synthetic sequences are available at:
107
- [YOUR_USERNAME/indus-script-synthetic](https://huggingface.co/datasets/YOUR_USERNAME/indus-script-synthetic)
 
 
1
  # Indus Script Models
2
 
3
+ Trained models for validating, predicting, and generating sequences in the undeciphered
4
+ Indus Valley Script (2600–1900 BCE). Built on 3,310 real archaeological inscriptions.
5
 
6
+ ---
7
 
8
+ ## Quick Start (3 steps)
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ ```bash
11
+ # Step 1 β€” Clone the repo
12
+ git clone https://huggingface.co/hellosindh/indus-script-models
13
+ cd indus-script-models
14
 
15
+ # Step 2 β€” Install dependencies
16
+ pip install torch transformers
 
 
 
 
 
 
17
 
18
+ # Step 3 β€” Run the demo
19
+ python inference.py --task demo
20
+ ```
 
 
21
 
22
+ That is it. No downloads, no configuration. Models are already in the repo.
 
 
23
 
24
+ ---
25
 
26
+ ## What you can do
 
27
 
28
+ ### 1. Validate a sequence
29
+ Is this inscription grammatically valid?
 
30
 
31
+ ```bash
 
 
 
32
  python inference.py --task validate --sequence "T638 T177 T420 T122"
33
+ ```
34
+
35
+ Output:
36
+ ```
37
+ Sequence : T638 T177 T420 T122
38
+ BERT : 0.9650
39
+ N-gram : 0.8930
40
+ ELECTRA : 0.9410
41
+ Ensemble : 0.9410
42
+ Verdict : VALID (>=85%)
43
+ ```
44
+
45
+ ### 2. Predict a masked sign
46
+ What sign most likely fills the missing position?
47
 
48
+ ```bash
49
  python inference.py --task predict --sequence "T638 [MASK] T420 T122"
50
+ ```
51
+
52
+ Output:
53
+ ```
54
+ Position 1 predictions:
55
+ T177 18.3%
56
+ T243 12.1%
57
+ T653 9.4%
58
+ T684 7.2%
59
+ T650 5.8%
60
+ ```
61
+
62
+ ### 3. Generate new sequences
63
 
64
+ ```bash
65
+ # Generate 10 sequences (default threshold 85%)
66
  python inference.py --task generate --count 10
67
 
68
+ # More variety, less strict
69
+ python inference.py --task generate --count 20 --threshold 0.78
70
+
71
+ # High quality only
72
+ python inference.py --task generate --count 5 --threshold 0.92
73
+ ```
74
+
75
+ ### 4. Score any sequence
76
+
77
+ ```bash
78
  python inference.py --task score --sequence "T604 T123 T609"
79
  ```
80
 
81
+ ---
82
 
83
+ ## Generating more diverse or longer sequences
84
+
85
+ Open `inference.py` and find the `task_generate` function. Change the temperature list:
86
+
87
+ **More random β€” forces rare signs to appear:**
88
+ ```python
89
+ # Change this line:
90
+ temps = [0.85, 0.90, 1.00, 1.10]
91
+ # To:
92
+ temps = [1.10, 1.20, 1.30, 1.40]
93
  ```
94
+
95
+ **Longer sequences:**
96
+ Find the `generate()` method inside `load_nanogpt()` and change `max_len`:
97
+ ```python
98
+ # Default (avg 7 signs):
99
+ def generate(self, temperature=0.85, top_k=40, max_len=15):
100
+
101
+ # For longer sequences:
102
+ def generate(self, temperature=0.85, top_k=40, max_len=25):
103
+
104
+ # For shorter sequences:
105
+ def generate(self, temperature=0.85, top_k=40, max_len=6):
106
+ ```
107
+
108
+ ---
109
+
110
+ ## Pros and cons of tuning
111
+
112
+ | Setting | Effect | Good for | Watch out for |
113
+ |---|---|---|---|
114
+ | Temperature 0.7–0.8 | Very focused, repeats common signs | High quality outputs | Low diversity |
115
+ | Temperature 0.9–1.0 | Balanced β€” default | General use | Nothing |
116
+ | Temperature 1.1–1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences |
117
+ | Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate |
118
+ | Threshold 0.85 | Strict β€” default | Publication quality | Slower generation |
119
+ | Threshold 0.75 | Relaxed | Larger datasets | Lower average quality |
120
+ | Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass |
121
+ | max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns |
122
+ | max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals |
123
+
124
+ ---
125
+
126
+ ## Displaying Indus glyphs
127
+
128
+ Sequences use sign IDs like T638, T177. To see actual glyphs:
129
+
130
+ 1. Search for **indus-brahmi-font** and download it
131
+ 2. Install the font on your system (double click the .ttf or .woff2 file)
132
+ 3. The `glyphs` field in output shows the rendered glyph characters
133
+ 4. Open `data/id_to_glyph.json` to see the full sign to character mapping
134
+
135
+ Without the font installed, glyphs show as boxes or question marks.
136
+ The sign IDs (T638, T177 etc.) always work regardless of font.
137
+
138
+ ---
139
+
140
+ ## Repo structure
141
+
142
+ ```
143
+ indus-script-models/
144
+ β”œβ”€β”€ inference.py run this for all tasks
145
+ β”œβ”€β”€ indus_ngram.py required by ngram_model.pkl β€” do not move
146
+ β”œβ”€β”€ README.md
147
+ β”œβ”€β”€ models/
148
+ β”‚ β”œβ”€β”€ nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3)
149
+ β”‚ β”œβ”€β”€ ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy)
150
+ β”‚ β”œβ”€β”€ mlm/ TinyBERT masked language model (val loss 2.06)
151
+ β”‚ β”œβ”€β”€ cls/ TinyBERT classifier (89.0% test accuracy)
152
+ β”‚ β”œβ”€β”€ electra/ ELECTRA discriminator (95.1% token accuracy)
153
+ β”‚ └── deberta/ DeBERTa discriminator (87.1% test accuracy)
154
+ └── data/
155
+ β”œβ”€β”€ id_to_glyph.json 641 sign ID to glyph character mappings
156
+ └── indus_tokenizer/ custom tokenizer for Indus Script
157
  ```
158
 
159
+ ---
160
+
161
+ ## How the pipeline works
162
+
163
+ **Stage 1 β€” Train on 3,310 real inscriptions:**
164
+
165
+ Four models trained independently, each learning a different aspect of grammar:
166
+
167
+ - **TinyBERT MLM** β€” learns which sign can fill a masked position in a sequence
168
+ - **TinyBERT Classifier** β€” learns to tell valid sequences from corrupted ones
169
+ - **N-gram RTL** β€” learns right-to-left transition probabilities between signs
170
+ - **ELECTRA** β€” learns token-level discrimination between real and fake signs
171
+ - **NanoGPT** β€” learns to generate new sequences from scratch
172
+
173
+ **Stage 2 β€” Generate and filter:**
174
 
175
+ NanoGPT generates candidate sequences in RTL order, then flips them to LTR.
176
+ Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%).
177
+ Only sequences scoring 85% or higher are kept as valid synthetic sequences.
178
+ Sequences that exactly match real inscriptions are separated as seal reproductions.
179
+ Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.
180
+
181
+ **Stage 3 β€” Retrain on combined data:**
182
+
183
+ The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total).
184
+ All models were retrained on the larger dataset. Results improved significantly:
185
+
186
+ | Model | Before | After |
187
  |---|---|---|
188
+ | TinyBERT accuracy | 78.4% | 89.0% |
189
+ | NanoGPT perplexity | 32.5 | 13.3 |
190
+ | DeBERTa accuracy | 80.5% | 87.1% |
191
+
192
+ The final 5,000 sequences in the dataset were generated with these retrained models.
193
+
194
+ ---
195
 
196
  ## Key findings
197
 
198
+ - **RTL reading confirmed** β€” right-to-left has 12% stronger grammatical structure than LTR
199
+ - **Grammar proven** β€” entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
200
+ - **Zipf law confirmed** β€” R squared = 0.968, language-like token distribution
201
+ - **752 seal reproductions** β€” model independently reproduced real archaeological inscriptions
202
+ - **Sign roles discovered:**
203
+ - PREFIX signs at reading end: T638, T604, T406, T496
204
+ - SUFFIX signs at reading start: T123, T122, T701, T741
205
+ - CORE signs in the middle: T101, T268, T177, T243
206
+
207
+ ---
208
+
209
+ ## Known limitations
210
+
211
+ **DeBERTa calibration issue:**
212
+ DeBERTa scores near-zero for all sequences due to confidence calibration failure.
213
+ It is logged in output but excluded from the quality gate.
214
+ BERT, N-gram, and ELECTRA handle all scoring.
215
+
216
+ **Vocabulary coverage:**
217
+ Only about 26% of the 641 known Indus signs appear reliably in generated sequences.
218
+ 475 signs appear 10 times or fewer in the real corpus β€” too rare for the model to learn.
219
+ This is a property of the archaeological record, not a model bug.
220
+ No synthetic corpus can reliably generate signs that barely exist in the training data.
221
+
222
+ **Short sequences:**
223
+ The model rarely generates length-2 sequences even though they are common in real inscriptions.
224
+ If you need shorter outputs, set `max_len=4` in the generate function.
225
+
226
+ ---
227
 
228
  ## Dataset
229
 
230
+ The 5,000 synthetic sequences with full scores and sign index are available at:
231
+
232
+ [hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)