Update README.md
Browse files
README.md
CHANGED
|
@@ -1,107 +1,232 @@
|
|
| 1 |
# Indus Script Models
|
| 2 |
|
| 3 |
-
|
|
|
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
-
models/
|
| 9 |
-
mlm/best/ TinyBERT masked language model
|
| 10 |
-
cls/best/ TinyBERT sequence classifier (valid vs corrupted)
|
| 11 |
-
ngram_model.pkl N-gram RTL transition model
|
| 12 |
-
electra/best/ ELECTRA token discriminator
|
| 13 |
-
deberta/best/ DeBERTa sequence discriminator
|
| 14 |
-
nanogpt_indus.pt NanoGPT generator (153K params)
|
| 15 |
-
data/
|
| 16 |
-
indus_tokenizer/ Custom tokenizer (641 Indus sign tokens)
|
| 17 |
-
id_to_glyph.json Sign ID β glyph character mapping
|
| 18 |
-
inference.py Run all tasks (see below)
|
| 19 |
-
indus_ngram.py Required by ngram_model.pkl
|
| 20 |
-
```
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
Each learned a different aspect of grammar:
|
| 27 |
-
- TinyBERT MLM β which signs can fill a masked position
|
| 28 |
-
- TinyBERT Classifier β valid sequence vs corrupted
|
| 29 |
-
- N-gram RTL β right-to-left transition probabilities
|
| 30 |
-
- ELECTRA β token-level real vs fake discrimination
|
| 31 |
-
- DeBERTa β sequence-level real vs fake discrimination
|
| 32 |
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
Only sequences scoring β₯85% ensemble are kept.
|
| 37 |
-
Exact matches to real inscriptions separated as validation evidence.
|
| 38 |
|
| 39 |
-
|
| 40 |
-
All models retrained β TinyBERT accuracy 78% β 89%, NanoGPT PPL 32.5 β 13.3.
|
| 41 |
-
Final 5,000 sequences generated with retrained models.
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
-
pip install torch transformers huggingface_hub
|
| 47 |
|
| 48 |
-
#
|
| 49 |
-
|
| 50 |
-
cd indus-script-models
|
| 51 |
|
| 52 |
-
|
| 53 |
-
python inference.py --task demo
|
| 54 |
-
|
| 55 |
-
# Validate a sequence
|
| 56 |
python inference.py --task validate --sequence "T638 T177 T420 T122"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
|
| 59 |
python inference.py --task predict --sequence "T638 [MASK] T420 T122"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 60 |
|
| 61 |
-
|
|
|
|
| 62 |
python inference.py --task generate --count 10
|
| 63 |
|
| 64 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
python inference.py --task score --sequence "T604 T123 T609"
|
| 66 |
```
|
| 67 |
|
| 68 |
-
|
| 69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
```
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
```
|
| 84 |
|
| 85 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|---|---|---|
|
| 89 |
-
| TinyBERT
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
|
|
|
| 95 |
|
| 96 |
## Key findings
|
| 97 |
|
| 98 |
-
- **RTL confirmed** β right-to-left has 12% stronger grammatical structure than LTR
|
| 99 |
-
- **Grammar proven** β H1
|
| 100 |
-
- **Zipf
|
| 101 |
-
- **752 seal reproductions** β model independently reproduced real inscriptions
|
| 102 |
-
- **Sign roles**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
## Dataset
|
| 105 |
|
| 106 |
-
The 5,000 synthetic sequences are available at:
|
| 107 |
-
|
|
|
|
|
|
| 1 |
# Indus Script Models
|
| 2 |
|
| 3 |
+
Trained models for validating, predicting, and generating sequences in the undeciphered
|
| 4 |
+
Indus Valley Script (2600β1900 BCE). Built on 3,310 real archaeological inscriptions.
|
| 5 |
|
| 6 |
+
---
|
| 7 |
|
| 8 |
+
## Quick Start (3 steps)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
+
```bash
|
| 11 |
+
# Step 1 β Clone the repo
|
| 12 |
+
git clone https://huggingface.co/hellosindh/indus-script-models
|
| 13 |
+
cd indus-script-models
|
| 14 |
|
| 15 |
+
# Step 2 β Install dependencies
|
| 16 |
+
pip install torch transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
+
# Step 3 β Run the demo
|
| 19 |
+
python inference.py --task demo
|
| 20 |
+
```
|
|
|
|
|
|
|
| 21 |
|
| 22 |
+
That is it. No downloads, no configuration. Models are already in the repo.
|
|
|
|
|
|
|
| 23 |
|
| 24 |
+
---
|
| 25 |
|
| 26 |
+
## What you can do
|
|
|
|
| 27 |
|
| 28 |
+
### 1. Validate a sequence
|
| 29 |
+
Is this inscription grammatically valid?
|
|
|
|
| 30 |
|
| 31 |
+
```bash
|
|
|
|
|
|
|
|
|
|
| 32 |
python inference.py --task validate --sequence "T638 T177 T420 T122"
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
Output:
|
| 36 |
+
```
|
| 37 |
+
Sequence : T638 T177 T420 T122
|
| 38 |
+
BERT : 0.9650
|
| 39 |
+
N-gram : 0.8930
|
| 40 |
+
ELECTRA : 0.9410
|
| 41 |
+
Ensemble : 0.9410
|
| 42 |
+
Verdict : VALID (>=85%)
|
| 43 |
+
```
|
| 44 |
+
|
| 45 |
+
### 2. Predict a masked sign
|
| 46 |
+
What sign most likely fills the missing position?
|
| 47 |
|
| 48 |
+
```bash
|
| 49 |
python inference.py --task predict --sequence "T638 [MASK] T420 T122"
|
| 50 |
+
```
|
| 51 |
+
|
| 52 |
+
Output:
|
| 53 |
+
```
|
| 54 |
+
Position 1 predictions:
|
| 55 |
+
T177 18.3%
|
| 56 |
+
T243 12.1%
|
| 57 |
+
T653 9.4%
|
| 58 |
+
T684 7.2%
|
| 59 |
+
T650 5.8%
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
### 3. Generate new sequences
|
| 63 |
|
| 64 |
+
```bash
|
| 65 |
+
# Generate 10 sequences (default threshold 85%)
|
| 66 |
python inference.py --task generate --count 10
|
| 67 |
|
| 68 |
+
# More variety, less strict
|
| 69 |
+
python inference.py --task generate --count 20 --threshold 0.78
|
| 70 |
+
|
| 71 |
+
# High quality only
|
| 72 |
+
python inference.py --task generate --count 5 --threshold 0.92
|
| 73 |
+
```
|
| 74 |
+
|
| 75 |
+
### 4. Score any sequence
|
| 76 |
+
|
| 77 |
+
```bash
|
| 78 |
python inference.py --task score --sequence "T604 T123 T609"
|
| 79 |
```
|
| 80 |
|
| 81 |
+
---
|
| 82 |
|
| 83 |
+
## Generating more diverse or longer sequences
|
| 84 |
+
|
| 85 |
+
Open `inference.py` and find the `task_generate` function. Change the temperature list:
|
| 86 |
+
|
| 87 |
+
**More random β forces rare signs to appear:**
|
| 88 |
+
```python
|
| 89 |
+
# Change this line:
|
| 90 |
+
temps = [0.85, 0.90, 1.00, 1.10]
|
| 91 |
+
# To:
|
| 92 |
+
temps = [1.10, 1.20, 1.30, 1.40]
|
| 93 |
```
|
| 94 |
+
|
| 95 |
+
**Longer sequences:**
|
| 96 |
+
Find the `generate()` method inside `load_nanogpt()` and change `max_len`:
|
| 97 |
+
```python
|
| 98 |
+
# Default (avg 7 signs):
|
| 99 |
+
def generate(self, temperature=0.85, top_k=40, max_len=15):
|
| 100 |
+
|
| 101 |
+
# For longer sequences:
|
| 102 |
+
def generate(self, temperature=0.85, top_k=40, max_len=25):
|
| 103 |
+
|
| 104 |
+
# For shorter sequences:
|
| 105 |
+
def generate(self, temperature=0.85, top_k=40, max_len=6):
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
---
|
| 109 |
+
|
| 110 |
+
## Pros and cons of tuning
|
| 111 |
+
|
| 112 |
+
| Setting | Effect | Good for | Watch out for |
|
| 113 |
+
|---|---|---|---|
|
| 114 |
+
| Temperature 0.7β0.8 | Very focused, repeats common signs | High quality outputs | Low diversity |
|
| 115 |
+
| Temperature 0.9β1.0 | Balanced β default | General use | Nothing |
|
| 116 |
+
| Temperature 1.1β1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences |
|
| 117 |
+
| Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate |
|
| 118 |
+
| Threshold 0.85 | Strict β default | Publication quality | Slower generation |
|
| 119 |
+
| Threshold 0.75 | Relaxed | Larger datasets | Lower average quality |
|
| 120 |
+
| Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass |
|
| 121 |
+
| max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns |
|
| 122 |
+
| max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals |
|
| 123 |
+
|
| 124 |
+
---
|
| 125 |
+
|
| 126 |
+
## Displaying Indus glyphs
|
| 127 |
+
|
| 128 |
+
Sequences use sign IDs like T638, T177. To see actual glyphs:
|
| 129 |
+
|
| 130 |
+
1. Search for **indus-brahmi-font** and download it
|
| 131 |
+
2. Install the font on your system (double click the .ttf or .woff2 file)
|
| 132 |
+
3. The `glyphs` field in output shows the rendered glyph characters
|
| 133 |
+
4. Open `data/id_to_glyph.json` to see the full sign to character mapping
|
| 134 |
+
|
| 135 |
+
Without the font installed, glyphs show as boxes or question marks.
|
| 136 |
+
The sign IDs (T638, T177 etc.) always work regardless of font.
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
+
|
| 140 |
+
## Repo structure
|
| 141 |
+
|
| 142 |
+
```
|
| 143 |
+
indus-script-models/
|
| 144 |
+
βββ inference.py run this for all tasks
|
| 145 |
+
βββ indus_ngram.py required by ngram_model.pkl β do not move
|
| 146 |
+
βββ README.md
|
| 147 |
+
βββ models/
|
| 148 |
+
β βββ nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3)
|
| 149 |
+
β βββ ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy)
|
| 150 |
+
β βββ mlm/ TinyBERT masked language model (val loss 2.06)
|
| 151 |
+
β βββ cls/ TinyBERT classifier (89.0% test accuracy)
|
| 152 |
+
β βββ electra/ ELECTRA discriminator (95.1% token accuracy)
|
| 153 |
+
β βββ deberta/ DeBERTa discriminator (87.1% test accuracy)
|
| 154 |
+
βββ data/
|
| 155 |
+
βββ id_to_glyph.json 641 sign ID to glyph character mappings
|
| 156 |
+
βββ indus_tokenizer/ custom tokenizer for Indus Script
|
| 157 |
```
|
| 158 |
|
| 159 |
+
---
|
| 160 |
+
|
| 161 |
+
## How the pipeline works
|
| 162 |
+
|
| 163 |
+
**Stage 1 β Train on 3,310 real inscriptions:**
|
| 164 |
+
|
| 165 |
+
Four models trained independently, each learning a different aspect of grammar:
|
| 166 |
+
|
| 167 |
+
- **TinyBERT MLM** β learns which sign can fill a masked position in a sequence
|
| 168 |
+
- **TinyBERT Classifier** β learns to tell valid sequences from corrupted ones
|
| 169 |
+
- **N-gram RTL** β learns right-to-left transition probabilities between signs
|
| 170 |
+
- **ELECTRA** β learns token-level discrimination between real and fake signs
|
| 171 |
+
- **NanoGPT** β learns to generate new sequences from scratch
|
| 172 |
+
|
| 173 |
+
**Stage 2 β Generate and filter:**
|
| 174 |
|
| 175 |
+
NanoGPT generates candidate sequences in RTL order, then flips them to LTR.
|
| 176 |
+
Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%).
|
| 177 |
+
Only sequences scoring 85% or higher are kept as valid synthetic sequences.
|
| 178 |
+
Sequences that exactly match real inscriptions are separated as seal reproductions.
|
| 179 |
+
Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.
|
| 180 |
+
|
| 181 |
+
**Stage 3 β Retrain on combined data:**
|
| 182 |
+
|
| 183 |
+
The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total).
|
| 184 |
+
All models were retrained on the larger dataset. Results improved significantly:
|
| 185 |
+
|
| 186 |
+
| Model | Before | After |
|
| 187 |
|---|---|---|
|
| 188 |
+
| TinyBERT accuracy | 78.4% | 89.0% |
|
| 189 |
+
| NanoGPT perplexity | 32.5 | 13.3 |
|
| 190 |
+
| DeBERTa accuracy | 80.5% | 87.1% |
|
| 191 |
+
|
| 192 |
+
The final 5,000 sequences in the dataset were generated with these retrained models.
|
| 193 |
+
|
| 194 |
+
---
|
| 195 |
|
| 196 |
## Key findings
|
| 197 |
|
| 198 |
+
- **RTL reading confirmed** β right-to-left has 12% stronger grammatical structure than LTR
|
| 199 |
+
- **Grammar proven** β entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
|
| 200 |
+
- **Zipf law confirmed** β R squared = 0.968, language-like token distribution
|
| 201 |
+
- **752 seal reproductions** β model independently reproduced real archaeological inscriptions
|
| 202 |
+
- **Sign roles discovered:**
|
| 203 |
+
- PREFIX signs at reading end: T638, T604, T406, T496
|
| 204 |
+
- SUFFIX signs at reading start: T123, T122, T701, T741
|
| 205 |
+
- CORE signs in the middle: T101, T268, T177, T243
|
| 206 |
+
|
| 207 |
+
---
|
| 208 |
+
|
| 209 |
+
## Known limitations
|
| 210 |
+
|
| 211 |
+
**DeBERTa calibration issue:**
|
| 212 |
+
DeBERTa scores near-zero for all sequences due to confidence calibration failure.
|
| 213 |
+
It is logged in output but excluded from the quality gate.
|
| 214 |
+
BERT, N-gram, and ELECTRA handle all scoring.
|
| 215 |
+
|
| 216 |
+
**Vocabulary coverage:**
|
| 217 |
+
Only about 26% of the 641 known Indus signs appear reliably in generated sequences.
|
| 218 |
+
475 signs appear 10 times or fewer in the real corpus β too rare for the model to learn.
|
| 219 |
+
This is a property of the archaeological record, not a model bug.
|
| 220 |
+
No synthetic corpus can reliably generate signs that barely exist in the training data.
|
| 221 |
+
|
| 222 |
+
**Short sequences:**
|
| 223 |
+
The model rarely generates length-2 sequences even though they are common in real inscriptions.
|
| 224 |
+
If you need shorter outputs, set `max_len=4` in the generate function.
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
|
| 228 |
## Dataset
|
| 229 |
|
| 230 |
+
The 5,000 synthetic sequences with full scores and sign index are available at:
|
| 231 |
+
|
| 232 |
+
[hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)
|