hellosindh
/

indus-script-models

@@ -1,107 +1,232 @@
 # Indus Script Models
-Four trained models + NanoGPT for the undeciphered Indus Valley Script (2600–1900 BCE).
-## What's in this repo
-```
-models/
-  mlm/best/           TinyBERT masked language model
-  cls/best/           TinyBERT sequence classifier (valid vs corrupted)
-  ngram_model.pkl     N-gram RTL transition model
-  electra/best/       ELECTRA token discriminator
-  deberta/best/       DeBERTa sequence discriminator
-  nanogpt_indus.pt    NanoGPT generator (153K params)
-data/
-  indus_tokenizer/    Custom tokenizer (641 Indus sign tokens)
-  id_to_glyph.json    Sign ID → glyph character mapping
-inference.py          Run all tasks (see below)
-indus_ngram.py        Required by ngram_model.pkl
-```
-## How the pipeline works
-**Stage 1 — Real inscriptions (3,310 sequences):**
-Four models trained independently on real Indus Script inscriptions.
-Each learned a different aspect of grammar:
-- TinyBERT MLM → which signs can fill a masked position
-- TinyBERT Classifier → valid sequence vs corrupted
-- N-gram RTL → right-to-left transition probabilities
-- ELECTRA → token-level real vs fake discrimination
-- DeBERTa → sequence-level real vs fake discrimination
-**Stage 2 — Generate + filter:**
-NanoGPT generates candidates in RTL order.
-Each candidate scored by BERT (50%) + N-gram (25%) + ELECTRA (25%).
-Only sequences scoring ≥85% ensemble are kept.
-Exact matches to real inscriptions separated as validation evidence.
-**Stage 3 — Retrain on combined data (3,310 real + 5,000 synthetic = 8,310):**
-All models retrained → TinyBERT accuracy 78% → 89%, NanoGPT PPL 32.5 → 13.3.
-Final 5,000 sequences generated with retrained models.
-## Quick start
-```bash
-pip install torch transformers huggingface_hub
-# Clone this repo
-git clone https://huggingface.co/YOUR_USERNAME/indus-script-models
-cd indus-script-models
-# Run demo (validates 5 example sequences)
-python inference.py --task demo
-# Validate a sequence
 python inference.py --task validate --sequence "T638 T177 T420 T122"
-# Predict a masked sign
 python inference.py --task predict --sequence "T638 [MASK] T420 T122"
-# Generate 10 new sequences
 python inference.py --task generate --count 10
-# Score any sequence
 python inference.py --task score --sequence "T604 T123 T609"
 ```
-## Example output
 ```
-Loading models...
-  ✓ TinyBERT
-  ✓ N-gram
-  ✓ ELECTRA
-  Sequence  : T638 T177 T420 T122
-  Glyphs    : 𐦭𐦬𐦰𐦡
-  BERT      : 0.9650
-  N-gram    : 0.8930
-  ELECTRA   : 0.9410
-  Ensemble  : 0.9410
-  Verdict   : ✅ VALID (≥85%)
 ```
-## Model performance
-| Model | Metric | Value |
 |---|---|---|
-| TinyBERT Classifier | Test accuracy | 89.0% |
-| TinyBERT MLM | Val loss | 2.06 |
-| N-gram RTL | Pairwise accuracy | 88.2% |
-| ELECTRA | Token accuracy | 95.1% |
-| DeBERTa | Test accuracy | 87.1% |
-| NanoGPT | Perplexity | 13.3 |
 ## Key findings
-- **RTL confirmed** — right-to-left has 12% stronger grammatical structure than LTR
-- **Grammar proven** — H1→H2→H3 = 6.03→3.41→2.39 bits (language-like decay)
-- **Zipf's law** — R²=0.968 (language-like token distribution)
-- **752 seal reproductions** — model independently reproduced real inscriptions
-- **Sign roles** — PREFIX (T638, T604), SUFFIX (T123, T122), CORE (T101, T268)
 ## Dataset
-The 5,000 synthetic sequences are available at:
-[YOUR_USERNAME/indus-script-synthetic](https://huggingface.co/datasets/YOUR_USERNAME/indus-script-synthetic)

 # Indus Script Models
+Trained models for validating, predicting, and generating sequences in the undeciphered
+Indus Valley Script (2600–1900 BCE). Built on 3,310 real archaeological inscriptions.
+---
+## Quick Start (3 steps)
+```bash
+# Step 1 — Clone the repo
+git clone https://huggingface.co/hellosindh/indus-script-models
+cd indus-script-models
+# Step 2 — Install dependencies
+pip install torch transformers
+# Step 3 — Run the demo
+python inference.py --task demo
+```
+That is it. No downloads, no configuration. Models are already in the repo.
+---
+## What you can do
+### 1. Validate a sequence
+Is this inscription grammatically valid?
+```bash
 python inference.py --task validate --sequence "T638 T177 T420 T122"
+```
+Output:
+```
+Sequence  : T638 T177 T420 T122
+BERT      : 0.9650
+N-gram    : 0.8930
+ELECTRA   : 0.9410
+Ensemble  : 0.9410
+Verdict   : VALID (>=85%)
+```
+### 2. Predict a masked sign
+What sign most likely fills the missing position?
+```bash
 python inference.py --task predict --sequence "T638 [MASK] T420 T122"
+```
+Output:
+```
+Position 1 predictions:
+  T177    18.3%
+  T243    12.1%
+  T653     9.4%
+  T684     7.2%
+  T650     5.8%
+```
+### 3. Generate new sequences
+```bash
+# Generate 10 sequences (default threshold 85%)
 python inference.py --task generate --count 10
+# More variety, less strict
+python inference.py --task generate --count 20 --threshold 0.78
+# High quality only
+python inference.py --task generate --count 5 --threshold 0.92
+```
+### 4. Score any sequence
+```bash
 python inference.py --task score --sequence "T604 T123 T609"
 ```
+---
+## Generating more diverse or longer sequences
+Open `inference.py` and find the `task_generate` function. Change the temperature list:
+**More random — forces rare signs to appear:**
+```python
+# Change this line:
+temps = [0.85, 0.90, 1.00, 1.10]
+# To:
+temps = [1.10, 1.20, 1.30, 1.40]
 ```
+**Longer sequences:**
+Find the `generate()` method inside `load_nanogpt()` and change `max_len`:
+```python
+# Default (avg 7 signs):
+def generate(self, temperature=0.85, top_k=40, max_len=15):
+# For longer sequences:
+def generate(self, temperature=0.85, top_k=40, max_len=25):
+# For shorter sequences:
+def generate(self, temperature=0.85, top_k=40, max_len=6):
+```
+---
+## Pros and cons of tuning
+| Setting | Effect | Good for | Watch out for |
+|---|---|---|---|
+| Temperature 0.7–0.8 | Very focused, repeats common signs | High quality outputs | Low diversity |
+| Temperature 0.9–1.0 | Balanced — default | General use | Nothing |
+| Temperature 1.1–1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences |
+| Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate |
+| Threshold 0.85 | Strict — default | Publication quality | Slower generation |
+| Threshold 0.75 | Relaxed | Larger datasets | Lower average quality |
+| Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass |
+| max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns |
+| max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals |
+---
+## Displaying Indus glyphs
+Sequences use sign IDs like T638, T177. To see actual glyphs:
+1. Search for **indus-brahmi-font** and download it
+2. Install the font on your system (double click the .ttf or .woff2 file)
+3. The `glyphs` field in output shows the rendered glyph characters
+4. Open `data/id_to_glyph.json` to see the full sign to character mapping
+Without the font installed, glyphs show as boxes or question marks.
+The sign IDs (T638, T177 etc.) always work regardless of font.
+---
+## Repo structure
+```
+indus-script-models/
+├── inference.py              run this for all tasks
+├── indus_ngram.py            required by ngram_model.pkl — do not move
+├── README.md
+├── models/
+│   ├── nanogpt_indus.pt      NanoGPT generator (153K params, PPL 13.3)
+│   ├── ngram_model.pkl       N-gram RTL model (88.2% pairwise accuracy)
+│   ├── mlm/                  TinyBERT masked language model (val loss 2.06)
+│   ├── cls/                  TinyBERT classifier (89.0% test accuracy)
+│   ├── electra/              ELECTRA discriminator (95.1% token accuracy)
+│   └── deberta/              DeBERTa discriminator (87.1% test accuracy)
+└── data/
+    ├── id_to_glyph.json      641 sign ID to glyph character mappings
+    └── indus_tokenizer/      custom tokenizer for Indus Script
 ```
+---
+## How the pipeline works
+**Stage 1 — Train on 3,310 real inscriptions:**
+Four models trained independently, each learning a different aspect of grammar:
+- **TinyBERT MLM** — learns which sign can fill a masked position in a sequence
+- **TinyBERT Classifier** — learns to tell valid sequences from corrupted ones
+- **N-gram RTL** — learns right-to-left transition probabilities between signs
+- **ELECTRA** — learns token-level discrimination between real and fake signs
+- **NanoGPT** — learns to generate new sequences from scratch
+**Stage 2 — Generate and filter:**
+NanoGPT generates candidate sequences in RTL order, then flips them to LTR.
+Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%).
+Only sequences scoring 85% or higher are kept as valid synthetic sequences.
+Sequences that exactly match real inscriptions are separated as seal reproductions.
+Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.
+**Stage 3 — Retrain on combined data:**
+The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total).
+All models were retrained on the larger dataset. Results improved significantly:
+| Model | Before | After |
 |---|---|---|
+| TinyBERT accuracy | 78.4% | 89.0% |
+| NanoGPT perplexity | 32.5 | 13.3 |
+| DeBERTa accuracy | 80.5% | 87.1% |
+The final 5,000 sequences in the dataset were generated with these retrained models.
+---
 ## Key findings
+- **RTL reading confirmed** — right-to-left has 12% stronger grammatical structure than LTR
+- **Grammar proven** — entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
+- **Zipf law confirmed** — R squared = 0.968, language-like token distribution
+- **752 seal reproductions** — model independently reproduced real archaeological inscriptions
+- **Sign roles discovered:**
+  - PREFIX signs at reading end: T638, T604, T406, T496
+  - SUFFIX signs at reading start: T123, T122, T701, T741
+  - CORE signs in the middle: T101, T268, T177, T243
+---
+## Known limitations
+**DeBERTa calibration issue:**
+DeBERTa scores near-zero for all sequences due to confidence calibration failure.
+It is logged in output but excluded from the quality gate.
+BERT, N-gram, and ELECTRA handle all scoring.
+**Vocabulary coverage:**
+Only about 26% of the 641 known Indus signs appear reliably in generated sequences.
+475 signs appear 10 times or fewer in the real corpus — too rare for the model to learn.
+This is a property of the archaeological record, not a model bug.
+No synthetic corpus can reliably generate signs that barely exist in the training data.
+**Short sequences:**
+The model rarely generates length-2 sequences even though they are common in real inscriptions.
+If you need shorter outputs, set `max_len=4` in the generate function.
+---
 ## Dataset
+The 5,000 synthetic sequences with full scores and sign index are available at:
+[hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)