---
language:
- und
license: cc-by-4.0
tags:
- indus-script
- ancient-scripts
- archaeology
- nlp
- text-generation
- sequence-modeling
- grammar-analysis
- undeciphered-script
library_name: transformers
pipeline_tag: text-generation
---

# Indus Script Models

Trained models for validating, predicting, and generating sequences in the undeciphered
Indus Valley Script (2600–1900 BCE). Built on 3,310 real archaeological inscriptions.

---

## Quick Start (3 steps)

```bash
# Step 1 — Clone the repo
git clone https://huggingface.co/hellosindh/indus-script-models
cd indus-script-models

# Step 2 — Install dependencies
pip install torch transformers

# Step 3 — Run the demo
python inference.py --task demo
```

---

## What you can do

### 1. Validate a sequence
Is this inscription grammatically valid?

```bash
python inference.py --task validate --sequence "T638 T177 T420 T122"
```

Output:
```
Sequence  : T638 T177 T420 T122
BERT      : 0.9650
N-gram    : 0.8930
ELECTRA   : 0.9410
Ensemble  : 0.9410
Verdict   : VALID (>=85%)
```

### 2. Predict a masked sign
What sign most likely fills the missing position?

```bash
python inference.py --task predict --sequence "T638 [MASK] T420 T122"
```

Output:
```
Position 1 predictions:
  T177    18.3%
  T243    12.1%
  T653     9.4%
  T684     7.2%
  T650     5.8%
```

### 3. Generate new sequences

```bash
# Generate 10 sequences (default threshold 85%)
python inference.py --task generate --count 10

# More variety, less strict
python inference.py --task generate --count 20 --threshold 0.78

# High quality only
python inference.py --task generate --count 5 --threshold 0.92
```

### 4. Score any sequence

```bash
python inference.py --task score --sequence "T604 T123 T609"
```

---

## Generating more diverse or longer sequences

Open `inference.py` and find the `task_generate` function. Change the temperature list:

**More random — forces rare signs to appear:**
```python
# Change this line:
temps = [0.85, 0.90, 1.00, 1.10]
# To:
temps = [1.10, 1.20, 1.30, 1.40]
```

**Longer sequences:**
Find the `generate()` method inside `load_nanogpt()` and change `max_len`:
```python
# Default (avg 7 signs):
def generate(self, temperature=0.85, top_k=40, max_len=15):

# For longer sequences:
def generate(self, temperature=0.85, top_k=40, max_len=25):

# For shorter sequences:
def generate(self, temperature=0.85, top_k=40, max_len=6):
```

---

## Pros and cons of tuning

| Setting | Effect | Good for | Watch out for |
|---|---|---|---|
| Temperature 0.7–0.8 | Very focused, repeats common signs | High quality outputs | Low diversity |
| Temperature 0.9–1.0 | Balanced — default | General use | Nothing |
| Temperature 1.1–1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences |
| Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate |
| Threshold 0.85 | Strict — default | Publication quality | Slower generation |
| Threshold 0.75 | Relaxed | Larger datasets | Lower average quality |
| Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass |
| max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns |
| max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals |

---

## Displaying Indus glyphs

Sequences use sign IDs like T638, T177. To see actual glyphs:

1. Search for **indus-brahmi-font** and download it
2. The `glyphs` field in output shows the rendered glyph characters
3. Open `data/id_to_glyph.json` to see the full sign to character mapping
4. If want to see mapping with T, open `data/indus_tokenizer/indus_id_map.json`

Without the font installed, glyphs show as boxes or question marks.
The sign IDs (T638, T177 etc.) always work regardless of font.

---

## Repo structure

```
indus-script-models/
├── inference.py              run this for all tasks
├── indus_ngram.py            required by ngram_model.pkl — do not move
├── README.md
├── models/
│   ├── nanogpt_indus.pt      NanoGPT generator (153K params, PPL 13.3)
│   ├── ngram_model.pkl       N-gram RTL model (88.2% pairwise accuracy)
│   ├── mlm/                  TinyBERT masked language model (val loss 2.06)
│   ├── cls/                  TinyBERT classifier (89.0% test accuracy)
│   ├── electra/              ELECTRA discriminator (95.1% token accuracy)
│   └── deberta/              DeBERTa discriminator (87.1% test accuracy)
└── data/
    ├── id_to_glyph.json      641 sign ID to glyph character mappings
    └── indus_tokenizer/      custom tokenizer for Indus Script
```

---

## How the pipeline works

**Stage 1 — Train on 3,310 real inscriptions:**

Four models trained independently, each learning a different aspect of grammar:

- **TinyBERT MLM** — learns which sign can fill a masked position in a sequence
- **TinyBERT Classifier** — learns to tell valid sequences from corrupted ones
- **N-gram RTL** — learns right-to-left transition probabilities between signs
- **ELECTRA** — learns token-level discrimination between real and fake signs
- **NanoGPT** — learns to generate new sequences from scratch

**Stage 2 — Generate and filter:**

NanoGPT generates candidate sequences in RTL order, then flips them to LTR.
Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%).
Only sequences scoring 85% or higher are kept as valid synthetic sequences.
Sequences that exactly match real inscriptions are separated as seal reproductions.
Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.

**Stage 3 — Retrain on combined data:**

The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total).
All models were retrained on the larger dataset. Results improved significantly:

| Model | Before | After |
|---|---|---|
| TinyBERT accuracy | 78.4% | 89.0% |
| NanoGPT perplexity | 32.5 | 13.3 |
| DeBERTa accuracy | 80.5% | 87.1% |

The final 5,000 sequences in the dataset were generated with these retrained models.

---

## Key findings

- **RTL reading confirmed** — right-to-left has 12% stronger grammatical structure than LTR
- **Grammar proven** — entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
- **Zipf law confirmed** — R squared = 0.968, language-like token distribution
- **752 seal reproductions** — model independently reproduced real archaeological inscriptions
- **Sign roles discovered:**
  - PREFIX signs at reading end: T638, T604, T406, T496
  - SUFFIX signs at reading start: T123, T122, T701, T741
  - CORE signs in the middle: T101, T268, T177, T243

---

## Known limitations

**DeBERTa calibration issue:**
DeBERTa scores near-zero for all sequences due to confidence calibration failure.
It is logged in output but excluded from the quality gate.
BERT, N-gram, and ELECTRA handle all scoring.

**Vocabulary coverage:**
Only about 26% of the 641 known Indus signs appear reliably in generated sequences.
475 signs appear 10 times or fewer in the real corpus — too rare for the model to learn.
This is a property of the archaeological record, not a model bug.
No synthetic corpus can reliably generate signs that barely exist in the training data.

**Short sequences:**
The model rarely generates length-2 sequences even though they are common in real inscriptions.
If you need shorter outputs, set `max_len=4` in the generate function.

---

## Dataset

The 5,000 synthetic sequences with full scores and sign index are available at:

[hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)