indus-script-models / README.md
hellosindh's picture
Update README.md
4f6155b verified
---
language:
- und
license: cc-by-4.0
tags:
- indus-script
- ancient-scripts
- archaeology
- nlp
- text-generation
- sequence-modeling
- grammar-analysis
- undeciphered-script
library_name: transformers
pipeline_tag: text-generation
---
# Indus Script Models
Trained models for validating, predicting, and generating sequences in the undeciphered
Indus Valley Script (2600–1900 BCE). Built on 3,310 real archaeological inscriptions.
---
## Quick Start (3 steps)
```bash
# Step 1 β€” Clone the repo
git clone https://huggingface.co/hellosindh/indus-script-models
cd indus-script-models
# Step 2 β€” Install dependencies
pip install torch transformers
# Step 3 β€” Run the demo
python inference.py --task demo
```
---
## What you can do
### 1. Validate a sequence
Is this inscription grammatically valid?
```bash
python inference.py --task validate --sequence "T638 T177 T420 T122"
```
Output:
```
Sequence : T638 T177 T420 T122
BERT : 0.9650
N-gram : 0.8930
ELECTRA : 0.9410
Ensemble : 0.9410
Verdict : VALID (>=85%)
```
### 2. Predict a masked sign
What sign most likely fills the missing position?
```bash
python inference.py --task predict --sequence "T638 [MASK] T420 T122"
```
Output:
```
Position 1 predictions:
T177 18.3%
T243 12.1%
T653 9.4%
T684 7.2%
T650 5.8%
```
### 3. Generate new sequences
```bash
# Generate 10 sequences (default threshold 85%)
python inference.py --task generate --count 10
# More variety, less strict
python inference.py --task generate --count 20 --threshold 0.78
# High quality only
python inference.py --task generate --count 5 --threshold 0.92
```
### 4. Score any sequence
```bash
python inference.py --task score --sequence "T604 T123 T609"
```
---
## Generating more diverse or longer sequences
Open `inference.py` and find the `task_generate` function. Change the temperature list:
**More random β€” forces rare signs to appear:**
```python
# Change this line:
temps = [0.85, 0.90, 1.00, 1.10]
# To:
temps = [1.10, 1.20, 1.30, 1.40]
```
**Longer sequences:**
Find the `generate()` method inside `load_nanogpt()` and change `max_len`:
```python
# Default (avg 7 signs):
def generate(self, temperature=0.85, top_k=40, max_len=15):
# For longer sequences:
def generate(self, temperature=0.85, top_k=40, max_len=25):
# For shorter sequences:
def generate(self, temperature=0.85, top_k=40, max_len=6):
```
---
## Pros and cons of tuning
| Setting | Effect | Good for | Watch out for |
|---|---|---|---|
| Temperature 0.7–0.8 | Very focused, repeats common signs | High quality outputs | Low diversity |
| Temperature 0.9–1.0 | Balanced β€” default | General use | Nothing |
| Temperature 1.1–1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences |
| Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate |
| Threshold 0.85 | Strict β€” default | Publication quality | Slower generation |
| Threshold 0.75 | Relaxed | Larger datasets | Lower average quality |
| Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass |
| max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns |
| max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals |
---
## Displaying Indus glyphs
Sequences use sign IDs like T638, T177. To see actual glyphs:
1. Search for **indus-brahmi-font** and download it
2. The `glyphs` field in output shows the rendered glyph characters
3. Open `data/id_to_glyph.json` to see the full sign to character mapping
4. If want to see mapping with T, open `data/indus_tokenizer/indus_id_map.json`
Without the font installed, glyphs show as boxes or question marks.
The sign IDs (T638, T177 etc.) always work regardless of font.
---
## Repo structure
```
indus-script-models/
β”œβ”€β”€ inference.py run this for all tasks
β”œβ”€β”€ indus_ngram.py required by ngram_model.pkl β€” do not move
β”œβ”€β”€ README.md
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3)
β”‚ β”œβ”€β”€ ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy)
β”‚ β”œβ”€β”€ mlm/ TinyBERT masked language model (val loss 2.06)
β”‚ β”œβ”€β”€ cls/ TinyBERT classifier (89.0% test accuracy)
β”‚ β”œβ”€β”€ electra/ ELECTRA discriminator (95.1% token accuracy)
β”‚ └── deberta/ DeBERTa discriminator (87.1% test accuracy)
└── data/
β”œβ”€β”€ id_to_glyph.json 641 sign ID to glyph character mappings
└── indus_tokenizer/ custom tokenizer for Indus Script
```
---
## How the pipeline works
**Stage 1 β€” Train on 3,310 real inscriptions:**
Four models trained independently, each learning a different aspect of grammar:
- **TinyBERT MLM** β€” learns which sign can fill a masked position in a sequence
- **TinyBERT Classifier** β€” learns to tell valid sequences from corrupted ones
- **N-gram RTL** β€” learns right-to-left transition probabilities between signs
- **ELECTRA** β€” learns token-level discrimination between real and fake signs
- **NanoGPT** β€” learns to generate new sequences from scratch
**Stage 2 β€” Generate and filter:**
NanoGPT generates candidate sequences in RTL order, then flips them to LTR.
Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%).
Only sequences scoring 85% or higher are kept as valid synthetic sequences.
Sequences that exactly match real inscriptions are separated as seal reproductions.
Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.
**Stage 3 β€” Retrain on combined data:**
The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total).
All models were retrained on the larger dataset. Results improved significantly:
| Model | Before | After |
|---|---|---|
| TinyBERT accuracy | 78.4% | 89.0% |
| NanoGPT perplexity | 32.5 | 13.3 |
| DeBERTa accuracy | 80.5% | 87.1% |
The final 5,000 sequences in the dataset were generated with these retrained models.
---
## Key findings
- **RTL reading confirmed** β€” right-to-left has 12% stronger grammatical structure than LTR
- **Grammar proven** β€” entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
- **Zipf law confirmed** β€” R squared = 0.968, language-like token distribution
- **752 seal reproductions** β€” model independently reproduced real archaeological inscriptions
- **Sign roles discovered:**
- PREFIX signs at reading end: T638, T604, T406, T496
- SUFFIX signs at reading start: T123, T122, T701, T741
- CORE signs in the middle: T101, T268, T177, T243
---
## Known limitations
**DeBERTa calibration issue:**
DeBERTa scores near-zero for all sequences due to confidence calibration failure.
It is logged in output but excluded from the quality gate.
BERT, N-gram, and ELECTRA handle all scoring.
**Vocabulary coverage:**
Only about 26% of the 641 known Indus signs appear reliably in generated sequences.
475 signs appear 10 times or fewer in the real corpus β€” too rare for the model to learn.
This is a property of the archaeological record, not a model bug.
No synthetic corpus can reliably generate signs that barely exist in the training data.
**Short sequences:**
The model rarely generates length-2 sequences even though they are common in real inscriptions.
If you need shorter outputs, set `max_len=4` in the generate function.
---
## Dataset
The 5,000 synthetic sequences with full scores and sign index are available at:
[hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)