Indus Script Models
Trained models for validating, predicting, and generating sequences in the undeciphered Indus Valley Script (2600β1900 BCE). Built on 3,310 real archaeological inscriptions.
Quick Start (3 steps)
# Step 1 β Clone the repo
git clone https://huggingface.co/hellosindh/indus-script-models
cd indus-script-models
# Step 2 β Install dependencies
pip install torch transformers
# Step 3 β Run the demo
python inference.py --task demo
What you can do
1. Validate a sequence
Is this inscription grammatically valid?
python inference.py --task validate --sequence "T638 T177 T420 T122"
Output:
Sequence : T638 T177 T420 T122
BERT : 0.9650
N-gram : 0.8930
ELECTRA : 0.9410
Ensemble : 0.9410
Verdict : VALID (>=85%)
2. Predict a masked sign
What sign most likely fills the missing position?
python inference.py --task predict --sequence "T638 [MASK] T420 T122"
Output:
Position 1 predictions:
T177 18.3%
T243 12.1%
T653 9.4%
T684 7.2%
T650 5.8%
3. Generate new sequences
# Generate 10 sequences (default threshold 85%)
python inference.py --task generate --count 10
# More variety, less strict
python inference.py --task generate --count 20 --threshold 0.78
# High quality only
python inference.py --task generate --count 5 --threshold 0.92
4. Score any sequence
python inference.py --task score --sequence "T604 T123 T609"
Generating more diverse or longer sequences
Open inference.py and find the task_generate function. Change the temperature list:
More random β forces rare signs to appear:
# Change this line:
temps = [0.85, 0.90, 1.00, 1.10]
# To:
temps = [1.10, 1.20, 1.30, 1.40]
Longer sequences:
Find the generate() method inside load_nanogpt() and change max_len:
# Default (avg 7 signs):
def generate(self, temperature=0.85, top_k=40, max_len=15):
# For longer sequences:
def generate(self, temperature=0.85, top_k=40, max_len=25):
# For shorter sequences:
def generate(self, temperature=0.85, top_k=40, max_len=6):
Pros and cons of tuning
| Setting | Effect | Good for | Watch out for |
|---|---|---|---|
| Temperature 0.7β0.8 | Very focused, repeats common signs | High quality outputs | Low diversity |
| Temperature 0.9β1.0 | Balanced β default | General use | Nothing |
| Temperature 1.1β1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences |
| Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate |
| Threshold 0.85 | Strict β default | Publication quality | Slower generation |
| Threshold 0.75 | Relaxed | Larger datasets | Lower average quality |
| Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass |
| max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns |
| max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals |
Displaying Indus glyphs
Sequences use sign IDs like T638, T177. To see actual glyphs:
- Search for indus-brahmi-font and download it
- The
glyphsfield in output shows the rendered glyph characters - Open
data/id_to_glyph.jsonto see the full sign to character mapping - If want to see mapping with T, open
data/indus_tokenizer/indus_id_map.json
Without the font installed, glyphs show as boxes or question marks. The sign IDs (T638, T177 etc.) always work regardless of font.
Repo structure
indus-script-models/
βββ inference.py run this for all tasks
βββ indus_ngram.py required by ngram_model.pkl β do not move
βββ README.md
βββ models/
β βββ nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3)
β βββ ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy)
β βββ mlm/ TinyBERT masked language model (val loss 2.06)
β βββ cls/ TinyBERT classifier (89.0% test accuracy)
β βββ electra/ ELECTRA discriminator (95.1% token accuracy)
β βββ deberta/ DeBERTa discriminator (87.1% test accuracy)
βββ data/
βββ id_to_glyph.json 641 sign ID to glyph character mappings
βββ indus_tokenizer/ custom tokenizer for Indus Script
How the pipeline works
Stage 1 β Train on 3,310 real inscriptions:
Four models trained independently, each learning a different aspect of grammar:
- TinyBERT MLM β learns which sign can fill a masked position in a sequence
- TinyBERT Classifier β learns to tell valid sequences from corrupted ones
- N-gram RTL β learns right-to-left transition probabilities between signs
- ELECTRA β learns token-level discrimination between real and fake signs
- NanoGPT β learns to generate new sequences from scratch
Stage 2 β Generate and filter:
NanoGPT generates candidate sequences in RTL order, then flips them to LTR. Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%). Only sequences scoring 85% or higher are kept as valid synthetic sequences. Sequences that exactly match real inscriptions are separated as seal reproductions. Result: 5,000 novel sequences with 752 exact seal matches as validation evidence.
Stage 3 β Retrain on combined data:
The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total). All models were retrained on the larger dataset. Results improved significantly:
| Model | Before | After |
|---|---|---|
| TinyBERT accuracy | 78.4% | 89.0% |
| NanoGPT perplexity | 32.5 | 13.3 |
| DeBERTa accuracy | 80.5% | 87.1% |
The final 5,000 sequences in the dataset were generated with these retrained models.
Key findings
- RTL reading confirmed β right-to-left has 12% stronger grammatical structure than LTR
- Grammar proven β entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay)
- Zipf law confirmed β R squared = 0.968, language-like token distribution
- 752 seal reproductions β model independently reproduced real archaeological inscriptions
- Sign roles discovered:
- PREFIX signs at reading end: T638, T604, T406, T496
- SUFFIX signs at reading start: T123, T122, T701, T741
- CORE signs in the middle: T101, T268, T177, T243
Known limitations
DeBERTa calibration issue: DeBERTa scores near-zero for all sequences due to confidence calibration failure. It is logged in output but excluded from the quality gate. BERT, N-gram, and ELECTRA handle all scoring.
Vocabulary coverage: Only about 26% of the 641 known Indus signs appear reliably in generated sequences. 475 signs appear 10 times or fewer in the real corpus β too rare for the model to learn. This is a property of the archaeological record, not a model bug. No synthetic corpus can reliably generate signs that barely exist in the training data.
Short sequences:
The model rarely generates length-2 sequences even though they are common in real inscriptions.
If you need shorter outputs, set max_len=4 in the generate function.
Dataset
The 5,000 synthetic sequences with full scores and sign index are available at: