--- language: - und license: cc-by-4.0 tags: - indus-script - ancient-scripts - archaeology - nlp - text-generation - sequence-modeling - grammar-analysis - undeciphered-script library_name: transformers pipeline_tag: text-generation --- # Indus Script Models Trained models for validating, predicting, and generating sequences in the undeciphered Indus Valley Script (2600–1900 BCE). Built on 3,310 real archaeological inscriptions. --- ## Quick Start (3 steps) ```bash # Step 1 — Clone the repo git clone https://huggingface.co/hellosindh/indus-script-models cd indus-script-models # Step 2 — Install dependencies pip install torch transformers # Step 3 — Run the demo python inference.py --task demo ``` --- ## What you can do ### 1. Validate a sequence Is this inscription grammatically valid? ```bash python inference.py --task validate --sequence "T638 T177 T420 T122" ``` Output: ``` Sequence : T638 T177 T420 T122 BERT : 0.9650 N-gram : 0.8930 ELECTRA : 0.9410 Ensemble : 0.9410 Verdict : VALID (>=85%) ``` ### 2. Predict a masked sign What sign most likely fills the missing position? ```bash python inference.py --task predict --sequence "T638 [MASK] T420 T122" ``` Output: ``` Position 1 predictions: T177 18.3% T243 12.1% T653 9.4% T684 7.2% T650 5.8% ``` ### 3. Generate new sequences ```bash # Generate 10 sequences (default threshold 85%) python inference.py --task generate --count 10 # More variety, less strict python inference.py --task generate --count 20 --threshold 0.78 # High quality only python inference.py --task generate --count 5 --threshold 0.92 ``` ### 4. Score any sequence ```bash python inference.py --task score --sequence "T604 T123 T609" ``` --- ## Generating more diverse or longer sequences Open `inference.py` and find the `task_generate` function. Change the temperature list: **More random — forces rare signs to appear:** ```python # Change this line: temps = [0.85, 0.90, 1.00, 1.10] # To: temps = [1.10, 1.20, 1.30, 1.40] ``` **Longer sequences:** Find the `generate()` method inside `load_nanogpt()` and change `max_len`: ```python # Default (avg 7 signs): def generate(self, temperature=0.85, top_k=40, max_len=15): # For longer sequences: def generate(self, temperature=0.85, top_k=40, max_len=25): # For shorter sequences: def generate(self, temperature=0.85, top_k=40, max_len=6): ``` --- ## Pros and cons of tuning | Setting | Effect | Good for | Watch out for | |---|---|---|---| | Temperature 0.7–0.8 | Very focused, repeats common signs | High quality outputs | Low diversity | | Temperature 0.9–1.0 | Balanced — default | General use | Nothing | | Temperature 1.1–1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences | | Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate | | Threshold 0.85 | Strict — default | Publication quality | Slower generation | | Threshold 0.75 | Relaxed | Larger datasets | Lower average quality | | Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass | | max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns | | max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals | --- ## Displaying Indus glyphs Sequences use sign IDs like T638, T177. To see actual glyphs: 1. Search for **indus-brahmi-font** and download it 2. The `glyphs` field in output shows the rendered glyph characters 3. Open `data/id_to_glyph.json` to see the full sign to character mapping 4. If want to see mapping with T, open `data/indus_tokenizer/indus_id_map.json` Without the font installed, glyphs show as boxes or question marks. The sign IDs (T638, T177 etc.) always work regardless of font. --- ## Repo structure ``` indus-script-models/ ├── inference.py run this for all tasks ├── indus_ngram.py required by ngram_model.pkl — do not move ├── README.md ├── models/ │ ├── nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3) │ ├── ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy) │ ├── mlm/ TinyBERT masked language model (val loss 2.06) │ ├── cls/ TinyBERT classifier (89.0% test accuracy) │ ├── electra/ ELECTRA discriminator (95.1% token accuracy) │ └── deberta/ DeBERTa discriminator (87.1% test accuracy) └── data/ ├── id_to_glyph.json 641 sign ID to glyph character mappings └── indus_tokenizer/ custom tokenizer for Indus Script ``` --- ## How the pipeline works **Stage 1 — Train on 3,310 real inscriptions:** Four models trained independently, each learning a different aspect of grammar: - **TinyBERT MLM** — learns which sign can fill a masked position in a sequence - **TinyBERT Classifier** — learns to tell valid sequences from corrupted ones - **N-gram RTL** — learns right-to-left transition probabilities between signs - **ELECTRA** — learns token-level discrimination between real and fake signs - **NanoGPT** — learns to generate new sequences from scratch **Stage 2 — Generate and filter:** NanoGPT generates candidate sequences in RTL order, then flips them to LTR. Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%). Only sequences scoring 85% or higher are kept as valid synthetic sequences. Sequences that exactly match real inscriptions are separated as seal reproductions. Result: 5,000 novel sequences with 752 exact seal matches as validation evidence. **Stage 3 — Retrain on combined data:** The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total). All models were retrained on the larger dataset. Results improved significantly: | Model | Before | After | |---|---|---| | TinyBERT accuracy | 78.4% | 89.0% | | NanoGPT perplexity | 32.5 | 13.3 | | DeBERTa accuracy | 80.5% | 87.1% | The final 5,000 sequences in the dataset were generated with these retrained models. --- ## Key findings - **RTL reading confirmed** — right-to-left has 12% stronger grammatical structure than LTR - **Grammar proven** — entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay) - **Zipf law confirmed** — R squared = 0.968, language-like token distribution - **752 seal reproductions** — model independently reproduced real archaeological inscriptions - **Sign roles discovered:** - PREFIX signs at reading end: T638, T604, T406, T496 - SUFFIX signs at reading start: T123, T122, T701, T741 - CORE signs in the middle: T101, T268, T177, T243 --- ## Known limitations **DeBERTa calibration issue:** DeBERTa scores near-zero for all sequences due to confidence calibration failure. It is logged in output but excluded from the quality gate. BERT, N-gram, and ELECTRA handle all scoring. **Vocabulary coverage:** Only about 26% of the 641 known Indus signs appear reliably in generated sequences. 475 signs appear 10 times or fewer in the real corpus — too rare for the model to learn. This is a property of the archaeological record, not a model bug. No synthetic corpus can reliably generate signs that barely exist in the training data. **Short sequences:** The model rarely generates length-2 sequences even though they are common in real inscriptions. If you need shorter outputs, set `max_len=4` in the generate function. --- ## Dataset The 5,000 synthetic sequences with full scores and sign index are available at: [hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic)