| --- |
| language: |
| - und |
| license: cc-by-4.0 |
| tags: |
| - indus-script |
| - ancient-scripts |
| - archaeology |
| - nlp |
| - text-generation |
| - sequence-modeling |
| - grammar-analysis |
| - undeciphered-script |
| library_name: transformers |
| pipeline_tag: text-generation |
| --- |
| |
| # Indus Script Models |
|
|
| Trained models for validating, predicting, and generating sequences in the undeciphered |
| Indus Valley Script (2600β1900 BCE). Built on 3,310 real archaeological inscriptions. |
|
|
| --- |
|
|
| ## Quick Start (3 steps) |
|
|
| ```bash |
| # Step 1 β Clone the repo |
| git clone https://huggingface.co/hellosindh/indus-script-models |
| cd indus-script-models |
| |
| # Step 2 β Install dependencies |
| pip install torch transformers |
| |
| # Step 3 β Run the demo |
| python inference.py --task demo |
| ``` |
|
|
| --- |
|
|
| ## What you can do |
|
|
| ### 1. Validate a sequence |
| Is this inscription grammatically valid? |
|
|
| ```bash |
| python inference.py --task validate --sequence "T638 T177 T420 T122" |
| ``` |
|
|
| Output: |
| ``` |
| Sequence : T638 T177 T420 T122 |
| BERT : 0.9650 |
| N-gram : 0.8930 |
| ELECTRA : 0.9410 |
| Ensemble : 0.9410 |
| Verdict : VALID (>=85%) |
| ``` |
|
|
| ### 2. Predict a masked sign |
| What sign most likely fills the missing position? |
|
|
| ```bash |
| python inference.py --task predict --sequence "T638 [MASK] T420 T122" |
| ``` |
|
|
| Output: |
| ``` |
| Position 1 predictions: |
| T177 18.3% |
| T243 12.1% |
| T653 9.4% |
| T684 7.2% |
| T650 5.8% |
| ``` |
|
|
| ### 3. Generate new sequences |
|
|
| ```bash |
| # Generate 10 sequences (default threshold 85%) |
| python inference.py --task generate --count 10 |
| |
| # More variety, less strict |
| python inference.py --task generate --count 20 --threshold 0.78 |
| |
| # High quality only |
| python inference.py --task generate --count 5 --threshold 0.92 |
| ``` |
|
|
| ### 4. Score any sequence |
|
|
| ```bash |
| python inference.py --task score --sequence "T604 T123 T609" |
| ``` |
|
|
| --- |
|
|
| ## Generating more diverse or longer sequences |
|
|
| Open `inference.py` and find the `task_generate` function. Change the temperature list: |
|
|
| **More random β forces rare signs to appear:** |
| ```python |
| # Change this line: |
| temps = [0.85, 0.90, 1.00, 1.10] |
| # To: |
| temps = [1.10, 1.20, 1.30, 1.40] |
| ``` |
|
|
| **Longer sequences:** |
| Find the `generate()` method inside `load_nanogpt()` and change `max_len`: |
| ```python |
| # Default (avg 7 signs): |
| def generate(self, temperature=0.85, top_k=40, max_len=15): |
| |
| # For longer sequences: |
| def generate(self, temperature=0.85, top_k=40, max_len=25): |
| |
| # For shorter sequences: |
| def generate(self, temperature=0.85, top_k=40, max_len=6): |
| ``` |
|
|
| --- |
|
|
| ## Pros and cons of tuning |
|
|
| | Setting | Effect | Good for | Watch out for | |
| |---|---|---|---| |
| | Temperature 0.7β0.8 | Very focused, repeats common signs | High quality outputs | Low diversity | |
| | Temperature 0.9β1.0 | Balanced β default | General use | Nothing | |
| | Temperature 1.1β1.3 | More variety, rare signs appear | Exploring vocabulary | Some unusual sequences | |
| | Temperature above 1.4 | Very random | Stress testing | Most sequences fail quality gate | |
| | Threshold 0.85 | Strict β default | Publication quality | Slower generation | |
| | Threshold 0.75 | Relaxed | Larger datasets | Lower average quality | |
| | Threshold 0.92 | Very strict | Highest confidence only | Very few sequences pass | |
| | max_len 6 | Short sequences | Matching real length distribution | Misses complex patterns | |
| | max_len 20+ | Long sequences | Complex grammar patterns | Not representative of real seals | |
|
|
| --- |
|
|
| ## Displaying Indus glyphs |
|
|
| Sequences use sign IDs like T638, T177. To see actual glyphs: |
|
|
| 1. Search for **indus-brahmi-font** and download it |
| 2. The `glyphs` field in output shows the rendered glyph characters |
| 3. Open `data/id_to_glyph.json` to see the full sign to character mapping |
| 4. If want to see mapping with T, open `data/indus_tokenizer/indus_id_map.json` |
|
|
| Without the font installed, glyphs show as boxes or question marks. |
| The sign IDs (T638, T177 etc.) always work regardless of font. |
|
|
| --- |
|
|
| ## Repo structure |
|
|
| ``` |
| indus-script-models/ |
| βββ inference.py run this for all tasks |
| βββ indus_ngram.py required by ngram_model.pkl β do not move |
| βββ README.md |
| βββ models/ |
| β βββ nanogpt_indus.pt NanoGPT generator (153K params, PPL 13.3) |
| β βββ ngram_model.pkl N-gram RTL model (88.2% pairwise accuracy) |
| β βββ mlm/ TinyBERT masked language model (val loss 2.06) |
| β βββ cls/ TinyBERT classifier (89.0% test accuracy) |
| β βββ electra/ ELECTRA discriminator (95.1% token accuracy) |
| β βββ deberta/ DeBERTa discriminator (87.1% test accuracy) |
| βββ data/ |
| βββ id_to_glyph.json 641 sign ID to glyph character mappings |
| βββ indus_tokenizer/ custom tokenizer for Indus Script |
| ``` |
|
|
| --- |
|
|
| ## How the pipeline works |
|
|
| **Stage 1 β Train on 3,310 real inscriptions:** |
|
|
| Four models trained independently, each learning a different aspect of grammar: |
|
|
| - **TinyBERT MLM** β learns which sign can fill a masked position in a sequence |
| - **TinyBERT Classifier** β learns to tell valid sequences from corrupted ones |
| - **N-gram RTL** β learns right-to-left transition probabilities between signs |
| - **ELECTRA** β learns token-level discrimination between real and fake signs |
| - **NanoGPT** β learns to generate new sequences from scratch |
|
|
| **Stage 2 β Generate and filter:** |
|
|
| NanoGPT generates candidate sequences in RTL order, then flips them to LTR. |
| Each candidate is scored by three models: BERT (50%) + N-gram (25%) + ELECTRA (25%). |
| Only sequences scoring 85% or higher are kept as valid synthetic sequences. |
| Sequences that exactly match real inscriptions are separated as seal reproductions. |
| Result: 5,000 novel sequences with 752 exact seal matches as validation evidence. |
|
|
| **Stage 3 β Retrain on combined data:** |
|
|
| The 5,000 synthetic sequences were combined with 3,310 real sequences (8,310 total). |
| All models were retrained on the larger dataset. Results improved significantly: |
|
|
| | Model | Before | After | |
| |---|---|---| |
| | TinyBERT accuracy | 78.4% | 89.0% | |
| | NanoGPT perplexity | 32.5 | 13.3 | |
| | DeBERTa accuracy | 80.5% | 87.1% | |
|
|
| The final 5,000 sequences in the dataset were generated with these retrained models. |
|
|
| --- |
|
|
| ## Key findings |
|
|
| - **RTL reading confirmed** β right-to-left has 12% stronger grammatical structure than LTR |
| - **Grammar proven** β entropy chain H1 to H2 to H3 = 6.03 to 3.41 to 2.39 bits (language-like decay) |
| - **Zipf law confirmed** β R squared = 0.968, language-like token distribution |
| - **752 seal reproductions** β model independently reproduced real archaeological inscriptions |
| - **Sign roles discovered:** |
| - PREFIX signs at reading end: T638, T604, T406, T496 |
| - SUFFIX signs at reading start: T123, T122, T701, T741 |
| - CORE signs in the middle: T101, T268, T177, T243 |
|
|
| --- |
|
|
| ## Known limitations |
|
|
| **DeBERTa calibration issue:** |
| DeBERTa scores near-zero for all sequences due to confidence calibration failure. |
| It is logged in output but excluded from the quality gate. |
| BERT, N-gram, and ELECTRA handle all scoring. |
|
|
| **Vocabulary coverage:** |
| Only about 26% of the 641 known Indus signs appear reliably in generated sequences. |
| 475 signs appear 10 times or fewer in the real corpus β too rare for the model to learn. |
| This is a property of the archaeological record, not a model bug. |
| No synthetic corpus can reliably generate signs that barely exist in the training data. |
|
|
| **Short sequences:** |
| The model rarely generates length-2 sequences even though they are common in real inscriptions. |
| If you need shorter outputs, set `max_len=4` in the generate function. |
|
|
| --- |
|
|
| ## Dataset |
|
|
| The 5,000 synthetic sequences with full scores and sign index are available at: |
|
|
| [hellosindh/indus-script-synthetic](https://huggingface.co/datasets/hellosindh/indus-script-synthetic) |
|
|