YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
IndicBERT Probing and Fine-tuning Analysis
Extension of the paper: IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? (Aravapalli et al., 2024)
This repository reproduces the probing experiments from the IndicSentEval paper on IndicBERT (by AI4Bharat) and extends the analysis with two new contributions — stronger MLP probing and end-to-end fine-tuning — to measure IndicBERT's true linguistic capacity beyond what standard probing reveals.
Repository Structure
IndicBertology/ │ ├── src/ │ ├── run_indicbert.py # Probing script — reproduces paper's LogReg methodology │ ├── finetune_indicbert.py # Fine-tuning + Frozen MLP comparison (our extension) │ ├── classifier.py # Original repo classifier script │ ├── ada.py # Original repo embedding extraction script │ └── probingData/ # Dataset CSVs (6 languages × 9 tasks) │ ├── hindi/ │ ├── marathi/ │ ├── telugu/ │ ├── malayalam/ │ ├── kannada/ │ └── urdu/ │ ├── utils/ │ └── ssfAPI.py # Parses SSF (Shakti Standard Format) annotated data │ ├── guidelines/ # Project guidelines and SSF format documentation │ ├── results/ │ ├── indicbert_results.json # Probing results — all 12 layers, all tasks │ └── finetuning_comparison.json # Frozen MLP vs Fine-tuned comparison │ └── finetuning_comparison.csv # Same results in CSV format │ ├── requirements.txt └── README.md
Our Three Experimental Approaches
Approach 1 — LogisticRegression Probe (Paper's Method, Reproduced)
IndicBERT (frozen) → extract all 12 layer embeddings → LogisticRegression → best layer accuracy
Answers: Which layer best encodes each linguistic property?
Approach 2 — Frozen MLP Probe (Our Extension)
IndicBERT (frozen) → CLS token from final layer → 2-layer MLP (768→256→ReLU→n_classes) → accuracy
Answers: Can a stronger non-linear classifier extract more from the same frozen representations?
Approach 3 — End-to-End Fine-tuning (Our Extension)
IndicBERT (trainable) + MLP head → trained end-to-end with differential LRs → accuracy
Answers: What is IndicBERT's true maximum linguistic capacity for each task?
Probing Tasks
| Probing Task | Category | Number of Classes |
|---|---|---|
| Sentence Length (SentLen) | Surface | 8 |
| Bigram Shift (BShift) | Syntactic | 2 |
| Tree Depth (TreeDepth) | Syntactic | 5 |
| Subject Number (SubjNum) | Semantic | 2 |
| Object Number (ObjNum) | Semantic | 2 |
| Word Content | Semantic | var |
| Gender | Semantic | 4 |
| Number | Semantic | 3 |
| Person | Semantic | 7 |
Examples of Each Class (Hindi)
Subject Number
| Type | Example |
|---|---|
| sg | वैश्विक रूप से यह संक्रमण उन जगहों पर अधिक आम है जहां पर रोग - प्रतिरोधकता कम है । |
| pl | सामान्य लोगों में से आधे लोगों में ये छोटे जीव नींद के दौरान प्रवेश करते हैं । |
Object Number
| Type | Example |
|---|---|
| sg | उस वक्त के अंगरेज शासक लार्ड डलहौजी के नाम पर इसका नाम ' डलहौजी ' रख दिया गया । |
| pl | अगस्त के बाद गर्म कपड़े साथ रखें । |
Bigram Shift
| Type | Example |
|---|---|
| 0 (original) | पंजपुला एक सैरगाह है । |
| 1 (shifted) | गर्मियों में ठंडी सड़क को सब लुभाती है । |
Gender
| Type | Example |
|---|---|
| any | पंजपुला एक सैरगाह है । |
| m | पहले चश्मे में इतना पानी होता था कि 7 धाराएँ बनती थीं । |
| f | आज चश्मे में से एक धारा निकलती है । |
Number
| Type | Example |
|---|---|
| sg | डलहौजी में भीड़भाड़ के स्थान पर शांत माहौल है । |
| pl | लंबी छुट्टियाँ गुजारने वाले एकांतपसंद लोग डलहौजी बड़ी संख्या में आते हैं । |
| any | सामान्य स्वेटर व शाल हर मौसम में चाहिए । |
Person
| Type | Example |
|---|---|
| 1 | हम पूरी तरह तैयार होकर नहीं जाते । |
| 1h | हम क्या सबसे झगड़ा करती फिरती हैं का ? |
| 2 | चना दाल को दो घंटे पहले पानी में भिगो दें । |
| 2h | अगस्त के बाद गर्म कपड़े साथ रखें । |
| 3 | पंजपुला एक सैरगाह है । |
| 3h | सुभाष चंद्र बोस कुछ दिन यहाँ आ कर रहे थे । |
| any | कभी पंजपुला से धर्मशाला की ओर एक पैदल मार्ग जाता था । |
Languages Covered
| Language | Code | Sentences | Family | Script |
|---|---|---|---|---|
| Hindi | hi | 12,292 | Indo-Aryan | Devanagari |
| Marathi | mr | 12,029 | Indo-Aryan | Devanagari |
| Telugu | te | 3,222 | Dravidian | Telugu |
| Malayalam | ml | 7,667 | Dravidian | Malayalam |
| Kannada | kn | 9,806 | Dravidian | Kannada |
| Urdu | ur | 2,363 | Indo-Aryan | Nastaliq |
Perturbations (From Original Paper)
The original paper applies 13 text perturbations to test model robustness. These are defined in the src/ folder.
| Perturbation Type | Description |
|---|---|
| Append Irrelevant | Append an irrelevant sentence |
| DropAll | Drop all content words |
| DropAllNouns | Drop all noun tokens |
| DropAllVerbs | Drop all verb tokens |
| DropFirst | Drop first word |
| DropLast | Drop last word |
| DropFirstLast | Drop first and last word |
| DropRandomNoun | Drop a random noun |
| DropRandomVerb | Drop a random verb |
| KeepBoth | Keep only subject and object |
| KeepOnlyNoun | Keep only nouns |
| KeepOnlyVerb | Keep only verbs |
| Shuffle | Randomly shuffle all words |
Word Content — Mid-Frequency Words per Language
| Language | Vocabulary Size |
|---|---|
| Hindi | 781 |
| Marathi | 626 |
| Kannada | 416 |
| Malayalam | 188 |
| Urdu | 153 |
| Telugu | 34 |
Key Results
Probing Results — Best Layer Accuracy (LogisticRegression, Paper Method)
| Task | Hindi | Marathi | Telugu | Malayalam | Kannada | Urdu |
|---|---|---|---|---|---|---|
| SentLen | 63.1% | 25.1% | — | 30.6% | — | 18.2% |
| SubjNum | 67.5% | 71.3% | 73.2% | 76.0% | 79.7% | 89.3% |
| ObjNum | 70.1% | 88.2% | 64.7% | 79.1% | 80.9% | 86.1% |
| TreeDepth | 36.3% | 22.9% | — | 21.8% | 27.9% | 20.3% |
| BShift | 71.3% | 73.2% | 75.2% | 71.5% | 72.8% | 74.0% |
| Gender | 51.4% | 41.5% | 45.9% | — | 78.6% | 46.1% |
| Number | 60.5% | 76.4% | 55.8% | — | 76.3% | 67.2% |
| Person | 66.2% | 88.2% | 68.5% | — | 90.7% | 72.8% |
Fine-tuning Comparison — Telugu (Frozen MLP vs Fine-tuned)
| Task | Frozen MLP | Fine-tuned | Δ |
|---|---|---|---|
| SentLen | 82.6% | 96.9% | +14.3% |
| SubjNum | 71.9% | 83.0% | +11.1% |
| ObjNum | 66.4% | 66.4% | 0.0% |
| TreeDepth | 55.8% | 59.7% | +3.9% |
| BShift | 71.8% | 78.6% | +6.8% |
| Gender | 47.9% | 82.4% | +34.5% |
| Number | 51.6% | 89.8% | +38.1% |
| Person | 69.3% | 88.2% | +18.9% |
Key Findings
1. Probing underestimates surface task capacity — SentLen accuracy in Urdu goes from 18.2% (LogReg probing) to 84.1% (fine-tuned), a +66% gain. IndicBERT has strong surface encoding capacity that probing alone cannot reveal.
2. Telugu morphology shows the largest fine-tuning gains — Number: +38.1%, Gender: +34.5%. Frozen representations fail to capture Telugu morphological features that the model can actively learn.
3. TreeDepth remains hard regardless of approach — Even with full fine-tuning, tree depth peaks at ~59.7% (Telugu), indicating a fundamental limitation in IndicBERT's syntactic depth encoding.
4. Kannada Person is already near-perfectly encoded — Frozen MLP: 89.6%, Fine-tuned: 90.8% (Δ = +1.2%). Pretraining naturally encodes Kannada person morphology near-perfectly.
5. Delta (Δ) analysis is our new contribution — Large Δ means probing underestimates model capacity. Zero Δ means frozen representations are already optimal. Negative Δ means fine-tuning overfit on small data.
Setup
1. Clone the repository
bash git clone https://github.com/YOUR_USERNAME/IndicBertology.git cd IndicBertology
2. Install dependencies
bash pip install -r requirements.txt
3. Download IndicBERT
The model is gated on HuggingFace. Request access at https://huggingface.co/ai4bharat/indic-bert, then place the model files at ./indic-bert-local/indic-bert/
The folder should contain:
indic-bert-local/indic-bert/ ├── config.json ├── pytorch_model.bin ├── spiece.model └── spiece.vocab
Running
Probing experiments — reproduces the paper's methodology
bash python src/run_indicbert.py
Extracts embeddings from all 12 IndicBERT layers, trains LogisticRegression on each, reports best-layer accuracy.
Results saved to results/indicbert_results.json
Fine-tuning comparison — our extension
bash python src/finetune_indicbert.py
Trains frozen MLP probe and fine-tunes IndicBERT end-to-end for each task.
Results saved to results/finetuning_comparison.json and results/finetuning_comparison.csv
Implementation Notes
keep_accents=Truein tokenizer — critical for preserving Indic vowel matras (diacritics)- CLS token (index 0 of last hidden state) used as sentence representation
- Train / Dev / Test split: 70% / 10% / 20% with stratified sampling
- Early stopping with patience=3 on validation accuracy
- Fine-tuning uses differential learning rates:
2e-5for IndicBERT,1e-3for classifier head - Fresh IndicBERT weights loaded for each task to prevent cross-task interference
- MPS (Apple Silicon GPU) used for acceleration
Requirements
torch>=1.13.0 transformers>=4.30.0 sentencepiece>=0.1.99 numpy>=1.23.0 scikit-learn>=1.2.0 pandas>=1.5.0
About IndicBERT
IndicBERT is an ALBERT-based multilingual model created by AI4Bharat (IIT Madras) and pretrained on ~9 billion tokens across 12 Indian languages using Masked Language Modeling (MLM) and Sentence Order Prediction (SOP). It has 12 transformer layers with a hidden dimension of 768.
- Model:
ai4bharat/indic-bert - Architecture: ALBERT-base
- Hidden size: 768
- Layers: 12
- Vocabulary: 200,000 SentencePiece tokens
Original Paper and Codebase
This repository builds on the original IndicBertology codebase:
- Paper: IndicSentEval — Aravapalli et al., 2024. arXiv: 2410.02611
- Original repo: https://github.com/aforakhilesh/IndicBertology
bibtex @article{aravapalli2024indicsenteval, title = {IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?}, author = {Aravapalli, Akhilesh and others}, journal = {arXiv preprint arXiv:2410.02611}, year = {2024}, url = {https://arxiv.org/abs/2410.02611} }
Acknowledgements
- AI4Bharat for the IndicBERT model and IndicCorp dataset
- Authors of IndicSentEval for the probing dataset and methodology