YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

IndicBERT Probing and Fine-tuning Analysis

Extension of the paper: IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? (Aravapalli et al., 2024)

This repository reproduces the probing experiments from the IndicSentEval paper on IndicBERT (by AI4Bharat) and extends the analysis with two new contributions — stronger MLP probing and end-to-end fine-tuning — to measure IndicBERT's true linguistic capacity beyond what standard probing reveals.


Repository Structure

IndicBertology/ │ ├── src/ │ ├── run_indicbert.py # Probing script — reproduces paper's LogReg methodology │ ├── finetune_indicbert.py # Fine-tuning + Frozen MLP comparison (our extension) │ ├── classifier.py # Original repo classifier script │ ├── ada.py # Original repo embedding extraction script │ └── probingData/ # Dataset CSVs (6 languages × 9 tasks) │ ├── hindi/ │ ├── marathi/ │ ├── telugu/ │ ├── malayalam/ │ ├── kannada/ │ └── urdu/ │ ├── utils/ │ └── ssfAPI.py # Parses SSF (Shakti Standard Format) annotated data │ ├── guidelines/ # Project guidelines and SSF format documentation │ ├── results/ │ ├── indicbert_results.json # Probing results — all 12 layers, all tasks │ └── finetuning_comparison.json # Frozen MLP vs Fine-tuned comparison │ └── finetuning_comparison.csv # Same results in CSV format │ ├── requirements.txt └── README.md


Our Three Experimental Approaches

Approach 1 — LogisticRegression Probe (Paper's Method, Reproduced)

IndicBERT (frozen) → extract all 12 layer embeddings → LogisticRegression → best layer accuracy

Answers: Which layer best encodes each linguistic property?

Approach 2 — Frozen MLP Probe (Our Extension)

IndicBERT (frozen) → CLS token from final layer → 2-layer MLP (768→256→ReLU→n_classes) → accuracy

Answers: Can a stronger non-linear classifier extract more from the same frozen representations?

Approach 3 — End-to-End Fine-tuning (Our Extension)

IndicBERT (trainable) + MLP head → trained end-to-end with differential LRs → accuracy

Answers: What is IndicBERT's true maximum linguistic capacity for each task?


Probing Tasks

Probing Task Category Number of Classes
Sentence Length (SentLen) Surface 8
Bigram Shift (BShift) Syntactic 2
Tree Depth (TreeDepth) Syntactic 5
Subject Number (SubjNum) Semantic 2
Object Number (ObjNum) Semantic 2
Word Content Semantic var
Gender Semantic 4
Number Semantic 3
Person Semantic 7

Examples of Each Class (Hindi)

Subject Number

Type Example
sg वैश्विक रूप से यह संक्रमण उन जगहों पर अधिक आम है जहां पर रोग - प्रतिरोधकता कम है ।
pl सामान्य लोगों में से आधे लोगों में ये छोटे जीव नींद के दौरान प्रवेश करते हैं ।

Object Number

Type Example
sg उस वक्त के अंगरेज शासक लार्ड डलहौजी के नाम पर इसका नाम ' डलहौजी ' रख दिया गया ।
pl अगस्त के बाद गर्म कपड़े साथ रखें ।

Bigram Shift

Type Example
0 (original) पंजपुला एक सैरगाह है ।
1 (shifted) गर्मियों में ठंडी सड़क को सब लुभाती है ।

Gender

Type Example
any पंजपुला एक सैरगाह है ।
m पहले चश्मे में इतना पानी होता था कि 7 धाराएँ बनती थीं ।
f आज चश्मे में से एक धारा निकलती है ।

Number

Type Example
sg डलहौजी में भीड़भाड़ के स्थान पर शांत माहौल है ।
pl लंबी छुट्टियाँ गुजारने वाले एकांतपसंद लोग डलहौजी बड़ी संख्या में आते हैं ।
any सामान्य स्वेटर व शाल हर मौसम में चाहिए ।

Person

Type Example
1 हम पूरी तरह तैयार होकर नहीं जाते ।
1h हम क्या सबसे झगड़ा करती फिरती हैं का ?
2 चना दाल को दो घंटे पहले पानी में भिगो दें ।
2h अगस्त के बाद गर्म कपड़े साथ रखें ।
3 पंजपुला एक सैरगाह है ।
3h सुभाष चंद्र बोस कुछ दिन यहाँ आ कर रहे थे ।
any कभी पंजपुला से धर्मशाला की ओर एक पैदल मार्ग जाता था ।

Languages Covered

Language Code Sentences Family Script
Hindi hi 12,292 Indo-Aryan Devanagari
Marathi mr 12,029 Indo-Aryan Devanagari
Telugu te 3,222 Dravidian Telugu
Malayalam ml 7,667 Dravidian Malayalam
Kannada kn 9,806 Dravidian Kannada
Urdu ur 2,363 Indo-Aryan Nastaliq

Perturbations (From Original Paper)

The original paper applies 13 text perturbations to test model robustness. These are defined in the src/ folder.

Perturbation Type Description
Append Irrelevant Append an irrelevant sentence
DropAll Drop all content words
DropAllNouns Drop all noun tokens
DropAllVerbs Drop all verb tokens
DropFirst Drop first word
DropLast Drop last word
DropFirstLast Drop first and last word
DropRandomNoun Drop a random noun
DropRandomVerb Drop a random verb
KeepBoth Keep only subject and object
KeepOnlyNoun Keep only nouns
KeepOnlyVerb Keep only verbs
Shuffle Randomly shuffle all words

Word Content — Mid-Frequency Words per Language

Language Vocabulary Size
Hindi 781
Marathi 626
Kannada 416
Malayalam 188
Urdu 153
Telugu 34

Key Results

Probing Results — Best Layer Accuracy (LogisticRegression, Paper Method)

Task Hindi Marathi Telugu Malayalam Kannada Urdu
SentLen 63.1% 25.1% 30.6% 18.2%
SubjNum 67.5% 71.3% 73.2% 76.0% 79.7% 89.3%
ObjNum 70.1% 88.2% 64.7% 79.1% 80.9% 86.1%
TreeDepth 36.3% 22.9% 21.8% 27.9% 20.3%
BShift 71.3% 73.2% 75.2% 71.5% 72.8% 74.0%
Gender 51.4% 41.5% 45.9% 78.6% 46.1%
Number 60.5% 76.4% 55.8% 76.3% 67.2%
Person 66.2% 88.2% 68.5% 90.7% 72.8%

Fine-tuning Comparison — Telugu (Frozen MLP vs Fine-tuned)

Task Frozen MLP Fine-tuned Δ
SentLen 82.6% 96.9% +14.3%
SubjNum 71.9% 83.0% +11.1%
ObjNum 66.4% 66.4% 0.0%
TreeDepth 55.8% 59.7% +3.9%
BShift 71.8% 78.6% +6.8%
Gender 47.9% 82.4% +34.5%
Number 51.6% 89.8% +38.1%
Person 69.3% 88.2% +18.9%

Key Findings

1. Probing underestimates surface task capacity — SentLen accuracy in Urdu goes from 18.2% (LogReg probing) to 84.1% (fine-tuned), a +66% gain. IndicBERT has strong surface encoding capacity that probing alone cannot reveal.

2. Telugu morphology shows the largest fine-tuning gains — Number: +38.1%, Gender: +34.5%. Frozen representations fail to capture Telugu morphological features that the model can actively learn.

3. TreeDepth remains hard regardless of approach — Even with full fine-tuning, tree depth peaks at ~59.7% (Telugu), indicating a fundamental limitation in IndicBERT's syntactic depth encoding.

4. Kannada Person is already near-perfectly encoded — Frozen MLP: 89.6%, Fine-tuned: 90.8% (Δ = +1.2%). Pretraining naturally encodes Kannada person morphology near-perfectly.

5. Delta (Δ) analysis is our new contribution — Large Δ means probing underestimates model capacity. Zero Δ means frozen representations are already optimal. Negative Δ means fine-tuning overfit on small data.


Setup

1. Clone the repository

bash git clone https://github.com/YOUR_USERNAME/IndicBertology.git cd IndicBertology

2. Install dependencies

bash pip install -r requirements.txt

3. Download IndicBERT

The model is gated on HuggingFace. Request access at https://huggingface.co/ai4bharat/indic-bert, then place the model files at ./indic-bert-local/indic-bert/

The folder should contain:

indic-bert-local/indic-bert/ ├── config.json ├── pytorch_model.bin ├── spiece.model └── spiece.vocab


Running

Probing experiments — reproduces the paper's methodology

bash python src/run_indicbert.py

Extracts embeddings from all 12 IndicBERT layers, trains LogisticRegression on each, reports best-layer accuracy. Results saved to results/indicbert_results.json

Fine-tuning comparison — our extension

bash python src/finetune_indicbert.py

Trains frozen MLP probe and fine-tunes IndicBERT end-to-end for each task. Results saved to results/finetuning_comparison.json and results/finetuning_comparison.csv


Implementation Notes

  • keep_accents=True in tokenizer — critical for preserving Indic vowel matras (diacritics)
  • CLS token (index 0 of last hidden state) used as sentence representation
  • Train / Dev / Test split: 70% / 10% / 20% with stratified sampling
  • Early stopping with patience=3 on validation accuracy
  • Fine-tuning uses differential learning rates: 2e-5 for IndicBERT, 1e-3 for classifier head
  • Fresh IndicBERT weights loaded for each task to prevent cross-task interference
  • MPS (Apple Silicon GPU) used for acceleration

Requirements

torch>=1.13.0 transformers>=4.30.0 sentencepiece>=0.1.99 numpy>=1.23.0 scikit-learn>=1.2.0 pandas>=1.5.0


About IndicBERT

IndicBERT is an ALBERT-based multilingual model created by AI4Bharat (IIT Madras) and pretrained on ~9 billion tokens across 12 Indian languages using Masked Language Modeling (MLM) and Sentence Order Prediction (SOP). It has 12 transformer layers with a hidden dimension of 768.

  • Model: ai4bharat/indic-bert
  • Architecture: ALBERT-base
  • Hidden size: 768
  • Layers: 12
  • Vocabulary: 200,000 SentencePiece tokens

Original Paper and Codebase

This repository builds on the original IndicBertology codebase:

bibtex @article{aravapalli2024indicsenteval, title = {IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?}, author = {Aravapalli, Akhilesh and others}, journal = {arXiv preprint arXiv:2410.02611}, year = {2024}, url = {https://arxiv.org/abs/2410.02611} }


Acknowledgements

  • AI4Bharat for the IndicBERT model and IndicCorp dataset
  • Authors of IndicSentEval for the probing dataset and methodology
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for JagritiRawat/IndicBertology