YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

IndicBERT Probing and Fine-tuning Analysis

Extension of the paper: IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages? (Aravapalli et al., 2024)

This repository reproduces the probing experiments from the IndicSentEval paper on IndicBERT (by AI4Bharat) and extends the analysis with two new contributions — stronger MLP probing and end-to-end fine-tuning — to measure IndicBERT's true linguistic capacity beyond what standard probing reveals.

Repository Structure

IndicBertology/ │ ├── src/ │ ├── run_indicbert.py # Probing script — reproduces paper's LogReg methodology │ ├── finetune_indicbert.py # Fine-tuning + Frozen MLP comparison (our extension) │ ├── classifier.py # Original repo classifier script │ ├── ada.py # Original repo embedding extraction script │ └── probingData/ # Dataset CSVs (6 languages × 9 tasks) │ ├── hindi/ │ ├── marathi/ │ ├── telugu/ │ ├── malayalam/ │ ├── kannada/ │ └── urdu/ │ ├── utils/ │ └── ssfAPI.py # Parses SSF (Shakti Standard Format) annotated data │ ├── guidelines/ # Project guidelines and SSF format documentation │ ├── results/ │ ├── indicbert_results.json # Probing results — all 12 layers, all tasks │ └── finetuning_comparison.json # Frozen MLP vs Fine-tuned comparison │ └── finetuning_comparison.csv # Same results in CSV format │ ├── requirements.txt └── README.md

Our Three Experimental Approaches

Approach 1 — LogisticRegression Probe (Paper's Method, Reproduced)

IndicBERT (frozen) → extract all 12 layer embeddings → LogisticRegression → best layer accuracy

Answers: Which layer best encodes each linguistic property?

Approach 2 — Frozen MLP Probe (Our Extension)

IndicBERT (frozen) → CLS token from final layer → 2-layer MLP (768→256→ReLU→n_classes) → accuracy

Answers: Can a stronger non-linear classifier extract more from the same frozen representations?

Approach 3 — End-to-End Fine-tuning (Our Extension)

IndicBERT (trainable) + MLP head → trained end-to-end with differential LRs → accuracy

Answers: What is IndicBERT's true maximum linguistic capacity for each task?

Probing Tasks

Probing Task	Category	Number of Classes
Sentence Length (SentLen)	Surface	8
Bigram Shift (BShift)	Syntactic	2
Tree Depth (TreeDepth)	Syntactic	5
Subject Number (SubjNum)	Semantic	2
Object Number (ObjNum)	Semantic	2
Word Content	Semantic	var
Gender	Semantic	4
Number	Semantic	3
Person	Semantic	7

Examples of Each Class (Hindi)

Subject Number

Type	Example
sg	वैश्विक रूप से यह संक्रमण उन जगहों पर अधिक आम है जहां पर रोग - प्रतिरोधकता कम है ।
pl	सामान्य लोगों में से आधे लोगों में ये छोटे जीव नींद के दौरान प्रवेश करते हैं ।

Object Number

Type	Example
sg	उस वक्त के अंगरेज शासक लार्ड डलहौजी के नाम पर इसका नाम ' डलहौजी ' रख दिया गया ।
pl	अगस्त के बाद गर्म कपड़े साथ रखें ।

Bigram Shift

Type	Example
0 (original)	पंजपुला एक सैरगाह है ।
1 (shifted)	गर्मियों में ठंडी सड़क को सब लुभाती है ।

Gender

Type	Example
any	पंजपुला एक सैरगाह है ।
m	पहले चश्मे में इतना पानी होता था कि 7 धाराएँ बनती थीं ।
f	आज चश्मे में से एक धारा निकलती है ।

Number

Type	Example
sg	डलहौजी में भीड़भाड़ के स्थान पर शांत माहौल है ।
pl	लंबी छुट्टियाँ गुजारने वाले एकांतपसंद लोग डलहौजी बड़ी संख्या में आते हैं ।
any	सामान्य स्वेटर व शाल हर मौसम में चाहिए ।

Person

Type	Example
1	हम पूरी तरह तैयार होकर नहीं जाते ।
1h	हम क्या सबसे झगड़ा करती फिरती हैं का ?
2	चना दाल को दो घंटे पहले पानी में भिगो दें ।
2h	अगस्त के बाद गर्म कपड़े साथ रखें ।
3	पंजपुला एक सैरगाह है ।
3h	सुभाष चंद्र बोस कुछ दिन यहाँ आ कर रहे थे ।
any	कभी पंजपुला से धर्मशाला की ओर एक पैदल मार्ग जाता था ।

Languages Covered

Language	Code	Sentences	Family	Script
Hindi	hi	12,292	Indo-Aryan	Devanagari
Marathi	mr	12,029	Indo-Aryan	Devanagari
Telugu	te	3,222	Dravidian	Telugu
Malayalam	ml	7,667	Dravidian	Malayalam
Kannada	kn	9,806	Dravidian	Kannada
Urdu	ur	2,363	Indo-Aryan	Nastaliq

Perturbations (From Original Paper)

The original paper applies 13 text perturbations to test model robustness. These are defined in the src/ folder.

Perturbation Type	Description
Append Irrelevant	Append an irrelevant sentence
DropAll	Drop all content words
DropAllNouns	Drop all noun tokens
DropAllVerbs	Drop all verb tokens
DropFirst	Drop first word
DropLast	Drop last word
DropFirstLast	Drop first and last word
DropRandomNoun	Drop a random noun
DropRandomVerb	Drop a random verb
KeepBoth	Keep only subject and object
KeepOnlyNoun	Keep only nouns
KeepOnlyVerb	Keep only verbs
Shuffle	Randomly shuffle all words

Word Content — Mid-Frequency Words per Language

Language	Vocabulary Size
Hindi	781
Marathi	626
Kannada	416
Malayalam	188
Urdu	153
Telugu	34

Key Results

Probing Results — Best Layer Accuracy (LogisticRegression, Paper Method)

Task	Hindi	Marathi	Telugu	Malayalam	Kannada	Urdu
SentLen	63.1%	25.1%	—	30.6%	—	18.2%
SubjNum	67.5%	71.3%	73.2%	76.0%	79.7%	89.3%
ObjNum	70.1%	88.2%	64.7%	79.1%	80.9%	86.1%
TreeDepth	36.3%	22.9%	—	21.8%	27.9%	20.3%
BShift	71.3%	73.2%	75.2%	71.5%	72.8%	74.0%
Gender	51.4%	41.5%	45.9%	—	78.6%	46.1%
Number	60.5%	76.4%	55.8%	—	76.3%	67.2%
Person	66.2%	88.2%	68.5%	—	90.7%	72.8%

Fine-tuning Comparison — Telugu (Frozen MLP vs Fine-tuned)

Task	Frozen MLP	Fine-tuned	Δ
SentLen	82.6%	96.9%	+14.3%
SubjNum	71.9%	83.0%	+11.1%
ObjNum	66.4%	66.4%	0.0%
TreeDepth	55.8%	59.7%	+3.9%
BShift	71.8%	78.6%	+6.8%
Gender	47.9%	82.4%	+34.5%
Number	51.6%	89.8%	+38.1%
Person	69.3%	88.2%	+18.9%

Key Findings

1. Probing underestimates surface task capacity — SentLen accuracy in Urdu goes from 18.2% (LogReg probing) to 84.1% (fine-tuned), a +66% gain. IndicBERT has strong surface encoding capacity that probing alone cannot reveal.

2. Telugu morphology shows the largest fine-tuning gains — Number: +38.1%, Gender: +34.5%. Frozen representations fail to capture Telugu morphological features that the model can actively learn.

3. TreeDepth remains hard regardless of approach — Even with full fine-tuning, tree depth peaks at ~59.7% (Telugu), indicating a fundamental limitation in IndicBERT's syntactic depth encoding.

4. Kannada Person is already near-perfectly encoded — Frozen MLP: 89.6%, Fine-tuned: 90.8% (Δ = +1.2%). Pretraining naturally encodes Kannada person morphology near-perfectly.

5. Delta (Δ) analysis is our new contribution — Large Δ means probing underestimates model capacity. Zero Δ means frozen representations are already optimal. Negative Δ means fine-tuning overfit on small data.

Setup

1. Clone the repository

bash git clone https://github.com/YOUR_USERNAME/IndicBertology.git cd IndicBertology

2. Install dependencies

bash pip install -r requirements.txt

3. Download IndicBERT

The model is gated on HuggingFace. Request access at https://huggingface.co/ai4bharat/indic-bert, then place the model files at ./indic-bert-local/indic-bert/

The folder should contain:

indic-bert-local/indic-bert/ ├── config.json ├── pytorch_model.bin ├── spiece.model └── spiece.vocab

Running

Probing experiments — reproduces the paper's methodology

bash python src/run_indicbert.py

Extracts embeddings from all 12 IndicBERT layers, trains LogisticRegression on each, reports best-layer accuracy. Results saved to results/indicbert_results.json

Fine-tuning comparison — our extension

bash python src/finetune_indicbert.py

Trains frozen MLP probe and fine-tunes IndicBERT end-to-end for each task. Results saved to results/finetuning_comparison.json and results/finetuning_comparison.csv

Implementation Notes

keep_accents=True in tokenizer — critical for preserving Indic vowel matras (diacritics)
CLS token (index 0 of last hidden state) used as sentence representation
Train / Dev / Test split: 70% / 10% / 20% with stratified sampling
Early stopping with patience=3 on validation accuracy
Fine-tuning uses differential learning rates: 2e-5 for IndicBERT, 1e-3 for classifier head
Fresh IndicBERT weights loaded for each task to prevent cross-task interference
MPS (Apple Silicon GPU) used for acceleration

Requirements

torch>=1.13.0 transformers>=4.30.0 sentencepiece>=0.1.99 numpy>=1.23.0 scikit-learn>=1.2.0 pandas>=1.5.0

About IndicBERT

IndicBERT is an ALBERT-based multilingual model created by AI4Bharat (IIT Madras) and pretrained on ~9 billion tokens across 12 Indian languages using Masked Language Modeling (MLM) and Sentence Order Prediction (SOP). It has 12 transformer layers with a hidden dimension of 768.

Model: ai4bharat/indic-bert
Architecture: ALBERT-base
Hidden size: 768
Layers: 12
Vocabulary: 200,000 SentencePiece tokens

Original Paper and Codebase

This repository builds on the original IndicBertology codebase:

Paper: IndicSentEval — Aravapalli et al., 2024. arXiv: 2410.02611
Original repo: https://github.com/aforakhilesh/IndicBertology

bibtex @article{aravapalli2024indicsenteval, title = {IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?}, author = {Aravapalli, Akhilesh and others}, journal = {arXiv preprint arXiv:2410.02611}, year = {2024}, url = {https://arxiv.org/abs/2410.02611} }

Acknowledgements

AI4Bharat for the IndicBERT model and IndicCorp dataset
Authors of IndicSentEval for the probing dataset and methodology

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for JagritiRawat/IndicBertology

IndicSentEval: How Effectively do Multilingual Transformer Models encode Linguistic Properties for Indic Languages?

Paper • 2410.02611 • Published Nov 1, 2025