Hindi BabyLM: Data-Efficient Language Modeling for Hindi

License: MIT Python 3.8+ PyTorch GitHub

📖 Model Overview

A collection of data-efficient Hindi language models trained as part of the BabyLM Challenge adaptation for morphologically rich languages. This repository contains 13 model variants spanning 2 architectures (GPT-2, DeBERTa), 5 tokenizer types (BPE, WordPiece, SentencePiece Unigram, Character-Bigram, Character-Level), and 2 data scales (10M and 100M words).

Research Question: How do tokenization strategies and model architectures affect linguistic competence when training with limited data on a morphologically rich language like Hindi?

Key Findings:

  • BPE tokenization with 32K vocabulary achieves the best overall balance of task performance and syntactic competence
  • GPT-2 significantly outperforms DeBERTa in the low-data regime (10M words)
  • Character-level tokenization catastrophically fails on syntactic evaluation (18.3% MultiBLiMP) despite reasonable task accuracy
  • Scaling from 10M to 100M words yields +1.6% IndicGLUE and +4.9% MultiBLiMP improvement

📊 Results

GPT-2 Models (10M words, 110M parameters)

Model Tokenizer Vocab IndicGLUE (Avg) MultiBLiMP (Avg) Perplexity
gpt_10M_bpe_32k BPE 32K 60.45% 87.69% 129.25
gpt_10M_character_bigram Char-Bigram 2K 59.94% 88.32% 6.03
gpt_10M_sentencepiece_unigram SP Unigram 32K 59.82% 87.29% 124.56
gpt_10M_bpe_8k BPE 8K 58.53% 86.51% 84.15
gpt_10M_wordpiece WordPiece 32K 58.20% 86.14% 124.70
gpt_10M_bpe_16k BPE 16K 57.62% 87.96% 113.17
gpt_10M_character_level Char-Level 200 53.84% 18.32% 3.93

GPT-2 Model (100M words) — Best Overall

Model Tokenizer Vocab IndicGLUE (Avg) MultiBLiMP (Avg) Perplexity
gpt_100M_large BPE 32K 62.09% 92.54% 83.50

DeBERTa Models (10M words, 86M parameters)

Model Tokenizer Vocab IndicGLUE (Avg) MultiBLiMP (Avg) Perplexity
deberta_10M_bpe_32K BPE 32K 47.70% 69.74% 616.69
deberta_10M_wordpiece_32K WordPiece 32K 40.92% 69.69% 525.93
deberta_10M_sentencepiece_unigram_32K SP Unigram 32K 38.93% 68.51% 644.42

Note: Perplexity values are not directly comparable across tokenizers with different vocabulary sizes, as the prediction space differs. Character-level models have naturally lower perplexity due to smaller vocabularies.

🏆 Best Model: IndicGLUE Per-Task Breakdown (gpt_100M_large)

Task Accuracy
BBC Articles Classification 78.06%
Product Review Sentiment 73.42%
Discourse Mode 73.32%
Choice of Plausible Alternatives 63.64%
Movie Review Sentiment 61.94%
Wikipedia Section Title Prediction 43.70%
Cloze-style Multiple-Choice QA 40.70%
Average (7 tasks) 62.09%

WinogradNLI was skipped as the dataset contains only the entailment class in the train/validation splits.

🧠 MultiBLiMP Syntactic Evaluation (gpt_100M_large)

Phenomenon Accuracy Correct / Total
Subject-Verb Agreement: Person (SV-P) 96.60% 398 / 412
Subject-Predicate Agreement: Number (SP-#) 95.00% 95 / 100
Subject-Predicate Agreement: Gender (SP-G) 92.66% 101 / 109
Subject-Verb Agreement: Gender (SV-G) 90.21% 378 / 419
Subject-Verb Agreement: Number (SV-#) 88.21% 359 / 407
Overall (1,447 pairs) 92.54%

🏗️ Architecture Details

GPT-2 Small (Primary Model)

  • Type: Causal Language Model (autoregressive)
  • Parameters: 110M
  • Layers: 12 | Hidden Size: 768 | Attention Heads: 12
  • Activation: GELU
  • Context Length: 512 tokens

DeBERTa Small

  • Type: Masked Language Model (bidirectional)
  • Parameters: 86M
  • Layers: 12 | Hidden Size: 768
  • Key Feature: Disentangled attention mechanism

💻 Usage

import torch
from transformers import GPT2LMHeadModel, PreTrainedTokenizerFast

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained("Ayush-Talreja/hindi-babylm")
tokenizer = PreTrainedTokenizerFast.from_pretrained("Ayush-Talreja/hindi-babylm")

# Generate text
input_text = "भारत एक"
input_ids = tokenizer.encode(input_text, return_tensors="pt")

with torch.no_grad():
    output = model.generate(
        input_ids,
        max_new_tokens=50,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.8,
    )

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

📦 Dataset

Hindi BabyLM Corpus — A curated, multi-source dataset for data-efficient Hindi language modeling.

Data Splits

Split Words Documents
Training 100M 113,266
Validation 10M 180,259
Test 10M 180,399

Sources

Source Proportion Description
IndicCorp V2 ~50% Curated news and general web text
Hindi Wikipedia ~30% Encyclopedia and reference material
IndicDialogue ~15% Movie and TV show subtitles
Children's Literature ~5% Stories and educational content

Data Quality Pipeline

  • Unicode normalization (NFC)
  • Length filtering (30–2,000 characters)
  • Language detection (≥80% Devanagari script)
  • Word count filtering (2–10,000 words per document)
  • Near-duplicate removal (MinHash LSH, 256 permutations, threshold: 0.8)

⚙️ Training Configuration

Parameter Value
Optimizer AdamW (β₁=0.9, β₂=0.999)
Learning Rate 3e-4
LR Schedule Cosine with warmup
Batch Size 32 × 8 gradient accumulation = 256 effective
Epochs 10
Weight Decay 0.01
Gradient Clipping max_norm = 1.0
Mixed Precision BF16
Hardware NVIDIA GPU (LRZ HPC cluster)

⚠️ Limitations and Biases

  • Limited training data: Models trained on 10M words are not competitive with full-scale language models. They are designed for research on data-efficient learning, not production use.
  • Hindi only: No multilingual capability. Performance on code-mixed Hindi-English text is untested.
  • Source biases: The training corpus over-represents formal written Hindi (news, Wikipedia) and movie dialogue. Spoken Hindi, regional dialects, and informal text are underrepresented.
  • Character-level tokenization failure: The character-level tokenizer achieves only 18.3% on MultiBLiMP (below chance), indicating that character-level representations alone are insufficient for capturing Hindi syntactic patterns at this data scale.
  • Evaluation scope: IndicGLUE covers 7 tasks (WinogradNLI skipped due to data issues). Results may not generalize to all Hindi NLP applications.

📚 Citation

If you use these models or the dataset, please cite:

@misc{talreja2025hindibabylm,
  title={Hindi BabyLM: Data-Efficient Language Modeling for Hindi},
  author={Talreja, Ayush},
  year={2025},
  howpublished={\url{https://huggingface.co/Ayush-Talreja/hindi-babylm}},
  note={BabyLM Challenge adaptation for morphologically rich languages}
}

🙏 Acknowledgments

  • AI4Bharat for IndicCorp, IndicGLUE, and Hindi NLP resources
  • BabyLM Challenge organizers for the research motivation
  • Hugging Face for model hosting and community support
  • LRZ (Leibniz-Rechenzentrum) for HPC compute resources

⚖️ License

This project is licensed under the MIT License — see the LICENSE file for details.

📧 Contact

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Ayush-Talreja/hindi-babylm

Evaluation results