You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By requesting access you agree to the CC BY-NC 4.0 license terms. Commercial use requires a separate licence from IIT Delhi.

Log in or Sign Up to review the conditions and access this model content.

RSL-BHARATI v3 — Multilingual IKS Tokenizer

BHARATI (भारती — Goddess of Speech & Learning / India) is a SentencePiece BPE tokenizer trained on 7 Indian languages with native IKS domain vocabulary pre-seeded.

Unlike general-purpose tokenizers that fragment Indic scripts into byte-level tokens, BHARATI provides native subword tokenization for Sanskrit, Hindi, Tamil, Telugu, Kannada, and Malayalam — critical for IKS educational content.

Tokenizer Summary

Property Value
Algorithm BPE (SentencePiece)
Vocabulary size 32,000
Languages English, Hindi, Sanskrit, Tamil, Telugu, Kannada, Malayalam
Model size 957 KB (.model) + 708 KB (.vocab)
Character coverage 99.98%
Byte fallback Enabled (for unseen scripts)
Special tokens <pad>, <unk>, <s>, </s>
License CC BY-NC 4.0

Why A Custom Tokenizer?

Standard LLM tokenizers (LLaMA, GPT-4) treat Indian scripts poorly:

Tokenizer Tamil "திருக்குறள்" Tokens
LLaMA <0xE0><0xAE><0xA4>... (byte fallback) 15 tokens
GPT-4 (tiktoken) Similar byte-level fragmentation 12 tokens
BHARATI v3 திருக்குறள் (single token) 1 token

BHARATI was trained on ~850 MB of Indic text specifically to handle IKS educational content efficiently.

Pre-Seeded IKS Domain Tokens

The vocabulary includes hand-seeded IKS technical terms in Devanagari, Tamil, Telugu, and Kannada scripts. These reserved tokens ensure IKS terminology is tokenized as single units rather than fragmented into subword pieces.

Training Data (~850 MB)

Language Sources Size
English Gutenberg IKS filtered + books 74 MB
Sanskrit Sangraha Sanskrit + Bhagavad Gita + instructions 56+ MB
Hindi Sangraha Hindi + instruction JSONL 127+ MB
Tamil Sangraha Tamil + Sangam + Thirukkural + instructions 135+ MB
Telugu Sangraha Telugu + instruction JSONL 142 MB
Kannada Sangraha Kannada + instruction JSONL 141 MB
Malayalam Sangraha Malayalam + instruction JSONL 141 MB

Files

File Description SHA-256
iks_sp_tokenizer_v3.model SentencePiece model (957 KB) 32944da13ad4dd148e27a3b984827770784540c9b4f00b9440819139987e9d68
iks_sp_tokenizer_v3.vocab Human-readable vocabulary (708 KB, 32,000 entries) 9a89c7bba1d79047a1b9b9359f0fdd143dd73a9708b6b106e692b144eec899e2

How to Use

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="iks_sp_tokenizer_v3.model")

# Tokenize English
tokens = sp.encode("Apply Krama Patha for number systems", out_type=str)
# ['▁Apply', '▁Krama', '▁Patha', '▁for', '▁number', '▁systems']

# Tokenize Tamil
tokens = sp.encode("திருக்குறள் அறத்துப்பால்", out_type=str)
# ['திருக்குறள்', '▁அறத்துப்பால்']

# Tokenize Hindi
tokens = sp.encode("भगवद्गीता का अध्ययन करें", out_type=str)
# ['भगवद्गीता', '▁का', '▁अध्ययन', '▁करें']

# Encode to IDs (for model input)
ids = sp.encode("Apply Dhyana-based Focus Protocol")
# [234, 5678, ...]

Version History

Version Vocab Languages Key Change
v1 32K English + Sanskrit Tamil fragmented to byte-level (15 tokens/word)
v2 32K + Hindi, Tamil Telugu/Kannada/Malayalam still 0.4 chars/token
v3 32K All 7 Every Indic script gets native subword tokens

Citation

@misc{rsl_bharati_v3,
  title={RSL-BHARATI v3: 7-Language IKS Tokenizer with Domain-Seeded Vocabulary},
  author={Sivasubramani, Santhosh},
  year={2026},
  institution={INTRINSIC Lab, RSL Foundation, IIT Delhi},
  url={https://huggingface.co/RSL-INTRINSICLab-IIT/RSL-BHARATI-v3}
}

Related Resources

  • RSL-SETU-Classifier-15M — Uses this tokenizer for technique classification
  • RSL-PRAJNA-v2 — Benchmark containing multilingual evaluation questions
  • RSL-SHRUTI-Thirukkural — Tamil classical text dataset

License

CC BY-NC 4.0 — Free for research and educational use. Commercial use requires a license from IIT Delhi.

Acknowledgment

Demonstrated at the Bharat Bodhan AI Conclave, anchored and driven by the Ministry of Education and IIT Madras, New Delhi.

Contact

Prof. Santhosh Sivasubramani Director, INTRINSIC Laboratory RSL Foundation, Centre for SeNSE, IIT Delhi ssivasub@iitd.ac.in https://intrinsic.iitd.ac.in

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support