You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

By requesting access you agree to the CC BY-NC 4.0 license terms. Commercial use requires a separate licence from IIT Delhi.

RSL-BHARATI v3 — Multilingual IKS Tokenizer

BHARATI (भारती — Goddess of Speech & Learning / India) is a SentencePiece BPE tokenizer trained on 7 Indian languages with native IKS domain vocabulary pre-seeded.

Unlike general-purpose tokenizers that fragment Indic scripts into byte-level tokens, BHARATI provides native subword tokenization for Sanskrit, Hindi, Tamil, Telugu, Kannada, and Malayalam — critical for IKS educational content.

Tokenizer Summary

Property	Value
Algorithm	BPE (SentencePiece)
Vocabulary size	32,000
Languages	English, Hindi, Sanskrit, Tamil, Telugu, Kannada, Malayalam
Model size	957 KB (.model) + 708 KB (.vocab)
Character coverage	99.98%
Byte fallback	Enabled (for unseen scripts)
Special tokens	`<pad>`, `<unk>`, `<s>`, `</s>`
License	CC BY-NC 4.0

Why A Custom Tokenizer?

Standard LLM tokenizers (LLaMA, GPT-4) treat Indian scripts poorly:

Tokenizer	Tamil "திருக்குறள்"	Tokens
LLaMA	`<0xE0><0xAE><0xA4>...` (byte fallback)	15 tokens
GPT-4 (tiktoken)	Similar byte-level fragmentation	12 tokens
BHARATI v3	`திருக்குறள்` (single token)	1 token

BHARATI was trained on ~850 MB of Indic text specifically to handle IKS educational content efficiently.

Pre-Seeded IKS Domain Tokens

The vocabulary includes hand-seeded IKS technical terms in Devanagari, Tamil, Telugu, and Kannada scripts. These reserved tokens ensure IKS terminology is tokenized as single units rather than fragmented into subword pieces.

Training Data (~850 MB)

Language	Sources	Size
English	Gutenberg IKS filtered + books	74 MB
Sanskrit	Sangraha Sanskrit + Bhagavad Gita + instructions	56+ MB
Hindi	Sangraha Hindi + instruction JSONL	127+ MB
Tamil	Sangraha Tamil + Sangam + Thirukkural + instructions	135+ MB
Telugu	Sangraha Telugu + instruction JSONL	142 MB
Kannada	Sangraha Kannada + instruction JSONL	141 MB
Malayalam	Sangraha Malayalam + instruction JSONL	141 MB

Files

File	Description	SHA-256
`iks_sp_tokenizer_v3.model`	SentencePiece model (957 KB)	`32944da13ad4dd148e27a3b984827770784540c9b4f00b9440819139987e9d68`
`iks_sp_tokenizer_v3.vocab`	Human-readable vocabulary (708 KB, 32,000 entries)	`9a89c7bba1d79047a1b9b9359f0fdd143dd73a9708b6b106e692b144eec899e2`

How to Use

import sentencepiece as spm

sp = spm.SentencePieceProcessor(model_file="iks_sp_tokenizer_v3.model")

# Tokenize English
tokens = sp.encode("Apply Krama Patha for number systems", out_type=str)
# ['▁Apply', '▁Krama', '▁Patha', '▁for', '▁number', '▁systems']

# Tokenize Tamil
tokens = sp.encode("திருக்குறள் அறத்துப்பால்", out_type=str)
# ['திருக்குறள்', '▁அறத்துப்பால்']

# Tokenize Hindi
tokens = sp.encode("भगवद्गीता का अध्ययन करें", out_type=str)
# ['भगवद्गीता', '▁का', '▁अध्ययन', '▁करें']

# Encode to IDs (for model input)
ids = sp.encode("Apply Dhyana-based Focus Protocol")
# [234, 5678, ...]

Version History

Version	Vocab	Languages	Key Change
v1	32K	English + Sanskrit	Tamil fragmented to byte-level (15 tokens/word)
v2	32K	+ Hindi, Tamil	Telugu/Kannada/Malayalam still 0.4 chars/token
v3	32K	All 7	Every Indic script gets native subword tokens

Citation

@misc{rsl_bharati_v3,
  title={RSL-BHARATI v3: 7-Language IKS Tokenizer with Domain-Seeded Vocabulary},
  author={Sivasubramani, Santhosh},
  year={2026},
  institution={INTRINSIC Lab, RSL Foundation, IIT Delhi},
  url={https://huggingface.co/RSL-INTRINSICLab-IIT/RSL-BHARATI-v3}
}

Related Resources

RSL-SETU-Classifier-15M — Uses this tokenizer for technique classification
RSL-PRAJNA-v2 — Benchmark containing multilingual evaluation questions
RSL-SHRUTI-Thirukkural — Tamil classical text dataset

License

CC BY-NC 4.0 — Free for research and educational use. Commercial use requires a license from IIT Delhi.

Acknowledgment

Demonstrated at the Bharat Bodhan AI Conclave, anchored and driven by the Ministry of Education and IIT Madras, New Delhi.

Contact

Prof. Santhosh Sivasubramani Director, INTRINSIC Laboratory RSL Foundation, Centre for SeNSE, IIT Delhi ssivasub@iitd.ac.in https://intrinsic.iitd.ac.in

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support