You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
By requesting access you agree to the CC BY-NC 4.0 license terms. Commercial use requires a separate licence from IIT Delhi.
Log in or Sign Up to review the conditions and access this model content.
RSL-BHARATI v3 — Multilingual IKS Tokenizer
BHARATI (भारती — Goddess of Speech & Learning / India) is a SentencePiece BPE tokenizer trained on 7 Indian languages with native IKS domain vocabulary pre-seeded.
Unlike general-purpose tokenizers that fragment Indic scripts into byte-level tokens, BHARATI provides native subword tokenization for Sanskrit, Hindi, Tamil, Telugu, Kannada, and Malayalam — critical for IKS educational content.
Tokenizer Summary
| Property | Value |
|---|---|
| Algorithm | BPE (SentencePiece) |
| Vocabulary size | 32,000 |
| Languages | English, Hindi, Sanskrit, Tamil, Telugu, Kannada, Malayalam |
| Model size | 957 KB (.model) + 708 KB (.vocab) |
| Character coverage | 99.98% |
| Byte fallback | Enabled (for unseen scripts) |
| Special tokens | <pad>, <unk>, <s>, </s> |
| License | CC BY-NC 4.0 |
Why A Custom Tokenizer?
Standard LLM tokenizers (LLaMA, GPT-4) treat Indian scripts poorly:
| Tokenizer | Tamil "திருக்குறள்" | Tokens |
|---|---|---|
| LLaMA | <0xE0><0xAE><0xA4>... (byte fallback) |
15 tokens |
| GPT-4 (tiktoken) | Similar byte-level fragmentation | 12 tokens |
| BHARATI v3 | திருக்குறள் (single token) |
1 token |
BHARATI was trained on ~850 MB of Indic text specifically to handle IKS educational content efficiently.
Pre-Seeded IKS Domain Tokens
The vocabulary includes hand-seeded IKS technical terms in Devanagari, Tamil, Telugu, and Kannada scripts. These reserved tokens ensure IKS terminology is tokenized as single units rather than fragmented into subword pieces.
Training Data (~850 MB)
| Language | Sources | Size |
|---|---|---|
| English | Gutenberg IKS filtered + books | 74 MB |
| Sanskrit | Sangraha Sanskrit + Bhagavad Gita + instructions | 56+ MB |
| Hindi | Sangraha Hindi + instruction JSONL | 127+ MB |
| Tamil | Sangraha Tamil + Sangam + Thirukkural + instructions | 135+ MB |
| Telugu | Sangraha Telugu + instruction JSONL | 142 MB |
| Kannada | Sangraha Kannada + instruction JSONL | 141 MB |
| Malayalam | Sangraha Malayalam + instruction JSONL | 141 MB |
Files
| File | Description | SHA-256 |
|---|---|---|
iks_sp_tokenizer_v3.model |
SentencePiece model (957 KB) | 32944da13ad4dd148e27a3b984827770784540c9b4f00b9440819139987e9d68 |
iks_sp_tokenizer_v3.vocab |
Human-readable vocabulary (708 KB, 32,000 entries) | 9a89c7bba1d79047a1b9b9359f0fdd143dd73a9708b6b106e692b144eec899e2 |
How to Use
import sentencepiece as spm
sp = spm.SentencePieceProcessor(model_file="iks_sp_tokenizer_v3.model")
# Tokenize English
tokens = sp.encode("Apply Krama Patha for number systems", out_type=str)
# ['▁Apply', '▁Krama', '▁Patha', '▁for', '▁number', '▁systems']
# Tokenize Tamil
tokens = sp.encode("திருக்குறள் அறத்துப்பால்", out_type=str)
# ['திருக்குறள்', '▁அறத்துப்பால்']
# Tokenize Hindi
tokens = sp.encode("भगवद्गीता का अध्ययन करें", out_type=str)
# ['भगवद्गीता', '▁का', '▁अध्ययन', '▁करें']
# Encode to IDs (for model input)
ids = sp.encode("Apply Dhyana-based Focus Protocol")
# [234, 5678, ...]
Version History
| Version | Vocab | Languages | Key Change |
|---|---|---|---|
| v1 | 32K | English + Sanskrit | Tamil fragmented to byte-level (15 tokens/word) |
| v2 | 32K | + Hindi, Tamil | Telugu/Kannada/Malayalam still 0.4 chars/token |
| v3 | 32K | All 7 | Every Indic script gets native subword tokens |
Citation
@misc{rsl_bharati_v3,
title={RSL-BHARATI v3: 7-Language IKS Tokenizer with Domain-Seeded Vocabulary},
author={Sivasubramani, Santhosh},
year={2026},
institution={INTRINSIC Lab, RSL Foundation, IIT Delhi},
url={https://huggingface.co/RSL-INTRINSICLab-IIT/RSL-BHARATI-v3}
}
Related Resources
- RSL-SETU-Classifier-15M — Uses this tokenizer for technique classification
- RSL-PRAJNA-v2 — Benchmark containing multilingual evaluation questions
- RSL-SHRUTI-Thirukkural — Tamil classical text dataset
License
CC BY-NC 4.0 — Free for research and educational use. Commercial use requires a license from IIT Delhi.
Acknowledgment
Demonstrated at the Bharat Bodhan AI Conclave, anchored and driven by the Ministry of Education and IIT Madras, New Delhi.
Contact
Prof. Santhosh Sivasubramani Director, INTRINSIC Laboratory RSL Foundation, Centre for SeNSE, IIT Delhi ssivasub@iitd.ac.in https://intrinsic.iitd.ac.in