Hindi Unigram Tokenizer (32K)
Unigram language model tokenizer for Hindi (Devanagari script) using SentencePiece. Trained on Hindi Wikipedia.
- Vocab size: 32,000
- Algorithm: Unigram (SentencePiece)
- Training data: Hindi Wikipedia (~519 MB, 688K lines)
- Character coverage: 0.9995
- Normalization: nmt_nfkc
Fertility ~1.8-2.0 tokens/word. Probabilistic tokenization supports subword regularization during training.
Usage
import sentencepiece as spm
sp = spm.SentencePieceProcessor()
sp.Load("hindi_unigram.model")
tokens = sp.EncodeAsPieces("भारत एक महान देश है।")
ids = sp.EncodeAsIds("भारत एक महान देश है।")
print(sp.DecodeIds(ids))
From HuggingFace Hub
from huggingface_hub import hf_hub_download
import sentencepiece as spm
model_path = hf_hub_download(
repo_id="adityaghai07/ag_hindi_uni_tokenizer_32k",
filename="hindi_unigram.model"
)
sp = spm.SentencePieceProcessor()
sp.Load(model_path)
Source
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support