Hindi Unigram Tokenizer (32K)

Unigram language model tokenizer for Hindi (Devanagari script) using SentencePiece. Trained on Hindi Wikipedia.

  • Vocab size: 32,000
  • Algorithm: Unigram (SentencePiece)
  • Training data: Hindi Wikipedia (~519 MB, 688K lines)
  • Character coverage: 0.9995
  • Normalization: nmt_nfkc

Fertility ~1.8-2.0 tokens/word. Probabilistic tokenization supports subword regularization during training.

Usage

import sentencepiece as spm

sp = spm.SentencePieceProcessor()
sp.Load("hindi_unigram.model")

tokens = sp.EncodeAsPieces("भारत एक महान देश है।")
ids = sp.EncodeAsIds("भारत एक महान देश है।")
print(sp.DecodeIds(ids))

From HuggingFace Hub

from huggingface_hub import hf_hub_download
import sentencepiece as spm

model_path = hf_hub_download(
    repo_id="adityaghai07/ag_hindi_uni_tokenizer_32k",
    filename="hindi_unigram.model"
)
sp = spm.SentencePieceProcessor()
sp.Load(model_path)

Source

GitHub

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support