Khmer 12k Tokenizer (SentencePiece Exact + Byte Fallback)

Repo: Msok99/khmer_12k_tok_sp_exact

This repository provides a Khmer SentencePiece tokenizer (12k vocab) packaged for Hugging Face such that:

  • βœ… Token IDs match the original SentencePieceProcessor outputs
  • βœ… Byte fallback is preserved (emoji, ZWSP, symbols like /, %, etc.)
  • βœ… Round-trip decode is safe for byte-fallback cases (no silent <unk> loss)

Why this repo exists: HF β€œfast” Unigram tokenizers (tokenizer.json) often do not implement SentencePiece byte fallback, which can turn emoji/ZWSP into <unk>. This repo avoids that by using a SentencePiece-backed tokenizer wrapper.


Files in this repo

  • tokenizer.model β€” the original SentencePiece model (source of truth)
  • tokenization_khmer_sp.py β€” custom tokenizer wrapper that calls SentencePieceProcessor directly
  • tokenizer_config.json, special_tokens_map.json β€” HF metadata for loading
  • (optional) README.md β€” this file

Loading (Recommended)

βœ… Hugging Face Transformers

from transformers import AutoTokenizer

HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # <-- paste your token here
repo_id = "Msok99/khmer_12k_tok_sp_exact"

tok = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
    token=HF_TOKEN,   # for private/gated repos; safe to include even if public
)

tests = [
    "αžαŸ’αž‰αž»αŸ†αžŸαŸ’αžšαž›αžΆαž‰αŸ‹αž—αžΆαžŸαžΆαžαŸ’αž˜αŸ‚αžšαŸ”",
    "αžαŸ’αž„αŸƒαž“αŸαŸ‡αžαŸ’αž‰αž»αŸ†αž‘αŸ…αž’αŸ’αžœαžΎαž€αžΆαžš\u200bαž“αŸ… Wing BankαŸ”",     # ZWSP
    "αžαž˜αŸ’αž›αŸƒαžŸαŸαžœαžΆ: 25,000αŸ› / month (VAT 10%).",         # / %
    "αž’αžΌαž™ αž˜αŸ’αžŸαž·αž›αž˜αž·αž‰αž‚αžΊαžšαŸ†αž—αžΎαž”αžŽαžΆαžŸαŸ‹ πŸ˜‚πŸ”₯",                  # emoji
]

for t in tests:
    ids = tok.encode(t, add_special_tokens=False)
    toks = tok.convert_ids_to_tokens(ids)
    dec = tok.decode(ids)
    print("=" * 80)
    print("TEXT:", t)
    print("TOKENS:", toks)
    print("DECODE:", dec)
    print("ROUND-TRIP:", dec == t)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support