Khmer 12k Tokenizer (SentencePiece Exact + Byte Fallback)
Repo: Msok99/khmer_12k_tok_sp_exact
This repository provides a Khmer SentencePiece tokenizer (12k vocab) packaged for Hugging Face such that:
- β
Token IDs match the original
SentencePieceProcessoroutputs - β
Byte fallback is preserved (emoji, ZWSP, symbols like
/,%, etc.) - β
Round-trip decode is safe for byte-fallback cases (no silent
<unk>loss)
Why this repo exists: HF βfastβ Unigram tokenizers (
tokenizer.json) often do not implement SentencePiece byte fallback, which can turn emoji/ZWSP into<unk>. This repo avoids that by using a SentencePiece-backed tokenizer wrapper.
Files in this repo
tokenizer.modelβ the original SentencePiece model (source of truth)tokenization_khmer_sp.pyβ custom tokenizer wrapper that callsSentencePieceProcessordirectlytokenizer_config.json,special_tokens_map.jsonβ HF metadata for loading- (optional)
README.mdβ this file
Loading (Recommended)
β Hugging Face Transformers
from transformers import AutoTokenizer
HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" # <-- paste your token here
repo_id = "Msok99/khmer_12k_tok_sp_exact"
tok = AutoTokenizer.from_pretrained(
repo_id,
trust_remote_code=True,
token=HF_TOKEN, # for private/gated repos; safe to include even if public
)
tests = [
"αααα»ααααααΆααααΆααΆαααααα",
"ααααααααααα»ααα
ααααΎααΆα\u200bαα
Wing Bankα", # ZWSP
"αααααααααΆ: 25,000α / month (VAT 10%).", # / %
"α’αΌα αααα·ααα·αααΊααααΎαααΆαα ππ₯", # emoji
]
for t in tests:
ids = tok.encode(t, add_special_tokens=False)
toks = tok.convert_ids_to_tokens(ids)
dec = tok.decode(ids)
print("=" * 80)
print("TEXT:", t)
print("TOKENS:", toks)
print("DECODE:", dec)
print("ROUND-TRIP:", dec == t)
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support