Khmer 12k Tokenizer (SentencePiece Exact + Byte Fallback)

Repo: Msok99/khmer_12k_tok_sp_exact

This repository provides a Khmer SentencePiece tokenizer (12k vocab) packaged for Hugging Face such that:

✅ Token IDs match the original SentencePieceProcessor outputs
✅ Byte fallback is preserved (emoji, ZWSP, symbols like /, %, etc.)
✅ Round-trip decode is safe for byte-fallback cases (no silent <unk> loss)

Why this repo exists: HF “fast” Unigram tokenizers (tokenizer.json) often do not implement SentencePiece byte fallback, which can turn emoji/ZWSP into <unk>. This repo avoids that by using a SentencePiece-backed tokenizer wrapper.

Files in this repo

tokenizer.model — the original SentencePiece model (source of truth)
tokenization_khmer_sp.py — custom tokenizer wrapper that calls SentencePieceProcessor directly
tokenizer_config.json, special_tokens_map.json — HF metadata for loading
(optional) README.md — this file

Loading (Recommended)

✅ Hugging Face Transformers

from transformers import AutoTokenizer

HF_TOKEN = "hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"  # <-- paste your token here
repo_id = "Msok99/khmer_12k_tok_sp_exact"

tok = AutoTokenizer.from_pretrained(
    repo_id,
    trust_remote_code=True,
    token=HF_TOKEN,   # for private/gated repos; safe to include even if public
)

tests = [
    "ខ្ញុំស្រលាញ់ភាសាខ្មែរ។",
    "ថ្ងៃនេះខ្ញុំទៅធ្វើការ\u200bនៅ Wing Bank។",     # ZWSP
    "តម្លៃសេវា: 25,000៛ / month (VAT 10%).",         # / %
    "អូយ ម្សិលមិញគឺរំភើបណាស់ 😂🔥",                  # emoji
]

for t in tests:
    ids = tok.encode(t, add_special_tokens=False)
    toks = tok.convert_ids_to_tokens(ids)
    dec = tok.decode(ids)
    print("=" * 80)
    print("TEXT:", t)
    print("TOKENS:", toks)
    print("DECODE:", dec)
    print("ROUND-TRIP:", dec == t)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support