Tibetan Normalisation - 5-gram KenLM Language Model (Tokenised)

A character-level 5-gram KenLM language model for Standard Classical Tibetan, trained on a large split of the ACTib corpus (Meelen & Roux 2020) that was pre-segmented using a customised version of the Botok Tibetan tokeniser (see Data Preparation). This model is not a standalone normaliser: it is designed to act as a re-ranker for beam search candidates produced by the pagantibet/normalisationS2S-tokenised sequence-to-sequence (S2S) model during Tibetan text normalisation.

By scoring multiple beam search hypotheses from the S2S model, the KenLM selects the output most consistent with fluent tokenised Standard Classical Tibetan β€” improving normalisation quality when working in a tokenised pipeline. For the non-tokenised equivalent (recommended for most workflows), see pagantibet/5gram-kenLM_char.

This model is part of the PaganTibet project and accompanies the paper:

Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in Proceedings of the AI4CHIEF conference, Springer.

Please cite the paper and the code repository when using this model.


Model Overview

KenLM is an efficient n-gram language model toolkit that estimates the probability of a sequence of tokens given its preceding context, using smoothed n-gram statistics. This model operates at the character level on tokenised Tibetan Unicode text β€” text that has been pre-segmented into tokens using a customised version of the Botok tokeniser before character-level processing. It assigns log-probabilities to character sequences and thereby scores the fluency of candidate normalisation outputs within a tokenised pipeline.

At inference time, the tokenised S2S model generates multiple beam search hypotheses. The KenLM scores each hypothesis, and the final normalised output is selected by combining the neural model score with the KenLM score β€” preferring whichever candidate is most consistent with standard tokenised written Tibetan.

Pipeline note: Results from Meelen & Griffiths (2026) show that tokenisation is best applied after normalisation for most workflows. This tokenised KenLM is provided as the paired ranker for the tokenised S2S model and for experimental comparison. For new pipelines, the non-tokenised models (pagantibet/normalisationS2S-nontokenised + pagantibet/5gram-kenLM_char) are recommended.

Model Specifications

  • Type: Character-level n-gram language model (KenLM)
  • Order: 5-gram
  • Smoothing: Modified Kneser-Ney discounting
  • Pruning thresholds: 0 1 1 2 2
  • Format: ARPA (.arpa)
  • Script: Tokenised Tibetan Unicode characters
  • Training environment: Google Colab (host a290405c6e16, Linux 6.6.105+)
  • Training time: <5 minutes

Full parameter settings are reported in the Appendix of Meelen & Griffiths (2026).


Training Data

The model was trained on an 8-million-line tokenised split of the ACTib corpus (Standard Classical Tibetan; >180 million words total), available from Zenodo (Meelen & Roux 2020).

Before training, the ACTib was cleaned and split into manuscript-length lines of varying length using the createTiblines.py script (see Data_Preparation), which removes non-Tibetan content (e.g. page numbers) and introduces artificial linebreaks to match the sequence lengths encountered during normalisation. Lines were then tokenised using a customised version of the Botok Tibetan tokeniser via the botokenise_src-tgt.py script, before being converted to space-separated character sequences for KenLM training.


Training the Model

The training notebook KenLM_trainforNormalisation.ipynb walks through the full training process and can be run on Google Colab. The key steps are as follows:

Step 1 β€” Install KenLM

sudo apt-get install -y build-essential cmake libboost-all-dev \
    libboost-program-options-dev libboost-system-dev \
    libboost-thread-dev libboost-test-dev libeigen3-dev zlib1g-dev
git clone https://github.com/kpu/kenlm.git
cd kenlm && mkdir build && cd build
cmake .. && make -j 4

Step 2 β€” Tokenise the Training Text

Before character-level conversion, tokenise the cleaned ACTib lines using the Botok-based script:

python3 botokenise_src-tgt.py

See the tokenisation ReadMe for full usage details.

Step 3 β€” Convert to Space-Separated Characters

KenLM requires space-separated tokens. For character-level modelling, each character must be separated by a space. This conversion can be done in Python:

with open("actib_8m_lines_tokenised.txt") as f_in, open("actib_char_tok.txt", "w") as f_out:
    for line in f_in:
        f_out.write(" ".join(line.strip()) + "\n")

Step 4 β€” Train the 5-gram KenLM

Using the KenLM lmplz binary with the pruning thresholds and modified Kneser-Ney smoothing used in Meelen & Griffiths (2026):

./kenlm/build/bin/lmplz \
    -o 5 \
    --prune 0 1 1 2 2 \
    --discount_fallback \
    < actib_char_tok.txt \
    > model_5gram_char_tok.arpa

Parameter notes:

  • -o 5 β€” 5-gram order
  • --prune 0 1 1 2 2 β€” pruning thresholds per n-gram order (unigrams kept, higher-order singletons pruned)
  • --discount_fallback β€” enables modified Kneser-Ney discounting with fallback, recommended for robustness on this data

Training completes in under 5 minutes on Google Colab. Full parameter settings are reported in the Appendix of Meelen & Griffiths (2026).

Step 5 β€” (Optional) Convert to Binary Format

For faster loading with the compiled KenLM Python backend:

./kenlm/build/bin/build_binary \
    model_5gram_char_tok.arpa \
    model_5gram_char_tok.bin

Note: The pure Python ARPA backend used in the inference scripts requires the .arpa format. The .bin format is only compatible with the compiled kenlm Python package. The .arpa file hosted here works with both backends.


How to Use: KenLM-Assisted Normalisation

This model is used exclusively through the flexible inference script provided in the PaganTibet normalisation repository, paired with the tokenised S2S model. Input text must be tokenised before inference (see Step 2 above). Three of the six available inference modes make use of the KenLM ranker:

Mode Description
neural+lm Seq2Seq beam search re-ranked by KenLM
neural+lm+rules As above, with rule-based postprocessing (recommended)
rules+neural+lm Rule-based preprocessing, then Seq2Seq + KenLM

Recommended Mode (neural+lm+rules)

python3 tibetan-inference-flexible.py \
    --mode neural+lm+rules \
    --model_path tibetan_model_tokenized_allchars.pt \
    --kenlm_path model_5gram_char_tok.arpa \
    --lm_backend python \
    --rules_dict abbreviations.txt \
    --input_file input_tokenised.txt

Neural + KenLM Only (no rule-based processing)

python3 tibetan-inference-flexible.py \
    --mode neural+lm \
    --model_path tibetan_model_tokenized_allchars.pt \
    --kenlm_path model_5gram_char_tok.arpa \
    --lm_backend python \
    --input_file input_tokenised.txt

LM Backend Options

The inference script supports two backends for loading the KenLM model:

  • --lm_backend kenlm β€” Fast (~50–100 texts/sec); requires the compiled KenLM Python package
  • --lm_backend python β€” Slower (~5–20 texts/sec); uses a built-in pure Python ARPA reader, no installation needed; requires .arpa format
  • --lm_backend auto β€” Auto-detects which backend is available (default)

Key Inference Parameters

The balance between the neural model score and the KenLM score can be tuned at inference time:

Parameter Default Description
--lm_weight 0.2 Weight given to KenLM score relative to neural score (range 0.0–1.0)
--beam_width 5 Number of beam search hypotheses to generate and rank
--length_penalty 0.6 Penalty for output length; higher values favour longer outputs

For challenging diplomatic corpora, --lm_weight 0.25 and --beam_width 7 are reasonable starting points. See the Inference ReadMe for full details and troubleshooting guidance.


Installation

The KenLM Python package is optional but recommended for speed. The inference scripts also include a pure Python ARPA reader that works with the hosted .arpa file and requires no additional installation.

# Optional: install KenLM Python bindings for fastest inference
sudo apt-get install build-essential cmake libboost-all-dev
pip install https://github.com/kpu/kenlm/archive/master.zip

Related Models and Resources

All models and datasets from the PaganTibet normalisation project are collected in the Normalisation collection on Hugging Face.

Resource Link
Non-tokenised KenLM (paired with non-tokenised S2S) pagantibet/5gram-kenLM_char
Tokenised S2S model pagantibet/normalisationS2S-tokenised
Non-tokenised S2S model pagantibet/normalisationS2S-nontokenised
Training dataset pagantibet/normalisation-S2S-training
Abbreviation dictionary pagantibet/Tibetan-abbreviation-dictionary
KenLM training notebook KenLM_trainforNormalisation.ipynb
Inference scripts github.com/pagantibet/normalisation/Inference
ACTib corpus Zenodo (Meelen & Roux 2020)
PaganTibet project pagantibet.com

License

This model is released under CC BY-NC-SA 4.0. It may be used freely for non-commercial research and educational purposes, with attribution and under the same licence terms.


Funding

This work was partially funded by the European Union (ERC, Pagan Tibet, grant no. 101097364). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including pagantibet/5gram-kenLM_char-tok