Tibetan Normalisation - 5-gram KenLM Language Model (Tokenised)
A character-level 5-gram KenLM language model for Standard Classical Tibetan, trained on a large split of the ACTib corpus (Meelen & Roux 2020) that was pre-segmented using a customised version of the Botok Tibetan tokeniser (see Data Preparation). This model is not a standalone normaliser: it is designed to act as a re-ranker for beam search candidates produced by the pagantibet/normalisationS2S-tokenised sequence-to-sequence (S2S) model during Tibetan text normalisation.
By scoring multiple beam search hypotheses from the S2S model, the KenLM selects the output most consistent with fluent tokenised Standard Classical Tibetan β improving normalisation quality when working in a tokenised pipeline. For the non-tokenised equivalent (recommended for most workflows), see pagantibet/5gram-kenLM_char.
This model is part of the PaganTibet project and accompanies the paper:
Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in Proceedings of the AI4CHIEF conference, Springer.
Please cite the paper and the code repository when using this model.
Model Overview
KenLM is an efficient n-gram language model toolkit that estimates the probability of a sequence of tokens given its preceding context, using smoothed n-gram statistics. This model operates at the character level on tokenised Tibetan Unicode text β text that has been pre-segmented into tokens using a customised version of the Botok tokeniser before character-level processing. It assigns log-probabilities to character sequences and thereby scores the fluency of candidate normalisation outputs within a tokenised pipeline.
At inference time, the tokenised S2S model generates multiple beam search hypotheses. The KenLM scores each hypothesis, and the final normalised output is selected by combining the neural model score with the KenLM score β preferring whichever candidate is most consistent with standard tokenised written Tibetan.
Pipeline note: Results from Meelen & Griffiths (2026) show that tokenisation is best applied after normalisation for most workflows. This tokenised KenLM is provided as the paired ranker for the tokenised S2S model and for experimental comparison. For new pipelines, the non-tokenised models (pagantibet/normalisationS2S-nontokenised + pagantibet/5gram-kenLM_char) are recommended.
Model Specifications
- Type: Character-level n-gram language model (KenLM)
- Order: 5-gram
- Smoothing: Modified Kneser-Ney discounting
- Pruning thresholds:
0 1 1 2 2 - Format: ARPA (
.arpa) - Script: Tokenised Tibetan Unicode characters
- Training environment: Google Colab (host
a290405c6e16, Linux 6.6.105+) - Training time: <5 minutes
Full parameter settings are reported in the Appendix of Meelen & Griffiths (2026).
Training Data
The model was trained on an 8-million-line tokenised split of the ACTib corpus (Standard Classical Tibetan; >180 million words total), available from Zenodo (Meelen & Roux 2020).
Before training, the ACTib was cleaned and split into manuscript-length lines of varying length using the createTiblines.py script (see Data_Preparation), which removes non-Tibetan content (e.g. page numbers) and introduces artificial linebreaks to match the sequence lengths encountered during normalisation. Lines were then tokenised using a customised version of the Botok Tibetan tokeniser via the botokenise_src-tgt.py script, before being converted to space-separated character sequences for KenLM training.
Training the Model
The training notebook KenLM_trainforNormalisation.ipynb walks through the full training process and can be run on Google Colab. The key steps are as follows:
Step 1 β Install KenLM
sudo apt-get install -y build-essential cmake libboost-all-dev \
libboost-program-options-dev libboost-system-dev \
libboost-thread-dev libboost-test-dev libeigen3-dev zlib1g-dev
git clone https://github.com/kpu/kenlm.git
cd kenlm && mkdir build && cd build
cmake .. && make -j 4
Step 2 β Tokenise the Training Text
Before character-level conversion, tokenise the cleaned ACTib lines using the Botok-based script:
python3 botokenise_src-tgt.py
See the tokenisation ReadMe for full usage details.
Step 3 β Convert to Space-Separated Characters
KenLM requires space-separated tokens. For character-level modelling, each character must be separated by a space. This conversion can be done in Python:
with open("actib_8m_lines_tokenised.txt") as f_in, open("actib_char_tok.txt", "w") as f_out:
for line in f_in:
f_out.write(" ".join(line.strip()) + "\n")
Step 4 β Train the 5-gram KenLM
Using the KenLM lmplz binary with the pruning thresholds and modified Kneser-Ney smoothing used in Meelen & Griffiths (2026):
./kenlm/build/bin/lmplz \
-o 5 \
--prune 0 1 1 2 2 \
--discount_fallback \
< actib_char_tok.txt \
> model_5gram_char_tok.arpa
Parameter notes:
-o 5β 5-gram order--prune 0 1 1 2 2β pruning thresholds per n-gram order (unigrams kept, higher-order singletons pruned)--discount_fallbackβ enables modified Kneser-Ney discounting with fallback, recommended for robustness on this data
Training completes in under 5 minutes on Google Colab. Full parameter settings are reported in the Appendix of Meelen & Griffiths (2026).
Step 5 β (Optional) Convert to Binary Format
For faster loading with the compiled KenLM Python backend:
./kenlm/build/bin/build_binary \
model_5gram_char_tok.arpa \
model_5gram_char_tok.bin
Note: The pure Python ARPA backend used in the inference scripts requires the .arpa format. The .bin format is only compatible with the compiled kenlm Python package. The .arpa file hosted here works with both backends.
How to Use: KenLM-Assisted Normalisation
This model is used exclusively through the flexible inference script provided in the PaganTibet normalisation repository, paired with the tokenised S2S model. Input text must be tokenised before inference (see Step 2 above). Three of the six available inference modes make use of the KenLM ranker:
| Mode | Description |
|---|---|
neural+lm |
Seq2Seq beam search re-ranked by KenLM |
neural+lm+rules |
As above, with rule-based postprocessing (recommended) |
rules+neural+lm |
Rule-based preprocessing, then Seq2Seq + KenLM |
Recommended Mode (neural+lm+rules)
python3 tibetan-inference-flexible.py \
--mode neural+lm+rules \
--model_path tibetan_model_tokenized_allchars.pt \
--kenlm_path model_5gram_char_tok.arpa \
--lm_backend python \
--rules_dict abbreviations.txt \
--input_file input_tokenised.txt
Neural + KenLM Only (no rule-based processing)
python3 tibetan-inference-flexible.py \
--mode neural+lm \
--model_path tibetan_model_tokenized_allchars.pt \
--kenlm_path model_5gram_char_tok.arpa \
--lm_backend python \
--input_file input_tokenised.txt
LM Backend Options
The inference script supports two backends for loading the KenLM model:
--lm_backend kenlmβ Fast (~50β100 texts/sec); requires the compiled KenLM Python package--lm_backend pythonβ Slower (~5β20 texts/sec); uses a built-in pure Python ARPA reader, no installation needed; requires.arpaformat--lm_backend autoβ Auto-detects which backend is available (default)
Key Inference Parameters
The balance between the neural model score and the KenLM score can be tuned at inference time:
| Parameter | Default | Description |
|---|---|---|
--lm_weight |
0.2 |
Weight given to KenLM score relative to neural score (range 0.0β1.0) |
--beam_width |
5 |
Number of beam search hypotheses to generate and rank |
--length_penalty |
0.6 |
Penalty for output length; higher values favour longer outputs |
For challenging diplomatic corpora, --lm_weight 0.25 and --beam_width 7 are reasonable starting points. See the Inference ReadMe for full details and troubleshooting guidance.
Installation
The KenLM Python package is optional but recommended for speed. The inference scripts also include a pure Python ARPA reader that works with the hosted .arpa file and requires no additional installation.
# Optional: install KenLM Python bindings for fastest inference
sudo apt-get install build-essential cmake libboost-all-dev
pip install https://github.com/kpu/kenlm/archive/master.zip
Related Models and Resources
All models and datasets from the PaganTibet normalisation project are collected in the Normalisation collection on Hugging Face.
| Resource | Link |
|---|---|
| Non-tokenised KenLM (paired with non-tokenised S2S) | pagantibet/5gram-kenLM_char |
| Tokenised S2S model | pagantibet/normalisationS2S-tokenised |
| Non-tokenised S2S model | pagantibet/normalisationS2S-nontokenised |
| Training dataset | pagantibet/normalisation-S2S-training |
| Abbreviation dictionary | pagantibet/Tibetan-abbreviation-dictionary |
| KenLM training notebook | KenLM_trainforNormalisation.ipynb |
| Inference scripts | github.com/pagantibet/normalisation/Inference |
| ACTib corpus | Zenodo (Meelen & Roux 2020) |
| PaganTibet project | pagantibet.com |
License
This model is released under CC BY-NC-SA 4.0. It may be used freely for non-commercial research and educational purposes, with attribution and under the same licence terms.
Funding
This work was partially funded by the European Union (ERC, Pagan Tibet, grant no. 101097364). Views and opinions expressed are those of the authors only and do not necessarily reflect those of the European Union or the European Research Council Executive Agency.