pagantibet commited on
Commit
c32d876
·
verified ·
1 Parent(s): 1da959b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -12
README.md CHANGED
@@ -17,23 +17,23 @@ pipeline_tag: text-generation
17
 
18
  # Tibetan Normalisation - S2S Model (Tokenised)
19
 
20
- A character-level sequence-to-sequence encoder-decoder transformer model for the **normalisation of Classical Tibetan**, converting diplomatic (non-standard, abbreviated) Tibetan manuscript text into Standard Classical Tibetan. This is the **tokenised** variant of the model — input and output have been pre-segmented into tokens using a customised version of the [Botok Tibetan tokeniser](https://www.github.com/OpenPecha/botok) prior to training.
21
 
22
- > **Important**: Results from Meelen & Griffiths (2026) indicate that for most use cases, normalisation performs better when applied to **non-tokenised** text. Tokenisation is best deferred until *after* normalisation in the processing pipeline. For general use, the non-tokenised model [`pagantibet/normalisationS2S-nontokenised`](https://huggingface.co/pagantibet/normalisationS2S-nontokenised) is therefore recommended. The tokenised model is provided for research purposes and for direct comparison of the two approaches.
23
 
24
  This model is part of the [PaganTibet](https://www.pagantibet.com/) project and accompanies the paper:
25
 
26
- > Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in *Proceedings of the AI4CHIEF conference*, Springer.
27
 
28
- Please cite the paper and the [training repository](https://github.com/pagantibet/normalisation) when using this model.
29
 
30
  ---
31
 
32
- ## Model Description
33
 
34
- Classical Tibetan manuscripts present major normalisation challenges: extensive abbreviations, non-standard orthography, scribal variation, and a near-complete absence of gold-standard parallel data. This model addresses these challenges using a hybrid approach combining a neural sequence-to-sequence transformer with optional rule-based pre-/post-processing and KenLM n-gram language model ranking (the latter applied at inference time; see the [Inference scripts](https://github.com/pagantibet/normalisation/tree/main/Inference)).
35
 
36
- The model operates at the **character level** on **tokenised** input — that is, the source text has been segmented into Tibetan word tokens using a customised version of [Botok](https://www.github.com/OpenPecha/botok) before being passed to the model. Both source (diplomatic) and target (normalised) sequences in the training data were tokenised in this way. At inference time, input text must likewise be tokenised using the same tool before being fed to this model.
37
 
38
  ### Architecture
39
 
@@ -71,10 +71,10 @@ Full details of the data preparation and augmentation pipeline are described in
71
  This model is intended for:
72
 
73
  - **Research** comparing tokenised and non-tokenised approaches to Classical Tibetan normalisation, as described in Meelen & Griffiths (2026).
74
- - **Normalisation of diplomatic Classical Tibetan texts** in workflows where tokenised input is already available or required.
75
  - **Digital humanities** work on historical Tibetan manuscripts, particularly when studying the interaction between tokenisation and normalisation.
76
 
77
- > **Note on pipeline order**: Results in Meelen & Griffiths (2026) show that tokenisation is best left until *after* normalisation in the processing pipeline. For most use cases, the non-tokenised model [`pagantibet/normalisationS2S-nontokenised`](https://huggingface.co/pagantibet/normalisationS2S-nontokenised) is recommended. For particularly challenging diplomatic corpora, combining either model with the KenLM n-gram ranker and rule-based pre/post-processing (see [Inference](https://github.com/pagantibet/normalisation/tree/main/Inference)) yields the best results.
78
 
79
  ---
80
 
@@ -86,7 +86,7 @@ Input text must first be tokenised using a customised version of the [Botok Tibe
86
  python3 botokenise_src-tgt.py
87
  ```
88
 
89
- See the [Data Preparation ReadMe](https://github.com/pagantibet/normalisation/blob/main/Data_Preparation/botokenise_ReadMe.md) for full tokenisation details.
90
 
91
  The model can then be used with the inference scripts provided in the [PaganTibet normalisation repository](https://github.com/pagantibet/normalisation/tree/main/Inference). Six inference modes are available, ranging from rule-based only to combined neural + n-gram + rule-based pipelines:
92
 
@@ -104,17 +104,24 @@ See the [Inference ReadMe](https://github.com/pagantibet/normalisation/blob/main
104
 
105
  ## Evaluation
106
 
107
- The training script includes a built-in beam search evaluation. Separate evaluation is available via the evaluation script, which reports:
108
 
109
  - **CER** (Character Error Rate)
110
  - **Precision, Recall, F1**
111
  - **Correction Precision (CP) and Correction Recall (CR)** (following [Huang et al. 2023](https://www.isca-archive.org/sigul_2023/huang23_sigul.html)) for a more accurate picture of normalisation effectiveness
112
- - **Bootstrapped Confidence Intervals** (1,000 iterations) for small test sets
 
 
 
 
 
113
 
114
  ```bash
115
  sbatch evaluate-model.sh
116
  # or
117
  python3 evaluate_model.py
 
 
118
  ```
119
 
120
  Full evaluation results including confidence intervals and example predictions are available in the [tokenised Evaluations directory](https://github.com/pagantibet/normalisation/tree/main/Evaluations/Gold-tokenised-CI) of the repository.
 
17
 
18
  # Tibetan Normalisation - S2S Model (Tokenised)
19
 
20
+ A character-level sequence-to-sequence (S2S) encoder-decoder transformer model for the normalisation of Old/Classical Tibetan, converting diplomatic (non-standard, abbreviated) Tibetan manuscript text into Standard Classical Tibetan. This is the **tokenised** variant of the model — input and output have been pre-segmented into tokens using a customised version of the [Botok Tibetan tokeniser](https://www.github.com/OpenPecha/botok) prior to training (see [Data Preparation](https://github.com/pagantibet/normalisation/tree/main/Data_Preparation)).
21
 
22
+ **Important**: Results from Meelen & Griffiths (2026) indicate that for most use cases, normalisation performs better when applied to **non-tokenised** text. Tokenisation is best deferred until *after* normalisation in the processing pipeline. For general use, the non-tokenised model [`pagantibet/normalisationS2S-nontokenised`](https://huggingface.co/pagantibet/normalisationS2S-nontokenised) is therefore recommended. The tokenised model is provided for research purposes and for direct comparison of the two approaches.
23
 
24
  This model is part of the [PaganTibet](https://www.pagantibet.com/) project and accompanies the paper:
25
 
26
+ Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in *Proceedings of the AI4CHIEF conference*, Springer.
27
 
28
+ Please cite the paper and the [code repository](https://github.com/pagantibet/normalisation) when using this model.
29
 
30
  ---
31
 
32
+ ## Model Overview
33
 
34
+ Old/Classical Tibetan manuscripts present major normalisation challenges: extensive abbreviations, non-standard orthography, scribal variation, and a near-complete absence of gold-standard parallel data. This model addresses these challenges using a hybrid approach combining a neural sequence-to-sequence transformer with optional rule-based pre-/post-processing and KenLM n-gram language model ranking (the latter applied at inference time; see the [Inference scripts](https://github.com/pagantibet/normalisation/tree/main/Inference)).
35
 
36
+ The model operates at the character level on tokenised input — that is, the source text has been segmented into Tibetan word tokens using a customised version of [Botok](https://www.github.com/OpenPecha/botok) before being passed to the model (see [Data Preparation](https://github.com/pagantibet/normalisation/tree/main/Data_Preparation)). Both source (diplomatic) and target (normalised) sequences in the training data were tokenised in this way. At inference time, input text must likewise be tokenised using the same tool before being fed to this model.
37
 
38
  ### Architecture
39
 
 
71
  This model is intended for:
72
 
73
  - **Research** comparing tokenised and non-tokenised approaches to Classical Tibetan normalisation, as described in Meelen & Griffiths (2026).
74
+ - **Normalisation of diplomatic Old/Classical Tibetan texts** in workflows where tokenised input is already available or required.
75
  - **Digital humanities** work on historical Tibetan manuscripts, particularly when studying the interaction between tokenisation and normalisation.
76
 
77
+ **Note on pipeline order**: Results in Meelen & Griffiths (2026) show that tokenisation is best left until *after* normalisation in the processing pipeline. For most use cases, the non-tokenised model [`pagantibet/normalisationS2S-nontokenised`](https://huggingface.co/pagantibet/normalisationS2S-nontokenised) is recommended. For particularly challenging diplomatic corpora, combining either model with the KenLM n-gram ranker and rule-based pre/post-processing (see [Inference](https://github.com/pagantibet/normalisation/tree/main/Inference)) yields the best results.
78
 
79
  ---
80
 
 
86
  python3 botokenise_src-tgt.py
87
  ```
88
 
89
+ See the [Custom Boktok ReadMe](https://github.com/pagantibet/normalisation/blob/main/Data_Preparation/botokenise_ReadMe.md) for full tokenisation details.
90
 
91
  The model can then be used with the inference scripts provided in the [PaganTibet normalisation repository](https://github.com/pagantibet/normalisation/tree/main/Inference). Six inference modes are available, ranging from rule-based only to combined neural + n-gram + rule-based pipelines:
92
 
 
104
 
105
  ## Evaluation
106
 
107
+ The training script includes a built-in beam search evaluation. Separate evaluation is available via the evaluation scripts, which reports:
108
 
109
  - **CER** (Character Error Rate)
110
  - **Precision, Recall, F1**
111
  - **Correction Precision (CP) and Correction Recall (CR)** (following [Huang et al. 2023](https://www.isca-archive.org/sigul_2023/huang23_sigul.html)) for a more accurate picture of normalisation effectiveness
112
+ - **Bootstrapped Confidence Intervals** (1,000 iterations) for small test sets (optional)
113
+
114
+ Two versions of the evaluation script are available:
115
+
116
+ - evaluate_model.py — the standard script#
117
+ - evaluate-model-withCIs.py — an extended version that additionally computes 95% bootstrap confidence intervals (CI) for all metrics
118
 
119
  ```bash
120
  sbatch evaluate-model.sh
121
  # or
122
  python3 evaluate_model.py
123
+ # or
124
+ python3 evaluate-model-withCIs.py
125
  ```
126
 
127
  Full evaluation results including confidence intervals and example predictions are available in the [tokenised Evaluations directory](https://github.com/pagantibet/normalisation/tree/main/Evaluations/Gold-tokenised-CI) of the repository.