Update README.md
Browse files
README.md
CHANGED
|
@@ -17,23 +17,23 @@ pipeline_tag: text-generation
|
|
| 17 |
|
| 18 |
# Tibetan Normalisation - S2S Model (Tokenised)
|
| 19 |
|
| 20 |
-
A character-level sequence-to-sequence encoder-decoder transformer model for the
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
This model is part of the [PaganTibet](https://www.pagantibet.com/) project and accompanies the paper:
|
| 25 |
|
| 26 |
-
|
| 27 |
|
| 28 |
-
Please cite the paper and the [
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
-
## Model
|
| 33 |
|
| 34 |
-
Classical Tibetan manuscripts present major normalisation challenges: extensive abbreviations, non-standard orthography, scribal variation, and a near-complete absence of gold-standard parallel data. This model addresses these challenges using a hybrid approach combining a neural sequence-to-sequence transformer with optional rule-based pre-/post-processing and KenLM n-gram language model ranking (the latter applied at inference time; see the [Inference scripts](https://github.com/pagantibet/normalisation/tree/main/Inference)).
|
| 35 |
|
| 36 |
-
The model operates at the
|
| 37 |
|
| 38 |
### Architecture
|
| 39 |
|
|
@@ -71,10 +71,10 @@ Full details of the data preparation and augmentation pipeline are described in
|
|
| 71 |
This model is intended for:
|
| 72 |
|
| 73 |
- **Research** comparing tokenised and non-tokenised approaches to Classical Tibetan normalisation, as described in Meelen & Griffiths (2026).
|
| 74 |
-
- **Normalisation of diplomatic Classical Tibetan texts** in workflows where tokenised input is already available or required.
|
| 75 |
- **Digital humanities** work on historical Tibetan manuscripts, particularly when studying the interaction between tokenisation and normalisation.
|
| 76 |
|
| 77 |
-
|
| 78 |
|
| 79 |
---
|
| 80 |
|
|
@@ -86,7 +86,7 @@ Input text must first be tokenised using a customised version of the [Botok Tibe
|
|
| 86 |
python3 botokenise_src-tgt.py
|
| 87 |
```
|
| 88 |
|
| 89 |
-
See the [
|
| 90 |
|
| 91 |
The model can then be used with the inference scripts provided in the [PaganTibet normalisation repository](https://github.com/pagantibet/normalisation/tree/main/Inference). Six inference modes are available, ranging from rule-based only to combined neural + n-gram + rule-based pipelines:
|
| 92 |
|
|
@@ -104,17 +104,24 @@ See the [Inference ReadMe](https://github.com/pagantibet/normalisation/blob/main
|
|
| 104 |
|
| 105 |
## Evaluation
|
| 106 |
|
| 107 |
-
The training script includes a built-in beam search evaluation. Separate evaluation is available via the evaluation
|
| 108 |
|
| 109 |
- **CER** (Character Error Rate)
|
| 110 |
- **Precision, Recall, F1**
|
| 111 |
- **Correction Precision (CP) and Correction Recall (CR)** (following [Huang et al. 2023](https://www.isca-archive.org/sigul_2023/huang23_sigul.html)) for a more accurate picture of normalisation effectiveness
|
| 112 |
-
- **Bootstrapped Confidence Intervals** (1,000 iterations) for small test sets
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
|
| 114 |
```bash
|
| 115 |
sbatch evaluate-model.sh
|
| 116 |
# or
|
| 117 |
python3 evaluate_model.py
|
|
|
|
|
|
|
| 118 |
```
|
| 119 |
|
| 120 |
Full evaluation results including confidence intervals and example predictions are available in the [tokenised Evaluations directory](https://github.com/pagantibet/normalisation/tree/main/Evaluations/Gold-tokenised-CI) of the repository.
|
|
|
|
| 17 |
|
| 18 |
# Tibetan Normalisation - S2S Model (Tokenised)
|
| 19 |
|
| 20 |
+
A character-level sequence-to-sequence (S2S) encoder-decoder transformer model for the normalisation of Old/Classical Tibetan, converting diplomatic (non-standard, abbreviated) Tibetan manuscript text into Standard Classical Tibetan. This is the **tokenised** variant of the model — input and output have been pre-segmented into tokens using a customised version of the [Botok Tibetan tokeniser](https://www.github.com/OpenPecha/botok) prior to training (see [Data Preparation](https://github.com/pagantibet/normalisation/tree/main/Data_Preparation)).
|
| 21 |
|
| 22 |
+
**Important**: Results from Meelen & Griffiths (2026) indicate that for most use cases, normalisation performs better when applied to **non-tokenised** text. Tokenisation is best deferred until *after* normalisation in the processing pipeline. For general use, the non-tokenised model [`pagantibet/normalisationS2S-nontokenised`](https://huggingface.co/pagantibet/normalisationS2S-nontokenised) is therefore recommended. The tokenised model is provided for research purposes and for direct comparison of the two approaches.
|
| 23 |
|
| 24 |
This model is part of the [PaganTibet](https://www.pagantibet.com/) project and accompanies the paper:
|
| 25 |
|
| 26 |
+
Meelen, M. & Griffiths, R.M. (2026) 'Historical Tibetan Normalisation: rule-based vs neural & n-gram LM methods for extremely low-resource languages' in *Proceedings of the AI4CHIEF conference*, Springer.
|
| 27 |
|
| 28 |
+
Please cite the paper and the [code repository](https://github.com/pagantibet/normalisation) when using this model.
|
| 29 |
|
| 30 |
---
|
| 31 |
|
| 32 |
+
## Model Overview
|
| 33 |
|
| 34 |
+
Old/Classical Tibetan manuscripts present major normalisation challenges: extensive abbreviations, non-standard orthography, scribal variation, and a near-complete absence of gold-standard parallel data. This model addresses these challenges using a hybrid approach combining a neural sequence-to-sequence transformer with optional rule-based pre-/post-processing and KenLM n-gram language model ranking (the latter applied at inference time; see the [Inference scripts](https://github.com/pagantibet/normalisation/tree/main/Inference)).
|
| 35 |
|
| 36 |
+
The model operates at the character level on tokenised input — that is, the source text has been segmented into Tibetan word tokens using a customised version of [Botok](https://www.github.com/OpenPecha/botok) before being passed to the model (see [Data Preparation](https://github.com/pagantibet/normalisation/tree/main/Data_Preparation)). Both source (diplomatic) and target (normalised) sequences in the training data were tokenised in this way. At inference time, input text must likewise be tokenised using the same tool before being fed to this model.
|
| 37 |
|
| 38 |
### Architecture
|
| 39 |
|
|
|
|
| 71 |
This model is intended for:
|
| 72 |
|
| 73 |
- **Research** comparing tokenised and non-tokenised approaches to Classical Tibetan normalisation, as described in Meelen & Griffiths (2026).
|
| 74 |
+
- **Normalisation of diplomatic Old/Classical Tibetan texts** in workflows where tokenised input is already available or required.
|
| 75 |
- **Digital humanities** work on historical Tibetan manuscripts, particularly when studying the interaction between tokenisation and normalisation.
|
| 76 |
|
| 77 |
+
**Note on pipeline order**: Results in Meelen & Griffiths (2026) show that tokenisation is best left until *after* normalisation in the processing pipeline. For most use cases, the non-tokenised model [`pagantibet/normalisationS2S-nontokenised`](https://huggingface.co/pagantibet/normalisationS2S-nontokenised) is recommended. For particularly challenging diplomatic corpora, combining either model with the KenLM n-gram ranker and rule-based pre/post-processing (see [Inference](https://github.com/pagantibet/normalisation/tree/main/Inference)) yields the best results.
|
| 78 |
|
| 79 |
---
|
| 80 |
|
|
|
|
| 86 |
python3 botokenise_src-tgt.py
|
| 87 |
```
|
| 88 |
|
| 89 |
+
See the [Custom Boktok ReadMe](https://github.com/pagantibet/normalisation/blob/main/Data_Preparation/botokenise_ReadMe.md) for full tokenisation details.
|
| 90 |
|
| 91 |
The model can then be used with the inference scripts provided in the [PaganTibet normalisation repository](https://github.com/pagantibet/normalisation/tree/main/Inference). Six inference modes are available, ranging from rule-based only to combined neural + n-gram + rule-based pipelines:
|
| 92 |
|
|
|
|
| 104 |
|
| 105 |
## Evaluation
|
| 106 |
|
| 107 |
+
The training script includes a built-in beam search evaluation. Separate evaluation is available via the evaluation scripts, which reports:
|
| 108 |
|
| 109 |
- **CER** (Character Error Rate)
|
| 110 |
- **Precision, Recall, F1**
|
| 111 |
- **Correction Precision (CP) and Correction Recall (CR)** (following [Huang et al. 2023](https://www.isca-archive.org/sigul_2023/huang23_sigul.html)) for a more accurate picture of normalisation effectiveness
|
| 112 |
+
- **Bootstrapped Confidence Intervals** (1,000 iterations) for small test sets (optional)
|
| 113 |
+
|
| 114 |
+
Two versions of the evaluation script are available:
|
| 115 |
+
|
| 116 |
+
- evaluate_model.py — the standard script#
|
| 117 |
+
- evaluate-model-withCIs.py — an extended version that additionally computes 95% bootstrap confidence intervals (CI) for all metrics
|
| 118 |
|
| 119 |
```bash
|
| 120 |
sbatch evaluate-model.sh
|
| 121 |
# or
|
| 122 |
python3 evaluate_model.py
|
| 123 |
+
# or
|
| 124 |
+
python3 evaluate-model-withCIs.py
|
| 125 |
```
|
| 126 |
|
| 127 |
Full evaluation results including confidence intervals and example predictions are available in the [tokenised Evaluations directory](https://github.com/pagantibet/normalisation/tree/main/Evaluations/Gold-tokenised-CI) of the repository.
|