How to use from
Docker Model Runner
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft
Quick Links

mT5-small with continual pre-training + fine-tuning for Kyrgyz text normalization

google/mt5-small continually pre-trained on a 538 MB Kyrgyz corpus (news portals + books) with T5-style span corruption, then fine-tuned on 1.67M noisy–clean text pairs for Kyrgyz text normalization.

This is the continual pre-training + fine-tuning (PT+FT) variant from the camera-ready paper "Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches" (MeLLM Workshop @ ACL 2026). For the fine-tuning-only variant see Zarinaaa/mt5-small-kyrgyz-normalization.

Note on choice between the two variants: in our experiments the additional continual pre-training step did not improve over direct fine-tuning (CER 0.0825 vs. 0.0796, p = 0.06). The main observable difference is a higher rate of hallucination (input repetition) in failure cases. For most users we recommend the fine-tune-only variant unless you specifically want the slightly better Digit–Word category performance (see Evaluation below).

Usage

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_id = "Zarinaaa/mt5-small-kyrgyz-normalization-ptft"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек"
inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256)
out = model.generate(**inputs, max_new_tokens=256, num_beams=4)
print(tokenizer.decode(out[0], skip_special_tokens=True))

The prefix "correct: " is required.

Training procedure

Stage 1 — Continual pre-training

  • Corpus: 538 MB clean Kyrgyz text from news portals and books
  • Objective: T5-style span corruption (mask rate 0.15, mean span length 3)
  • Epochs: 3
  • Train/validation split: 98 / 2, seed 42; best checkpoint by validation loss

Stage 2 — Fine-tuning

Identical to the fine-tune-only variant:

  • Effective batch size: 64 (4 × 16 gradient accumulation)
  • Learning rate: 3e-4, cosine schedule, 500 warmup steps
  • Epochs: 5
  • Max sequence length: 256
  • Train/validation split: 95 / 5, seed 42; best checkpoint by validation loss
  • Hardware: 1× NVIDIA RTX 5080 (16 GB VRAM)

Evaluation

Automatic metrics on the held-out 1,000-example test set:

Metric Value
CER 0.0825 ± 0.004
WER 0.2017
Exact Match 0.184

Vs. fine-tune-only (CER 0.0796): paired bootstrap two-sided p = 0.06. We treat this as insufficient evidence to reject the null of no difference, not as equivalence — n = 1,000 is underpowered for detecting small effects in either direction.

Human evaluation (200 examples, 2 native annotators): 99.8% rated correct (Wilson 95% CI [0.986, 0.9996]); PABAK = 0.990, Gwet's AC1 = 0.995 — identical to the fine-tune-only variant at the ceiling.

Per-category CER

Category N FT-only PT+FT
Punctuation 849 0.078 0.081
Capitalization 62 0.084 0.085
All-caps 39 0.084 0.083
Digit–Word 41 0.076 0.067

PT+FT is numerically slightly better on Digit–Word compounds; with N = 41 we do not treat this as a robust advantage.

Failure analysis

In 40 examples where FT outperforms PT+FT by more than 0.05 CER, hallucination (input repetition) is the dominant error mode (35/40 = 87.5%, 95% Wilson CI [74%, 95%]). Two non-exclusive hypotheses (see paper §6.1):

  1. Copy bias from span corruption — T5-style span corruption trains the decoder to reconstruct spans of the input verbatim, which may reinforce copying behavior harmful for normalization (where the target is usually not a superset of the input).
  2. Register mismatch — continual pre-training used clean, formal text (news/books), while fine-tuning targets normalize noisy informal social-media text. The register gap may push the model toward fluent formal continuations that read as hallucinations.

Limitations

Same as the fine-tune-only variant, plus:

  • Higher hallucination rate in failure cases — if you need maximum robustness, use the FT-only variant.
  • No measurable benefit from the additional pre-training at this scale and corpus composition; results suggest a more targeted continual objective (in-domain noisy text, denoising closer to the normalization target) would be needed.

Citation

@inproceedings{uvalieva2026kyrgyz,
  title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches},
  author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek},
  booktitle={Proceedings of the MeLLM Workshop at ACL 2026},
  year={2026}
}

License

MIT. Code: github.com/Zarina33/Kyrgyz-Text-Normalization-Conference.

Downloads last month
39
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zarinaaa/mt5-small-kyrgyz-normalization-ptft

Base model

google/mt5-small
Finetuned
(682)
this model

Dataset used to train Zarinaaa/mt5-small-kyrgyz-normalization-ptft