Text Generation
Transformers
Safetensors
Kyrgyz
mt5
text2text-generation
text-normalization
kyrgyz
low-resource
turkic
continual-pretraining
Instructions to use Zarinaaa/mt5-small-kyrgyz-normalization-ptft with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Zarinaaa/mt5-small-kyrgyz-normalization-ptft with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="Zarinaaa/mt5-small-kyrgyz-normalization-ptft")# Load model directly from transformers import AutoTokenizer, AutoModelForSeq2SeqLM tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization-ptft") model = AutoModelForSeq2SeqLM.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization-ptft") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Zarinaaa/mt5-small-kyrgyz-normalization-ptft with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Zarinaaa/mt5-small-kyrgyz-normalization-ptft" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization-ptft", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft
- SGLang
How to use Zarinaaa/mt5-small-kyrgyz-normalization-ptft with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Zarinaaa/mt5-small-kyrgyz-normalization-ptft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization-ptft", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Zarinaaa/mt5-small-kyrgyz-normalization-ptft" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Zarinaaa/mt5-small-kyrgyz-normalization-ptft", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use Zarinaaa/mt5-small-kyrgyz-normalization-ptft with Docker Model Runner:
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft
| language: | |
| - ky | |
| license: mit | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| base_model: google/mt5-small | |
| tags: | |
| - mt5 | |
| - text-normalization | |
| - kyrgyz | |
| - low-resource | |
| - turkic | |
| - continual-pretraining | |
| datasets: | |
| - Zarinaaa/kyrgyz-text-normalization | |
| metrics: | |
| - cer | |
| - wer | |
| - exact_match | |
| # mT5-small with continual pre-training + fine-tuning for Kyrgyz text normalization | |
| `google/mt5-small` continually pre-trained on a 538 MB Kyrgyz corpus (news portals + books) with T5-style span corruption, then fine-tuned on 1.67M noisy–clean text pairs for Kyrgyz text normalization. | |
| This is the **continual pre-training + fine-tuning** (PT+FT) variant from the camera-ready paper *"Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches"* (MeLLM Workshop @ ACL 2026). For the fine-tuning-only variant see [Zarinaaa/mt5-small-kyrgyz-normalization](https://huggingface.co/Zarinaaa/mt5-small-kyrgyz-normalization). | |
| **Note on choice between the two variants:** in our experiments the additional continual pre-training step did **not** improve over direct fine-tuning (CER 0.0825 vs. 0.0796, p = 0.06). The main observable difference is a higher rate of hallucination (input repetition) in failure cases. For most users we recommend the fine-tune-only variant unless you specifically want the slightly better Digit–Word category performance (see Evaluation below). | |
| ## Usage | |
| ```python | |
| from transformers import AutoModelForSeq2SeqLM, AutoTokenizer | |
| model_id = "Zarinaaa/mt5-small-kyrgyz-normalization-ptft" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_id) | |
| noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек" | |
| inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256) | |
| out = model.generate(**inputs, max_new_tokens=256, num_beams=4) | |
| print(tokenizer.decode(out[0], skip_special_tokens=True)) | |
| ``` | |
| The prefix `"correct: "` is required. | |
| ## Training procedure | |
| ### Stage 1 — Continual pre-training | |
| - **Corpus:** 538 MB clean Kyrgyz text from news portals and books | |
| - **Objective:** T5-style span corruption (mask rate 0.15, mean span length 3) | |
| - **Epochs:** 3 | |
| - **Train/validation split:** 98 / 2, seed 42; best checkpoint by validation loss | |
| ### Stage 2 — Fine-tuning | |
| Identical to the fine-tune-only variant: | |
| - **Effective batch size:** 64 (4 × 16 gradient accumulation) | |
| - **Learning rate:** 3e-4, cosine schedule, 500 warmup steps | |
| - **Epochs:** 5 | |
| - **Max sequence length:** 256 | |
| - **Train/validation split:** 95 / 5, seed 42; best checkpoint by validation loss | |
| - **Hardware:** 1× NVIDIA RTX 5080 (16 GB VRAM) | |
| ## Evaluation | |
| Automatic metrics on the held-out 1,000-example test set: | |
| | Metric | Value | | |
| |---|---| | |
| | CER | 0.0825 ± 0.004 | | |
| | WER | 0.2017 | | |
| | Exact Match | 0.184 | | |
| Vs. fine-tune-only (CER 0.0796): paired bootstrap two-sided p = 0.06. We treat this as **insufficient evidence to reject the null** of no difference, **not** as equivalence — n = 1,000 is underpowered for detecting small effects in either direction. | |
| Human evaluation (200 examples, 2 native annotators): **99.8%** rated correct (Wilson 95% CI [0.986, 0.9996]); PABAK = 0.990, Gwet's AC1 = 0.995 — identical to the fine-tune-only variant at the ceiling. | |
| ### Per-category CER | |
| | Category | N | FT-only | **PT+FT** | | |
| |---|---|---|---| | |
| | Punctuation | 849 | **0.078** | 0.081 | | |
| | Capitalization | 62 | **0.084** | 0.085 | | |
| | All-caps | 39 | 0.084 | **0.083** | | |
| | Digit–Word | 41 | 0.076 | **0.067** | | |
| PT+FT is numerically slightly better on Digit–Word compounds; with N = 41 we do not treat this as a robust advantage. | |
| ### Failure analysis | |
| In 40 examples where FT outperforms PT+FT by more than 0.05 CER, **hallucination (input repetition) is the dominant error mode (35/40 = 87.5%, 95% Wilson CI [74%, 95%])**. Two non-exclusive hypotheses (see paper §6.1): | |
| 1. **Copy bias from span corruption** — T5-style span corruption trains the decoder to reconstruct spans of the input verbatim, which may reinforce copying behavior harmful for normalization (where the target is usually not a superset of the input). | |
| 2. **Register mismatch** — continual pre-training used clean, formal text (news/books), while fine-tuning targets normalize noisy informal social-media text. The register gap may push the model toward fluent formal continuations that read as hallucinations. | |
| ## Limitations | |
| Same as the fine-tune-only variant, plus: | |
| - **Higher hallucination rate** in failure cases — if you need maximum robustness, use the FT-only variant. | |
| - **No measurable benefit from the additional pre-training** at this scale and corpus composition; results suggest a more targeted continual objective (in-domain noisy text, denoising closer to the normalization target) would be needed. | |
| ## Citation | |
| ```bibtex | |
| @inproceedings{uvalieva2026kyrgyz, | |
| title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches}, | |
| author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek}, | |
| booktitle={Proceedings of the MeLLM Workshop at ACL 2026}, | |
| year={2026} | |
| } | |
| ``` | |
| ## License | |
| MIT. Code: [github.com/Zarina33/Kyrgyz-Text-Normalization-Conference](https://github.com/Zarina33/Kyrgyz-Text-Normalization-Conference). | |