Instructions to use Zarinaaa/mt5-small-kyrgyz-normalization with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Zarinaaa/mt5-small-kyrgyz-normalization with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Zarinaaa/mt5-small-kyrgyz-normalization")

# Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization")
model = AutoModelForSeq2SeqLM.from_pretrained("Zarinaaa/mt5-small-kyrgyz-normalization")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Zarinaaa/mt5-small-kyrgyz-normalization with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Zarinaaa/mt5-small-kyrgyz-normalization"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zarinaaa/mt5-small-kyrgyz-normalization",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization

SGLang

How to use Zarinaaa/mt5-small-kyrgyz-normalization with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Zarinaaa/mt5-small-kyrgyz-normalization" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zarinaaa/mt5-small-kyrgyz-normalization",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Zarinaaa/mt5-small-kyrgyz-normalization" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Zarinaaa/mt5-small-kyrgyz-normalization",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Zarinaaa/mt5-small-kyrgyz-normalization with Docker Model Runner:
```
docker model run hf.co/Zarinaaa/mt5-small-kyrgyz-normalization
```

Zarinaaa commited on 2 days ago

Commit

cd5f316

verified ·

1 Parent(s): 18b6e8d

Add model card

Browse files

Files changed (1) hide show

README.md +108 -0

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+language:
+- ky
+license: mit
+library_name: transformers
+pipeline_tag: text2text-generation
+base_model: google/mt5-small
+tags:
+- mt5
+- text-normalization
+- kyrgyz
+- low-resource
+- turkic
+datasets:
+- Zarinaaa/kyrgyz-text-normalization
+metrics:
+- cer
+- wer
+- exact_match
+---
+# mT5-small fine-tuned for Kyrgyz text normalization
+Fine-tuned `google/mt5-small` for normalizing noisy Kyrgyz social-media text (YouTube comments, Instagram posts, Telegram messages) into a standardized form — punctuation, capitalization, dialectal spelling, digit–word compounds.
+This is the **fine-tuned only** variant from the camera-ready paper *"Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches"* (MeLLM Workshop @ ACL 2026). For the continual pre-training + fine-tuning variant see [Zarinaaa/mt5-small-kyrgyz-normalization-ptft](https://huggingface.co/Zarinaaa/mt5-small-kyrgyz-normalization-ptft).
+## Usage
+```python
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+model_id = "Zarinaaa/mt5-small-kyrgyz-normalization"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
+noisy = "барды жакшы болсун коркунучту жерлерди тазалаш керек"
+inputs = tokenizer("correct: " + noisy, return_tensors="pt", truncation=True, max_length=256)
+out = model.generate(**inputs, max_new_tokens=256, num_beams=4)
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+# Барды жакшы болсун. Коркунучтуу жерлерди тазалаш керек.
+```
+The prefix `"correct: "` is required — the model was fine-tuned with this exact prompt.
+## Training data
+1.67M noisy–clean Kyrgyz text pairs from YouTube (45%), Instagram (25%), and Telegram (30%), automatically annotated with Gemini 3 Pro and spot-checked on a 400-example sample (84% acceptance rate, 95% Wilson CI [80%, 87%]). The 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication.
+A 20,000-pair subset of the training data and the full test set are released at [Zarinaaa/kyrgyz-text-normalization](https://huggingface.co/datasets/Zarinaaa/kyrgyz-text-normalization).
+## Training procedure
+- **Base model:** `google/mt5-small` (300M parameters)
+- **Effective batch size:** 64 (physical batch 4 × gradient accumulation 16)
+- **Learning rate:** 3e-4, cosine schedule, 500 warmup steps
+- **Epochs:** 5
+- **Max sequence length:** 256
+- **Train/validation split:** 95 / 5, seed 42; best checkpoint by validation loss
+- **Hardware:** 1× NVIDIA RTX 5080 (16 GB VRAM)
+The 1,000 test inputs are disjoint from the 1.67M training set (verified 0/1,000 exact-match overlap and 0/1,000 case-insensitive overlap).
+## Evaluation
+Automatic metrics on the held-out 1,000-example test set:
+| Metric | Value |
+|---|---|
+| **CER** | **0.0796** ± 0.003 |
+| WER | 0.1978 |
+| Exact Match | 0.186 |
+For comparison: rule-based baseline 0.2029 CER, zero-shot Gemma 4 (9.6B, 32× larger) 0.1620 CER.
+Human evaluation by two native Kyrgyz speakers on 200 examples: **99.8%** rated correct (Wilson 95% CI [0.986, 0.9996]). Reliability under prevalence skew: PABAK = 0.990, Gwet's AC1 = 0.995. Of the 199 outputs both annotators rated correct, 162 (81.4%) differ from the Gemini reference at the character level — surface-form variability that EM penalizes but native speakers accept.
+## Per-category CER
+| Category | N | CER |
+|---|---|---|
+| Punctuation restoration | 849 | 0.078 |
+| Capitalization | 62 | 0.084 |
+| All-caps segments | 39 | 0.084 |
+| Digit–word compounds | 41 | 0.076 |
+## Limitations
+- **Domain:** trained and evaluated on social-media text. Performance on news, speech transcripts, or formal government text is not guaranteed.
+- **Reference bias:** training references were produced by Gemini 3 Pro; a probe with an independent annotator shows the model has learned a general normalization function (CER changes by only 0.012 against an independent reference), but residual stylistic bias is possible.
+- **Label noise:** ~16% of training pairs may contain minor issues per the 400-example spot-check.
+- **Model size:** larger variants (mT5-base/large, ByT5) and fine-tuned LLMs were not evaluated due to compute constraints.
+- **Rule-based comparison:** the baseline in the paper is intentionally minimal; a stronger Kyrgyz FST-based pipeline would likely close part of the gap.
+## Citation
+```bibtex
+@inproceedings{uvalieva2026kyrgyz,
+  title={Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches},
+  author={Uvalieva, Zarina and Kumarbai uulu, Bektemir and Metinov, Adilet and Tashbaltaev, Tynchtykbek and Alibekov, Nurtilek},
+  booktitle={Proceedings of the MeLLM Workshop at ACL 2026},
+  year={2026}
+}
+```
+## License
+MIT. Code: [github.com/Zarina33/Kyrgyz-Text-Normalization-Conference](https://github.com/Zarina33/Kyrgyz-Text-Normalization-Conference).