XLM-RoBERTa Fine-tuned for Kyrgyz Punctuation Restoration
This model restores punctuation in Kyrgyz text. It is based on xlm-roberta-base fine-tuned on a Kyrgyz punctuation dataset as a token classification task.
Labels
| Label | Description |
|---|---|
| O | No punctuation |
| COMMA | Comma (,) |
| PERIOD | Period (.) |
| QUESTION | Question mark (?) |
Evaluation Results
| Class | Precision | Recall | F1-score | Support |
|---|---|---|---|---|
| O | 0.969 | 0.971 | 0.970 | 16344 |
| COMMA | 0.784 | 0.766 | 0.775 | 2169 |
| PERIOD | 0.989 | 0.988 | 0.989 | 1984 |
| QUESTION | 0.662 | 0.752 | 0.704 | 125 |
| Weighted avg | 0.949 | 0.950 | 0.949 | 20622 |
Usage
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
model_name = "YOUR_USERNAME/xlmr-kyrgyz-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()
ID2LABEL = {0: 'O', 1: 'COMMA', 2: 'PERIOD', 3: 'QUESTION'}
PUNCT_MAP = {'O': '', 'COMMA': ',', 'PERIOD': '.', 'QUESTION': '?'}
def restore_punctuation(text):
words = text.split()
encoding = tokenizer(
words,
is_split_into_words=True,
return_tensors='pt',
truncation=True,
max_length=256,
)
with torch.no_grad():
outputs = model(**encoding)
preds = torch.argmax(outputs.logits, dim=-1)[0].tolist()
word_ids = encoding.word_ids(batch_index=0)
result = []
for i in range(len(word_ids) - 1, -1, -1):
wid = word_ids[i]
if wid is None:
continue
if i == len(word_ids) - 1 or word_ids[i + 1] != wid:
label = ID2LABEL[preds[i]]
punct = PUNCT_MAP[label]
result.append((wid, punct))
word_puncts = {}
for wid, punct in result:
if wid not in word_puncts:
word_puncts[wid] = punct
output = []
for idx, word in enumerate(words):
output.append(word + word_puncts.get(idx, ''))
return ' '.join(output)
text = "бүгүн аба ырайы жакшы болду биз сейилдөөгө чыктык"
print(restore_punctuation(text))
Training Details
- Base model: xlm-roberta-base
- Epochs: 5
- Batch size: 16
- Learning rate: 5e-5
- Max sequence length: 256
- Optimizer: AdamW (weight decay 0.01, warmup ratio 0.1)
- FP16: enabled
- Hardware: NVIDIA RTX 5080
Citation
@article{uvalieva2025kyrgyz,
title = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
author = {Uvalieva, Zarina},
journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
year = {2025}
}
GitHub : https://github.com/Zarina33/kyrgyz-punctuation-restoration
- Downloads last month
- 6
Model tree for Zarinaaa/xlmr-kyrgyz-punctuation
Base model
FacebookAI/xlm-roberta-base