XLM-RoBERTa Fine-tuned for Kyrgyz Punctuation Restoration

This model restores punctuation in Kyrgyz text. It is based on xlm-roberta-base fine-tuned on a Kyrgyz punctuation dataset as a token classification task.

Labels

Label	Description
O	No punctuation
COMMA	Comma (,)
PERIOD	Period (.)
QUESTION	Question mark (?)

Evaluation Results

Class	Precision	Recall	F1-score	Support
O	0.969	0.971	0.970	16344
COMMA	0.784	0.766	0.775	2169
PERIOD	0.989	0.988	0.989	1984
QUESTION	0.662	0.752	0.704	125
Weighted avg	0.949	0.950	0.949	20622

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "YOUR_USERNAME/xlmr-kyrgyz-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

ID2LABEL = {0: 'O', 1: 'COMMA', 2: 'PERIOD', 3: 'QUESTION'}
PUNCT_MAP = {'O': '', 'COMMA': ',', 'PERIOD': '.', 'QUESTION': '?'}

def restore_punctuation(text):
    words = text.split()
    encoding = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors='pt',
        truncation=True,
        max_length=256,
    )
    with torch.no_grad():
        outputs = model(**encoding)
    preds = torch.argmax(outputs.logits, dim=-1)[0].tolist()
    word_ids = encoding.word_ids(batch_index=0)

    result = []
    for i in range(len(word_ids) - 1, -1, -1):
        wid = word_ids[i]
        if wid is None:
            continue
        if i == len(word_ids) - 1 or word_ids[i + 1] != wid:
            label = ID2LABEL[preds[i]]
            punct = PUNCT_MAP[label]
            result.append((wid, punct))

    word_puncts = {}
    for wid, punct in result:
        if wid not in word_puncts:
            word_puncts[wid] = punct

    output = []
    for idx, word in enumerate(words):
        output.append(word + word_puncts.get(idx, ''))

    return ' '.join(output)

text = "бүгүн аба ырайы жакшы болду биз сейилдөөгө чыктык"
print(restore_punctuation(text))

Training Details

Base model: xlm-roberta-base
Epochs: 5
Batch size: 16
Learning rate: 5e-5
Max sequence length: 256
Optimizer: AdamW (weight decay 0.01, warmup ratio 0.1)
FP16: enabled
Hardware: NVIDIA RTX 5080

Citation

@article{uvalieva2025kyrgyz,
  title   = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
  author  = {Uvalieva, Zarina},
  journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  year    = {2025}
}

GitHub : https://github.com/Zarina33/kyrgyz-punctuation-restoration

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Zarinaaa/xlmr-kyrgyz-punctuation

Base model

FacebookAI/xlm-roberta-base

Finetuned

(3885)

this model