XLM-RoBERTa Fine-tuned for Kyrgyz Punctuation Restoration

This model restores punctuation in Kyrgyz text. It is based on xlm-roberta-base fine-tuned on a Kyrgyz punctuation dataset as a token classification task.

Labels

Label Description
O No punctuation
COMMA Comma (,)
PERIOD Period (.)
QUESTION Question mark (?)

Evaluation Results

Class Precision Recall F1-score Support
O 0.969 0.971 0.970 16344
COMMA 0.784 0.766 0.775 2169
PERIOD 0.989 0.988 0.989 1984
QUESTION 0.662 0.752 0.704 125
Weighted avg 0.949 0.950 0.949 20622

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

model_name = "YOUR_USERNAME/xlmr-kyrgyz-punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
model.eval()

ID2LABEL = {0: 'O', 1: 'COMMA', 2: 'PERIOD', 3: 'QUESTION'}
PUNCT_MAP = {'O': '', 'COMMA': ',', 'PERIOD': '.', 'QUESTION': '?'}

def restore_punctuation(text):
    words = text.split()
    encoding = tokenizer(
        words,
        is_split_into_words=True,
        return_tensors='pt',
        truncation=True,
        max_length=256,
    )
    with torch.no_grad():
        outputs = model(**encoding)
    preds = torch.argmax(outputs.logits, dim=-1)[0].tolist()
    word_ids = encoding.word_ids(batch_index=0)

    result = []
    for i in range(len(word_ids) - 1, -1, -1):
        wid = word_ids[i]
        if wid is None:
            continue
        if i == len(word_ids) - 1 or word_ids[i + 1] != wid:
            label = ID2LABEL[preds[i]]
            punct = PUNCT_MAP[label]
            result.append((wid, punct))

    word_puncts = {}
    for wid, punct in result:
        if wid not in word_puncts:
            word_puncts[wid] = punct

    output = []
    for idx, word in enumerate(words):
        output.append(word + word_puncts.get(idx, ''))

    return ' '.join(output)

text = "бүгүн аба ырайы жакшы болду биз сейилдөөгө чыктык"
print(restore_punctuation(text))

Training Details

  • Base model: xlm-roberta-base
  • Epochs: 5
  • Batch size: 16
  • Learning rate: 5e-5
  • Max sequence length: 256
  • Optimizer: AdamW (weight decay 0.01, warmup ratio 0.1)
  • FP16: enabled
  • Hardware: NVIDIA RTX 5080

Citation

@article{uvalieva2025kyrgyz,
  title   = {Punctuation Restoration for Kyrgyz Language: A Comparative Study of Multilingual Transformer Models},
  author  = {Uvalieva, Zarina},
  journal = {ACM Transactions on Asian and Low-Resource Language Information Processing},
  year    = {2025}
}

GitHub : https://github.com/Zarina33/kyrgyz-punctuation-restoration

Downloads last month
6
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Zarinaaa/xlmr-kyrgyz-punctuation

Finetuned
(3885)
this model