🏆 XLM-RoBERTa Large for Tatar Toponyms QA

📖 Model Description

XLM-RoBERTa large fine-tuned for question answering on Tatarstan toponyms. This is the best performing model in the Tatar Toponyms QA collection with 99.4% F1 score.

This model is fine-tuned from deepset/xlm-roberta-large-squad2 on a synthetic dataset of 38,696 QA pairs about Tatarstan geographical names.

✨ Key Features

  • 🎯 99.4% F1 score on test set
  • 🌍 Multilingual (Russian & Tatar)
  • 🗺️ Handles all toponym-related questions
  • ⚡ Production-ready with no post-processing needed

📊 Performance Metrics

Metric Score 95% CI
Exact Match 0.992 [0.984, 0.998]
F1 Score 0.994 [0.986, 0.999]
ROUGE-L 0.496 -

📈 Performance by Question Type

Question Type F1 Score Examples
Coordinates 0.980 "Какие координаты у Казани?"
Location 1.000 "Где находится Рантамак?"
Etymology 1.000 "Что означает название Чистополь?"
Type 1.000 "Что такое Казань?"
Region 1.000 "В каком регионе находится?"
Sources 0.990 "Какие источники упоминают?"

🚀 Quick Start

Installation

pip install transformers torch

With Pipeline (recommended)

from transformers import pipeline

# Load model (automatically downloads from Hub)
qa_pipeline = pipeline(
    "question-answering",
    model="TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa"
)

# Example with context from dataset
context = """
Название (рус): Рантамак | Название (тат): Рантамак | Объект: Село | 
Расположение: на р. Мелля, в 21 км к востоку от с. Сарманово | 
Этимология: Топоним произошел от ойконима «Рангазар-Тамак» | 
Координаты: 55.205461, 52.881862
"""

questions = [
    "Где находится Рантамак?",
    "Что означает название Рантамак?",
    "Какие координаты у Рантамак?",
    "Какой тип объекта у Рантамак?"
]

for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {result['answer']}")
    print(f"Confidence: {result['score']:.3f}\n")

With PyTorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa")
model = AutoModelForQuestionAnswering.from_pretrained("TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Prepare inputs
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=512).to(device)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Decode answer
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1], skip_special_tokens=True)
print(f"Answer: {answer}")

📚 Training Details

Dataset

  • Source: Tatarstan Toponyms Dataset
  • QA pairs: 38,696 synthetic examples
  • Train/Validation/Test split: 80%/10%/10%
  • Question types: coordinates, location, etymology, type, region, sources

Training Parameters

Parameter Value
Base model deepset/xlm-roberta-large-squad2
Epochs 3
Learning rate 3e-5
Batch size 4
Max sequence length 384
Optimizer AdamW
Warmup steps 500
Weight decay 0.01
Hardware NVIDIA GPU

💡 Usage Tips

  1. Context format: The model expects context in the format from the original dataset (with prefixes like "Название (рус):", "Объект:", etc.)
  2. No normalization needed: Unlike RuBERT models, XLM-RoBERTa doesn't require post-processing
  3. Question types: Works best with the 6 question types it was trained on
  4. Confidence scores: Can be used for answer filtering

🔗 Related Resources

Models in Collection

Model Description F1 Score
xlm-roberta-large-tatar-toponyms-qa Best performing (this model) 0.994
rubert-base-tatar-toponyms-qa Balanced model 0.684
rubert-large-tatar-toponyms-qa Large version 0.679

Datasets

📝 Citation

If you use this model in your research, please cite:

@model{xlm_roberta_large_tatar_toponyms_qa,
    author = {Arabov, Mullosharaf Kurbonvoich},
    title = {XLM-RoBERTa Large for Tatar Toponyms QA},
    year = {2026},
    publisher = {Hugging Face},
    journal = {Hugging Face Hub},
    howpublished = {\url{https://huggingface.co/TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa}}
}

For the dataset:

@dataset{tatarstan_toponyms_qa_2026,
    title = {Tatarstan Toponyms QA Dataset},
    author = {Arabov, Mullosharaf Kurbonvoich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/datasets/TatarNLPWorld/tatarstan-toponyms-qa}
}

🎯 Intended Uses

  • Geographic information extraction from texts
  • Question answering systems for Tatarstan geography
  • Educational tools for learning about toponyms
  • NLP research on Tatar and Russian languages
  • Geocoding applications (with coordinate questions)

⚠️ Limitations

  • Trained only on Tatarstan region toponyms
  • Questions must be answerable from the provided context
  • Best performance on the 6 trained question types
  • Requires context in the specific format from the original dataset

👥 Team and Maintenance

🤝 Contributing

Contributions are welcome! Please:

  1. Open an issue for bugs or feature requests
  2. Submit PRs for improvements
  3. Share your use cases and results

📬 Contact

📄 License

This model is released under the CC BY-SA 4.0 license.


🌟 Acknowledgments


📅 Version: 1.0.0 | 📅 Published: 2026-03-10 | 🏆 F1 Score: 0.994

Downloads last month
11
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa

Evaluation results