🏆 XLM-RoBERTa Large for Tatar Toponyms QA
📖 Model Description
XLM-RoBERTa large fine-tuned for question answering on Tatarstan toponyms. This is the best performing model in the Tatar Toponyms QA collection with 99.4% F1 score.
This model is fine-tuned from deepset/xlm-roberta-large-squad2 on a synthetic dataset of 38,696 QA pairs about Tatarstan geographical names.
✨ Key Features
- 🎯 99.4% F1 score on test set
- 🌍 Multilingual (Russian & Tatar)
- 🗺️ Handles all toponym-related questions
- ⚡ Production-ready with no post-processing needed
📊 Performance Metrics
| Metric | Score | 95% CI |
|---|---|---|
| Exact Match | 0.992 | [0.984, 0.998] |
| F1 Score | 0.994 | [0.986, 0.999] |
| ROUGE-L | 0.496 | - |
📈 Performance by Question Type
| Question Type | F1 Score | Examples |
|---|---|---|
| Coordinates | 0.980 | "Какие координаты у Казани?" |
| Location | 1.000 | "Где находится Рантамак?" |
| Etymology | 1.000 | "Что означает название Чистополь?" |
| Type | 1.000 | "Что такое Казань?" |
| Region | 1.000 | "В каком регионе находится?" |
| Sources | 0.990 | "Какие источники упоминают?" |
🚀 Quick Start
Installation
pip install transformers torch
With Pipeline (recommended)
from transformers import pipeline
# Load model (automatically downloads from Hub)
qa_pipeline = pipeline(
"question-answering",
model="TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa"
)
# Example with context from dataset
context = """
Название (рус): Рантамак | Название (тат): Рантамак | Объект: Село |
Расположение: на р. Мелля, в 21 км к востоку от с. Сарманово |
Этимология: Топоним произошел от ойконима «Рангазар-Тамак» |
Координаты: 55.205461, 52.881862
"""
questions = [
"Где находится Рантамак?",
"Что означает название Рантамак?",
"Какие координаты у Рантамак?",
"Какой тип объекта у Рантамак?"
]
for question in questions:
result = qa_pipeline(question=question, context=context)
print(f"Q: {question}")
print(f"A: {result['answer']}")
print(f"Confidence: {result['score']:.3f}\n")
With PyTorch
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa")
model = AutoModelForQuestionAnswering.from_pretrained("TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa")
# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
# Prepare inputs
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=512).to(device)
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
# Decode answer
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1], skip_special_tokens=True)
print(f"Answer: {answer}")
📚 Training Details
Dataset
- Source: Tatarstan Toponyms Dataset
- QA pairs: 38,696 synthetic examples
- Train/Validation/Test split: 80%/10%/10%
- Question types: coordinates, location, etymology, type, region, sources
Training Parameters
| Parameter | Value |
|---|---|
| Base model | deepset/xlm-roberta-large-squad2 |
| Epochs | 3 |
| Learning rate | 3e-5 |
| Batch size | 4 |
| Max sequence length | 384 |
| Optimizer | AdamW |
| Warmup steps | 500 |
| Weight decay | 0.01 |
| Hardware | NVIDIA GPU |
💡 Usage Tips
- Context format: The model expects context in the format from the original dataset (with prefixes like "Название (рус):", "Объект:", etc.)
- No normalization needed: Unlike RuBERT models, XLM-RoBERTa doesn't require post-processing
- Question types: Works best with the 6 question types it was trained on
- Confidence scores: Can be used for answer filtering
🔗 Related Resources
Models in Collection
| Model | Description | F1 Score |
|---|---|---|
| xlm-roberta-large-tatar-toponyms-qa | Best performing (this model) | 0.994 |
| rubert-base-tatar-toponyms-qa | Balanced model | 0.684 |
| rubert-large-tatar-toponyms-qa | Large version | 0.679 |
Datasets
- Tatarstan Toponyms QA Dataset - Training data (38,696 QA pairs)
- Tatarstan Toponyms Dataset - Original data (9,688 toponyms)
📝 Citation
If you use this model in your research, please cite:
@model{xlm_roberta_large_tatar_toponyms_qa,
author = {Arabov, Mullosharaf Kurbonvoich},
title = {XLM-RoBERTa Large for Tatar Toponyms QA},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Hub},
howpublished = {\url{https://huggingface.co/TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa}}
}
For the dataset:
@dataset{tatarstan_toponyms_qa_2026,
title = {Tatarstan Toponyms QA Dataset},
author = {Arabov, Mullosharaf Kurbonvoich},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/TatarNLPWorld/tatarstan-toponyms-qa}
}
🎯 Intended Uses
- Geographic information extraction from texts
- Question answering systems for Tatarstan geography
- Educational tools for learning about toponyms
- NLP research on Tatar and Russian languages
- Geocoding applications (with coordinate questions)
⚠️ Limitations
- Trained only on Tatarstan region toponyms
- Questions must be answerable from the provided context
- Best performance on the 6 trained question types
- Requires context in the specific format from the original dataset
👥 Team and Maintenance
- Developer: Mullosharaf Kurbonvoich Arabov
- Organization: TatarNLPWorld
- Project: Tat2Vec - Advancing Tatar Language Processing
🤝 Contributing
Contributions are welcome! Please:
- Open an issue for bugs or feature requests
- Submit PRs for improvements
- Share your use cases and results
📬 Contact
- Issues: GitHub Issues
- Email: TatarNLPWorld community
- Website: Tat2Vec Project
📄 License
This model is released under the CC BY-SA 4.0 license.
🌟 Acknowledgments
- Original data from Tatarstan Toponyms Dataset
- Base model: deepset/xlm-roberta-large-squad2
- Data sources: Ф.Г. Гарипова, Г.Ф. Саттаров, Р.Г. Әхмәтьянов
- Resource: Топонимы Татарстана
📅 Version: 1.0.0 | 📅 Published: 2026-03-10 | 🏆 F1 Score: 0.994
- Downloads last month
- 11
Dataset used to train TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa
Evaluation results
- Exact Match on Tatarstan Toponyms QAtest set self-reported0.992
- F1 Score on Tatarstan Toponyms QAtest set self-reported0.994
- ROUGE-L on Tatarstan Toponyms QAtest set self-reported0.496