🏆 XLM-RoBERTa Large for Tatar Toponyms QA

📖 Model Description

XLM-RoBERTa large fine-tuned for question answering on Tatarstan toponyms. This is the best performing model in the Tatar Toponyms QA collection with 99.4% F1 score.

This model is fine-tuned from deepset/xlm-roberta-large-squad2 on a synthetic dataset of 38,696 QA pairs about Tatarstan geographical names.

✨ Key Features

🎯 99.4% F1 score on test set
🌍 Multilingual (Russian & Tatar)
🗺️ Handles all toponym-related questions
⚡ Production-ready with no post-processing needed

📊 Performance Metrics

Metric	Score	95% CI
Exact Match	0.992	[0.984, 0.998]
F1 Score	0.994	[0.986, 0.999]
ROUGE-L	0.496	-

📈 Performance by Question Type

Question Type	F1 Score	Examples
Coordinates	0.980	"Какие координаты у Казани?"
Location	1.000	"Где находится Рантамак?"
Etymology	1.000	"Что означает название Чистополь?"
Type	1.000	"Что такое Казань?"
Region	1.000	"В каком регионе находится?"
Sources	0.990	"Какие источники упоминают?"

🚀 Quick Start

Installation

pip install transformers torch

With Pipeline (recommended)

from transformers import pipeline

# Load model (automatically downloads from Hub)
qa_pipeline = pipeline(
    "question-answering",
    model="TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa"
)

# Example with context from dataset
context = """
Название (рус): Рантамак | Название (тат): Рантамак | Объект: Село | 
Расположение: на р. Мелля, в 21 км к востоку от с. Сарманово | 
Этимология: Топоним произошел от ойконима «Рангазар-Тамак» | 
Координаты: 55.205461, 52.881862
"""

questions = [
    "Где находится Рантамак?",
    "Что означает название Рантамак?",
    "Какие координаты у Рантамак?",
    "Какой тип объекта у Рантамак?"
]

for question in questions:
    result = qa_pipeline(question=question, context=context)
    print(f"Q: {question}")
    print(f"A: {result['answer']}")
    print(f"Confidence: {result['score']:.3f}\n")

With PyTorch

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa")
model = AutoModelForQuestionAnswering.from_pretrained("TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

# Prepare inputs
inputs = tokenizer(question, context, return_tensors="pt", truncation=True, max_length=512).to(device)

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Decode answer
start_idx = torch.argmax(outputs.start_logits)
end_idx = torch.argmax(outputs.end_logits)
answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx+1], skip_special_tokens=True)
print(f"Answer: {answer}")

📚 Training Details

Dataset

Source: Tatarstan Toponyms Dataset
QA pairs: 38,696 synthetic examples
Train/Validation/Test split: 80%/10%/10%
Question types: coordinates, location, etymology, type, region, sources

Training Parameters

Parameter	Value
Base model	`deepset/xlm-roberta-large-squad2`
Epochs	3
Learning rate	3e-5
Batch size	4
Max sequence length	384
Optimizer	AdamW
Warmup steps	500
Weight decay	0.01
Hardware	NVIDIA GPU

💡 Usage Tips

Context format: The model expects context in the format from the original dataset (with prefixes like "Название (рус):", "Объект:", etc.)
No normalization needed: Unlike RuBERT models, XLM-RoBERTa doesn't require post-processing
Question types: Works best with the 6 question types it was trained on
Confidence scores: Can be used for answer filtering

🔗 Related Resources

Models in Collection

Model	Description	F1 Score
xlm-roberta-large-tatar-toponyms-qa	Best performing (this model)	0.994
rubert-base-tatar-toponyms-qa	Balanced model	0.684
rubert-large-tatar-toponyms-qa	Large version	0.679

Datasets

Tatarstan Toponyms QA Dataset - Training data (38,696 QA pairs)
Tatarstan Toponyms Dataset - Original data (9,688 toponyms)

📝 Citation

If you use this model in your research, please cite:

@model{xlm_roberta_large_tatar_toponyms_qa,
    author = {Arabov, Mullosharaf Kurbonvoich},
    title = {XLM-RoBERTa Large for Tatar Toponyms QA},
    year = {2026},
    publisher = {Hugging Face},
    journal = {Hugging Face Hub},
    howpublished = {\url{https://huggingface.co/TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa}}
}

For the dataset:

@dataset{tatarstan_toponyms_qa_2026,
    title = {Tatarstan Toponyms QA Dataset},
    author = {Arabov, Mullosharaf Kurbonvoich},
    year = {2026},
    publisher = {Hugging Face},
    url = {https://huggingface.co/datasets/TatarNLPWorld/tatarstan-toponyms-qa}
}

🎯 Intended Uses

Geographic information extraction from texts
Question answering systems for Tatarstan geography
Educational tools for learning about toponyms
NLP research on Tatar and Russian languages
Geocoding applications (with coordinate questions)

⚠️ Limitations

Trained only on Tatarstan region toponyms
Questions must be answerable from the provided context
Best performance on the 6 trained question types
Requires context in the specific format from the original dataset

👥 Team and Maintenance

Developer: Mullosharaf Kurbonvoich Arabov
Organization: TatarNLPWorld
Project: Tat2Vec - Advancing Tatar Language Processing

🤝 Contributing

Contributions are welcome! Please:

Open an issue for bugs or feature requests
Submit PRs for improvements
Share your use cases and results

📬 Contact

Issues: GitHub Issues
Email: TatarNLPWorld community
Website: Tat2Vec Project

📄 License

This model is released under the CC BY-SA 4.0 license.

🌟 Acknowledgments

Original data from Tatarstan Toponyms Dataset
Base model: deepset/xlm-roberta-large-squad2
Data sources: Ф.Г. Гарипова, Г.Ф. Саттаров, Р.Г. Әхмәтьянов
Resource: Топонимы Татарстана

📅 Version: 1.0.0 | 📅 Published: 2026-03-10 | 🏆 F1 Score: 0.994

Downloads last month: 11

Safetensors

Model size

0.6B params

Tensor type

F32

Dataset used to train TatarNLPWorld/xlm-roberta-large-tatar-toponyms-qa

Evaluation results

Exact Match on Tatarstan Toponyms QA
test set self-reported

0.992
F1 Score on Tatarstan Toponyms QA
test set self-reported

0.994
ROUGE-L on Tatarstan Toponyms QA
test set self-reported

0.496