TatarNLPWorld
/

Tatar2Vec

Tatar

Model card Files Files and versions

xet

Community

ArabovMK commited on Mar 8

Commit

1f9b93f

verified ·

1 Parent(s): 56a98c5

Create README.md

Browse files

Files changed (1) hide show

README.md +145 -0

README.md ADDED Viewed

	@@ -0,0 +1,145 @@

+---
+license: mit
+language:
+- tt
+metrics:
+- accuracy
+base_model:
+- facebook/fasttext-language-identification
+---
+# Tatar2Vec: Word Embeddings for the Tatar Language
+This repository contains a collection of pre-trained word embedding models for the Tatar language. The models are trained on a large Tatar corpus using two popular algorithms: **Word2Vec** and **FastText**, with different architectures and vector sizes.
+All models are ready to use with the `gensim` library and can be easily downloaded via the Hugging Face Hub.
+## 📦 Available Models
+The following models are included:
+| Model Name          | Type      | Architecture | Vector Size | #Vectors | Notes |
+|---------------------|-----------|--------------|-------------|----------|-------|
+| `w2v_cbow_100`      | Word2Vec  | CBOW         | 100         | 1.29M    | Best overall for semantic analogy tasks |
+| `w2v_cbow_200`      | Word2Vec  | CBOW         | 200         | 1.29M    | Higher dimensionality, more expressive |
+| `w2v_sg_100`        | Word2Vec  | Skip-gram    | 100         | 1.29M    | Often better for rare words |
+| `ft_cbow_100`       | FastText  | CBOW         | 100         | 1.29M    | Handles subword information, good for morphology |
+| `ft_cbow_200`       | FastText  | CBOW         | 200         | 1.29M    | Larger FastText model |
+All models share the same vocabulary of **1,293,992** unique tokens, achieving **100% coverage** on the training corpus.
+## 📁 Repository Structure
+The files are organised in subdirectories for easy access:
+```
+Tatar2Vec/
+├── word2vec/
+│   ├── cbow100/          # w2v_cbow_100 model files
+│   ├── cbow200/          # w2v_cbow_200 model files
+│   └── sg100/            # w2v_sg_100 model files
+└── fasttext/
+    ├── cbow100/          # ft_cbow_100 model files
+    └── cbow200/          # ft_cbow_200 model files
+```
+Each model folder contains the files saved by `gensim` (`.model`, `.npy` vectors, etc.).
+## 🚀 Usage
+### Installation
+First, install the required libraries:
+```bash
+pip install huggingface_hub gensim
+```
+### Download a Model
+Use `snapshot_download` to download all files of a specific model to a local directory:
+```python
+from huggingface_hub import snapshot_download
+import gensim
+import os
+# Download the best Word2Vec CBOW 100 model
+model_path = snapshot_download(
+    repo_id="TatarNLPWorld/Tatar2Vec",
+    allow_patterns="word2vec/cbow100/*",   # only download this model
+    local_dir="./tatar2vec_cbow100"        # optional local folder
+)
+# Load the model with gensim
+model_file = os.path.join(model_path, "word2vec/cbow100/w2v_cbow_100.model")
+model = gensim.models.Word2Vec.load(model_file)
+# Test it
+print(model.wv.most_similar("татар"))
+```
+Alternatively, you can download the whole repository or individual files using `hf_hub_download`.
+## 📊 Model Comparison
+We evaluated all models on a set of intrinsic tasks:
+- **Word analogies** (e.g., `Мәскәү:Россия = Казан:?`)
+- **Semantic similarity** (cosine similarity of related word pairs)
+- **Out-of-vocabulary (OOV)** handling (for FastText)
+- **Nearest neighbours inspection**
+The **Word2Vec CBOW (100-dim)** model performed best overall, especially on analogy tasks (60% accuracy vs. 0% for FastText). Below is a summary of the key metrics:
+| Metric                | Word2Vec (cbow100) | FastText (cbow100) |
+|-----------------------|---------------------|---------------------|
+| Analogy accuracy      | 60.0%               | 0.0%                |
+| Avg. semantic similarity | 0.568             | 0.582               |
+| OOV handling          | N/A                 | Good (subword)      |
+| Vocabulary coverage   | 100%                | 100%                |
+| Training time         | 1760s               | 3323s               |
+**Why Word2Vec?** It produces cleaner nearest neighbours (actual words without punctuation artifacts) and captures semantic relationships more accurately. FastText, while slightly better on raw similarity, tends to return noisy forms with attached punctuation.
+For a detailed report, see the [model comparison results](model_comparison_report.md) (included in the repository).
+## 📝 License
+All models are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution.
+## 📜 Certificate
+This software (Tatar2Vec) is registered with the Federal Service for Intellectual Property (Rospatent) under the following certificate:
+- **Certificate number**: 2026610619
+- **Title**: Tatar2Vec
+- **Filing date**: December 23, 2025
+- **Publication date**: January 14, 2026
+- **Author**: Mullosharaf K. Arabov
+- **Applicant**: Kazan Federal University
+*Свидетельство о государственной регистрации программы для ЭВМ № 2026610619 Российская Федерация. Tatar2Vec : заявл. 23.12.2025 : опубл. 14.01.2026 / М. К. Арабов ; заявитель Федеральное государственное автономное образовательное учреждение высшего образования «Казанский федеральный университет».*
+## 🤝 Citation
+If you use these models in your research, please cite the software registration:
+```bibtex
+@software{tatar2vec_2026,
+    title = {Tatar2Vec},
+    author = {Arabov, Mullosharaf Kurbonvoich},
+    year = {2026},
+    publisher = {Kazan Federal University},
+    note = {Registered software, Certificate No. 2026610619},
+    url = {https://huggingface.co/TatarNLPWorld/Tatar2Vec}
+}
+```
+## 🌐 Language
+The models are trained on Tatar text and are intended for use with the Tatar language (language code `tt`).
+## 🙌 Acknowledgements
+These models were trained by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.