Tatar
ArabovMK commited on
Commit
1f9b93f
·
verified ·
1 Parent(s): 56a98c5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +145 -0
README.md ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - tt
5
+ metrics:
6
+ - accuracy
7
+ base_model:
8
+ - facebook/fasttext-language-identification
9
+ ---
10
+
11
+ # Tatar2Vec: Word Embeddings for the Tatar Language
12
+
13
+ This repository contains a collection of pre-trained word embedding models for the Tatar language. The models are trained on a large Tatar corpus using two popular algorithms: **Word2Vec** and **FastText**, with different architectures and vector sizes.
14
+
15
+ All models are ready to use with the `gensim` library and can be easily downloaded via the Hugging Face Hub.
16
+
17
+ ## 📦 Available Models
18
+
19
+ The following models are included:
20
+
21
+ | Model Name | Type | Architecture | Vector Size | #Vectors | Notes |
22
+ |---------------------|-----------|--------------|-------------|----------|-------|
23
+ | `w2v_cbow_100` | Word2Vec | CBOW | 100 | 1.29M | Best overall for semantic analogy tasks |
24
+ | `w2v_cbow_200` | Word2Vec | CBOW | 200 | 1.29M | Higher dimensionality, more expressive |
25
+ | `w2v_sg_100` | Word2Vec | Skip-gram | 100 | 1.29M | Often better for rare words |
26
+ | `ft_cbow_100` | FastText | CBOW | 100 | 1.29M | Handles subword information, good for morphology |
27
+ | `ft_cbow_200` | FastText | CBOW | 200 | 1.29M | Larger FastText model |
28
+
29
+ All models share the same vocabulary of **1,293,992** unique tokens, achieving **100% coverage** on the training corpus.
30
+
31
+ ## 📁 Repository Structure
32
+
33
+ The files are organised in subdirectories for easy access:
34
+
35
+ ```
36
+ Tatar2Vec/
37
+ ├── word2vec/
38
+ │ ├── cbow100/ # w2v_cbow_100 model files
39
+ │ ├── cbow200/ # w2v_cbow_200 model files
40
+ │ └── sg100/ # w2v_sg_100 model files
41
+ └── fasttext/
42
+ ├── cbow100/ # ft_cbow_100 model files
43
+ └── cbow200/ # ft_cbow_200 model files
44
+ ```
45
+
46
+ Each model folder contains the files saved by `gensim` (`.model`, `.npy` vectors, etc.).
47
+
48
+ ## 🚀 Usage
49
+
50
+ ### Installation
51
+
52
+ First, install the required libraries:
53
+
54
+ ```bash
55
+ pip install huggingface_hub gensim
56
+ ```
57
+
58
+ ### Download a Model
59
+
60
+ Use `snapshot_download` to download all files of a specific model to a local directory:
61
+
62
+ ```python
63
+ from huggingface_hub import snapshot_download
64
+ import gensim
65
+ import os
66
+
67
+ # Download the best Word2Vec CBOW 100 model
68
+ model_path = snapshot_download(
69
+ repo_id="TatarNLPWorld/Tatar2Vec",
70
+ allow_patterns="word2vec/cbow100/*", # only download this model
71
+ local_dir="./tatar2vec_cbow100" # optional local folder
72
+ )
73
+
74
+ # Load the model with gensim
75
+ model_file = os.path.join(model_path, "word2vec/cbow100/w2v_cbow_100.model")
76
+ model = gensim.models.Word2Vec.load(model_file)
77
+
78
+ # Test it
79
+ print(model.wv.most_similar("татар"))
80
+ ```
81
+
82
+ Alternatively, you can download the whole repository or individual files using `hf_hub_download`.
83
+
84
+ ## 📊 Model Comparison
85
+
86
+ We evaluated all models on a set of intrinsic tasks:
87
+
88
+ - **Word analogies** (e.g., `Мәскәү:Россия = Казан:?`)
89
+ - **Semantic similarity** (cosine similarity of related word pairs)
90
+ - **Out-of-vocabulary (OOV)** handling (for FastText)
91
+ - **Nearest neighbours inspection**
92
+
93
+ The **Word2Vec CBOW (100-dim)** model performed best overall, especially on analogy tasks (60% accuracy vs. 0% for FastText). Below is a summary of the key metrics:
94
+
95
+ | Metric | Word2Vec (cbow100) | FastText (cbow100) |
96
+ |-----------------------|---------------------|---------------------|
97
+ | Analogy accuracy | 60.0% | 0.0% |
98
+ | Avg. semantic similarity | 0.568 | 0.582 |
99
+ | OOV handling | N/A | Good (subword) |
100
+ | Vocabulary coverage | 100% | 100% |
101
+ | Training time | 1760s | 3323s |
102
+
103
+ **Why Word2Vec?** It produces cleaner nearest neighbours (actual words without punctuation artifacts) and captures semantic relationships more accurately. FastText, while slightly better on raw similarity, tends to return noisy forms with attached punctuation.
104
+
105
+ For a detailed report, see the [model comparison results](model_comparison_report.md) (included in the repository).
106
+
107
+ ## 📝 License
108
+
109
+ All models are released under the **MIT License**. You are free to use, modify, and distribute them for any purpose, with proper attribution.
110
+
111
+ ## 📜 Certificate
112
+
113
+ This software (Tatar2Vec) is registered with the Federal Service for Intellectual Property (Rospatent) under the following certificate:
114
+
115
+ - **Certificate number**: 2026610619
116
+ - **Title**: Tatar2Vec
117
+ - **Filing date**: December 23, 2025
118
+ - **Publication date**: January 14, 2026
119
+ - **Author**: Mullosharaf K. Arabov
120
+ - **Applicant**: Kazan Federal University
121
+
122
+ *Свидетельство о государственной регистрации программы для ЭВМ № 2026610619 Российская Федерация. Tatar2Vec : заявл. 23.12.2025 : опубл. 14.01.2026 / М. К. Арабов ; заявитель Федеральное государственное автономное образовательное учреждение высшего образования «Казанский федеральный университет».*
123
+
124
+ ## 🤝 Citation
125
+
126
+ If you use these models in your research, please cite the software registration:
127
+
128
+ ```bibtex
129
+ @software{tatar2vec_2026,
130
+ title = {Tatar2Vec},
131
+ author = {Arabov, Mullosharaf Kurbonvoich},
132
+ year = {2026},
133
+ publisher = {Kazan Federal University},
134
+ note = {Registered software, Certificate No. 2026610619},
135
+ url = {https://huggingface.co/TatarNLPWorld/Tatar2Vec}
136
+ }
137
+ ```
138
+
139
+ ## 🌐 Language
140
+
141
+ The models are trained on Tatar text and are intended for use with the Tatar language (language code `tt`).
142
+
143
+ ## 🙌 Acknowledgements
144
+
145
+ These models were trained by [TatarNLPWorld](https://huggingface.co/TatarNLPWorld) as part of an effort to advance NLP resources for the Tatar language. We thank the open-source community for the tools and libraries that made this work possible.