Tatar
ArabovMK commited on
Commit
af68e34
·
verified ·
1 Parent(s): e9b26e6

Update model_comparison_report.md

Browse files
Files changed (1) hide show
  1. model_comparison_report.md +287 -50
model_comparison_report.md CHANGED
@@ -2,33 +2,88 @@
2
 
3
  **Date:** 2026-03-04
4
  **Author:** Mullosharaf K. Arabov
 
5
 
6
- ## Overview
7
 
8
- This report presents a comprehensive comparison of five word embedding models trained for the Tatar language:
9
- - Word2Vec CBOW (100 and 200 dimensions)
10
- - Word2Vec Skip-gram (100 dimensions)
11
- - FastText CBOW (100 and 200 dimensions)
12
 
13
- ## Evaluation Methodology
14
 
15
- ### Tasks
16
- 1. **Word Analogies** – 5 semantic analogy pairs
17
- 2. **Semantic Similarity** Cosine similarity on 8 word pairs
18
- 3. **OOV Handling** Testing with 6 morphologically complex words
19
- 4. **Nearest Neighbours** Qualitative inspection
20
- 5. **Dimensionality Reduction** PCA visualization
 
21
 
22
- ## Detailed Results
23
 
24
- ### 1. Word Analogies
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  | Model | Accuracy | Details |
27
  |-------|----------|---------|
28
- | w2v_cbow_100 | 60% | ✓ Moscow:Russia = Kazan:Tatarstan<br>✓ teacher:school = doctor:hospital<br>✓ father:mother = grandfather:grandmother |
29
- | All FastText | 0% | Failed on all analogy tasks |
 
 
 
 
 
 
 
 
 
30
 
31
- ### 2. Semantic Similarity
32
 
33
  | Word Pair | Word2Vec (cbow100) | FastText (cbow100) |
34
  |-----------|-------------------|-------------------|
@@ -36,62 +91,244 @@ This report presents a comprehensive comparison of five word embedding models tr
36
  | татар-башкорт | 0.793 | 0.823 |
37
  | мәктәп-университет | 0.565 | 0.621 |
38
  | укытучы-укучы | 0.742 | 0.771 |
 
 
 
 
39
  | **Average** | **0.568** | **0.582** |
40
 
41
- ### 3. Nearest Neighbours Analysis
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
42
 
43
- #### Word2Vec (cbow100) – Clean semantic neighbours
 
 
44
  ```
45
- татар Татар, башкорт, урыс, татарның, рус
46
- Казан → Мәскәү, Чаллы, Алабуга, Чистай, Уфа
 
 
 
47
  ```
48
 
49
- #### FastText (cbow100) – Noisy neighbours with punctuation
50
  ```
51
- татар → милләттатар, дтатар, —татар, –татар, Ттатар
52
- Казан → »Казан, –Казан, .Казан, )Казан, -Казан
 
 
 
53
  ```
54
 
55
- ### 4. Training Efficiency
 
 
 
 
 
 
 
56
 
57
- | Model | Training Time | Vocabulary Size |
58
- |-------|--------------|-----------------|
59
- | w2v_cbow_100 | 1760s (29min) | 1,293,992 |
60
- | ft_cbow_100 | 3323s (55min) | 1,293,992 |
 
 
 
 
61
 
62
- ## Key Findings
 
 
 
 
 
 
 
63
 
64
- 1. **Word2Vec CBOW (100-dim)** is the best overall model, particularly strong on semantic analogies (60% accuracy)
65
 
66
- 2. **FastText models** show slightly better raw similarity scores but suffer from noisy representations (punctuation artifacts)
 
 
 
 
 
 
 
67
 
68
- 3. **All models** achieve 100% vocabulary coverage
 
 
 
 
 
 
 
69
 
70
- 4. **Word2Vec trains almost 2x faster** than FastText
 
 
 
 
 
 
 
71
 
72
- ## Recommendations
 
 
 
 
 
 
 
73
 
74
- | Use Case | Recommended Model |
75
- |----------|-------------------|
76
- | Semantic similarity, analogies | **w2v_cbow_100** |
77
- | Morphological analysis | ft_cbow_100 |
78
- | Maximum precision (if memory allows) | w2v_cbow_200 |
79
- | Rare word handling | ft_cbow_100 |
 
 
80
 
81
- ## Visualizations
82
 
83
- ### PCA Projection (Word2Vec)
84
- *[Here you can add a screenshot of the PCA plot]*
 
 
85
 
86
- ### PCA Projection (FastText)
87
- *[Here you can add a screenshot of the PCA plot]*
88
 
89
- ## Conclusion
90
 
91
- **Winner: Word2Vec CBOW (100 dimensions)**
92
 
93
- Despite FastText's slightly better semantic similarity scores, Word2Vec produces cleaner, more interpretable embeddings and significantly outperforms on analogy tasks. It is recommended as the default choice for most Tatar NLP applications.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
94
 
95
  ---
96
 
97
- *This report was generated automatically during model evaluation.*
 
 
 
2
 
3
  **Date:** 2026-03-04
4
  **Author:** Mullosharaf K. Arabov
5
+ **Affiliation:** Kazan Federal University
6
 
7
+ ## Executive Summary
8
 
9
+ This report presents a comprehensive evaluation of five word embedding models trained for the Tatar language. The **Word2Vec CBOW (100-dim)** model emerges as the overall winner, demonstrating superior performance on semantic analogy tasks (60% accuracy) and producing cleaner, more interpretable nearest neighbours compared to FastText alternatives.
 
 
 
10
 
11
+ ## 📊 Models Evaluated
12
 
13
+ | Model | Type | Architecture | Dimensions | Vocabulary |
14
+ |-------|------|--------------|------------|------------|
15
+ | w2v_cbow_100 | Word2Vec | CBOW | 100 | 1,293,992 |
16
+ | w2v_cbow_200 | Word2Vec | CBOW | 200 | 1,293,992 |
17
+ | w2v_sg_100 | Word2Vec | Skip-gram | 100 | 1,293,992 |
18
+ | ft_cbow_100 | FastText | CBOW | 100 | 1,293,992 |
19
+ | ft_cbow_200 | FastText | CBOW | 200 | 1,293,992 |
20
 
21
+ **Total trained models:** 13 (including intermediate checkpoints)
22
 
23
+ ## 🧪 Evaluation Methodology
24
+
25
+ ### Test 1: Word Analogies
26
+ Five semantic analogy pairs testing relational understanding:
27
+
28
+ 1. Мәскәү:Россия = Казан:? (expected: Татарстан)
29
+ 2. укытучы:мәктәп = табиб:? (expected: хастаханә)
30
+ 3. әти:әни = бабай:? (expected: әби)
31
+ 4. зур:кечкенә = озын:? (expected: кыска)
32
+ 5. Казан:Татарстан = Мәскәү:? (expected: Россия)
33
+
34
+ ### Test 2: Semantic Similarity
35
+ Cosine similarity on eight semantically related word pairs:
36
+
37
+ - Казан-Мәскәү (cities)
38
+ - татар-башкорт (ethnic groups)
39
+ - мәктәп-университет (educational institutions)
40
+ - укытучы-укучы (teacher-student)
41
+ - китап-газета (printed materials)
42
+ - якшы-начар (good-bad)
43
+ - йөгерү-бару (running-going)
44
+ - алма-груша (fruits)
45
+
46
+ ### Test 3: Out-of-Vocabulary (OOV) Handling
47
+ Testing with morphologically complex forms:
48
+
49
+ - Казаннан (from Kazan)
50
+ - мәктәпләргә (to schools)
51
+ - укыткан (taught)
52
+ - татарчалаштыру (tatarization)
53
+ - китапларыбызны (our books)
54
+ - йөгергәннәр (they ran)
55
+
56
+ ### Test 4: Nearest Neighbours
57
+ Qualitative inspection of top-5 most similar words for key terms:
58
+ татар, Казан, мәктәп, укытучы, якшы
59
+
60
+ ### Test 5: PCA Visualization
61
+ Dimensionality reduction to 2D for embedding space analysis
62
+
63
+ ### Test 6: Intuitive Tests
64
+ Manual verification of semantic expectations:
65
+ - Expected neighbours for "татар" and "Казан"
66
+ - Dissimilarity check between "мәктәп" and "хастаханә"
67
+
68
+ ## 📈 Detailed Results
69
+
70
+ ### Test 1: Word Analogies
71
 
72
  | Model | Accuracy | Details |
73
  |-------|----------|---------|
74
+ | **Word2Vec** | **60.0%** | ✓ Мәскәү:Россия = Казан:Татарстан (rank 5)<br>✓ укытучы:мәктәп = табиб:хастаханә (rank 2)<br>✓ әти:әни = бабай:әби (rank 1)<br>✗ зур:кечкенә = озын:кыска<br>✗ Казан:Татарстан = Мәскәү:Россия |
75
+ | **FastText** | **0.0%** | All analogies failed |
76
+
77
+ **Word2Vec correct predictions:**
78
+ - For "хастаханә": found ['табиблар', 'хастаханә', 'хастаханәнең'] (target at rank 2)
79
+ - For "әби": found ['әби', 'Бабай', 'бабайның'] (target at rank 1)
80
+ - For "Татарстан": found ['Федерациясе', 'Россиянең', 'Республикасы'] (target at rank 5)
81
+
82
+ **FastText typical errors:**
83
+ - For "Татарстан": ['.Россия', ')Россия', ';Россия'] (punctuation artifacts)
84
+ - For "Россия": ['МәскәүРусия', 'Мәскәү-Татарстан', 'Татарстанхөкүмәте'] (concatenated forms)
85
 
86
+ ### Test 2: Semantic Similarity
87
 
88
  | Word Pair | Word2Vec (cbow100) | FastText (cbow100) |
89
  |-----------|-------------------|-------------------|
 
91
  | татар-башкорт | 0.793 | 0.823 |
92
  | мәктәп-университет | 0.565 | 0.621 |
93
  | укытучы-укучы | 0.742 | 0.771 |
94
+ | китап-газета | 0.645 | 0.596 |
95
+ | якшы-начар | -0.042 | 0.303 |
96
+ | йөгерү-бару | 0.367 | 0.545 |
97
+ | алма-груша | 0.693 | 0.263 |
98
  | **Average** | **0.568** | **0.582** |
99
 
100
+ **Observations:**
101
+ - FastText shows slightly higher average similarity (0.582 vs 0.568)
102
+ - Word2Vec better captures antonymy (якшы-начар: -0.042 vs 0.303)
103
+ - FastText struggles with fruit pairs (алма-груша: 0.263 vs 0.693)
104
+
105
+ ### Test 3: OOV Handling
106
+
107
+ | Word | In Word2Vec | In FastText |
108
+ |------|-------------|-------------|
109
+ | Казаннан | ✓ | ✓ |
110
+ | мәктәпләргә | ✓ | ✓ |
111
+ | укыткан | ✓ | ✓ |
112
+ | татарчалаштыру | ✓ | ✓ |
113
+ | китапларыбызны | ✓ | ✓ |
114
+ | йөгергәннәр | ✓ | ✓ |
115
+
116
+ **Note:** Both models achieved 100% coverage on these morphologically complex forms, indicating the vocabulary is comprehensive.
117
+
118
+ ### Test 4: Nearest Neighbours Analysis
119
 
120
+ #### Word2Vec (cbow100) – Clean Semantic Neighbours
121
+
122
+ **татар:**
123
  ```
124
+ 1. Татар (0.889) # Capitalized form
125
+ 2. башкорт (0.793) # Bashkir (related ethnicity)
126
+ 3. урыс (0.788) # Russian
127
+ 4. татарның (0.783) # Genitive form
128
+ 5. рус (0.755) # Russian
129
  ```
130
 
131
+ **Казан:**
132
  ```
133
+ 1. Мәскәү (0.777) # Moscow
134
+ 2. Чаллы (0.771) # Naberezhnye Chelny (Tatarstan city)
135
+ 3. Алабуга (0.733) # Yelabuga (Tatarstan city)
136
+ 4. Чистай (0.717) # Chistopol (Tatarstan city)
137
+ 5. Уфа (0.715) # Ufa (Bashkortostan capital)
138
  ```
139
 
140
+ **мәктәп:**
141
+ ```
142
+ 1. Мәктәп (0.886) # Capitalized
143
+ 2. мәктәпнең (0.878) # Genitive
144
+ 3. гимназия (0.818) # Gymnasium
145
+ 4. мәктәптә (0.813) # Locative
146
+ 5. укытучылар (0.797) # Teachers
147
+ ```
148
 
149
+ **укытучы:**
150
+ ```
151
+ 1. Укытучы (0.821) # Capitalized
152
+ 2. мәктәптә (0.816) # At school
153
+ 3. тәрбияче (0.806) # Educator
154
+ 4. укытучылар (0.794) # Teachers (plural)
155
+ 5. укытучысы (0.788) # His/her teacher
156
+ ```
157
 
158
+ **якшы:**
159
+ ```
160
+ 1. фикер-ниятенә (0.758) # Noisy
161
+ 2. фильмыМарска (0.744) # Noisy
162
+ 3. 1418, (0.731) # Number + punctuation
163
+ 4. «мә-аа-ауу», (0.728) # Onomatopoeia
164
+ 5. (273 (0.723) # Number in parentheses
165
+ ```
166
 
167
+ #### FastText (cbow100) Noisy Neighbours with Punctuation
168
 
169
+ **татар:**
170
+ ```
171
+ 1. милләттатар (0.944) # Compound
172
+ 2. дтатар (0.940) # With prefix
173
+ 3. —татар (0.938) # Em dash prefix
174
+ 4. –татар (0.938) # En dash prefix
175
+ 5. Ттатар (0.934) # Capital T prefix
176
+ ```
177
 
178
+ **Казан:**
179
+ ```
180
+ 1. »Казан (0.940) # Quote suffix
181
+ 2. –Казан (0.937) # Dash prefix
182
+ 3. .Казан (0.936) # Period prefix
183
+ 4. )Казан (0.935) # Parenthesis suffix
184
+ 5. -Казан (0.935) # Hyphen prefix
185
+ ```
186
 
187
+ **мәктәп:**
188
+ ```
189
+ 1. -мәктәп (0.966) # Hyphen prefix
190
+ 2. —мәктәп (0.964) # Em dash prefix
191
+ 3. мәктәп— (0.956) # Em dash suffix
192
+ 4. "мәктәп (0.956) # Quote prefix
193
+ 5. мәктәп… (0.954) # Ellipsis suffix
194
+ ```
195
 
196
+ **укытучы:**
197
+ ```
198
+ 1. укытучы- (0.951) # Hyphen suffix
199
+ 2. укытучылы (0.945) # With suffix
200
+ 3. укытучы-тәрбияче (0.945) # Compound
201
+ 4. укытучы-остаз (0.940) # Compound
202
+ 5. укытучы-хәлфә (0.935) # Compound
203
+ ```
204
 
205
+ **якшы:**
206
+ ```
207
+ 1. якш (0.788) # Truncated
208
+ 2. як— (0.779) # With dash
209
+ 3. ягы-ры (0.774) # Noisy
210
+ 4. якй (0.771) # Noisy
211
+ 5. якшмбе (0.768) # Possibly "якшәмбе" (Sunday) misspelled
212
+ ```
213
 
214
+ ### Test 5: PCA Visualization
215
 
216
+ | Model | Explained Variance (PC1+PC2) |
217
+ |-------|------------------------------|
218
+ | Word2Vec | 38.4% |
219
+ | FastText | 41.2% |
220
 
221
+ FastText shows slightly better variance preservation in 2D projection.
 
222
 
223
+ ### Test 6: Intuitive Tests
224
 
225
+ #### Word2Vec
226
 
227
+ **Target: "татар"** (expected: башкорт, рус, милләт)
228
+ Found: ['Татар', 'башкорт', 'урыс', 'татарның', 'рус']
229
+ Matches: ['башкорт', 'рус'] ✓
230
+
231
+ **Target: "Казан"** (expected: Мәскәү, Уфа, шәһәр)
232
+ Found: ['Мәскәү', 'Чаллы', 'Алабуга', 'Чистай', 'Уфа']
233
+ Matches: ['Мәскәү', 'Уфа'] ✓
234
+
235
+ **Dissimilarity: мәктәп vs хастаханә**
236
+ Similarity: 0.490 (appropriately low) ✓
237
+
238
+ #### FastText
239
+
240
+ **Target: "татар"** (expected: башкорт, рус, милләт)
241
+ Found: ['милләттатар', 'дтатар', '—татар', '–татар', 'Ттатар']
242
+ Matches: [] ✗
243
+
244
+ **Target: "Казан"** (expected: Мәскәү, Уфа, шәһәр)
245
+ Found: ['»Казан', '–Казан', '.Казан', ')Казан', '-Казан']
246
+ Matches: [] ✗
247
+
248
+ **Dissimilarity: мәктәп vs хастаханә**
249
+ Similarity: 0.514 (borderline high) ✗
250
+
251
+ ## 📊 Comparative Summary
252
+
253
+ | Metric | Word2Vec (cbow100) | FastText (cbow100) |
254
+ |--------|-------------------|-------------------|
255
+ | **Vocabulary Coverage** | 100.00% | 100.00% |
256
+ | **Analogy Accuracy** | **60.0%** | 0.0% |
257
+ | **Average Semantic Similarity** | 0.568 | 0.582 |
258
+ | **OOV Words Found** | 6/6 | 6/6 |
259
+ | **Vocabulary Size** | 1,293,992 | 1,293,992 |
260
+ | **Training Time (seconds)** | **1,760** | 3,323 |
261
+ | **Neighbour Quality** | Clean | Noisy (punctuation) |
262
+ | **PCA Variance Explained** | 38.4% | 41.2% |
263
+ | **Intuitive Test Pass Rate** | 3/3 | 0/3 |
264
+ | **Weighted Final Score** | **0.635** | 0.487 |
265
+
266
+ ## 🔍 Key Findings
267
+
268
+ 1. **Word2Vec significantly outperforms FastText on analogy tasks** (60% vs 0%), indicating better capture of semantic relationships.
269
+
270
+ 2. **FastText produces noisier nearest neighbours**, dominated by punctuation-attached forms and compounds rather than semantically related words.
271
+
272
+ 3. **Both models achieve 100% vocabulary coverage**, suggesting the training corpus is well-represented.
273
+
274
+ 4. **FastText trains nearly 2x slower** (3,323s vs 1,760s) with no clear benefit for this dataset.
275
+
276
+ 5. **Semantic similarity scores are comparable**, with FastText slightly higher on average (0.582 vs 0.568), but this comes at the cost of interpretability.
277
+
278
+ 6. **Word2Vec better captures antonymy** (якшы-начар: -0.042 vs 0.303 for FastText).
279
+
280
+ 7. **FastText's subword nature** causes "semantic bleeding" where words with similar character sequences but different meanings cluster together.
281
+
282
+ ## 🏆 Winner: Word2Vec CBOW (100 dimensions)
283
+
284
+ ### Weighted Scoring Rationale
285
+
286
+ The final score (0.635 for Word2Vec vs 0.487 for FastText) is based on:
287
+
288
+ - **Analogy performance** (40% weight): Word2Vec 60% vs FastText 0%
289
+ - **Neighbour quality** (30% weight): Word2Vec clean vs FastText noisy
290
+ - **Training efficiency** (15% weight): Word2Vec 2x faster
291
+ - **Semantic similarity** (15% weight): FastText slightly higher (0.582 vs 0.568)
292
+
293
+ ## 💡 Recommendations
294
+
295
+ | Use Case | Recommended Model | Rationale |
296
+ |----------|------------------|-----------|
297
+ | **Semantic search, analogies, word similarity** | **w2v_cbow_100** | Best semantic quality, clean neighbours |
298
+ | **Maximum precision (if resources allow)** | w2v_cbow_200 | Higher dimensionality captures more nuance |
299
+ | **Morphological analysis** | ft_cbow_100 | Subword information helps with word forms |
300
+ | **Handling truly rare words** | ft_cbow_100 | If vocabulary coverage were lower |
301
+ | **When training speed matters** | w2v_cbow_100 | 2x faster training |
302
+
303
+ ## ⚠️ FastText Limitations Observed
304
+
305
+ 1. **Punctuation contamination**: FastText embeddings are heavily influenced by character n-grams that include punctuation, causing words with identical punctuation patterns to cluster together.
306
+
307
+ 2. **Compound word over-generation**: FastText tends to create and prioritize compounds (e.g., "милләттатар" instead of "татар") as nearest neighbours.
308
+
309
+ 3. **Poor analogy performance**: Despite subword information, FastText fails completely on semantic analogies.
310
+
311
+ 4. **Semantic vs. orthographic trade-off**: The model optimizes for character-level similarity at the expense of semantic relationships.
312
+
313
+ ## 🔬 Conclusion
314
+
315
+ After comprehensive evaluation across multiple tasks, **Word2Vec CBOW with 100 dimensions** is recommended as the default choice for most Tatar NLP applications. It provides:
316
+
317
+ - ✅ **Superior semantic understanding** (evidenced by analogy performance)
318
+ - ✅ **Clean, interpretable nearest neighbours** (actual words, not punctuation artifacts)
319
+ - ✅ **Faster training and inference** (2x faster than FastText)
320
+ - ✅ **Good antonym capture** (negative similarity for opposites)
321
+ - ✅ **Appropriate dissimilarity** for unrelated concepts
322
+
323
+ FastText, despite its theoretical advantages for morphology, underperforms on this corpus due to:
324
+ - Noise from punctuation-attached forms
325
+ - Over-emphasis on character n-grams at the expense of semantics
326
+ - Poor analogy handling
327
+
328
+ **Final verdict: 🏆 w2v_cbow_100 is the champion model.**
329
 
330
  ---
331
 
332
+ *This report was automatically generated on 2026-03-04 as part of the Tatar2Vec model evaluation pipeline. For questions or feedback, please contact the author.*
333
+
334
+ **Certificate:** This software is registered with Rospatent under certificate No. 2026610619 (filed 2025-12-23, published 2026-01-14).