Rekipjan's picture
Update README.md
e6fd2df verified
---
language:
- ug
license: mit
library_name: kenlm
tags:
- uyghur
- input-method
- nlp
- character-level
datasets:
- custom
metrics:
- perplexity
---
# Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)
An intelligent input prediction engine specifically designed for the **Uyghur language**. It combines traditional corpus-based prefix searching with high-performance **KenLM (N-gram)** language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.
<p dir="rtl" align="right">ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى</p>
---
## 🌟 Why Character-Level?
Uyghur is a highly **Agglutinative Language**. A single word root can produce dozens of different forms through the addition of suffixes.
- **Example**:
- `مەك-تەپ` (School)
- `مەك-تەپ-لى-رى-مىز` (Our schools)
Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a **Character-level n-gram** approach:
- **Root & Suffix Learning**: Automatically learns the relationships between word roots and various suffixes.
- **Superior Generalization**: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models.
- **Stability**: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
---
## 🧠 System Architecture
The project functions as a **Ranking/Scoring Engine** based on the formula:
$$P(candidate | context)$$
1. **Candidate Generation**: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix.
2. **KenLM Scoring**: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability.
3. **Sorting**: The most probable candidates are delivered to the user interface.
---
## 📂 Repository Contents
| File | Description |
| :---------- | :----------------------------------------------------------- |
| `char.bin` | **Core Model**: 407MB Binary KenLM model. |
| `dict.txt` | **Dictionary**: Massive corpus containing 14,416,068 Uyghur entries. |
| `server.py` | **Linux Server**: Flask API for remote scoring and prediction. |
| `main.py` | **Windows Client**: Desktop overlay for real-time typing. |
| `test.py` | **Testing Script**: CLI script to verify candidate scoring. |
---
## 📊 Performance & Case Studies
### CLI Prediction Test
When a user types a prefix, the engine generates scored candidates instantly:
**Input: `مە`**
- مەكتەپتىكى (Score: -6.69)
- مەدەنىيلىكنىڭ (Score: -7.67)
- مەنپەئەتى (Score: -7.70)
**Input: `مەر`**
- مەردان (Score: -6.27)
- مەرھابانىڭ (Score: -7.18)
- مەركىزىدىكى (Score: -8.05)
---
## 🖼️ Input Method Preview (Screenshots)
Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.
<div align="center">
<img src="img/test-1.jpg" height="100px" alt="Test 1">
<img src="img/test-2.jpg" height="100px" alt="Test 2">
<img src="img/test-3.jpg" height="100px" alt="Test 3">
</div>
---
## 🚀 Deployment
### 1. Backend (Linux Server)
The heavy-lifting scoring is done on Linux:
```bash
pip install flask (https://github.com/kpu/kenlm/archive/master.zip)
python3 server.py
```