| --- |
| language: |
| - ug |
| license: mit |
| library_name: kenlm |
| tags: |
| - uyghur |
| - input-method |
| - nlp |
| - character-level |
| datasets: |
| - custom |
| metrics: |
| - perplexity |
| --- |
| # Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى) |
|
|
|
|
|
|
| An intelligent input prediction engine specifically designed for the **Uyghur language**. It combines traditional corpus-based prefix searching with high-performance **KenLM (N-gram)** language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking. |
|
|
| <p dir="rtl" align="right">ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى</p> |
|
|
| --- |
|
|
| ## 🌟 Why Character-Level? |
|
|
| Uyghur is a highly **Agglutinative Language**. A single word root can produce dozens of different forms through the addition of suffixes. |
|
|
| - **Example**: |
| - `مەك-تەپ` (School) |
| - `مەك-تەپ-لى-رى-مىز` (Our schools) |
|
|
| Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a **Character-level n-gram** approach: |
| - **Root & Suffix Learning**: Automatically learns the relationships between word roots and various suffixes. |
| - **Superior Generalization**: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models. |
| - **Stability**: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur. |
| --- |
|
|
| ## 🧠 System Architecture |
|
|
| The project functions as a **Ranking/Scoring Engine** based on the formula: |
| $$P(candidate | context)$$ |
|
|
| 1. **Candidate Generation**: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix. |
| 2. **KenLM Scoring**: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability. |
| 3. **Sorting**: The most probable candidates are delivered to the user interface. |
| --- |
|
|
| ## 📂 Repository Contents |
|
|
| | File | Description | |
| | :---------- | :----------------------------------------------------------- | |
| | `char.bin` | **Core Model**: 407MB Binary KenLM model. | |
| | `dict.txt` | **Dictionary**: Massive corpus containing 14,416,068 Uyghur entries. | |
| | `server.py` | **Linux Server**: Flask API for remote scoring and prediction. | |
| | `main.py` | **Windows Client**: Desktop overlay for real-time typing. | |
| | `test.py` | **Testing Script**: CLI script to verify candidate scoring. | |
|
|
| --- |
|
|
| ## 📊 Performance & Case Studies |
|
|
| ### CLI Prediction Test |
| When a user types a prefix, the engine generates scored candidates instantly: |
|
|
| **Input: `مە`** |
| - مەكتەپتىكى (Score: -6.69) |
| - مەدەنىيلىكنىڭ (Score: -7.67) |
| - مەنپەئەتى (Score: -7.70) |
|
|
| **Input: `مەر`** |
| - مەردان (Score: -6.27) |
| - مەرھابانىڭ (Score: -7.18) |
| - مەركىزىدىكى (Score: -8.05) |
|
|
| --- |
|
|
| ## 🖼️ Input Method Preview (Screenshots) |
|
|
| Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script. |
|
|
| <div align="center"> |
| <img src="img/test-1.jpg" height="100px" alt="Test 1"> |
| <img src="img/test-2.jpg" height="100px" alt="Test 2"> |
| <img src="img/test-3.jpg" height="100px" alt="Test 3"> |
| </div> |
| --- |
|
|
| ## 🚀 Deployment |
|
|
| ### 1. Backend (Linux Server) |
| The heavy-lifting scoring is done on Linux: |
| ```bash |
| pip install flask (https://github.com/kpu/kenlm/archive/master.zip) |
| python3 server.py |
| ``` |
|
|
|
|
|
|
|
|
|
|
|
|
|
|