--- language: - ug license: mit library_name: kenlm tags: - uyghur - input-method - nlp - character-level datasets: - custom metrics: - perplexity --- # Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى) An intelligent input prediction engine specifically designed for the **Uyghur language**. It combines traditional corpus-based prefix searching with high-performance **KenLM (N-gram)** language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.
ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى
--- ## 🌟 Why Character-Level? Uyghur is a highly **Agglutinative Language**. A single word root can produce dozens of different forms through the addition of suffixes. - **Example**: - `مەك-تەپ` (School) - `مەك-تەپ-لى-رى-مىز` (Our schools) Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a **Character-level n-gram** approach: - **Root & Suffix Learning**: Automatically learns the relationships between word roots and various suffixes. - **Superior Generalization**: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models. - **Stability**: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur. --- ## 🧠 System Architecture The project functions as a **Ranking/Scoring Engine** based on the formula: $$P(candidate | context)$$ 1. **Candidate Generation**: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix. 2. **KenLM Scoring**: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability. 3. **Sorting**: The most probable candidates are delivered to the user interface. --- ## 📂 Repository Contents | File | Description | | :---------- | :----------------------------------------------------------- | | `char.bin` | **Core Model**: 407MB Binary KenLM model. | | `dict.txt` | **Dictionary**: Massive corpus containing 14,416,068 Uyghur entries. | | `server.py` | **Linux Server**: Flask API for remote scoring and prediction. | | `main.py` | **Windows Client**: Desktop overlay for real-time typing. | | `test.py` | **Testing Script**: CLI script to verify candidate scoring. | --- ## 📊 Performance & Case Studies ### CLI Prediction Test When a user types a prefix, the engine generates scored candidates instantly: **Input: `مە`** - مەكتەپتىكى (Score: -6.69) - مەدەنىيلىكنىڭ (Score: -7.67) - مەنپەئەتى (Score: -7.70) **Input: `مەر`** - مەردان (Score: -6.27) - مەرھابانىڭ (Score: -7.18) - مەركىزىدىكى (Score: -8.05) --- ## 🖼️ Input Method Preview (Screenshots) Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.