---
language:
- ug
license: mit
library_name: kenlm
tags:
- uyghur
- input-method
- nlp
- character-level
datasets:
- custom
metrics:
- perplexity
---
# Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)


An intelligent input prediction engine specifically designed for the **Uyghur language**. It combines traditional corpus-based prefix searching with high-performance **KenLM (N-gram)** language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.

<p dir="rtl" align="right">ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى</p>

---

## 🌟 Why Character-Level?

Uyghur is a highly **Agglutinative Language**. A single word root can produce dozens of different forms through the addition of suffixes.

- **Example**: 
  - `مەك-تەپ` (School)
  - `مەك-تەپ-لى-رى-مىز` (Our schools)

Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a **Character-level n-gram** approach:
- **Root & Suffix Learning**: Automatically learns the relationships between word roots and various suffixes.
- **Superior Generalization**: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models.
- **Stability**: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
---

## 🧠 System Architecture

The project functions as a **Ranking/Scoring Engine** based on the formula:  
$$P(candidate | context)$$

1. **Candidate Generation**: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix.
2. **KenLM Scoring**: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability.
3. **Sorting**: The most probable candidates are delivered to the user interface.
---

## 📂 Repository Contents

| File        | Description                                                  |
| :---------- | :----------------------------------------------------------- |
| `char.bin`  | **Core Model**: 407MB Binary KenLM model.                    |
| `dict.txt`  | **Dictionary**: Massive corpus containing 14,416,068 Uyghur entries. |
| `server.py` | **Linux Server**: Flask API for remote scoring and prediction. |
| `main.py`   | **Windows Client**: Desktop overlay for real-time typing.    |
| `test.py`   | **Testing Script**: CLI script to verify candidate scoring.  |

---

## 📊 Performance & Case Studies

### CLI Prediction Test
When a user types a prefix, the engine generates scored candidates instantly:

**Input: `مە`**
- مەكتەپتىكى (Score: -6.69)
- مەدەنىيلىكنىڭ (Score: -7.67)
- مەنپەئەتى (Score: -7.70)

**Input: `مەر`**
- مەردان (Score: -6.27)
- مەرھابانىڭ (Score: -7.18)
- مەركىزىدىكى (Score: -8.05)

---

## 🖼️ Input Method Preview (Screenshots)

Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.

<div align="center">
  <img src="img/test-1.jpg" height="100px" alt="Test 1">
  <img src="img/test-2.jpg" height="100px" alt="Test 2">
  <img src="img/test-3.jpg" height="100px" alt="Test 3">
</div>
---

## 🚀 Deployment

### 1. Backend (Linux Server)
The heavy-lifting scoring is done on Linux:
```bash
pip install flask (https://github.com/kpu/kenlm/archive/master.zip)
python3 server.py
```