- Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)
- 🌟 Why Character-Level?
- Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a Character-level n-gram approach:
- Root & Suffix Learning: Automatically learns the relationships between word roots and various suffixes.
- Superior Generalization: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models.
- Stability: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
- 🧠 System Architecture
- 📂 Repository Contents
- 📊 Performance & Case Studies
- 🖼️ Input Method Preview (Screenshots)
- 🚀 Deployment
- 🌟 Why Character-Level?
Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)
An intelligent input prediction engine specifically designed for the Uyghur language. It combines traditional corpus-based prefix searching with high-performance KenLM (N-gram) language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.
ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى
🌟 Why Character-Level?
Uyghur is a highly Agglutinative Language. A single word root can produce dozens of different forms through the addition of suffixes.
- Example:
مەك-تەپ(School)مەك-تەپ-لى-رى-مىز(Our schools)
Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a Character-level n-gram approach: - Root & Suffix Learning: Automatically learns the relationships between word roots and various suffixes. - Superior Generalization: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models. - Stability: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
🧠 System Architecture
The project functions as a Ranking/Scoring Engine based on the formula:
- Candidate Generation: The system retrieves possible words from the 14M+ word dictionary (
dict.txt) based on the user's input prefix. - KenLM Scoring: The Character-level KenLM (
char.bin) acts as the "Brain," scoring each candidate based on linguistic probability. - Sorting: The most probable candidates are delivered to the user interface.
📂 Repository Contents
| File | Description |
|---|---|
char.bin |
Core Model: 407MB Binary KenLM model. |
dict.txt |
Dictionary: Massive corpus containing 14,416,068 Uyghur entries. |
server.py |
Linux Server: Flask API for remote scoring and prediction. |
main.py |
Windows Client: Desktop overlay for real-time typing. |
test.py |
Testing Script: CLI script to verify candidate scoring. |
📊 Performance & Case Studies
CLI Prediction Test
When a user types a prefix, the engine generates scored candidates instantly:
Input: مە
- مەكتەپتىكى (Score: -6.69)
- مەدەنىيلىكنىڭ (Score: -7.67)
- مەنپەئەتى (Score: -7.70)
Input: مەر
- مەردان (Score: -6.27)
- مەرھابانىڭ (Score: -7.18)
- مەركىزىدىكى (Score: -8.05)
🖼️ Input Method Preview (Screenshots)
Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.
🚀 Deployment
1. Backend (Linux Server)
The heavy-lifting scoring is done on Linux:
pip install flask (https://github.com/kpu/kenlm/archive/master.zip)
python3 server.py