Rekipjan's picture
Update README.md
e6fd2df verified
metadata
language:
  - ug
license: mit
library_name: kenlm
tags:
  - uyghur
  - input-method
  - nlp
  - character-level
datasets:
  - custom
metrics:
  - perplexity

Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)

An intelligent input prediction engine specifically designed for the Uyghur language. It combines traditional corpus-based prefix searching with high-performance KenLM (N-gram) language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.

ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى


🌟 Why Character-Level?

Uyghur is a highly Agglutinative Language. A single word root can produce dozens of different forms through the addition of suffixes.

  • Example:
    • مەك-تەپ (School)
    • مەك-تەپ-لى-رى-مىز (Our schools)

Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a Character-level n-gram approach: - Root & Suffix Learning: Automatically learns the relationships between word roots and various suffixes. - Superior Generalization: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models. - Stability: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.

🧠 System Architecture

The project functions as a Ranking/Scoring Engine based on the formula:
P(candidatecontext)P(candidate | context)

  1. Candidate Generation: The system retrieves possible words from the 14M+ word dictionary (dict.txt) based on the user's input prefix.
  2. KenLM Scoring: The Character-level KenLM (char.bin) acts as the "Brain," scoring each candidate based on linguistic probability.
  3. Sorting: The most probable candidates are delivered to the user interface.

📂 Repository Contents

File Description
char.bin Core Model: 407MB Binary KenLM model.
dict.txt Dictionary: Massive corpus containing 14,416,068 Uyghur entries.
server.py Linux Server: Flask API for remote scoring and prediction.
main.py Windows Client: Desktop overlay for real-time typing.
test.py Testing Script: CLI script to verify candidate scoring.

📊 Performance & Case Studies

CLI Prediction Test

When a user types a prefix, the engine generates scored candidates instantly:

Input: مە

  • مەكتەپتىكى (Score: -6.69)
  • مەدەنىيلىكنىڭ (Score: -7.67)
  • مەنپەئەتى (Score: -7.70)

Input: مەر

  • مەردان (Score: -6.27)
  • مەرھابانىڭ (Score: -7.18)
  • مەركىزىدىكى (Score: -8.05)

🖼️ Input Method Preview (Screenshots)

Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.

Test 1 Test 2 Test 3
---

🚀 Deployment

1. Backend (Linux Server)

The heavy-lifting scoring is done on Linux:

pip install flask (https://github.com/kpu/kenlm/archive/master.zip)
python3 server.py