Update README.md

e6fd2df verified 3 days ago

3.54 kB

language:
  - ug
license: mit
library_name: kenlm
tags:
  - uyghur
  - input-method
  - nlp
  - character-level
datasets:
  - custom
metrics:
  - perplexity

Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)

An intelligent input prediction engine specifically designed for the Uyghur language. It combines traditional corpus-based prefix searching with high-performance KenLM (N-gram) language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.

ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى

🌟 Why Character-Level?

Uyghur is a highly Agglutinative Language. A single word root can produce dozens of different forms through the addition of suffixes.

Example:
- مەك-تەپ (School)
- مەك-تەپ-لى-رى-مىز (Our schools)

Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a Character-level n-gram approach: - Root & Suffix Learning: Automatically learns the relationships between word roots and various suffixes. - Superior Generalization: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models. - Stability: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.

🧠 System Architecture

The project functions as a Ranking/Scoring Engine based on the formula:
$P (c a n d i d a t e ∣ c o n t e x t)$

Candidate Generation: The system retrieves possible words from the 14M+ word dictionary (dict.txt) based on the user's input prefix.
KenLM Scoring: The Character-level KenLM (char.bin) acts as the "Brain," scoring each candidate based on linguistic probability.
Sorting: The most probable candidates are delivered to the user interface.

📂 Repository Contents

File	Description
`char.bin`	Core Model: 407MB Binary KenLM model.
`dict.txt`	Dictionary: Massive corpus containing 14,416,068 Uyghur entries.
`server.py`	Linux Server: Flask API for remote scoring and prediction.
`main.py`	Windows Client: Desktop overlay for real-time typing.
`test.py`	Testing Script: CLI script to verify candidate scoring.

📊 Performance & Case Studies

CLI Prediction Test

When a user types a prefix, the engine generates scored candidates instantly:

Input: مە

مەكتەپتىكى (Score: -6.69)
مەدەنىيلىكنىڭ (Score: -7.67)
مەنپەئەتى (Score: -7.70)

Input: مەر

مەردان (Score: -6.27)
مەرھابانىڭ (Score: -7.18)
مەركىزىدىكى (Score: -8.05)

🖼️ Input Method Preview (Screenshots)

Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.

---

🚀 Deployment

1. Backend (Linux Server)

The heavy-lifting scoring is done on Linux:

pip install flask (https://github.com/kpu/kenlm/archive/master.zip)
python3 server.py