Rekipjan
/

Uyghur-Char-KenLM-Input-Method

character-level

Model card Files Files and versions

Uyghur-Char-KenLM-Input-Method / README.md

Rekipjan's picture

Update README.md

e6fd2df verified 3 days ago

|

history blame contribute delete

3.54 kB

	---
	language:
	- ug
	license: mit
	library_name: kenlm
	tags:
	- uyghur
	- input-method
	- nlp
	- character-level
	datasets:
	- custom
	metrics:
	- perplexity
	---
	# Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)



	An intelligent input prediction engine specifically designed for the Uyghur language. It combines traditional corpus-based prefix searching with high-performance KenLM (N-gram) language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.

	<p dir="rtl" align="right">ئۇيغۇرتىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى</p>

	---

	## 🌟 Why Character-Level?

	Uyghur is a highly Agglutinative Language. A single word root can produce dozens of different forms through the addition of suffixes.

	- Example:
	- `مەك-تەپ` (School)
	- `مەك-تەپ-لى-رى-مىز` (Our schools)

	Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a Character-level n-gram approach:
	- Root & Suffix Learning: Automatically learns the relationships between word roots and various suffixes.
	- Superior Generalization: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models.
	- Stability: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
	---

	## 🧠 System Architecture

	The project functions as a Ranking/Scoring Engine based on the formula:
	$$P(candidate \| context)$$

	1. Candidate Generation: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix.
	2. KenLM Scoring: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability.
	3. Sorting: The most probable candidates are delivered to the user interface.
	---

	## 📂 Repository Contents

	\| File \| Description \|
	\| :---------- \| :----------------------------------------------------------- \|
	\| `char.bin` \| Core Model: 407MB Binary KenLM model. \|
	\| `dict.txt` \| Dictionary: Massive corpus containing 14,416,068 Uyghur entries. \|
	\| `server.py` \| Linux Server: Flask API for remote scoring and prediction. \|
	\| `main.py` \| Windows Client: Desktop overlay for real-time typing. \|
	\| `test.py` \| Testing Script: CLI script to verify candidate scoring. \|

	---

	## 📊 Performance & Case Studies

	### CLI Prediction Test
	When a user types a prefix, the engine generates scored candidates instantly:

	Input: `مە`
	- مەكتەپتىكى (Score: -6.69)
	- مەدەنىيلىكنىڭ (Score: -7.67)
	- مەنپەئەتى (Score: -7.70)

	Input: `مەر`
	- مەردان (Score: -6.27)
	- مەرھابانىڭ (Score: -7.18)
	- مەركىزىدىكى (Score: -8.05)

	---

	## 🖼️ Input Method Preview (Screenshots)

	Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.

	<div align="center">
	<img src="img/test-1.jpg" height="100px" alt="Test 1">
	<img src="img/test-2.jpg" height="100px" alt="Test 2">
	<img src="img/test-3.jpg" height="100px" alt="Test 3">
	</div>
	---

	## 🚀 Deployment

	### 1. Backend (Linux Server)
	The heavy-lifting scoring is done on Linux:
	```bash
	pip install flask (https://github.com/kpu/kenlm/archive/master.zip)
	python3 server.py
	```