Rekipjan commited on
Commit
06bc836
·
verified ·
1 Parent(s): ef0d095

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ dict.txt filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,95 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)
2
+
3
+
4
+
5
+ An intelligent input prediction engine specifically designed for the **Uyghur language**. It combines traditional corpus-based prefix searching with high-performance **KenLM (N-gram)** language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.
6
+
7
+ ```text-html
8
+ <p dir="rtl" align="right">ئۇيغۇر تىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى</p>
9
+ ```
10
+
11
+ ---
12
+
13
+ ## 🌟 Why Character-Level?
14
+
15
+ Uyghur is a highly **Agglutinative Language**. A single word root can produce dozens of different forms through the addition of suffixes.
16
+
17
+ - **Example**:
18
+ - `مەك-تەپ` (School)
19
+ - `مەك-تەپ-لى-رى-مىز` (Our schools)
20
+
21
+ Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a **Character-level n-gram** approach:
22
+ - **Root & Suffix Learning**: Automatically learns the relationships between word roots and various suffixes.
23
+ - **Superior Generalization**: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models.
24
+ - **Stability**: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
25
+
26
+ ---
27
+
28
+ ## 🧠 System Architecture
29
+
30
+ The project functions as a **Ranking/Scoring Engine** based on the formula:
31
+ $$P(candidate | context)$$
32
+
33
+ 1. **Candidate Generation**: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix.
34
+ 2. **KenLM Scoring**: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability.
35
+ 3. **Sorting**: The most probable candidates are delivered to the user interface.
36
+
37
+ ---
38
+
39
+ ## 📂 Repository Contents
40
+
41
+ | File | Description |
42
+ | :---------- | :----------------------------------------------------------- |
43
+ | `char.bin` | **Core Model**: 389MB Binary KenLM model. |
44
+ | `dict.txt` | **Dictionary**: Massive corpus containing 14,416,068 Uyghur entries. |
45
+ | `server.py` | **Linux Server**: Flask API for remote scoring and prediction. |
46
+ | `main.py` | **Windows Client**: Desktop overlay for real-time typing. |
47
+ | `test.py` | **Testing Script**: CLI script to verify candidate scoring. |
48
+
49
+ ---
50
+
51
+ ## 📊 Performance & Case Studies
52
+
53
+ ### CLI Prediction Test
54
+ When a user types a prefix, the engine generates scored candidates instantly:
55
+
56
+ **Input: `مە`**
57
+ - مەكتەپتىكى (Score: -6.69)
58
+ - مەدەنىيلىكنىڭ (Score: -7.67)
59
+ - مەنپەئەتى (Score: -7.70)
60
+
61
+ **Input: `مەر`**
62
+ - مەردان (Score: -6.27)
63
+ - مەرھابانىڭ (Score: -7.18)
64
+ - مەركىزىدىكى (Score: -8.05)
65
+
66
+ ---
67
+
68
+ ## 🖼️ Input Method Preview (Screenshots)
69
+
70
+ Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.
71
+
72
+ | General Input | General Input | General Input |
73
+ | :-----------------------: | :-----------------------: | :-----------------------: |
74
+ | ![Test 1](img/test-1.jpg) | ![Test 2](img/test-2.jpg) | ![Test 3](img/test-3.jpg) |
75
+ | | | |
76
+
77
+ ---
78
+
79
+ ## 🚀 Deployment
80
+
81
+ ### 1. Backend (Linux Server)
82
+ The heavy-lifting scoring is done on Linux:
83
+ ```bash
84
+ pip install flask [https://github.com/kpu/kenlm/archive/master.zip](https://github.com/kpu/kenlm/archive/master.zip)
85
+ python3 server.py
86
+ ```
87
+
88
+
89
+
90
+
91
+
92
+
93
+
94
+
95
+
char.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7044c70da54582afac8e5357a3d175a6c9acc9fa1dfa00a4f274399cfd06f84
3
+ size 406955411
dict.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bbc5a5085c4fedc2b35ca00176194c3f004f0049b680a03482fb31938662a370
3
+ size 14416068
img/test-1.jpg ADDED
img/test-2.jpg ADDED
img/test-3.jpg ADDED
linux-server/run.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ uvicorn server:app --host 0.0.0.0 --port 5000
linux-server/server.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fastapi import FastAPI
2
+ from pydantic import BaseModel
3
+ import kenlm
4
+
5
+ app = FastAPI()
6
+
7
+ lm = kenlm.Model("char.bin")
8
+
9
+ CORPUS = open(
10
+ "1.txt",
11
+ encoding="utf-8"
12
+ ).read().splitlines()
13
+
14
+
15
+ class Query(BaseModel):
16
+ text: str
17
+
18
+
19
+ def generate_candidates(prefix, max_n=100):
20
+ cands = []
21
+
22
+ for line in CORPUS:
23
+ words = line.split()
24
+
25
+ for w in words:
26
+ if w.startswith(prefix):
27
+ cands.append(w)
28
+
29
+ return list(set(cands))[:max_n]
30
+
31
+
32
+
33
+ @app.post("/predict")
34
+ def predict(q: Query):
35
+ prefix = q.text.strip()
36
+
37
+ candidates = generate_candidates(prefix)
38
+
39
+ scored = []
40
+
41
+ for c in candidates:
42
+ score = lm.score(c)
43
+
44
+ scored.append({
45
+ "word": c,
46
+ "score": score
47
+ })
48
+
49
+ scored.sort(key=lambda x: x["score"], reverse=True)
50
+
51
+ return {
52
+ "candidates": scored[:5]
53
+ }
test.py ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import kenlm
2
+
3
+ lm = kenlm.Model("char.bin")
4
+
5
+ CORPUS = open("dict.txt", encoding="utf-8").read().splitlines()
6
+
7
+ def generate_candidates(prefix, max_n=50):
8
+ cands = []
9
+
10
+ for line in CORPUS:
11
+ if prefix in line: # 或 startswith优化
12
+ words = line.split()
13
+ for w in words:
14
+ if w.startswith(prefix):
15
+ cands.append(w)
16
+
17
+ return list(set(cands))[:max_n]
18
+
19
+
20
+ def predict(prefix):
21
+ candidates = generate_candidates(prefix)
22
+
23
+ scored = []
24
+ for c in candidates:
25
+ scored.append((c, lm.score(c)))
26
+
27
+ return sorted(scored, key=lambda x: x[1], reverse=True)[:5]
28
+
29
+
30
+ while True:
31
+ p = input("prefix: ")
32
+ res = predict(p)
33
+
34
+ print("\n候选:")
35
+ for w, s in res:
36
+ print(w, s)
windows-client/api.py ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # api.py
2
+ import requests
3
+
4
+ SERVER = "http://192.168.10.3:5000/predict"
5
+
6
+ def fetch_candidates(text):
7
+ print(f"API request text: {text}")
8
+ if not text.strip():
9
+ return []
10
+ try:
11
+ # 增加超时到 2 秒。如果 2 秒还没出结果,说明后端推理太慢。
12
+ r = requests.post(
13
+ SERVER,
14
+ json={"text": text},
15
+ timeout=2.0
16
+ )
17
+ if r.status_code == 200:
18
+ data = r.json()
19
+ print(f"API response data: {data}")
20
+ return [x["word"] for x in data.get("candidates", [])]
21
+ except Exception as e:
22
+ print(f"API Error: {e}")
23
+ return []
windows-client/main.py ADDED
@@ -0,0 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import keyboard
3
+ from PySide6.QtWidgets import QApplication
4
+ from PySide6.QtCore import QThread, Signal, QTimer, QObject
5
+
6
+ import state
7
+ from api import fetch_candidates
8
+ from ui import CandidateWindow
9
+
10
+ # --- 核心修复:信号桥接类 ---
11
+ class EventBridge(QObject):
12
+ """负责将子线程的键盘事件安全转发到主线程"""
13
+ request_refresh_sig = Signal()
14
+ reset_ui_sig = Signal()
15
+
16
+ # --- 异步请求线程 ---
17
+ class FetchThread(QThread):
18
+ result_ready = Signal(list)
19
+
20
+ def __init__(self, text):
21
+ super().__init__()
22
+ self.text = text
23
+
24
+ def run(self):
25
+ # 执行耗时的网络请求
26
+ res = fetch_candidates(self.text)
27
+ self.result_ready.emit(res)
28
+
29
+ class IMEController:
30
+ def __init__(self):
31
+ self.app = QApplication(sys.argv)
32
+ self.window = CandidateWindow()
33
+ self.bridge = EventBridge()
34
+ self.worker = None
35
+
36
+ # 防抖定时器:必须在主线程初始化
37
+ self.debounce_timer = QTimer()
38
+ self.debounce_timer.setSingleShot(True)
39
+
40
+ # 信号绑定:确保所有 UI/Timer 操作都在主线程执行
41
+ self.bridge.request_refresh_sig.connect(self._handle_refresh_request)
42
+ self.bridge.reset_ui_sig.connect(self.reset_ime)
43
+ self.debounce_timer.timeout.connect(self._start_fetch)
44
+
45
+ def on_key_event(self, event):
46
+ """键盘钩子回调(运行在非主线程)"""
47
+ if event.event_type == "up":
48
+ return True
49
+
50
+ name = event.name
51
+
52
+ # 1. 输入字母
53
+ if len(name) == 1 and name.isalpha():
54
+
55
+ state.buffer += name
56
+ self.bridge.request_refresh_sig.emit() # 安全转发
57
+ return False
58
+
59
+ # 2. 退格
60
+ elif name == "backspace":
61
+ if state.buffer:
62
+ state.buffer = state.buffer[:-1]
63
+ self.bridge.request_refresh_sig.emit()
64
+ return False
65
+ return True
66
+
67
+ # 3. 选择候选词
68
+ elif name == "space" or (name in "12345" and state.buffer):
69
+ if state.candidates:
70
+ idx = 0 if name == "space" else int(name) - 1
71
+ if idx < len(state.candidates):
72
+ keyboard.write(state.candidates[idx])
73
+ self.bridge.reset_ui_sig.emit()
74
+ return False
75
+ return True
76
+
77
+ # 4. 确认/取消
78
+ elif name in ["enter", "esc"]:
79
+ if state.buffer:
80
+ if name == "enter":
81
+ keyboard.write(state.buffer)
82
+ self.bridge.reset_ui_sig.emit()
83
+ return False
84
+ return True
85
+
86
+ return True
87
+
88
+
89
+ def _handle_refresh_request(self):
90
+ """主线程逻辑:控制定时器"""
91
+ if not state.buffer:
92
+ self.reset_ime()
93
+ return
94
+ # 只有在主线程调用的 start(150) 才不会报错
95
+ self.debounce_timer.start(70)
96
+
97
+ def _start_fetch(self):
98
+ """主线程逻辑:启动异步拉取词条"""
99
+ if self.worker and self.worker.isRunning():
100
+ self.worker.terminate()
101
+ self.worker.wait()
102
+
103
+ self.worker = FetchThread(state.buffer)
104
+ self.worker.result_ready.connect(self._on_data_received)
105
+ self.worker.start()
106
+
107
+ def _on_data_received(self, candidates):
108
+ """主线程逻辑:更新UI内容"""
109
+ state.candidates = candidates
110
+ if candidates:
111
+ self.window.update_candidates(candidates)
112
+ else:
113
+ self.window.hide()
114
+
115
+ def reset_ime(self):
116
+ """主线程逻辑:重置状态并隐藏UI"""
117
+ state.buffer = ""
118
+ state.candidates = []
119
+ self.window.hide()
120
+ if self.debounce_timer.isActive():
121
+ self.debounce_timer.stop()
122
+
123
+ def run(self):
124
+ # 启动全局钩子
125
+ keyboard.hook(self.on_key_event)
126
+ # 进入 Qt 事件循环
127
+ sys.exit(self.app.exec())
128
+
129
+ if __name__ == "__main__":
130
+ # 注意:运行此脚本通常需要管理员/Root权限
131
+ ime = IMEController()
132
+ ime.run()
windows-client/state.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ buffer = ""
2
+ candidates = []
3
+ #ime_enabled = True
windows-client/ui.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ui.py
2
+ from PySide6.QtWidgets import QWidget, QLabel, QVBoxLayout
3
+ from PySide6.QtCore import Qt
4
+ import win32gui
5
+
6
+ class CandidateWindow(QWidget):
7
+ def __init__(self):
8
+ super().__init__()
9
+
10
+ # 窗口属性:置顶、无边框、不抢占焦点
11
+ self.setWindowFlags(
12
+ Qt.Tool |
13
+ Qt.FramelessWindowHint |
14
+ Qt.WindowStaysOnTopHint |
15
+ Qt.WindowDoesNotAcceptFocus
16
+ )
17
+ self.setAttribute(Qt.WA_TranslucentBackground)
18
+
19
+ self.layout = QVBoxLayout(self)
20
+ self.layout.setContentsMargins(0, 0, 0, 0)
21
+
22
+ self.label = QLabel("")
23
+ self.label.setStyleSheet("""
24
+ QLabel {
25
+ background-color: rgba(40, 44, 52, 240);
26
+ color: #abb2bf;
27
+ font-family: "Microsoft YaHei UI";
28
+ font-size: 18px;
29
+ padding: 8px 15px;
30
+ border: 1px solid #61afef;
31
+ border-radius: 6px;
32
+ }
33
+ """)
34
+ self.layout.addWidget(self.label)
35
+
36
+ def get_caret_pos(self):
37
+ """尝试获取光标位置"""
38
+ try:
39
+ gui_info = win32gui.GetGUIThreadInfo(0)
40
+ hwnd = gui_info['hwndFocus']
41
+ rect = gui_info['rcCaret']
42
+ # 将客户端坐标转为屏幕坐标
43
+ p = win32gui.ClientToScreen(hwnd, (rect[0], rect[1]))
44
+ return p[0], p[1]
45
+ except:
46
+ return None
47
+
48
+ def update_candidates(self, candidates):
49
+ if not candidates:
50
+ self.hide()
51
+ return
52
+
53
+ # 格式化显示:1.词 A 2.词 B
54
+ items = [f"<span style='color:#61afef;'>{i+1}.</span>{c}" for i, c in enumerate(candidates)]
55
+ self.label.setText(f"<html>{'&nbsp;&nbsp;'.join(items)}</html>")
56
+ self.adjustSize()
57
+
58
+ pos = self.get_caret_pos()
59
+ if pos:
60
+ self.move(pos[0], pos[1] + 28) # 移动到光标下方
61
+ else:
62
+ # 备选方案:显示在鼠标附近或屏幕中心
63
+ pass
64
+
65
+ self.show()
66
+ self.raise_()