Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +95 -3
char.bin +3 -0
dict.txt +3 -0
img/test-1.jpg +0 -0
img/test-2.jpg +0 -0
img/test-3.jpg +0 -0
linux-server/run.txt +1 -0
linux-server/server.py +53 -0
test.py +36 -0
windows-client/api.py +23 -0
windows-client/main.py +132 -0
windows-client/state.py +3 -0
windows-client/ui.py +66 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+dict.txt filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,95 @@
----
-license: apache-2.0
----

+# Uyghur-Character-Level-KenLM-Input-Method (سىناق نۇسخىسى)
+An intelligent input prediction engine specifically designed for the **Uyghur language**. It combines traditional corpus-based prefix searching with high-performance **KenLM (N-gram)** language models to achieve real-time mapping from Latin characters to Uyghur script with probabilistic ranking.
+```text-html
+<p dir="rtl" align="right">ئۇيغۇر تىلى ئۈچۈن مەخسۇس لايىھەلەنگەن ئەقلىي كىرگۈزۈش سىناق تۈرى</p>
+```
+---
+## 🌟 Why Character-Level?
+Uyghur is a highly **Agglutinative Language**. A single word root can produce dozens of different forms through the addition of suffixes.
+- **Example**:
+  - `مەك-تەپ` (School)
+  - `مەك-تەپ-لى-رى-مىز` (Our schools)
+Traditional word-level N-grams often suffer from data sparsity and poor generalization in Uyghur. This project utilizes a **Character-level n-gram** approach:
+- **Root & Suffix Learning**: Automatically learns the relationships between word roots and various suffixes.
+- **Superior Generalization**: Handles "Out-of-Vocabulary" (OOV) words more effectively than word-level models.
+- **Stability**: Provides more reliable completion results for the unique phonetic and morphological structure of Uyghur.
+---
+## 🧠 System Architecture
+The project functions as a **Ranking/Scoring Engine** based on the formula:
+$$P(candidate | context)$$
+1. **Candidate Generation**: The system retrieves possible words from the 14M+ word dictionary (`dict.txt`) based on the user's input prefix.
+2. **KenLM Scoring**: The Character-level KenLM (`char.bin`) acts as the "Brain," scoring each candidate based on linguistic probability.
+3. **Sorting**: The most probable candidates are delivered to the user interface.
+---
+## 📂 Repository Contents
+| File        | Description                                                  |
+| :---------- | :----------------------------------------------------------- |
+| `char.bin`  | **Core Model**: 389MB Binary KenLM model.                    |
+| `dict.txt`  | **Dictionary**: Massive corpus containing 14,416,068 Uyghur entries. |
+| `server.py` | **Linux Server**: Flask API for remote scoring and prediction. |
+| `main.py`   | **Windows Client**: Desktop overlay for real-time typing.    |
+| `test.py`   | **Testing Script**: CLI script to verify candidate scoring.  |
+---
+## 📊 Performance & Case Studies
+### CLI Prediction Test
+When a user types a prefix, the engine generates scored candidates instantly:
+**Input: `مە`**
+- مەكتەپتىكى (Score: -6.69)
+- مەدەنىيلىكنىڭ (Score: -7.67)
+- مەنپەئەتى (Score: -7.70)
+**Input: `مەر`**
+- مەردان (Score: -6.27)
+- مەرھابانىڭ (Score: -7.18)
+- مەركىزىدىكى (Score: -8.05)
+---
+## 🖼️ Input Method Preview (Screenshots)
+Below are the test cases for the input method interface. It supports seamless rendering of RTL (Right-to-Left) Uyghur script.
+|       General Input       |       General Input       |       General Input       |
+| :-----------------------: | :-----------------------: | :-----------------------: |
+| ![Test 1](img/test-1.jpg) | ![Test 2](img/test-2.jpg) | ![Test 3](img/test-3.jpg) |
+|                           |                           |                           |
+---
+## 🚀 Deployment
+### 1. Backend (Linux Server)
+The heavy-lifting scoring is done on Linux:
+```bash
+pip install flask [https://github.com/kpu/kenlm/archive/master.zip](https://github.com/kpu/kenlm/archive/master.zip)
+python3 server.py
+```

char.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f7044c70da54582afac8e5357a3d175a6c9acc9fa1dfa00a4f274399cfd06f84
+size 406955411

dict.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bbc5a5085c4fedc2b35ca00176194c3f004f0049b680a03482fb31938662a370
+size 14416068

img/test-1.jpg ADDED Viewed

img/test-2.jpg ADDED Viewed

img/test-3.jpg ADDED Viewed

linux-server/run.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ uvicorn server:app --host 0.0.0.0 --port 5000

linux-server/server.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from fastapi import FastAPI
+from pydantic import BaseModel
+import kenlm
+app = FastAPI()
+lm = kenlm.Model("char.bin")
+CORPUS = open(
+    "1.txt",
+    encoding="utf-8"
+).read().splitlines()
+class Query(BaseModel):
+    text: str
+def generate_candidates(prefix, max_n=100):
+    cands = []
+    for line in CORPUS:
+        words = line.split()
+        for w in words:
+            if w.startswith(prefix):
+                cands.append(w)
+    return list(set(cands))[:max_n]
+@app.post("/predict")
+def predict(q: Query):
+    prefix = q.text.strip()
+    candidates = generate_candidates(prefix)
+    scored = []
+    for c in candidates:
+        score = lm.score(c)
+        scored.append({
+            "word": c,
+            "score": score
+        })
+    scored.sort(key=lambda x: x["score"], reverse=True)
+    return {
+        "candidates": scored[:5]
+    }

test.py ADDED Viewed

	@@ -0,0 +1,36 @@

+import kenlm
+lm = kenlm.Model("char.bin")
+CORPUS = open("dict.txt", encoding="utf-8").read().splitlines()
+def generate_candidates(prefix, max_n=50):
+    cands = []
+    for line in CORPUS:
+        if prefix in line:   # 或 startswith优化
+            words = line.split()
+            for w in words:
+                if w.startswith(prefix):
+                    cands.append(w)
+    return list(set(cands))[:max_n]
+def predict(prefix):
+    candidates = generate_candidates(prefix)
+    scored = []
+    for c in candidates:
+        scored.append((c, lm.score(c)))
+    return sorted(scored, key=lambda x: x[1], reverse=True)[:5]
+while True:
+    p = input("prefix: ")
+    res = predict(p)
+    print("\n候选:")
+    for w, s in res:
+        print(w, s)

windows-client/api.py ADDED Viewed

	@@ -0,0 +1,23 @@

+# api.py
+import requests
+SERVER = "http://192.168.10.3:5000/predict"
+def fetch_candidates(text):
+    print(f"API request text: {text}")
+    if not text.strip():
+        return []
+    try:
+        # 增加超时到 2 秒。如果 2 秒还没出结果，说明后端推理太慢。
+        r = requests.post(
+            SERVER,
+            json={"text": text},
+            timeout=2.0
+        )
+        if r.status_code == 200:
+            data = r.json()
+            print(f"API response data: {data}")
+            return [x["word"] for x in data.get("candidates", [])]
+    except Exception as e:
+        print(f"API Error: {e}")
+    return []

windows-client/main.py ADDED Viewed

	@@ -0,0 +1,132 @@

+import sys
+import keyboard
+from PySide6.QtWidgets import QApplication
+from PySide6.QtCore import QThread, Signal, QTimer, QObject
+import state
+from api import fetch_candidates
+from ui import CandidateWindow
+# --- 核心修复：信号桥接类 ---
+class EventBridge(QObject):
+    """负责将子线程的键盘事件安全转发到主线程"""
+    request_refresh_sig = Signal()
+    reset_ui_sig = Signal()
+# --- 异步请求线程 ---
+class FetchThread(QThread):
+    result_ready = Signal(list)
+    def __init__(self, text):
+        super().__init__()
+        self.text = text
+    def run(self):
+        # 执行耗时的网络请求
+        res = fetch_candidates(self.text)
+        self.result_ready.emit(res)
+class IMEController:
+    def __init__(self):
+        self.app = QApplication(sys.argv)
+        self.window = CandidateWindow()
+        self.bridge = EventBridge()
+        self.worker = None
+        # 防抖定时器：必须在主线程初始化
+        self.debounce_timer = QTimer()
+        self.debounce_timer.setSingleShot(True)
+        # 信号绑定：确保所有 UI/Timer 操作都在主线程执行
+        self.bridge.request_refresh_sig.connect(self._handle_refresh_request)
+        self.bridge.reset_ui_sig.connect(self.reset_ime)
+        self.debounce_timer.timeout.connect(self._start_fetch)
+    def on_key_event(self, event):
+        """键盘钩子回调（运行在非主线程）"""
+        if event.event_type == "up":
+            return True
+        name = event.name
+        # 1. 输入字母
+        if len(name) == 1 and name.isalpha():
+            state.buffer += name
+            self.bridge.request_refresh_sig.emit() # 安全转发
+            return False
+        # 2. 退格
+        elif name == "backspace":
+            if state.buffer:
+                state.buffer = state.buffer[:-1]
+                self.bridge.request_refresh_sig.emit()
+                return False
+            return True
+        # 3. 选择候选词
+        elif name == "space" or (name in "12345" and state.buffer):
+            if state.candidates:
+                idx = 0 if name == "space" else int(name) - 1
+                if idx < len(state.candidates):
+                    keyboard.write(state.candidates[idx])
+                    self.bridge.reset_ui_sig.emit()
+                    return False
+            return True
+        # 4. 确认/取消
+        elif name in ["enter", "esc"]:
+            if state.buffer:
+                if name == "enter":
+                    keyboard.write(state.buffer)
+                self.bridge.reset_ui_sig.emit()
+                return False
+            return True
+        return True
+    def _handle_refresh_request(self):
+        """主线程逻辑：控制定时器"""
+        if not state.buffer:
+            self.reset_ime()
+            return
+        # 只有在主线程调用的 start(150) 才不会报错
+        self.debounce_timer.start(70)
+    def _start_fetch(self):
+        """主线程逻辑：启动异步拉取词条"""
+        if self.worker and self.worker.isRunning():
+            self.worker.terminate()
+            self.worker.wait()
+        self.worker = FetchThread(state.buffer)
+        self.worker.result_ready.connect(self._on_data_received)
+        self.worker.start()
+    def _on_data_received(self, candidates):
+        """主线程逻辑：更新UI内容"""
+        state.candidates = candidates
+        if candidates:
+            self.window.update_candidates(candidates)
+        else:
+            self.window.hide()
+    def reset_ime(self):
+        """主线程逻辑：重置状态并隐藏UI"""
+        state.buffer = ""
+        state.candidates = []
+        self.window.hide()
+        if self.debounce_timer.isActive():
+            self.debounce_timer.stop()
+    def run(self):
+        # 启动全局钩子
+        keyboard.hook(self.on_key_event)
+        # 进入 Qt 事件循环
+        sys.exit(self.app.exec())
+if __name__ == "__main__":
+    # 注意：运行此脚本通常需要管理员/Root权限
+    ime = IMEController()
+    ime.run()

windows-client/state.py ADDED Viewed

	@@ -0,0 +1,3 @@

+buffer = ""
+candidates = []
+#ime_enabled = True

windows-client/ui.py ADDED Viewed

	@@ -0,0 +1,66 @@

+# ui.py
+from PySide6.QtWidgets import QWidget, QLabel, QVBoxLayout
+from PySide6.QtCore import Qt
+import win32gui
+class CandidateWindow(QWidget):
+    def __init__(self):
+        super().__init__()
+        # 窗口属性：置顶、无边框、不抢占焦点
+        self.setWindowFlags(
+            Qt.Tool |
+            Qt.FramelessWindowHint |
+            Qt.WindowStaysOnTopHint |
+            Qt.WindowDoesNotAcceptFocus
+        )
+        self.setAttribute(Qt.WA_TranslucentBackground)
+        self.layout = QVBoxLayout(self)
+        self.layout.setContentsMargins(0, 0, 0, 0)
+        self.label = QLabel("")
+        self.label.setStyleSheet("""
+            QLabel {
+                background-color: rgba(40, 44, 52, 240);
+                color: #abb2bf;
+                font-family: "Microsoft YaHei UI";
+                font-size: 18px;
+                padding: 8px 15px;
+                border: 1px solid #61afef;
+                border-radius: 6px;
+            }
+        """)
+        self.layout.addWidget(self.label)
+    def get_caret_pos(self):
+        """尝试获取光标位置"""
+        try:
+            gui_info = win32gui.GetGUIThreadInfo(0)
+            hwnd = gui_info['hwndFocus']
+            rect = gui_info['rcCaret']
+            # 将客户端坐标转为屏幕坐标
+            p = win32gui.ClientToScreen(hwnd, (rect[0], rect[1]))
+            return p[0], p[1]
+        except:
+            return None
+    def update_candidates(self, candidates):
+        if not candidates:
+            self.hide()
+            return
+        # 格式化显示：1.词 A  2.词 B
+        items = [f"<span style='color:#61afef;'>{i+1}.</span>{c}" for i, c in enumerate(candidates)]
+        self.label.setText(f"<html>{'&nbsp;&nbsp;'.join(items)}</html>")
+        self.adjustSize()
+        pos = self.get_caret_pos()
+        if pos:
+            self.move(pos[0], pos[1] + 28) # 移动到光标下方
+        else:
+            # 备选方案：显示在鼠标附近或屏幕中心
+            pass
+        self.show()
+        self.raise_()