Turkish Diacritic Restoration — Bidirectional CRF
Character-level linear-chain CRF that restores diacritics stripped from Turkish text (ç→c, ğ→g, ı→i, ö→o, ş→s, ü→u and circumflex variants).
Trained on wikimedia/wikipedia 20231101.tr (200k sentences, 1-to-1 aligned pairs only).
Performance (medium corruption: p_strip=0.70, p_drop=0.15, p_typo=0.05)
| Model | char_acc | word_acc | exact | diac_acc |
|---|---|---|---|---|
| CRF-bidir (this) | 0.939 | 0.685 | nan | nan |
| CRF-forward | — | — | — | — |
| GRU 50M (BitNet) | 0.024 | 0.000 | 0.000 | 0.004 |
| Transformer 51.8M | 0.731 | 0.830 | 0.268 | 0.736 |
HMM/CRF metrics are on 1-to-1 pairs only (no vowel-drop sentences).
Files
| File | Description |
|---|---|
crf_gpu.safetensors |
Model weights + labels metadata (GPU inference) |
crf_bidir.pkl |
Trained sklearn-crfsuite model (CPU inference) |
crf_forward.pkl |
Causal (left-context only) variant |
crf_gpu.py |
GPU inference module (CRFGPUModel) |
ldgc/vocab.py |
Character vocabulary (required by crf_gpu.py) |
Quick start — GPU inference
from crf_gpu import CRFGPUModel
model = CRFGPUModel.from_pretrained("crf_gpu.safetensors", device="cuda")
preds = model.predict(["turkce cok guzel", "bugun hava guzel"], batch_size=512)
# → ["türkçe çok güzel", "bugün hava güzel"]
Quick start — CPU inference (sklearn-crfsuite)
import pickle
crf = pickle.load(open("crf_bidir.pkl", "rb"))
sent = "turkce cok guzel"
feats = [...] # see crf_gpu.py encode_batch for feature extraction
pred = "".join(crf.predict([feats])[0])
Streaming billions of sentences (GPU)
from crf_gpu import predict_stream
with open("large_corpus.txt") as f:
for restored in predict_stream(model, f, batch_size=512):
process(restored)
Label vocabulary
37 output labels: ' ' '!' ',' '.' '?' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'r' 's' 't' 'u' 'v' 'y' 'z' 'â' 'ç' 'î' 'ö' 'û' 'ü' 'ğ' 'ı' 'ş'
Feature set (bidirectional)
- Unigram: current char, ±1 char, ±2 char
- Bigram: (−1,0), (−2,−1), (0,+1), (+1,+2) character pairs
- Boolean: is_vowel, is_diac_src, −1/+1:is_vowel, BOS, EOS
- Categorical: word_pos ∈ {start, mid, end}
- Bias
Citation
@misc{ldgc2025,
title = {LDGC: Latent Dynamics Grammar Corrector for Turkish},
author = {Emircan EROL},
year = {2025},
url = {https://github.com/emircan-erol/tr-grammar},
}
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support