Turkish Diacritic Restoration — Bidirectional CRF

Character-level linear-chain CRF that restores diacritics stripped from Turkish text (ç→c, ğ→g, ı→i, ö→o, ş→s, ü→u and circumflex variants).

Trained on wikimedia/wikipedia 20231101.tr (200k sentences, 1-to-1 aligned pairs only).

Performance (medium corruption: p_strip=0.70, p_drop=0.15, p_typo=0.05)

Model char_acc word_acc exact diac_acc
CRF-bidir (this) 0.939 0.685 nan nan
CRF-forward
GRU 50M (BitNet) 0.024 0.000 0.000 0.004
Transformer 51.8M 0.731 0.830 0.268 0.736

HMM/CRF metrics are on 1-to-1 pairs only (no vowel-drop sentences).

Files

File Description
crf_gpu.safetensors Model weights + labels metadata (GPU inference)
crf_bidir.pkl Trained sklearn-crfsuite model (CPU inference)
crf_forward.pkl Causal (left-context only) variant
crf_gpu.py GPU inference module (CRFGPUModel)
ldgc/vocab.py Character vocabulary (required by crf_gpu.py)

Quick start — GPU inference

from crf_gpu import CRFGPUModel

model = CRFGPUModel.from_pretrained("crf_gpu.safetensors", device="cuda")
preds = model.predict(["turkce cok guzel", "bugun hava guzel"], batch_size=512)
# → ["türkçe çok güzel", "bugün hava güzel"]

Quick start — CPU inference (sklearn-crfsuite)

import pickle
crf  = pickle.load(open("crf_bidir.pkl", "rb"))
sent = "turkce cok guzel"
feats = [...]   # see crf_gpu.py encode_batch for feature extraction
pred  = "".join(crf.predict([feats])[0])

Streaming billions of sentences (GPU)

from crf_gpu import predict_stream

with open("large_corpus.txt") as f:
    for restored in predict_stream(model, f, batch_size=512):
        process(restored)

Label vocabulary

37 output labels: ' ' '!' ',' '.' '?' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'r' 's' 't' 'u' 'v' 'y' 'z' 'â' 'ç' 'î' 'ö' 'û' 'ü' 'ğ' 'ı' 'ş'

Feature set (bidirectional)

  • Unigram: current char, ±1 char, ±2 char
  • Bigram: (−1,0), (−2,−1), (0,+1), (+1,+2) character pairs
  • Boolean: is_vowel, is_diac_src, −1/+1:is_vowel, BOS, EOS
  • Categorical: word_pos ∈ {start, mid, end}
  • Bias

Citation

@misc{ldgc2025,
  title  = {LDGC: Latent Dynamics Grammar Corrector for Turkish},
  author = {Emircan EROL},
  year   = {2025},
  url    = {https://github.com/emircan-erol/tr-grammar},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using emircanerol/turkish-diacritic-crf 1