Turkish Diacritic Restoration — Bidirectional CRF

Character-level linear-chain CRF that restores diacritics stripped from Turkish text (ç→c, ğ→g, ı→i, ö→o, ş→s, ü→u and circumflex variants).

Trained on wikimedia/wikipedia 20231101.tr (200k sentences, 1-to-1 aligned pairs only).

Performance (medium corruption: p_strip=0.70, p_drop=0.15, p_typo=0.05)

Model	char_acc	word_acc	exact	diac_acc
CRF-bidir (this)	0.939	0.685	nan	nan
CRF-forward	—	—	—	—
GRU 50M (BitNet)	0.024	0.000	0.000	0.004
Transformer 51.8M	0.731	0.830	0.268	0.736

HMM/CRF metrics are on 1-to-1 pairs only (no vowel-drop sentences).

Files

File	Description
`crf_gpu.safetensors`	Model weights + labels metadata (GPU inference)
`crf_bidir.pkl`	Trained sklearn-crfsuite model (CPU inference)
`crf_forward.pkl`	Causal (left-context only) variant
`crf_gpu.py`	GPU inference module (`CRFGPUModel`)
`ldgc/vocab.py`	Character vocabulary (required by crf_gpu.py)

Quick start — GPU inference

from crf_gpu import CRFGPUModel

model = CRFGPUModel.from_pretrained("crf_gpu.safetensors", device="cuda")
preds = model.predict(["turkce cok guzel", "bugun hava guzel"], batch_size=512)
# → ["türkçe çok güzel", "bugün hava güzel"]

Quick start — CPU inference (sklearn-crfsuite)

import pickle
crf  = pickle.load(open("crf_bidir.pkl", "rb"))
sent = "turkce cok guzel"
feats = [...]   # see crf_gpu.py encode_batch for feature extraction
pred  = "".join(crf.predict([feats])[0])

Streaming billions of sentences (GPU)

from crf_gpu import predict_stream

with open("large_corpus.txt") as f:
    for restored in predict_stream(model, f, batch_size=512):
        process(restored)

Label vocabulary

37 output labels: ' ' '!' ',' '.' '?' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'r' 's' 't' 'u' 'v' 'y' 'z' 'â' 'ç' 'î' 'ö' 'û' 'ü' 'ğ' 'ı' 'ş'

Feature set (bidirectional)

Unigram: current char, ±1 char, ±2 char
Bigram: (−1,0), (−2,−1), (0,+1), (+1,+2) character pairs
Boolean: is_vowel, is_diac_src, −1/+1:is_vowel, BOS, EOS
Categorical: word_pos ∈ {start, mid, end}
Bias

Citation

@misc{ldgc2025,
  title  = {LDGC: Latent Dynamics Grammar Corrector for Turkish},
  author = {Emircan EROL},
  year   = {2025},
  url    = {https://github.com/emircan-erol/tr-grammar},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

emircanerol
/

turkish-diacritic-crf