# NedoTurkishTokenizer Self-contained Turkish morphological tokenizer. **Zero external dependencies** — tokenizes Turkish text into morphologically meaningful units using a candidate-based segmentation engine, a bundled TDK dictionary, and 260+ suffix patterns. > This is a standalone tokenizer. It does not wrap `turkish-tokenizer`, `zemberek-python`, `requests`, or `transformers`. There are no hidden fallbacks or optional dependency paths. Install and use immediately. ## Installation ```bash pip install . ``` No additional packages required. Everything is bundled. ## Quick Start ```python from nedo_turkish_tokenizer import NedoTurkishTokenizer tok = NedoTurkishTokenizer() tokens = tok.tokenize("İstanbul'da toplantıya katılamadım") for t in tokens: print(f"{t['token']:15s} {t['token_type']:10s} pos={t['morph_pos']}") ``` Output: ``` İstanbul ROOT pos=0 ' PUNCT pos=0 da SUFFIX pos=1 toplantı ROOT pos=0 ya SUFFIX pos=1 katıl ROOT pos=0 a SUFFIX pos=1 ma SUFFIX pos=2 dım SUFFIX pos=3 ``` ## API ### `NedoTurkishTokenizer()` ```python tok = NedoTurkishTokenizer() # Single text tokens = tok.tokenize("Merhaba dünya") # Callable shorthand tokens = tok("Merhaba dünya") # Batch (parallel, uses multiprocessing) results = tok.batch_tokenize(["text1", "text2", "text3"]) # Statistics stats = tok.stats(tokens) ``` ### Token Output Format Each token is a `dict` with these guaranteed fields: | Field | Type | Description | |---|---|---| | `token` | `str` | **Clean token text** — no leading/trailing whitespace. | | `token_type` | `str` | One of the types below. | | `morph_pos` | `int` | 0 = root/word-initial, 1+ = suffix position. | **Token text does not encode spacing.** The `token` field contains only the clean surface form. Whether a token starts a new word is indicated by `morph_pos == 0`, not by whitespace in the string. ### Token Types | Type | Description | Example | |---|---|---| | `ROOT` | Turkish word root | `ev`, `gel`, `kitap` | | `SUFFIX` | Morphological suffix | `de`, `ler`, `yor` | | `FOREIGN` | Non-Turkish word | `meeting`, `cloud` | | `PUNCT` | Punctuation | `.`, `,`, `'` | | `NUM` | Number | `42`, `%85`, `3.14` | | `DATE` | Date | `14.03.2026` | | `UNIT` | Unit | `kg`, `km`, `TL` | | `URL` | URL | `https://...` | | `MENTION` | Social mention | `@user` | | `HASHTAG` | Hashtag | `#topic` | | `EMOJI` | Emoji | 😀, :) | | `ACRONYM` | Acronym | `NATO`, `TBMM` | ### Optional Metadata Fields Tokens may include `_`-prefixed metadata fields: | Field | Type | Description | |---|---|---| | `_suffix_label` | `str` | Morphological label (e.g. `-LOC`, `-PL`, `-PST`) | | `_canonical` | `str` | Canonical morpheme (e.g. `LOC`, `PL`, `PAST`) | | `_caps` | `bool` | Word was originally ALL CAPS | | `_foreign` | `bool` | Word detected as foreign | | `_acronym` | `bool` | Token is an acronym | | `_expansion` | `str` | Acronym expansion (e.g. `NATO` → `Kuzey Atlantik...`) | | `_compound` | `bool` | Root is a compound word | | `_parts` | `list[str]` | Compound decomposition | | `_apo_suffix` | `bool` | Suffix follows an apostrophe | | `_domain` | `bool` | Root from domain vocabulary | ## Architecture The tokenizer uses a **candidate-based segmentation** pipeline: ``` Text → Normalize → Special Spans → Word Split → Per-Word Segmentation → Annotate → Strip │ Generate Candidates Score & Select Best ``` For each word, the engine: 1. Generates 2–5 segmentation candidates (whole ROOT, suffix chains, foreign) 2. Scores each candidate deterministically (TDK validation, root length, suffix recognition) 3. Selects the highest-scoring segmentation 4. Strips internal whitespace markers from the output ### Scoring Rules | Factor | Score | |---|---| | Root in TDK dictionary | +10 | | Whole word in TDK (unsplit) | +5 bonus | | Root in domain vocabulary | +8 | | Root length | +2 per character | | Each recognised suffix | +2 | | Short root penalty (≤2 chars) | −4 | | Foreign root (fallback) | +3 base | | Unknown root | +1 base | ### Known-Intact Words A curated set of common Turkish words (inflected forms of `demek`, `yemek`, and discourse particles) bypass candidate generation entirely and are always kept whole. This prevents false splits like `dedi` → `de` + `di` where the root `de` is a valid TDK conjunction. ### Bundled Resources | Resource | Size | Purpose | |---|---|---| | `tdk_words.txt` | ~746 KB | TDK dictionary (64K+ lemmas + derived verb stems) | | `turkish_proper_nouns.txt` | ~1 KB | Proper nouns (cities, regions, names) | | Suffix table | 260+ entries | Turkish suffix patterns with morphological labels | | Acronym table | 80+ entries | Acronym → Turkish expansion mappings | | Domain vocabulary | 200+ entries | Medical, sports, tourism terms | ## Known Limitations - **Not a full morphological analyzer.** This is a heuristic segmenter, not a Zemberek/TRMorph replacement. Morphological labels may be incorrect for ambiguous suffixes (e.g. `-ACC` vs `-GEN` for "in"). - **No disambiguation.** The tokenizer does not use sentence-level context to resolve ambiguous words (e.g. "gelir" = income vs. aorist). - **Verb stem derivation is simple.** Only `-mak`/`-mek` infinitive stripping (for stems ≥3 characters) is used; vowel harmony alternations in stems are not modelled. - **TDK dictionary coverage.** The bundled TDK list has ~64K entries. Words absent from TDK default to ROOT (unknown) or FOREIGN. - **Not backward-compatible with v1.x.** The old wrapper relied on `turkish-tokenizer` BPE and `zemberek-python`. Output format and token boundaries will differ. ## Running Tests ```bash pip install -e ".[dev]" pytest tests/ -v ``` ## License MIT