# NedoTurkishTokenizer

Self-contained Turkish morphological tokenizer.  
**Zero external dependencies** — tokenizes Turkish text into morphologically meaningful units using a candidate-based segmentation engine, a bundled TDK dictionary, and 260+ suffix patterns.

> This is a standalone tokenizer. It does not wrap `turkish-tokenizer`, `zemberek-python`, `requests`, or `transformers`. There are no hidden fallbacks or optional dependency paths. Install and use immediately.

## Installation

```bash
pip install .
```

No additional packages required. Everything is bundled.

## Quick Start

```python
from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

tokens = tok.tokenize("İstanbul'da toplantıya katılamadım")
for t in tokens:
    print(f"{t['token']:15s}  {t['token_type']:10s}  pos={t['morph_pos']}")
```

Output:
```
İstanbul        ROOT        pos=0
'               PUNCT       pos=0
da              SUFFIX      pos=1
toplantı        ROOT        pos=0
ya              SUFFIX      pos=1
katıl           ROOT        pos=0
a               SUFFIX      pos=1
ma              SUFFIX      pos=2
dım             SUFFIX      pos=3
```

## API

### `NedoTurkishTokenizer()`

```python
tok = NedoTurkishTokenizer()

# Single text
tokens = tok.tokenize("Merhaba dünya")

# Callable shorthand
tokens = tok("Merhaba dünya")

# Batch (parallel, uses multiprocessing)
results = tok.batch_tokenize(["text1", "text2", "text3"])

# Statistics
stats = tok.stats(tokens)
```

### Token Output Format

Each token is a `dict` with these guaranteed fields:

| Field | Type | Description |
|---|---|---|
| `token` | `str` | **Clean token text** — no leading/trailing whitespace. |
| `token_type` | `str` | One of the types below. |
| `morph_pos` | `int` | 0 = root/word-initial, 1+ = suffix position. |

**Token text does not encode spacing.** The `token` field contains only the clean surface form. Whether a token starts a new word is indicated by `morph_pos == 0`, not by whitespace in the string.

### Token Types

| Type | Description | Example |
|---|---|---|
| `ROOT` | Turkish word root | `ev`, `gel`, `kitap` |
| `SUFFIX` | Morphological suffix | `de`, `ler`, `yor` |
| `FOREIGN` | Non-Turkish word | `meeting`, `cloud` |
| `PUNCT` | Punctuation | `.`, `,`, `'` |
| `NUM` | Number | `42`, `%85`, `3.14` |
| `DATE` | Date | `14.03.2026` |
| `UNIT` | Unit | `kg`, `km`, `TL` |
| `URL` | URL | `https://...` |
| `MENTION` | Social mention | `@user` |
| `HASHTAG` | Hashtag | `#topic` |
| `EMOJI` | Emoji | 😀, :) |
| `ACRONYM` | Acronym | `NATO`, `TBMM` |

### Optional Metadata Fields

Tokens may include `_`-prefixed metadata fields:

| Field | Type | Description |
|---|---|---|
| `_suffix_label` | `str` | Morphological label (e.g. `-LOC`, `-PL`, `-PST`) |
| `_canonical` | `str` | Canonical morpheme (e.g. `LOC`, `PL`, `PAST`) |
| `_caps` | `bool` | Word was originally ALL CAPS |
| `_foreign` | `bool` | Word detected as foreign |
| `_acronym` | `bool` | Token is an acronym |
| `_expansion` | `str` | Acronym expansion (e.g. `NATO` → `Kuzey Atlantik...`) |
| `_compound` | `bool` | Root is a compound word |
| `_parts` | `list[str]` | Compound decomposition |
| `_apo_suffix` | `bool` | Suffix follows an apostrophe |
| `_domain` | `bool` | Root from domain vocabulary |

## Architecture

The tokenizer uses a **candidate-based segmentation** pipeline:

```
Text → Normalize → Special Spans → Word Split → Per-Word Segmentation → Annotate → Strip
                                                      │
                                              Generate Candidates
                                              Score & Select Best
```

For each word, the engine:
1. Generates 2–5 segmentation candidates (whole ROOT, suffix chains, foreign)
2. Scores each candidate deterministically (TDK validation, root length, suffix recognition)
3. Selects the highest-scoring segmentation
4. Strips internal whitespace markers from the output

### Scoring Rules

| Factor | Score |
|---|---|
| Root in TDK dictionary | +10 |
| Whole word in TDK (unsplit) | +5 bonus |
| Root in domain vocabulary | +8 |
| Root length | +2 per character |
| Each recognised suffix | +2 |
| Short root penalty (≤2 chars) | −4 |
| Foreign root (fallback) | +3 base |
| Unknown root | +1 base |

### Known-Intact Words

A curated set of common Turkish words (inflected forms of `demek`, `yemek`, and discourse particles) bypass candidate generation entirely and are always kept whole. This prevents false splits like `dedi` → `de` + `di` where the root `de` is a valid TDK conjunction.

### Bundled Resources

| Resource | Size | Purpose |
|---|---|---|
| `tdk_words.txt` | ~746 KB | TDK dictionary (64K+ lemmas + derived verb stems) |
| `turkish_proper_nouns.txt` | ~1 KB | Proper nouns (cities, regions, names) |
| Suffix table | 260+ entries | Turkish suffix patterns with morphological labels |
| Acronym table | 80+ entries | Acronym → Turkish expansion mappings |
| Domain vocabulary | 200+ entries | Medical, sports, tourism terms |

## Known Limitations

- **Not a full morphological analyzer.** This is a heuristic segmenter, not a Zemberek/TRMorph replacement. Morphological labels may be incorrect for ambiguous suffixes (e.g. `-ACC` vs `-GEN` for "in").
- **No disambiguation.** The tokenizer does not use sentence-level context to resolve ambiguous words (e.g. "gelir" = income vs. aorist).
- **Verb stem derivation is simple.** Only `-mak`/`-mek` infinitive stripping (for stems ≥3 characters) is used; vowel harmony alternations in stems are not modelled.
- **TDK dictionary coverage.** The bundled TDK list has ~64K entries. Words absent from TDK default to ROOT (unknown) or FOREIGN.
- **Not backward-compatible with v1.x.** The old wrapper relied on `turkish-tokenizer` BPE and `zemberek-python`. Output format and token boundaries will differ.

## Running Tests

```bash
pip install -e ".[dev]"
pytest tests/ -v
```

## License

MIT