---
license: apache-2.0
language:
  - en
  - fr
  - de
  - es
  - pt
  - tr
  - vi
  - zh
tags:
  - word-segmentation
  - domain-names
  - bilstm-crf
  - onnx
  - sequence-labeling
library_name: onnxruntime
pipeline_tag: token-classification
datasets:
  - ABTdomain/dksplit-benchmark
---

# DKSplit v0.3.1

BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases.

**85% accuracy** on real-world newly registered domains, outperforming WordSegment (54%) and WordNinja (46%).

## Quick Start

```bash
pip install dksplit
```

```python
import dksplit

dksplit.split("chatgptlogin")
# ['chatgpt', 'login']

dksplit.split("spotifywrapped")
# ['spotify', 'wrapped']

dksplit.split("mercibeaucoup")
# ['merci', 'beaucoup']

dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"])
# [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']]
```

## Model Details

| Property | Value |
|---|---|
| Architecture | BiLSTM-CRF |
| Parameters | 9.47M |
| Embedding | 384 |
| Hidden | 768 |
| Layers | 3 |
| Vocab | a-z, 0-9 (38 tokens) |
| Max length | 64 characters |
| Format | ONNX INT8 quantized |
| Size | 9 MB |
| Inference | CPU only, no GPU required |

## Training

- **Infrastructure**: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100)
- **Compute**: EuroHPC Joint Undertaking, project AIFAC_P02_281
- **Data**: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases
- **Labels**: Character-level B/I tags (B = word boundary, I = continuation)
- **Optimizer**: Adam, cosine LR schedule with warmup
- **Epochs**: 15

## Benchmark

1,000 randomly sampled domains from the [Newly Registered Domains Database (NRDS)](https://domainkits.com/download/nrds) (April 2026 .com feed), human-audited ground truth:

| Model | Accuracy |
|---|---|
| **DKSplit v0.3.1** | **85.0%** |
| DKSplit v0.2.x | 82.8% |
| WordSegment | 54.0% |
| WordNinja | 46.1% |

> ~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%.

### Examples

| Input | DKSplit | WordSegment | WordNinja |
|---|---|---|---|
| `chatgptprompts` | **chatgpt prompts** | chat gpt prompts | chat gp t prompts |
| `spotifywrapped` | **spotify wrapped** | spot if y wrapped | spot if y wrapped |
| `ethereumwallet` | **ethereum wallet** | e there um wallet | e there um wallet |
| `whatsappstatus` | **whatsapp status** | what sapp status | what s app status |
| `escribirenvozalta` | **escribir en voz alta** | escribir env oz alta | es crib ire nv oz alta |
| `candidiasenuncamais` | **candidiase nunca mais** | candid iase nunca mais | can didi as e nun cama is |

## Using the ONNX Model Directly

The model outputs emission scores. CRF decoding is done separately using the parameters in `dksplit.npz`.

```python
import numpy as np
import onnxruntime as ort

# Load model
sess = ort.InferenceSession("dksplit-int8.onnx")
crf = np.load("dksplit.npz")

# Encode input
CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")}
text = "chatgptlogin"
ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64)

# Get emissions
emissions = sess.run(["emissions"], {"chars": ids})[0]

# CRF Viterbi decode
trans = crf["transitions"]
start_t = crf["start_transitions"]
end_t = crf["end_transitions"]

score = start_t + emissions[0, 0]
history = []
for t in range(1, emissions.shape[1]):
    ns = score[:, None] + trans + emissions[0, t, None, :]
    history.append(np.argmax(ns, axis=0))
    score = np.max(ns, axis=0)
best = [np.argmax(score + end_t)]
for h in reversed(history):
    best.append(h[best[-1]])
best.reverse()

# Decode to words
words, cur = [], []
for ch, lb in zip(text, best):
    if lb == 1 and cur:
        words.append("".join(cur))
        cur = [ch]
    else:
        cur.append(ch)
if cur:
    words.append("".join(cur))
print(words)  # ['chatgpt', 'login']
```

## Files

- `dksplit-int8.onnx` - BiLSTM emissions model (INT8 quantized, 9 MB)
- `dksplit.npz` - CRF parameters (transitions, start_transitions, end_transitions)

## Intended Use

- Domain name analysis and segmentation
- Hashtag splitting
- URL component extraction
- Compound string decomposition
- Any concatenated text without spaces

## Limitations

- Latin script only (a-z, 0-9)
- Max 64 characters
- Accuracy is highest on English and major European languages
- Some inputs are genuinely ambiguous

## Links

- PyPI: [pypi.org/project/dksplit](https://pypi.org/project/dksplit)
- GitHub: [github.com/ABTdomain/dksplit](https://github.com/ABTdomain/dksplit)
- Go version: [github.com/ABTdomain/dksplit-go](https://github.com/ABTdomain/dksplit-go)
- Website: [ABTdomain.com](https://abtdomain.com), [DomainKits.com](https://domainkits.com)

## Acknowledgements

The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the [EuroHPC Joint Undertaking](https://eurohpc-ju.europa.eu/) through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure.

## License

Apache 2.0

**Please attribute as:** DKsplit by [ABTdomain](https://abtdomain.com)