--- license: apache-2.0 language: - en - fr - de - es - pt - tr - vi - zh tags: - word-segmentation - domain-names - bilstm-crf - onnx - sequence-labeling library_name: onnxruntime pipeline_tag: token-classification datasets: - ABTdomain/dksplit-benchmark --- # DKSplit v0.3.1 BiLSTM-CRF model for splitting concatenated strings into words. Trained on millions of domain names, brand names, personal names, and multilingual phrases. **85% accuracy** on real-world newly registered domains, outperforming WordSegment (54%) and WordNinja (46%). ## Quick Start ```bash pip install dksplit ``` ```python import dksplit dksplit.split("chatgptlogin") # ['chatgpt', 'login'] dksplit.split("spotifywrapped") # ['spotify', 'wrapped'] dksplit.split("mercibeaucoup") # ['merci', 'beaucoup'] dksplit.split_batch(["openaikey", "microsoftoffice", "bitcoinprice"]) # [['openai', 'key'], ['microsoft', 'office'], ['bitcoin', 'price']] ``` ## Model Details | Property | Value | |---|---| | Architecture | BiLSTM-CRF | | Parameters | 9.47M | | Embedding | 384 | | Hidden | 768 | | Layers | 3 | | Vocab | a-z, 0-9 (38 tokens) | | Max length | 64 characters | | Format | ONNX INT8 quantized | | Size | 9 MB | | Inference | CPU only, no GPU required | ## Training - **Infrastructure**: Leonardo Booster supercomputer at CINECA, Italy (NVIDIA A100) - **Compute**: EuroHPC Joint Undertaking, project AIFAC_P02_281 - **Data**: Millions of labeled samples covering domain names, brand names, tech terms, personal names, and multilingual phrases - **Labels**: Character-level B/I tags (B = word boundary, I = continuation) - **Optimizer**: Adam, cosine LR schedule with warmup - **Epochs**: 15 ## Benchmark 1,000 randomly sampled domains from the [Newly Registered Domains Database (NRDS)](https://domainkits.com/download/nrds) (April 2026 .com feed), human-audited ground truth: | Model | Accuracy | |---|---| | **DKSplit v0.3.1** | **85.0%** | | DKSplit v0.2.x | 82.8% | | WordSegment | 54.0% | | WordNinja | 46.1% | > ~5% of test samples have multiple valid segmentations. Accounting for these, effective accuracy is closer to 90%. ### Examples | Input | DKSplit | WordSegment | WordNinja | |---|---|---|---| | `chatgptprompts` | **chatgpt prompts** | chat gpt prompts | chat gp t prompts | | `spotifywrapped` | **spotify wrapped** | spot if y wrapped | spot if y wrapped | | `ethereumwallet` | **ethereum wallet** | e there um wallet | e there um wallet | | `whatsappstatus` | **whatsapp status** | what sapp status | what s app status | | `escribirenvozalta` | **escribir en voz alta** | escribir env oz alta | es crib ire nv oz alta | | `candidiasenuncamais` | **candidiase nunca mais** | candid iase nunca mais | can didi as e nun cama is | ## Using the ONNX Model Directly The model outputs emission scores. CRF decoding is done separately using the parameters in `dksplit.npz`. ```python import numpy as np import onnxruntime as ort # Load model sess = ort.InferenceSession("dksplit-int8.onnx") crf = np.load("dksplit.npz") # Encode input CHAR_MAP = {c: i+2 for i, c in enumerate("abcdefghijklmnopqrstuvwxyz0123456789")} text = "chatgptlogin" ids = np.array([[CHAR_MAP.get(c, 1) for c in text]], dtype=np.int64) # Get emissions emissions = sess.run(["emissions"], {"chars": ids})[0] # CRF Viterbi decode trans = crf["transitions"] start_t = crf["start_transitions"] end_t = crf["end_transitions"] score = start_t + emissions[0, 0] history = [] for t in range(1, emissions.shape[1]): ns = score[:, None] + trans + emissions[0, t, None, :] history.append(np.argmax(ns, axis=0)) score = np.max(ns, axis=0) best = [np.argmax(score + end_t)] for h in reversed(history): best.append(h[best[-1]]) best.reverse() # Decode to words words, cur = [], [] for ch, lb in zip(text, best): if lb == 1 and cur: words.append("".join(cur)) cur = [ch] else: cur.append(ch) if cur: words.append("".join(cur)) print(words) # ['chatgpt', 'login'] ``` ## Files - `dksplit-int8.onnx` - BiLSTM emissions model (INT8 quantized, 9 MB) - `dksplit.npz` - CRF parameters (transitions, start_transitions, end_transitions) ## Intended Use - Domain name analysis and segmentation - Hashtag splitting - URL component extraction - Compound string decomposition - Any concatenated text without spaces ## Limitations - Latin script only (a-z, 0-9) - Max 64 characters - Accuracy is highest on English and major European languages - Some inputs are genuinely ambiguous ## Links - PyPI: [pypi.org/project/dksplit](https://pypi.org/project/dksplit) - GitHub: [github.com/ABTdomain/dksplit](https://github.com/ABTdomain/dksplit) - Go version: [github.com/ABTdomain/dksplit-go](https://github.com/ABTdomain/dksplit-go) - Website: [ABTdomain.com](https://abtdomain.com), [DomainKits.com](https://domainkits.com) ## Acknowledgements The v0.3.1 model was trained on the Leonardo Booster supercomputer at CINECA, Italy, with computing resources provided by the [EuroHPC Joint Undertaking](https://eurohpc-ju.europa.eu/) through the Playground Access program (project AIFAC_P02_281). We thank EuroHPC JU for enabling SMEs to explore new possibilities with world-class HPC infrastructure. ## License Apache 2.0 **Please attribute as:** DKsplit by [ABTdomain](https://abtdomain.com)