| # π domainTokenizer |
|
|
| **Building small models that understand domain tokens β not just words.** |
|
|
| --- |
|
|
| ## The Idea |
|
|
| LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns. |
|
|
| But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture. |
|
|
| **domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language? |
|
|
| ``` |
| Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word |
| |
| domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase |
| ``` |
|
|
| ## Why This Matters |
|
|
| | Problem | Text Tokenizer | Domain Tokenizer | |
| |---------|---------------|-----------------| |
| | Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning | |
| | Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") | |
| | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` | |
| | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β composite token | |
|
|
| ## Research Foundation |
|
|
| This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize. |
|
|
| Five paradigms have emerged: |
|
|
| | Paradigm | Method | Key Paper | |
| |----------|--------|-----------| |
| | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) | |
| | **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) | |
| | **Transaction Tokenization** | Composite (date + amount + text) encoding | [Banking TF](https://arxiv.org/abs/2410.08243) (2024) | |
| | **Tabular Tokenization** | Relative magnitude encoding for numbers | [TP-BERTa](https://arxiv.org/abs/2403.01841) (2024) | |
| | **Universal Tokenization** | All modalities β shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) | |
|
|
| π **Full research report:** [`docs/research_report.md`](docs/research_report.md) |
|
|
| ## Project Vision |
|
|
| ### Phase 1: Research & Survey (β
Current) |
| - Literature survey of domain tokenization methods |
| - Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains |
| - Blueprint for a general-purpose domain tokenizer |
|
|
| ### Phase 2: Core Tokenizer Library |
| - Implement per-field tokenizers: |
| - `SemanticIDTokenizer` β RQ-VAE for entity encoding |
| - `MagnitudeTokenizer` β relative magnitude binning for numerical values |
| - `TemporalTokenizer` β calendar + relative delta encoding |
| - `CompositeTokenizer` β BPE-like merging of multi-field patterns (ActionPiece-style) |
| - Schema-driven automatic tokenizer selection |
|
|
| ### Phase 3: Pre-training Framework |
| - Self-supervised objectives: Masked Event Prediction, Next Event Prediction |
| - Small Transformer backbone (10Mβ350M parameters) |
| - Domain-agnostic training loop that works with any tokenizer configuration |
|
|
| ### Phase 4: Domain Demos |
| - E-commerce: next purchase prediction, customer segmentation |
| - Finance: fraud detection, credit scoring |
| - Healthcare: clinical event prediction |
|
|
| ## Repo Structure |
|
|
| ``` |
| domainTokenizer/ |
| βββ docs/ |
| β βββ research_report.md # Detailed research findings (30+ papers) |
| βββ src/ # (coming) Core library |
| β βββ tokenizers/ # Per-field tokenizer implementations |
| β βββ models/ # Small Transformer backbones |
| β βββ training/ # Pre-training and fine-tuning |
| βββ examples/ # (coming) Domain-specific demos |
| βββ README.md |
| ``` |
|
|
| ## Key References |
|
|
| | Paper | Year | What It Does | Link | |
| |-------|------|-------------|------| |
| | TIGER | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) | |
| | ActionPiece | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) | |
| | Banking TF | 2024 | Tokenizer for financial transactions | [arXiv](https://arxiv.org/abs/2410.08243) | |
| | LETTER | 2024 | Learnable item tokenization | [arXiv](https://arxiv.org/abs/2405.07314) | |
| | TP-BERTa | 2024 | Numerical value tokenization | [arXiv](https://arxiv.org/abs/2403.01841) | |
| | Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) | |
| | NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) | |
| | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) | |
|
|
| See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 31 papers in the research report. |
|
|
| ## License |
|
|
| MIT |
|
|
| --- |
|
|
| *domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.* |