π domainTokenizer
Building small models that understand domain tokens β not just words.
The Idea
LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.
But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
Why This Matters
| Problem | Text Tokenizer | Domain Tokenizer |
|---|---|---|
Product SKU-8847291 |
Split into meaningless fragments: SK, U-, 884... |
Encoded as Semantic ID: [Electronics, 23, 7, 41] with hierarchical meaning |
Price $79.99 |
Fragmented: $, 79, ., 99 |
Tokenized as price_bin_37 (73rd percentile = "mid-range") |
Timestamp 2025-03-15 |
Calendar-unaware text fragments | [Wednesday, Afternoon, 2_days_later] |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: {Electronics + $50-100} β composite token |
Research Foundation
This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm β the challenge is how to tokenize.
Five paradigms have emerged:
| Paradigm | Method | Key Paper |
|---|---|---|
| Semantic IDs | RQ-VAE quantization of item embeddings | TIGER (Google, 2023) |
| Action Tokenization | BPE-like merging of feature patterns | ActionPiece (DeepMind, 2025) |
| Transaction Tokenization | Composite (date + amount + text) encoding | Banking TF (2024) |
| Tabular Tokenization | Relative magnitude encoding for numbers | TP-BERTa (2024) |
| Universal Tokenization | All modalities β shared discrete space | Meta-Transformer (2023) |
π Full research report: docs/research_report.md
Project Vision
Phase 1: Research & Survey (β Current)
- Literature survey of domain tokenization methods
- Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
- Blueprint for a general-purpose domain tokenizer
Phase 2: Core Tokenizer Library
- Implement per-field tokenizers:
SemanticIDTokenizerβ RQ-VAE for entity encodingMagnitudeTokenizerβ relative magnitude binning for numerical valuesTemporalTokenizerβ calendar + relative delta encodingCompositeTokenizerβ BPE-like merging of multi-field patterns (ActionPiece-style)
- Schema-driven automatic tokenizer selection
Phase 3: Pre-training Framework
- Self-supervised objectives: Masked Event Prediction, Next Event Prediction
- Small Transformer backbone (10Mβ350M parameters)
- Domain-agnostic training loop that works with any tokenizer configuration
Phase 4: Domain Demos
- E-commerce: next purchase prediction, customer segmentation
- Finance: fraud detection, credit scoring
- Healthcare: clinical event prediction
Repo Structure
domainTokenizer/
βββ docs/
β βββ research_report.md # Detailed research findings (30+ papers)
βββ src/ # (coming) Core library
β βββ tokenizers/ # Per-field tokenizer implementations
β βββ models/ # Small Transformer backbones
β βββ training/ # Pre-training and fine-tuning
βββ examples/ # (coming) Domain-specific demos
βββ README.md
Key References
| Paper | Year | What It Does | Link |
|---|---|---|---|
| TIGER | 2023 | Semantic IDs for products via RQ-VAE | arXiv |
| ActionPiece | 2025 | BPE for user action sequences | arXiv |
| Banking TF | 2024 | Tokenizer for financial transactions | arXiv |
| LETTER | 2024 | Learnable item tokenization | arXiv |
| TP-BERTa | 2024 | Numerical value tokenization | arXiv |
| Meta-Transformer | 2023 | 12 modalities, one token space | arXiv |
| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | arXiv |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | arXiv |
See the full reference table with 31 papers in the research report.
License
MIT
domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.