Add README with project overview and vision
Browse files
README.md
ADDED
|
@@ -0,0 +1,106 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# π domainTokenizer
|
| 2 |
+
|
| 3 |
+
**Building small models that understand domain tokens β not just words.**
|
| 4 |
+
|
| 5 |
+
---
|
| 6 |
+
|
| 7 |
+
## The Idea
|
| 8 |
+
|
| 9 |
+
LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.
|
| 10 |
+
|
| 11 |
+
But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
|
| 12 |
+
|
| 13 |
+
**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
|
| 14 |
+
|
| 15 |
+
```
|
| 16 |
+
Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word
|
| 17 |
+
|
| 18 |
+
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
|
| 19 |
+
```
|
| 20 |
+
|
| 21 |
+
## Why This Matters
|
| 22 |
+
|
| 23 |
+
| Problem | Text Tokenizer | Domain Tokenizer |
|
| 24 |
+
|---------|---------------|-----------------|
|
| 25 |
+
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
|
| 26 |
+
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
|
| 27 |
+
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
|
| 28 |
+
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β composite token |
|
| 29 |
+
|
| 30 |
+
## Research Foundation
|
| 31 |
+
|
| 32 |
+
This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize.
|
| 33 |
+
|
| 34 |
+
Five paradigms have emerged:
|
| 35 |
+
|
| 36 |
+
| Paradigm | Method | Key Paper |
|
| 37 |
+
|----------|--------|-----------|
|
| 38 |
+
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
|
| 39 |
+
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
|
| 40 |
+
| **Transaction Tokenization** | Composite (date + amount + text) encoding | [Banking TF](https://arxiv.org/abs/2410.08243) (2024) |
|
| 41 |
+
| **Tabular Tokenization** | Relative magnitude encoding for numbers | [TP-BERTa](https://arxiv.org/abs/2403.01841) (2024) |
|
| 42 |
+
| **Universal Tokenization** | All modalities β shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
|
| 43 |
+
|
| 44 |
+
π **Full research report:** [`docs/research_report.md`](docs/research_report.md)
|
| 45 |
+
|
| 46 |
+
## Project Vision
|
| 47 |
+
|
| 48 |
+
### Phase 1: Research & Survey (β
Current)
|
| 49 |
+
- Literature survey of domain tokenization methods
|
| 50 |
+
- Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
|
| 51 |
+
- Blueprint for a general-purpose domain tokenizer
|
| 52 |
+
|
| 53 |
+
### Phase 2: Core Tokenizer Library
|
| 54 |
+
- Implement per-field tokenizers:
|
| 55 |
+
- `SemanticIDTokenizer` β RQ-VAE for entity encoding
|
| 56 |
+
- `MagnitudeTokenizer` β relative magnitude binning for numerical values
|
| 57 |
+
- `TemporalTokenizer` β calendar + relative delta encoding
|
| 58 |
+
- `CompositeTokenizer` β BPE-like merging of multi-field patterns (ActionPiece-style)
|
| 59 |
+
- Schema-driven automatic tokenizer selection
|
| 60 |
+
|
| 61 |
+
### Phase 3: Pre-training Framework
|
| 62 |
+
- Self-supervised objectives: Masked Event Prediction, Next Event Prediction
|
| 63 |
+
- Small Transformer backbone (10Mβ350M parameters)
|
| 64 |
+
- Domain-agnostic training loop that works with any tokenizer configuration
|
| 65 |
+
|
| 66 |
+
### Phase 4: Domain Demos
|
| 67 |
+
- E-commerce: next purchase prediction, customer segmentation
|
| 68 |
+
- Finance: fraud detection, credit scoring
|
| 69 |
+
- Healthcare: clinical event prediction
|
| 70 |
+
|
| 71 |
+
## Repo Structure
|
| 72 |
+
|
| 73 |
+
```
|
| 74 |
+
domainTokenizer/
|
| 75 |
+
βββ docs/
|
| 76 |
+
β βββ research_report.md # Detailed research findings (30+ papers)
|
| 77 |
+
βββ src/ # (coming) Core library
|
| 78 |
+
β βββ tokenizers/ # Per-field tokenizer implementations
|
| 79 |
+
β βββ models/ # Small Transformer backbones
|
| 80 |
+
β βββ training/ # Pre-training and fine-tuning
|
| 81 |
+
βββ examples/ # (coming) Domain-specific demos
|
| 82 |
+
βββ README.md
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
## Key References
|
| 86 |
+
|
| 87 |
+
| Paper | Year | What It Does | Link |
|
| 88 |
+
|-------|------|-------------|------|
|
| 89 |
+
| TIGER | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
|
| 90 |
+
| ActionPiece | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
|
| 91 |
+
| Banking TF | 2024 | Tokenizer for financial transactions | [arXiv](https://arxiv.org/abs/2410.08243) |
|
| 92 |
+
| LETTER | 2024 | Learnable item tokenization | [arXiv](https://arxiv.org/abs/2405.07314) |
|
| 93 |
+
| TP-BERTa | 2024 | Numerical value tokenization | [arXiv](https://arxiv.org/abs/2403.01841) |
|
| 94 |
+
| Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
|
| 95 |
+
| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
|
| 96 |
+
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
| 97 |
+
|
| 98 |
+
See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 31 papers in the research report.
|
| 99 |
+
|
| 100 |
+
## License
|
| 101 |
+
|
| 102 |
+
MIT
|
| 103 |
+
|
| 104 |
+
---
|
| 105 |
+
|
| 106 |
+
*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*
|