rtferraz commited on
Commit
f930fef
Β·
verified Β·
1 Parent(s): be86e60

Add README with project overview and vision

Browse files
Files changed (1) hide show
  1. README.md +106 -0
README.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸ”‘ domainTokenizer
2
+
3
+ **Building small models that understand domain tokens β€” not just words.**
4
+
5
+ ---
6
+
7
+ ## The Idea
8
+
9
+ LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.
10
+
11
+ But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
12
+
13
+ **domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
14
+
15
+ ```
16
+ Text LLM: "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word
17
+
18
+ domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase
19
+ ```
20
+
21
+ ## Why This Matters
22
+
23
+ | Problem | Text Tokenizer | Domain Tokenizer |
24
+ |---------|---------------|-----------------|
25
+ | Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
26
+ | Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
27
+ | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
28
+ | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β†’ composite token |
29
+
30
+ ## Research Foundation
31
+
32
+ This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β€” the challenge is *how* to tokenize.
33
+
34
+ Five paradigms have emerged:
35
+
36
+ | Paradigm | Method | Key Paper |
37
+ |----------|--------|-----------|
38
+ | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
39
+ | **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
40
+ | **Transaction Tokenization** | Composite (date + amount + text) encoding | [Banking TF](https://arxiv.org/abs/2410.08243) (2024) |
41
+ | **Tabular Tokenization** | Relative magnitude encoding for numbers | [TP-BERTa](https://arxiv.org/abs/2403.01841) (2024) |
42
+ | **Universal Tokenization** | All modalities β†’ shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
43
+
44
+ πŸ“„ **Full research report:** [`docs/research_report.md`](docs/research_report.md)
45
+
46
+ ## Project Vision
47
+
48
+ ### Phase 1: Research & Survey (βœ… Current)
49
+ - Literature survey of domain tokenization methods
50
+ - Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
51
+ - Blueprint for a general-purpose domain tokenizer
52
+
53
+ ### Phase 2: Core Tokenizer Library
54
+ - Implement per-field tokenizers:
55
+ - `SemanticIDTokenizer` β€” RQ-VAE for entity encoding
56
+ - `MagnitudeTokenizer` β€” relative magnitude binning for numerical values
57
+ - `TemporalTokenizer` β€” calendar + relative delta encoding
58
+ - `CompositeTokenizer` β€” BPE-like merging of multi-field patterns (ActionPiece-style)
59
+ - Schema-driven automatic tokenizer selection
60
+
61
+ ### Phase 3: Pre-training Framework
62
+ - Self-supervised objectives: Masked Event Prediction, Next Event Prediction
63
+ - Small Transformer backbone (10M–350M parameters)
64
+ - Domain-agnostic training loop that works with any tokenizer configuration
65
+
66
+ ### Phase 4: Domain Demos
67
+ - E-commerce: next purchase prediction, customer segmentation
68
+ - Finance: fraud detection, credit scoring
69
+ - Healthcare: clinical event prediction
70
+
71
+ ## Repo Structure
72
+
73
+ ```
74
+ domainTokenizer/
75
+ β”œβ”€β”€ docs/
76
+ β”‚ └── research_report.md # Detailed research findings (30+ papers)
77
+ β”œβ”€β”€ src/ # (coming) Core library
78
+ β”‚ β”œβ”€β”€ tokenizers/ # Per-field tokenizer implementations
79
+ β”‚ β”œβ”€β”€ models/ # Small Transformer backbones
80
+ β”‚ └── training/ # Pre-training and fine-tuning
81
+ β”œβ”€β”€ examples/ # (coming) Domain-specific demos
82
+ └── README.md
83
+ ```
84
+
85
+ ## Key References
86
+
87
+ | Paper | Year | What It Does | Link |
88
+ |-------|------|-------------|------|
89
+ | TIGER | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
90
+ | ActionPiece | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
91
+ | Banking TF | 2024 | Tokenizer for financial transactions | [arXiv](https://arxiv.org/abs/2410.08243) |
92
+ | LETTER | 2024 | Learnable item tokenization | [arXiv](https://arxiv.org/abs/2405.07314) |
93
+ | TP-BERTa | 2024 | Numerical value tokenization | [arXiv](https://arxiv.org/abs/2403.01841) |
94
+ | Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
95
+ | NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
96
+ | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
97
+
98
+ See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 31 papers in the research report.
99
+
100
+ ## License
101
+
102
+ MIT
103
+
104
+ ---
105
+
106
+ *domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*