rtferraz commited on
Commit
be86e60
Β·
verified Β·
1 Parent(s): 356a72e

Add comprehensive research report on domain-specific tokenization

Browse files
Files changed (1) hide show
  1. docs/research_report.md +862 -0
docs/research_report.md ADDED
@@ -0,0 +1,862 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Domain Tokenization: Beyond Words β€” A Research Report
2
+
3
+ > **Building small models that understand domain tokens, not just words.**
4
+ >
5
+ > *Last updated: April 2026*
6
+
7
+ ---
8
+
9
+ ## Table of Contents
10
+
11
+ 1. [Executive Summary](#1-executive-summary)
12
+ 2. [The Problem: Why Words Are Not Enough](#2-the-problem-why-words-are-not-enough)
13
+ 3. [The Core Insight: Anything Can Be a Token](#3-the-core-insight-anything-can-be-a-token)
14
+ 4. [Research Landscape: Five Paradigms of Domain Tokenization](#4-research-landscape-five-paradigms-of-domain-tokenization)
15
+ - 4.1 [Semantic ID Tokenization (Products & Items)](#41-semantic-id-tokenization-products--items)
16
+ - 4.2 [Action Sequence Tokenization (User Behaviors)](#42-action-sequence-tokenization-user-behaviors)
17
+ - 4.3 [Financial Transaction Tokenization](#43-financial-transaction-tokenization)
18
+ - 4.4 [Tabular Feature Tokenization](#44-tabular-feature-tokenization)
19
+ - 4.5 [Universal Modality Tokenization](#45-universal-modality-tokenization)
20
+ 5. [Key Papers: Detailed Analysis](#5-key-papers-detailed-analysis)
21
+ - 5.1 [TIGER β€” Semantic IDs for Generative Retrieval](#51-tiger--semantic-ids-for-generative-retrieval)
22
+ - 5.2 [ActionPiece β€” BPE for User Actions](#52-actionpiece--bpe-for-user-actions)
23
+ - 5.3 [Banking Transaction Flow β€” Transactions as Tokens](#53-banking-transaction-flow--transactions-as-tokens)
24
+ - 5.4 [LETTER β€” Learnable Item Tokenization](#54-letter--learnable-item-tokenization)
25
+ - 5.5 [TP-BERTa β€” Numerical Value Tokenization](#55-tp-berta--numerical-value-tokenization)
26
+ - 5.6 [Meta-Transformer β€” 12 Modalities, One Token Space](#56-meta-transformer--12-modalities-one-token-space)
27
+ 6. [Tokenization Methods: A Technical Taxonomy](#6-tokenization-methods-a-technical-taxonomy)
28
+ - 6.1 [Quantization-Based (RQ-VAE, VQ-VAE)](#61-quantization-based-rq-vae-vq-vae)
29
+ - 6.2 [BPE-Inspired Merging](#62-bpe-inspired-merging)
30
+ - 6.3 [Magnitude & Binning Approaches](#63-magnitude--binning-approaches)
31
+ - 6.4 [Learnable End-to-End Tokenizers](#64-learnable-end-to-end-tokenizers)
32
+ - 6.5 [Serialization-Based (Text Templates)](#65-serialization-based-text-templates)
33
+ 7. [The domainTokenizer Blueprint: How to Build It](#7-the-domaintokenizer-blueprint-how-to-build-it)
34
+ - 7.1 [Architecture Design](#71-architecture-design)
35
+ - 7.2 [Tokenizer Construction Pipeline](#72-tokenizer-construction-pipeline)
36
+ - 7.3 [Pre-training Objectives](#73-pre-training-objectives)
37
+ - 7.4 [Downstream Task Adaptation](#74-downstream-task-adaptation)
38
+ 8. [Use Case Walkthrough: E-Commerce Transaction Model](#8-use-case-walkthrough-e-commerce-transaction-model)
39
+ 9. [Open Challenges and Research Gaps](#9-open-challenges-and-research-gaps)
40
+ 10. [Complete Paper Reference Table](#10-complete-paper-reference-table)
41
+ 11. [Related Concepts: Nested Learning & Continual Adaptation](#11-related-concepts-nested-learning--continual-adaptation)
42
+
43
+ ---
44
+
45
+ ## 1. Executive Summary
46
+
47
+ Large Language Models (LLMs) process text by breaking it into **tokens** β€” subword units learned via algorithms like BPE (Byte-Pair Encoding). This tokenization is the foundation that allows Transformers to model sequential patterns via next-token prediction.
48
+
49
+ But words are just one type of sequential data. Businesses generate vast amounts of **non-textual sequential data** every day:
50
+
51
+ - **E-commerce:** millions of purchase transactions, each with product IDs, amounts, timestamps, categories
52
+ - **Banking:** transaction flows with dates, amounts, merchant codes, and descriptions
53
+ - **Healthcare:** sequences of diagnoses, procedures, lab results, medications
54
+ - **Advertising:** impression β†’ click β†’ conversion funnels with bid amounts and user features
55
+ - **Logistics:** shipping events, warehouse movements, delivery status sequences
56
+
57
+ **The central question this project explores:** Can we build tokenizers that encode these domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then train small, efficient Transformer models that understand domain patterns the way LLMs understand language?
58
+
59
+ **The answer from recent research is a resounding yes.** This report surveys 25+ papers spanning 2021–2026 that collectively establish a new paradigm: **domain tokenization**. The key findings are:
60
+
61
+ 1. **Semantic IDs** (Google, 2023): Products can be encoded as tuples of discrete tokens derived from their content embeddings via quantization (RQ-VAE). A Transformer trained on sequences of these Semantic IDs outperforms traditional recommendation systems and generalizes to unseen items.
62
+
63
+ 2. **Action tokenization** (Google DeepMind, 2025): User action sequences can be tokenized using a BPE-like algorithm that merges frequently co-occurring features β€” the same algorithm that powers text tokenization, applied to business events instead of characters.
64
+
65
+ 3. **Transaction tokenization** (2024): Banking transactions β€” multimodal events of (date, amount, text) β€” can be encoded as composite tokens and modeled with self-supervised pre-training, achieving state-of-the-art on fraud detection and credit scoring.
66
+
67
+ 4. **Tabular tokenization** (2024–2025): Individual feature values (numerical, categorical) can be tokenized via relative magnitude encoding or serialization, enabling foundation models that transfer across different tabular datasets.
68
+
69
+ 5. **Universal tokenization** (2023–2024): Frameworks like Meta-Transformer demonstrate that 12+ modalities including time series and tabular data can be projected into a shared token space and processed by a single frozen Transformer.
70
+
71
+ This report details each paradigm, provides technical depth on the tokenization methods, and lays out a concrete blueprint for building domainTokenizer.
72
+
73
+ ---
74
+
75
+ ## 2. The Problem: Why Words Are Not Enough
76
+
77
+ ### 2.1 The Mismatch Between Business Data and Text Tokens
78
+
79
+ When an e-commerce platform processes a customer's purchase history, the raw data looks like:
80
+
81
+ ```
82
+ customer_42 | 2025-03-15 | SKU-8847291 | Electronics > Headphones | $79.99 | Credit Card | qty: 1
83
+ customer_42 | 2025-03-15 | SKU-3321098 | Electronics > Cables | $12.49 | Credit Card | qty: 2
84
+ customer_42 | 2025-04-01 | SKU-5519273 | Books > Technical | $44.95 | Debit Card | qty: 1
85
+ ```
86
+
87
+ If you feed this to a standard LLM tokenizer (e.g., GPT-4o's `cl100k_base`), you get:
88
+
89
+ - `SKU-8847291` β†’ split into meaningless subword fragments like `SK`, `U-`, `884`, `72`, `91`
90
+ - `$79.99` β†’ tokenized as `$`, `79`, `.`, `99` β€” losing the semantic meaning of "a mid-range purchase"
91
+ - `2025-03-15` β†’ fragmented into date components with no temporal understanding
92
+ - The **relationships** between fields (this amount goes with this product in this category) are lost in a flat token stream
93
+
94
+ **The fundamental problem:** text tokenizers are optimized for the statistical structure of natural language. They know that `ing` and `tion` are common suffixes, that `the` is frequent, that `un-` is a prefix. They know nothing about:
95
+
96
+ - Product similarity (headphones and earbuds are related)
97
+ - Price ranges ($79.99 is "mid-range electronics" vs. $2,499 is "premium")
98
+ - Temporal patterns (weekly vs. monthly purchase cadence)
99
+ - Cross-field interactions (buying a cable right after headphones = accessory purchase)
100
+
101
+ ### 2.2 The Opportunity: Domain Structure is Richer Than Language
102
+
103
+ Business domains have structure that goes beyond what text captures:
104
+
105
+ | Dimension | Language | Business Domain |
106
+ |-----------|----------|-----------------|
107
+ | **Vocabulary** | ~50K–256K subwords | Millions of SKUs, thousands of categories |
108
+ | **Sequence meaning** | Word order determines syntax | Temporal order determines behavioral patterns |
109
+ | **Similarity** | Semantic (synonyms, paraphrases) | Collaborative (users who buy X also buy Y) |
110
+ | **Numerical values** | Rare, incidental | Central (prices, quantities, timestamps) |
111
+ | **Compositionality** | Words compose into sentences | Features compose into events/transactions |
112
+ | **Temporal dynamics** | Mostly static semantics | Evolving trends, seasonal patterns |
113
+
114
+ A domain tokenizer should exploit all of this structure.
115
+
116
+ ### 2.3 Why Small Models?
117
+
118
+ This project focuses on **small** models (tens of millions to low billions of parameters) because:
119
+
120
+ 1. **Domain data is structured** β€” you don't need 70B parameters to learn that "users who buy phones often buy cases." The pattern space is narrower than open-domain language.
121
+ 2. **Latency matters** β€” production systems need real-time inference (fraud detection, recommendations, pricing).
122
+ 3. **Data efficiency** β€” most businesses have millions, not trillions, of training examples.
123
+ 4. **Cost** β€” training and serving small models is orders of magnitude cheaper.
124
+ 5. **Interpretability** β€” smaller models with domain-specific tokens are more auditable than black-box LLMs.
125
+
126
+ ---
127
+
128
+ ## 3. The Core Insight: Anything Can Be a Token
129
+
130
+ The survey **"Next Token Prediction Towards Multimodal Intelligence"** ([arXiv: 2412.18619](https://arxiv.org/abs/2412.18619), 59 upvotes) formalizes this principle:
131
+
132
+ > Next-Token Prediction (NTP) is a **universal training objective** that works across modalities. The bottleneck is not the model architecture β€” it's **tokenization**: how you map domain entities into discrete token spaces.
133
+
134
+ This means the entire LLM machinery β€” attention, scaling laws, in-context learning, transfer learning β€” becomes available for any domain once you solve the tokenization problem.
135
+
136
+ The precedent is clear across modalities:
137
+
138
+ | Modality | How It's Tokenized | Key Paper |
139
+ |----------|--------------------|-----------|
140
+ | **Text** | BPE / WordPiece / SentencePiece | GPT, BERT, Llama |
141
+ | **Images** | VQ-VAE, patch embeddings | DALL-E, ViT |
142
+ | **Audio** | Spectral codecs (EnCodec) | AudioLM, Whisper |
143
+ | **Video** | 3D causal VAE | HiTVideo, Emu3 |
144
+ | **Robotics actions** | Discrete Cosine Transform | FAST (2501.09747) |
145
+ | **Products/Items** | **Semantic IDs via RQ-VAE** | **TIGER** |
146
+ | **User actions** | **BPE on feature sets** | **ActionPiece** |
147
+ | **Transactions** | **Composite (date+amount+text)** | **Banking TF** |
148
+ | **Tabular features** | **Magnitude binning, serialization** | **TP-BERTa, TabuLa** |
149
+ | **Time series** | Scalar quantization, symbolic discretization | TokenCast, LLMTime |
150
+
151
+ The bottom half of this table β€” the business-domain entries β€” is where domainTokenizer operates.
152
+
153
+ ---
154
+
155
+ ## 4. Research Landscape: Five Paradigms of Domain Tokenization
156
+
157
+ ### 4.1 Semantic ID Tokenization (Products & Items)
158
+
159
+ **Core idea:** Encode each item (product, video, song, article) as a **sequence of discrete semantic tokens** derived from its content features.
160
+
161
+ **How it works:**
162
+ 1. Extract a dense embedding from item features (e.g., product title + description β†’ SentenceT5 β†’ 768-dim vector)
163
+ 2. Apply **Residual Quantization (RQ-VAE)**: iteratively quantize the embedding into a sequence of codebook indices
164
+ 3. The resulting tuple `(c1, c2, c3, ...)` is the item's **Semantic ID** β€” its "word" in the domain language
165
+ 4. Train a Transformer to predict sequences of these Semantic IDs
166
+
167
+ **Key property:** Items with similar content share token prefixes, creating a hierarchical semantic structure:
168
+ ```
169
+ Headphones A: [Audio, 23, 7, 41]
170
+ Headphones B: [Audio, 23, 7, 55] ← shares 3/4 prefix tokens
171
+ Laptop C: [Computing, 8, 31, 12] ← completely different tokens
172
+ ```
173
+
174
+ **Papers:**
175
+ - **TIGER** (Google, 2023) β€” [arXiv: 2305.05065](https://arxiv.org/abs/2305.05065) β€” The landmark paper introducing Semantic IDs for recommendation. [GitHub 781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender)
176
+ - **Semantic IDs at YouTube** (Google, 2023) β€” [arXiv: 2306.08121](https://arxiv.org/abs/2306.08121) β€” Deployed at industry scale, replacing random IDs
177
+ - **PRISM** (2025) β€” [arXiv: 2601.16556](https://arxiv.org/abs/2601.16556) β€” Purified quantization for better semantic tokenization
178
+ - **MMGRec** (2024) β€” [arXiv: 2404.16555](https://arxiv.org/abs/2404.16555) β€” Graph RQ-VAE incorporating multimodal item features
179
+ - **Semantic IDs for Joint Search & Rec** (2025) β€” [arXiv: 2508.10478](https://arxiv.org/abs/2508.10478) β€” Unified Semantic IDs across search and recommendation
180
+
181
+ ### 4.2 Action Sequence Tokenization (User Behaviors)
182
+
183
+ **Core idea:** Don't just tokenize individual items β€” tokenize the **entire action sequence**, where each action is a composite event with multiple features.
184
+
185
+ **How it works:**
186
+ 1. Represent each user action as an **unordered set of features**: `{category: Electronics, price_bin: $50-100, brand: Sony, payment: Credit}`
187
+ 2. Apply a **BPE-like vocabulary construction** algorithm that merges frequently co-occurring feature patterns:
188
+ - Count co-occurrence of feature pairs both within actions and across adjacent actions
189
+ - Merge the most frequent pair into a new token
190
+ - Repeat until desired vocabulary size is reached
191
+ 3. The same action can be tokenized differently depending on surrounding context
192
+
193
+ **Key insight (from ActionPiece):** Just as BPE discovers that `t` + `h` + `e` should be merged into a single `the` token in English, the action tokenizer discovers that `{Electronics, $50-100}` should be merged into a single composite token because they co-occur frequently in purchase sequences.
194
+
195
+ **Papers:**
196
+ - **ActionPiece** (Google DeepMind, 2025) β€” [arXiv: 2502.13581](https://arxiv.org/abs/2502.13581) β€” First context-aware action sequence tokenizer. [GitHub 53⭐](https://github.com/google-deepmind/action_piece)
197
+ - **MBGen** (2024) β€” [arXiv: 2405.16871](https://arxiv.org/abs/2405.16871) β€” Multi-behavior generative recommendation (view, click, purchase as different token types). [GitHub 57⭐](https://github.com/anananan116/MBGen)
198
+ - **SETRec** (2025) β€” [arXiv: 2502.10833](https://arxiv.org/abs/2502.10833) β€” Order-agnostic set identifiers integrating collaborative + semantic signals
199
+ - **ContRec** (2025) β€” [arXiv: 2504.12007](https://arxiv.org/abs/2504.12007) β€” Continuous tokens via sigma-VAE + diffusion
200
+
201
+ ### 4.3 Financial Transaction Tokenization
202
+
203
+ **Core idea:** Banking/financial transactions are **multimodal sequential events** (date + amount + description). Design a composite tokenizer that encodes all three modalities jointly.
204
+
205
+ **How it works (from Banking Transaction Flow paper):**
206
+ 1. **Date tokenization:** Convert to day-of-week + relative time since last transaction
207
+ 2. **Amount tokenization:** Quantize into logarithmic bins (captures the difference between $5 and $500 better than linear bins)
208
+ 3. **Wording tokenization:** Standard BPE on the transaction description text (e.g., "AMAZON MARKETPLACE" β†’ subword tokens)
209
+ 4. **Composite token:** Combine date + amount + wording tokens into a single transaction representation
210
+ 5. **Sequence ordering:** Within each day, sort transactions by ascending amount; across days, chronological order
211
+ 6. **Pre-train** with masked transaction prediction (mask entire transactions, not just subwords)
212
+
213
+ **Papers:**
214
+ - **Banking Transaction Flow** (2024) β€” [arXiv: 2410.08243](https://arxiv.org/abs/2410.08243) β€” Custom tokenizer for banking transactions; pre-trained models outperform prior art on transaction categorization (31 classes) and credit risk scoring
215
+ - **LBSF** (2024) β€” [arXiv: 2411.15056](https://arxiv.org/abs/2411.15056) β€” Long-term payment behavior sequence folding by merchant, with multi-field behavior encoding
216
+ - **Temporal Tokenization Strategies** (2025) β€” [arXiv: 2512.13618](https://arxiv.org/abs/2512.13618) β€” Systematic comparison of how to tokenize timestamps for event sequences. Key finding: log-based encoding works best for skewed financial data
217
+ - **FinTRec** (2025) β€” [arXiv: 2511.14865](https://arxiv.org/abs/2511.14865) β€” Transformer for long-range financial product recommendation with temporally heterogeneous context
218
+ - **TIMeSynC** (2024) β€” [arXiv: 2410.12825](https://arxiv.org/abs/2410.12825) β€” Encoder-decoder transformer for sequential intent prediction in financial services
219
+
220
+ ### 4.4 Tabular Feature Tokenization
221
+
222
+ **Core idea:** Each row in a table can be serialized as a sequence of tokens, and each feature value can be encoded meaningfully (not just as a text fragment).
223
+
224
+ **Key methods:**
225
+ - **Relative Magnitude Tokenization (RMT):** Instead of tokenizing "$79.99" as text fragments, discretize it relative to the feature's distribution β†’ "percentile_75" or "bin_high". This preserves ordinal relationships.
226
+ - **Intra-Feature Attention:** Bind each value token to its column name via attention, so the model knows "$79.99" means "price is $79.99", not just a number.
227
+ - **Serialization:** Convert rows to natural language: `"price: $79.99, category: Electronics, brand: Sony"` β€” surprisingly effective with large enough models.
228
+
229
+ **Papers:**
230
+ - **TP-BERTa** (2024) β€” [arXiv: 2403.01841](https://arxiv.org/abs/2403.01841) β€” Relative Magnitude Tokenization + intra-feature attention. Competitive with XGBoost/LightGBM.
231
+ - **TabuLa-8B** (2024) β€” [arXiv: 2406.12031](https://arxiv.org/abs/2406.12031) β€” Llama 3-8B fine-tuned on serialized tabular data. Strong zero/few-shot. [GitHub 71⭐](https://github.com/mlfoundations/rtfm)
232
+ - **TabSTAR** (2025) β€” [arXiv: 2505.18125](https://arxiv.org/abs/2505.18125) β€” Foundation tabular model with semantically target-aware representations. [GitHub 83⭐](https://github.com/alanarazi7/TabSTAR). 112 upvotes.
233
+ - **UniTabE** (2023) β€” [arXiv: 2307.09249](https://arxiv.org/abs/2307.09249) β€” Universal pretraining protocol for tabular foundation models
234
+ - **TARTE** (2025) β€” [arXiv: 2505.14415](https://arxiv.org/abs/2505.14415) β€” Knowledge-enhanced tabular representations via pre-training on column names + table entries
235
+ - **TabICL** (2025) β€” [arXiv: 2502.05564](https://arxiv.org/abs/2502.05564) β€” Column-then-row attention, scales to 500K samples
236
+ - **Language Modeling on Tabular Data: A Survey** (2024) β€” [arXiv: 2408.10548](https://arxiv.org/abs/2408.10548) β€” Comprehensive survey. [GitHub 33⭐](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey)
237
+
238
+ ### 4.5 Universal Modality Tokenization
239
+
240
+ **Core idea:** Project all modalities β€” including time series, tabular data, graphs β€” into a **shared discrete token space** and process them with a single Transformer.
241
+
242
+ **Papers:**
243
+ - **Meta-Transformer** (2023) β€” [arXiv: 2307.10802](https://arxiv.org/abs/2307.10802) β€” 12 modalities (text, image, audio, video, point cloud, **time series**, **tabular**, IMU, graph, etc.) via a unified tokenizer + frozen encoder. [GitHub 1652⭐](https://github.com/invictus717/MetaTransformer). 45 upvotes.
244
+ - **Emu3** (2024) β€” [arXiv: 2409.18869](https://arxiv.org/abs/2409.18869) β€” Next-token prediction is all you need across modalities. [GitHub 2400⭐](https://github.com/baaivision/emu3). 99 upvotes.
245
+ - **Unified-IO 2** (2023) β€” [arXiv: 2312.17172](https://arxiv.org/abs/2312.17172) β€” Images, text, audio, and actions in one autoregressive model. [GitHub 647⭐](https://github.com/allenai/unified-io-2). 30 upvotes.
246
+ - **NTP Multimodal Survey** (2024) β€” [arXiv: 2412.18619](https://arxiv.org/abs/2412.18619) β€” Comprehensive taxonomy of multimodal tokenization + NTP. [GitHub 478⭐](https://github.com/lmm101/awesome-multimodal-next-token-prediction). 59 upvotes.
247
+ - **LongCat-Next** (2025) β€” [arXiv: 2603.27538](https://arxiv.org/abs/2603.27538) β€” Lexicalizing modalities as discrete tokens. [GitHub 409⭐](https://github.com/meituan-longcat/LongCat-Next). 145 upvotes.
248
+
249
+ ---
250
+
251
+ ## 5. Key Papers: Detailed Analysis
252
+
253
+ ### 5.1 TIGER β€” Semantic IDs for Generative Retrieval
254
+
255
+ **Full title:** "Recommender Systems with Generative Retrieval"
256
+ **Authors:** Shashank Rajput, Nikhil Mehta, Anima Singh, et al. (Google Research)
257
+ **Link:** [arXiv: 2305.05065](https://arxiv.org/abs/2305.05065) | [GitHub 781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender)
258
+
259
+ **What it does:**
260
+ TIGER (Transformer Index for GEnerative Recommenders) replaces the traditional two-stage retrieve-and-rank pipeline with a single generative model. Each item is assigned a Semantic ID β€” a tuple of discrete codewords β€” and the model autoregressively generates the Semantic ID of the next item a user will interact with.
261
+
262
+ **Semantic ID generation process:**
263
+ ```
264
+ Item features (title, description, ...)
265
+ β†’ Pre-trained text encoder (SentenceT5)
266
+ β†’ Dense embedding (768-dim)
267
+ β†’ Residual Quantization (RQ-VAE)
268
+ β†’ Semantic ID: (c1, c2, c3, ..., cK) # K codewords from K codebooks
269
+ ```
270
+
271
+ **Residual Quantization (RQ):**
272
+ 1. Quantize the embedding to the nearest codebook entry β†’ c1
273
+ 2. Compute the **residual** (difference between original and quantized)
274
+ 3. Quantize the residual β†’ c2
275
+ 4. Repeat K times
276
+
277
+ This creates a **hierarchical** representation: c1 captures coarse semantics (category-level), c2 refines it, c3 further, etc.
278
+
279
+ **Training:**
280
+ - Input: sequence of Semantic IDs representing a user's past interactions
281
+ - Target: Semantic ID of the next item
282
+ - Loss: cross-entropy at each code position
283
+ - Architecture: standard Transformer encoder-decoder
284
+
285
+ **Key results:**
286
+ - Outperforms SASRec, BERT4Rec, and dual-encoder baselines on Amazon datasets
287
+ - **Cold-start capability:** can recommend items never seen in training (because Semantic IDs generalize via shared prefixes)
288
+ - **Diversity:** beam search with temperature naturally produces diverse recommendations
289
+
290
+ **Relevance to domainTokenizer:** TIGER's Semantic ID is the canonical example of how to create a "word" for a non-textual entity. The RQ-VAE approach is directly applicable to any item-based domain.
291
+
292
+ ---
293
+
294
+ ### 5.2 ActionPiece β€” BPE for User Actions
295
+
296
+ **Full title:** "ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation"
297
+ **Authors:** Yupeng Hou, Jianmo Ni, Zhankui He, et al. (Google DeepMind)
298
+ **Link:** [arXiv: 2502.13581](https://arxiv.org/abs/2502.13581) | [GitHub 53⭐](https://github.com/google-deepmind/action_piece)
299
+
300
+ **What it does:**
301
+ ActionPiece is the first **context-aware** tokenizer for user action sequences. It applies the BPE principle β€” merging frequently co-occurring pairs β€” but on **sets of item features** rather than characters.
302
+
303
+ **Key innovation β€” actions as unordered feature sets:**
304
+ Instead of treating each item as an atomic ID, ActionPiece represents each user action as a set of features:
305
+ ```
306
+ Action = {category: "Electronics", brand: "Sony", price_range: "$50-100", ...}
307
+ ```
308
+
309
+ **Vocabulary construction (BPE-like):**
310
+ 1. Start with base vocabulary = all individual features
311
+ 2. Count co-occurrence of feature pairs:
312
+ - **Intra-action:** features within the same action (e.g., "Electronics" + "$50-100")
313
+ - **Inter-action:** features across adjacent actions (e.g., "Phone" in action t, "PhoneCase" in action t+1)
314
+ 3. Merge the most frequent pair into a new composite token
315
+ 4. Repeat until desired vocabulary size
316
+
317
+ **Set Permutation Regularization (SPR):**
318
+ Because feature sets are unordered, the same action can be tokenized with different internal orderings. SPR produces multiple segmentations of the same sequence, acting as data augmentation and preventing the model from overfitting to arbitrary feature orderings.
319
+
320
+ **Key results:**
321
+ - Outperforms TIGER, SASRec, BERT4Rec on Amazon Sports, Beauty, and CDs datasets
322
+ - NDCG@10 improvements of 5–15% over TIGER
323
+ - The context-aware tokenization means the same item gets different tokens in different behavioral contexts
324
+
325
+ **Relevance to domainTokenizer:** ActionPiece is the most directly applicable template for building a domain tokenizer. Its BPE-like algorithm can be generalized to any domain where events are composed of multiple features.
326
+
327
+ ---
328
+
329
+ ### 5.3 Banking Transaction Flow β€” Transactions as Tokens
330
+
331
+ **Full title:** "Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow"
332
+ **Authors:** Cyrile Delestre, Yoann Sola
333
+ **Link:** [arXiv: 2410.08243](https://arxiv.org/abs/2410.08243)
334
+
335
+ **What it does:**
336
+ Designs a custom tokenizer for banking transactions β€” multimodal events consisting of (date, numerical amount, text wording) β€” and pre-trains Transformer and RNN models on large-scale transaction data.
337
+
338
+ **Tokenization scheme:**
339
+ 1. **Date modality:** Converted to relative temporal features (days since last transaction, day of week)
340
+ 2. **Amount modality:** Quantized into bins. The paper doesn't specify the exact binning, but refers to discretization that preserves order and magnitude.
341
+ 3. **Wording modality:** Standard BPE tokenization on the text description (e.g., merchant names, transaction descriptions) after normalization (removing account numbers, dates from text, standardizing merchant names)
342
+ 4. **Composite embedding:** Each modality's tokens are independently embedded, then combined via concatenation or learned projection into a single transaction-level representation
343
+
344
+ **Sequence construction:**
345
+ - Within each day: transactions sorted by ascending amount
346
+ - Across days: chronological order
347
+ - Special separator tokens between days
348
+
349
+ **Pre-training (self-supervised):**
350
+ - **Masked Transaction Prediction (MTP):** Mask entire transactions (not just subword tokens within a description), predict the masked transaction. This forces the model to learn cross-transaction patterns.
351
+ - Both RNN (BiLSTM-based, ELMo-style) and Transformer (BERT-style) pre-training explored
352
+
353
+ **Downstream tasks:**
354
+ - **Transaction categorization:** 31 classes (income, shopping, subscription, transport, savings, etc.). Fine-tuned pre-trained models beat all baselines.
355
+ - **Credit risk scoring:** Binary classification of default risk. Pre-trained models significantly outperform non-pre-trained approaches.
356
+
357
+ **Relevance to domainTokenizer:** This is the closest existing work to an e-commerce transaction tokenizer. The multimodal composite tokenization approach (date + amount + text) is directly applicable.
358
+
359
+ ---
360
+
361
+ ### 5.4 LETTER β€” Learnable Item Tokenization
362
+
363
+ **Full title:** "Learnable Item Tokenization for Generative Recommendation"
364
+ **Authors:** Wenjie Wang, Honghui Bao, et al.
365
+ **Link:** [arXiv: 2405.07314](https://arxiv.org/abs/2405.07314) | [GitHub 153⭐](https://github.com/honghuibao2000/letter)
366
+
367
+ **What it does:**
368
+ LETTER addresses three limitations of prior item tokenization methods:
369
+ 1. **ID-based:** No semantic information, can't generalize to new items
370
+ 2. **Text-based:** Lose collaborative signals (who bought what with what)
371
+ 3. **Codebook-based (RQ-VAE):** Suffer from code assignment bias (popular items get all the good codes)
372
+
373
+ **LETTER's solution β€” a learnable tokenizer with three objectives:**
374
+ 1. **Semantic regularization:** Tokenizer's codebook should respect semantic similarity (similar items β†’ similar codes)
375
+ 2. **Contrastive alignment:** Tokens should capture collaborative filtering signals (items bought together β†’ nearby in token space)
376
+ 3. **Diversity loss:** Prevent codebook collapse β€” ensure all codes are used, not just a few popular ones
377
+
378
+ **Architecture:**
379
+ - Uses Residual Quantized VAE (like TIGER) as the base tokenizer
380
+ - Adds the three losses above during tokenizer training
381
+ - The tokenizer is trained jointly with (or alternately with) the generative recommendation model
382
+
383
+ **Key results:**
384
+ - Outperforms TIGER, P5, and other generative recommendation baselines
385
+ - Particularly strong on long-tail items (items with few interactions) due to the diversity loss
386
+
387
+ **Relevance to domainTokenizer:** LETTER shows that **the tokenizer itself should be a learnable model** trained with domain-specific objectives, not just a fixed preprocessing step.
388
+
389
+ ---
390
+
391
+ ### 5.5 TP-BERTa β€” Numerical Value Tokenization
392
+
393
+ **Full title:** "Making Pre-trained Language Models Great on Tabular Prediction"
394
+ **Authors:** Jiahuan Yan, et al.
395
+ **Link:** [arXiv: 2403.01841](https://arxiv.org/abs/2403.01841)
396
+
397
+ **What it does:**
398
+ Solves the fundamental problem of representing **numerical feature values** as tokens. Standard text tokenizers fragment numbers meaninglessly. TP-BERTa introduces **Relative Magnitude Tokenization (RMT)**.
399
+
400
+ **Relative Magnitude Tokenization:**
401
+ Instead of tokenizing the raw number "$79.99" as text:
402
+ 1. Compute the feature's distribution across the dataset
403
+ 2. Express each value as its **relative position** in that distribution
404
+ 3. Discretize into bins: "very_low", "low", "medium", "high", "very_high" (or finer)
405
+ 4. The token is the bin label, which preserves ordinal relationships
406
+
407
+ Example:
408
+ ```
409
+ price = $79.99
410
+ β†’ Within the "price" feature distribution, $79.99 is at the 73rd percentile
411
+ β†’ Token: "price_bin_73" or "price_high"
412
+ ```
413
+
414
+ **Intra-Feature Attention:**
415
+ Each feature value is paired with its feature name:
416
+ ```
417
+ "price" β†’ [price_name_embedding]
418
+ "$79.99" β†’ [price_value_embedding via RMT]
419
+ ```
420
+ Intra-feature attention binds them, so the model knows this number means "price" not "quantity" or "weight".
421
+
422
+ **Key results:**
423
+ - TP-BERTa is competitive with XGBoost and LightGBM on standard tabular benchmarks
424
+ - Significantly outperforms other deep learning approaches on tabular data
425
+ - The pre-trained model transfers across different tables
426
+
427
+ **Relevance to domainTokenizer:** RMT solves the critical problem of numerical tokenization. Every domain tokenizer will need to handle numbers (prices, amounts, quantities, durations), and RMT is currently the best approach.
428
+
429
+ ---
430
+
431
+ ### 5.6 Meta-Transformer β€” 12 Modalities, One Token Space
432
+
433
+ **Full title:** "Meta-Transformer: A Unified Framework for Multimodal Learning"
434
+ **Authors:** Yiyuan Zhang, Kaixiong Gong, et al.
435
+ **Link:** [arXiv: 2307.10802](https://arxiv.org/abs/2307.10802) | [GitHub 1652⭐](https://github.com/invictus717/MetaTransformer)
436
+
437
+ **What it does:**
438
+ Demonstrates that a single frozen Transformer encoder can process 12 different modalities β€” including **time series** and **tabular data** β€” by projecting each modality into a shared token space via modality-specific tokenizers.
439
+
440
+ **Modality-specific tokenizers:**
441
+ - **Text:** standard embedding
442
+ - **Image:** patch embedding (ViT-style)
443
+ - **Audio:** spectrogram patches
444
+ - **Time series:** segment embedding (chop time series into fixed-length segments, project each to a token)
445
+ - **Tabular:** feature-wise embedding (each column value becomes a token)
446
+ - **Graph:** node feature embedding
447
+ - **Point cloud:** point group embedding
448
+
449
+ **Key insight:** The tokenizers are lightweight (small learnable projections), and the Transformer encoder is **frozen** β€” trained once and shared across all modalities. This means the bulk of the computation is modality-agnostic.
450
+
451
+ **Relevance to domainTokenizer:** Meta-Transformer proves the viability of the unified approach. A domain tokenizer could use a similar architecture: lightweight domain-specific tokenizers feeding into a shared Transformer backbone.
452
+
453
+ ---
454
+
455
+ ## 6. Tokenization Methods: A Technical Taxonomy
456
+
457
+ ### 6.1 Quantization-Based (RQ-VAE, VQ-VAE)
458
+
459
+ **How it works:**
460
+ - Train a Vector Quantized Variational Autoencoder on item embeddings
461
+ - The encoder maps items to a continuous latent space
462
+ - The quantization layer maps each embedding to the nearest entry in a learned codebook
463
+ - **Residual Quantization (RQ):** apply quantization iteratively on residuals for multi-token representations
464
+ - The decoder reconstructs the original embedding from the quantized codes
465
+
466
+ **Strengths:**
467
+ - Produces hierarchically structured tokens (coarse-to-fine)
468
+ - Items with similar content naturally share token prefixes
469
+ - Controllable vocabulary size (codebook size Γ— number of levels)
470
+
471
+ **Weaknesses:**
472
+ - Codebook collapse (some codes rarely used)
473
+ - Training instability (requires commitment loss, EMA updates, etc.)
474
+ - No collaborative signal unless explicitly added (see LETTER)
475
+
476
+ **Used by:** TIGER, LETTER, PRISM, MMGRec, MiniOneRec, GenRec
477
+
478
+ ### 6.2 BPE-Inspired Merging
479
+
480
+ **How it works:**
481
+ - Start with atomic features as the base vocabulary
482
+ - Count co-occurrence frequencies of feature pairs in the corpus
483
+ - Merge the most frequent pair into a new composite token
484
+ - Repeat until desired vocabulary size
485
+
486
+ **Strengths:**
487
+ - Naturally discovers meaningful composite patterns
488
+ - Context-aware (merges depend on surrounding actions)
489
+ - Directly analogous to text BPE β€” well-understood properties
490
+ - No neural network training required for vocabulary construction
491
+
492
+ **Weaknesses:**
493
+ - Greedy algorithm β€” may not find globally optimal vocabulary
494
+ - Requires careful handling of unordered feature sets (set permutation regularization)
495
+ - Vocabulary depends on corpus statistics β€” may not generalize to distribution shifts
496
+
497
+ **Used by:** ActionPiece
498
+
499
+ ### 6.3 Magnitude & Binning Approaches
500
+
501
+ **How it works:**
502
+ - For numerical values: compute distribution statistics, discretize into bins
503
+ - Options: uniform bins, quantile bins, logarithmic bins, adaptive bins
504
+ - For timestamps: calendar tokens (day-of-week, month, etc.) or relative encodings
505
+
506
+ **Strengths:**
507
+ - Simple, interpretable, no training required
508
+ - Preserves ordinal relationships
509
+ - Handles numerical data natively (no text conversion)
510
+
511
+ **Weaknesses:**
512
+ - Fixed granularity (bin resolution)
513
+ - Information loss at bin boundaries
514
+ - Requires domain knowledge to choose binning strategy
515
+
516
+ **Used by:** TP-BERTa, Banking Transaction Flow, Temporal Tokenization Strategies
517
+
518
+ ### 6.4 Learnable End-to-End Tokenizers
519
+
520
+ **How it works:**
521
+ - A neural network (encoder) maps raw domain data to discrete tokens
522
+ - The tokenizer is trained end-to-end with the downstream model
523
+ - Uses techniques like Gumbel-Softmax for differentiable discretization
524
+
525
+ **Strengths:**
526
+ - Tokenizer adapts to the downstream task
527
+ - Can incorporate multiple objectives (semantic, collaborative, diversity)
528
+ - No manual design of tokenization rules
529
+
530
+ **Weaknesses:**
531
+ - More complex training (joint optimization)
532
+ - Risk of tokenizer-model co-adaptation (poor generalization)
533
+ - Harder to interpret what tokens mean
534
+
535
+ **Used by:** LETTER, UniGRec, ContRec, MANTa
536
+
537
+ ### 6.5 Serialization-Based (Text Templates)
538
+
539
+ **How it works:**
540
+ - Convert each data record to a natural language string:
541
+ `"The customer bought Sony WH-1000XM5 headphones for $349.99 using a credit card on March 15, 2025."`
542
+ - Use a standard text tokenizer (BPE) on the serialized string
543
+ - Feed to a pre-trained LLM
544
+
545
+ **Strengths:**
546
+ - Zero engineering β€” use off-the-shelf LLMs
547
+ - Benefits from LLM's pre-trained world knowledge
548
+ - Handles heterogeneous schemas easily
549
+
550
+ **Weaknesses:**
551
+ - Extremely token-inefficient (one row might become 100+ tokens)
552
+ - Numerical values still poorly handled by text tokenizers
553
+ - Requires large models to work well (no "small model" possibility)
554
+ - No exploitation of domain structure
555
+
556
+ **Used by:** TabuLa-8B, TabSTAR (partially), various LLM-for-tabular approaches
557
+
558
+ ---
559
+
560
+ ## 7. The domainTokenizer Blueprint: How to Build It
561
+
562
+ ### 7.1 Architecture Design
563
+
564
+ Based on the research, domainTokenizer should have three components:
565
+
566
+ ```
567
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
568
+ β”‚ domainTokenizer β”‚
569
+ β”‚ β”‚
570
+ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
571
+ β”‚ β”‚ Domain β”‚ β”‚ Transformer β”‚ β”‚ Task β”‚ β”‚
572
+ β”‚ β”‚ Tokenizer │──│ Backbone │──│ Heads β”‚ β”‚
573
+ β”‚ β”‚ (learnable) β”‚ β”‚ (small) β”‚ β”‚ β”‚ β”‚
574
+ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
575
+ β”‚ β”‚
576
+ β”‚ Tokenizer: Domain events β†’ discrete tokens β”‚
577
+ β”‚ Backbone: Sequence modeling via attention β”‚
578
+ β”‚ Heads: Task-specific outputs β”‚
579
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
580
+ ```
581
+
582
+ **Domain Tokenizer (per-domain, learnable):**
583
+ - Handles the conversion of raw domain events into discrete tokens
584
+ - Combines multiple strategies: RQ-VAE for items, magnitude binning for numbers, BPE-like merging for feature compositions, calendar encoding for timestamps
585
+ - Small and fast (a few million parameters at most)
586
+
587
+ **Transformer Backbone (shared, small):**
588
+ - Standard causal or bidirectional Transformer
589
+ - Target sizes: 10M, 50M, 150M, 350M parameters
590
+ - Pre-trained on domain sequences with self-supervised objectives
591
+ - Potentially shareable across related domains
592
+
593
+ **Task Heads (per-task):**
594
+ - Classification head for fraud detection, churn prediction, etc.
595
+ - Generation head for next-event prediction, recommendation
596
+ - Regression head for value prediction (LTV, credit score, etc.)
597
+
598
+ ### 7.2 Tokenizer Construction Pipeline
599
+
600
+ For a given domain (e.g., e-commerce), the tokenizer construction follows:
601
+
602
+ **Step 1: Schema Analysis**
603
+ ```python
604
+ # Identify field types in the domain data
605
+ schema = {
606
+ "product_id": "categorical_entity", # β†’ Semantic ID via RQ-VAE
607
+ "category": "categorical_fixed", # β†’ direct vocabulary mapping
608
+ "price": "numerical_continuous", # β†’ magnitude binning (RMT)
609
+ "quantity": "numerical_discrete", # β†’ small fixed vocabulary
610
+ "timestamp": "temporal", # β†’ calendar + relative encoding
611
+ "description": "text", # β†’ standard BPE (subword)
612
+ "payment_method": "categorical_small", # β†’ direct mapping
613
+ "customer_id": "entity_id", # β†’ learned embedding or behavioral cluster
614
+ }
615
+ ```
616
+
617
+ **Step 2: Per-Field Tokenization**
618
+
619
+ | Field Type | Method | Output |
620
+ |------------|--------|--------|
621
+ | Categorical entity (products) | RQ-VAE Semantic IDs | Tuple of K codebook indices |
622
+ | Categorical fixed (categories) | Direct vocab mapping | Single token index |
623
+ | Numerical continuous (prices) | Relative Magnitude Tokenization | Bin token |
624
+ | Temporal (timestamps) | Calendar tokens + relative delta | 2–3 tokens (day-of-week, time-of-day, delta) |
625
+ | Text (descriptions) | Standard BPE | Variable-length subword tokens |
626
+ | Entity ID (customers) | Behavioral clustering or learned embedding | Single token or short sequence |
627
+
628
+ **Step 3: Composite Token Construction (BPE-like)**
629
+ Following ActionPiece, apply a BPE-like merge algorithm on the composite per-field tokens to discover meaningful multi-field patterns:
630
+ ```
631
+ Initial: [Electronics] [price_high] [CreditCard] [Weekday]
632
+ After merging: [Electronics+price_high] [CreditCard+Weekday]
633
+ Further: [HighEndElectronicsPurchase] [WeekdayCreditCard]
634
+ ```
635
+
636
+ **Step 4: Special Tokens**
637
+ ```
638
+ [SEP] - separates transactions in a sequence
639
+ [DAY_SEP] - separates days
640
+ [PAD] - padding
641
+ [MASK] - for masked pre-training
642
+ [CLS] - sequence-level representation
643
+ [UNK] - unknown/out-of-vocabulary events
644
+ ```
645
+
646
+ ### 7.3 Pre-training Objectives
647
+
648
+ Based on the literature, the following self-supervised objectives are most effective:
649
+
650
+ **1. Masked Event Prediction (MEP) β€” BERT-style**
651
+ - Mask 15% of complete events (not just individual tokens within an event)
652
+ - Predict all tokens of the masked event
653
+ - Forces the model to learn cross-event patterns
654
+
655
+ **2. Next Event Prediction (NEP) β€” GPT-style**
656
+ - Given a sequence of events, predict the next event autoregressively
657
+ - Generate the event's token sequence (e.g., Semantic ID) token by token
658
+ - The primary objective for generative recommendation
659
+
660
+ **3. Contrastive Sequence Learning**
661
+ - Similar customer sequences should have similar representations
662
+ - Push apart sequences from different behavioral clusters
663
+ - Helps with customer segmentation and transfer learning
664
+
665
+ **4. Temporal Ordering**
666
+ - Given a shuffled sequence, predict the correct temporal order
667
+ - Forces the model to learn temporal patterns (seasonality, cadence, trends)
668
+
669
+ ### 7.4 Downstream Task Adaptation
670
+
671
+ Once pre-trained, the model can be fine-tuned for specific tasks:
672
+
673
+ | Task | Adaptation Method | Head |
674
+ |------|-------------------|------|
675
+ | **Next purchase prediction** | Continue NEP, decode Semantic IDs | Generative (autoregressive) |
676
+ | **Fraud detection** | Fine-tune on labeled transactions | Binary classifier on [CLS] |
677
+ | **Customer segmentation** | Extract [CLS] embeddings, cluster | No head (use embeddings) |
678
+ | **Churn prediction** | Fine-tune on labeled sequences | Binary classifier on [CLS] |
679
+ | **Credit scoring** | Fine-tune on labeled customer histories | Regression or classification |
680
+ | **Demand forecasting** | Adapt temporal patterns | Regression on quantity tokens |
681
+ | **Product recommendation** | NEP with Semantic ID decoding | Generative (beam search) |
682
+
683
+ ---
684
+
685
+ ## 8. Use Case Walkthrough: E-Commerce Transaction Model
686
+
687
+ ### The Scenario
688
+ An e-commerce platform with:
689
+ - 2M customers
690
+ - 500K products
691
+ - 100M transactions over 2 years
692
+ - Each transaction: `(customer_id, product_id, category, price, quantity, timestamp, payment_method, device)`
693
+
694
+ ### Step 1: Build the Tokenizer
695
+
696
+ **Product Semantic IDs:**
697
+ ```python
698
+ # 1. Generate product embeddings from title + description
699
+ product_embeddings = sentence_encoder(product_titles + product_descriptions) # 500K Γ— 768
700
+
701
+ # 2. Train RQ-VAE with 4 codebooks of 256 entries each
702
+ rq_vae = ResidualQuantizedVAE(n_codebooks=4, codebook_size=256)
703
+ rq_vae.fit(product_embeddings)
704
+
705
+ # 3. Each product gets a 4-token Semantic ID
706
+ product_semantic_ids = rq_vae.encode(product_embeddings) # 500K Γ— 4
707
+ # e.g., Headphones β†’ [42, 187, 23, 91]
708
+ ```
709
+
710
+ **Price Tokenization (RMT):**
711
+ ```python
712
+ # Compute percentile bins
713
+ price_bins = compute_quantile_bins(all_prices, n_bins=50)
714
+ # $79.99 β†’ "price_bin_37" (37th percentile bin)
715
+ ```
716
+
717
+ **Timestamp Tokenization:**
718
+ ```python
719
+ # Calendar features + relative delta
720
+ def tokenize_timestamp(ts, prev_ts):
721
+ return [
722
+ day_of_week_token(ts), # "wednesday"
723
+ time_of_day_token(ts), # "afternoon"
724
+ delta_token(ts - prev_ts), # "2_days_later"
725
+ ]
726
+ ```
727
+
728
+ **Composite vocabulary construction (BPE-like):**
729
+ ```python
730
+ # Run ActionPiece-style merging on the corpus of tokenized transaction sequences
731
+ vocabulary = actionpiece_vocab_construction(
732
+ corpus=all_tokenized_transactions,
733
+ target_vocab_size=8192,
734
+ consider_intra_event=True, # merge features within a transaction
735
+ consider_inter_event=True, # merge features across adjacent transactions
736
+ )
737
+ ```
738
+
739
+ ### Step 2: Pre-train
740
+
741
+ ```python
742
+ # Tokenize all 100M transactions
743
+ tokenized_corpus = tokenize_all_transactions(transactions, tokenizer)
744
+
745
+ # Pre-train a small Transformer (150M params)
746
+ model = TransformerLM(
747
+ vocab_size=8192 + special_tokens,
748
+ d_model=768,
749
+ n_heads=12,
750
+ n_layers=12,
751
+ max_seq_len=256, # ~256 transactions per customer
752
+ )
753
+
754
+ # Self-supervised pre-training with MEP + NEP
755
+ train(model, tokenized_corpus, objectives=["masked_event", "next_event"])
756
+ ```
757
+
758
+ ### Step 3: Fine-tune & Deploy
759
+
760
+ ```python
761
+ # Example: Fraud detection
762
+ fraud_model = add_classification_head(model, n_classes=2)
763
+ fine_tune(fraud_model, labeled_fraud_data)
764
+
765
+ # Example: Next purchase recommendation
766
+ rec_model = model # Use generative mode directly
767
+ next_item_semantic_id = rec_model.generate(customer_transaction_sequence)
768
+ next_item = rq_vae.decode(next_item_semantic_id) # Map back to product
769
+ ```
770
+
771
+ ---
772
+
773
+ ## 9. Open Challenges and Research Gaps
774
+
775
+ ### 9.1 Vocabulary Evolution
776
+ Products are added and removed constantly. Semantic IDs need to be recomputed, which may invalidate the model's learned associations. **Partial solutions:** periodic re-indexing (TIGER), using content features that are stable even when the catalog changes.
777
+
778
+ ### 9.2 Cross-Domain Transfer
779
+ Can a tokenizer trained on e-commerce data transfer to banking? The field-level tokenizers (RMT for numbers, calendar for dates) should transfer, but composite vocabularies are domain-specific. **Open question:** is there a "universal domain tokenizer" or will each domain need its own?
780
+
781
+ ### 9.3 Numerical Precision
782
+ All current methods lose some numerical precision through discretization. For applications where exact values matter (financial auditing, pricing optimization), this is a limitation. **Potential solution:** hybrid approaches that combine discrete tokens with continuous residuals.
783
+
784
+ ### 9.4 Handling Missing Data
785
+ Real business data is full of missing values. Text tokenizers never face this issue. Domain tokenizers need explicit strategies: [MISSING] tokens, imputation, or learning to model missingness as a signal.
786
+
787
+ ### 9.5 Privacy & Fairness
788
+ Tokenizing customer behavior raises privacy concerns. Semantic IDs could encode sensitive attributes (demographic patterns, financial status) in ways that are hard to audit. Domain tokenizers should be designed with fairness constraints.
789
+
790
+ ### 9.6 Scalability of BPE-Like Merging
791
+ ActionPiece's vocabulary construction is O(N Γ— V) per merge step. For very large corpora (billions of events) and feature spaces (thousands of features), this may become prohibitively expensive. **Potential solution:** approximate counting, hierarchical merging, or neural vocabulary construction.
792
+
793
+ ### 9.7 Evaluation Standards
794
+ There are no standard benchmarks for "domain tokenization quality." Text tokenizers can be evaluated by compression ratio and downstream perplexity. Domain tokenizers need domain-specific metrics: recommendation quality, prediction accuracy, calibration, etc.
795
+
796
+ ### 9.8 Connection to Continual Learning
797
+ The HOPE / Nested Learning paradigm (see Section 11) suggests that models should continuously learn from new data. Domain tokenizers that can incrementally update their vocabularies β€” adding new product tokens, retiring obsolete ones β€” without full retraining would be highly valuable.
798
+
799
+ ---
800
+
801
+ ## 10. Complete Paper Reference Table
802
+
803
+ | # | Paper | Year | ArXiv | Domain | Key Contribution | GitHub |
804
+ |---|-------|------|-------|--------|-----------------|--------|
805
+ | 1 | **TIGER** | 2023 | [2305.05065](https://arxiv.org/abs/2305.05065) | Recommendation | Semantic IDs via RQ-VAE for generative retrieval | [781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender) |
806
+ | 2 | **Semantic IDs (YouTube)** | 2023 | [2306.08121](https://arxiv.org/abs/2306.08121) | Recommendation | Content-derived IDs at industry scale | β€” |
807
+ | 3 | **ActionPiece** | 2025 | [2502.13581](https://arxiv.org/abs/2502.13581) | Recommendation | BPE-like context-aware action tokenization | [53⭐](https://github.com/google-deepmind/action_piece) |
808
+ | 4 | **LETTER** | 2024 | [2405.07314](https://arxiv.org/abs/2405.07314) | Recommendation | Learnable tokenizer with semantic+collaborative+diversity | [153⭐](https://github.com/honghuibao2000/letter) |
809
+ | 5 | **SETRec** | 2025 | [2502.10833](https://arxiv.org/abs/2502.10833) | Recommendation | Order-agnostic set identifiers | β€” |
810
+ | 6 | **ContRec** | 2025 | [2504.12007](https://arxiv.org/abs/2504.12007) | Recommendation | Continuous tokens via sigma-VAE + diffusion | β€” |
811
+ | 7 | **GenRec** | 2026 | [2604.14878](https://arxiv.org/abs/2604.14878) | Recommendation | Page-wise NTP for large-scale recommendation | β€” |
812
+ | 8 | **MBGen** | 2024 | [2405.16871](https://arxiv.org/abs/2405.16871) | Recommendation | Multi-behavior (view/click/buy) as token types | [57⭐](https://github.com/anananan116/MBGen) |
813
+ | 9 | **RSLLM** | 2024 | [2412.16933](https://arxiv.org/abs/2412.16933) | Recommendation | Recommendation as a new language in LLMs | β€” |
814
+ | 10 | **PRISM** | 2025 | [2601.16556](https://arxiv.org/abs/2601.16556) | Recommendation | Purified quantization for semantic tokenization | β€” |
815
+ | 11 | **MMGRec** | 2024 | [2404.16555](https://arxiv.org/abs/2404.16555) | Recommendation | Graph RQ-VAE for multimodal items | β€” |
816
+ | 12 | **UniGRec** | 2025 | [2601.17438](https://arxiv.org/abs/2601.17438) | Recommendation | Soft item identifiers for end-to-end optimization | β€” |
817
+ | 13 | **Semantic IDs for Search+Rec** | 2025 | [2508.10478](https://arxiv.org/abs/2508.10478) | Recommendation | Joint search and recommendation Semantic IDs | β€” |
818
+ | 14 | **Banking Transaction Flow** | 2024 | [2410.08243](https://arxiv.org/abs/2410.08243) | Finance | Composite tokenizer for (date, amount, text) transactions | β€” |
819
+ | 15 | **LBSF** | 2024 | [2411.15056](https://arxiv.org/abs/2411.15056) | Finance | Long-term payment behavior folding by merchant | β€” |
820
+ | 16 | **Temporal Tokenization** | 2025 | [2512.13618](https://arxiv.org/abs/2512.13618) | Events | Systematic comparison of temporal tokenization strategies | β€” |
821
+ | 17 | **FinTRec** | 2025 | [2511.14865](https://arxiv.org/abs/2511.14865) | Finance | Transformer for long-range financial recommendation | β€” |
822
+ | 18 | **TIMeSynC** | 2024 | [2410.12825](https://arxiv.org/abs/2410.12825) | Finance | Temporal intent prediction in financial services | β€” |
823
+ | 19 | **TP-BERTa** | 2024 | [2403.01841](https://arxiv.org/abs/2403.01841) | Tabular | Relative Magnitude Tokenization for numbers | β€” |
824
+ | 20 | **TabuLa-8B** | 2024 | [2406.12031](https://arxiv.org/abs/2406.12031) | Tabular | Llama 3 fine-tuned on serialized tables | [71⭐](https://github.com/mlfoundations/rtfm) |
825
+ | 21 | **TabSTAR** | 2025 | [2505.18125](https://arxiv.org/abs/2505.18125) | Tabular | Semantically target-aware tabular foundation model | [83⭐](https://github.com/alanarazi7/TabSTAR) |
826
+ | 22 | **UniTabE** | 2023 | [2307.09249](https://arxiv.org/abs/2307.09249) | Tabular | Universal tabular pretraining protocol | β€” |
827
+ | 23 | **TARTE** | 2025 | [2505.14415](https://arxiv.org/abs/2505.14415) | Tabular | Knowledge-enhanced tabular representations | β€” |
828
+ | 24 | **TabICL** | 2025 | [2502.05564](https://arxiv.org/abs/2502.05564) | Tabular | Column-then-row attention, scales to 500K samples | β€” |
829
+ | 25 | **Meta-Transformer** | 2023 | [2307.10802](https://arxiv.org/abs/2307.10802) | Universal | 12 modalities in one token space | [1652⭐](https://github.com/invictus717/MetaTransformer) |
830
+ | 26 | **Emu3** | 2024 | [2409.18869](https://arxiv.org/abs/2409.18869) | Universal | NTP is all you need across modalities | [2400⭐](https://github.com/baaivision/emu3) |
831
+ | 27 | **Unified-IO 2** | 2023 | [2312.17172](https://arxiv.org/abs/2312.17172) | Universal | Image+text+audio+action in one model | [647⭐](https://github.com/allenai/unified-io-2) |
832
+ | 28 | **NTP Multimodal Survey** | 2024 | [2412.18619](https://arxiv.org/abs/2412.18619) | Survey | Taxonomy of multimodal tokenization + NTP | [478⭐](https://github.com/lmm101/awesome-multimodal-next-token-prediction) |
833
+ | 29 | **LongCat-Next** | 2025 | [2603.27538](https://arxiv.org/abs/2603.27538) | Universal | Lexicalizing modalities as discrete tokens | [409⭐](https://github.com/meituan-longcat/LongCat-Next) |
834
+ | 30 | **Tabular Data Survey** | 2024 | [2408.10548](https://arxiv.org/abs/2408.10548) | Survey | Comprehensive survey of LMs for tabular data | [33⭐](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey) |
835
+ | 31 | **KL3M Tokenizers** | 2025 | [2503.17247](https://arxiv.org/abs/2503.17247) | Legal/Finance | Domain-specific BPE for professional text | [GitHub](https://github.com/alea-institute/kl3m-tokenizer-paper) |
836
+
837
+ ---
838
+
839
+ ## 11. Related Concepts: Nested Learning & Continual Adaptation
840
+
841
+ An important related development is the **Nested Learning** paradigm introduced by Google Research ([arXiv: 2512.24695](https://arxiv.org/abs/2512.24695), by Ali Behrouz et al.), which presents the **HOPE** architecture.
842
+
843
+ ### Why Nested Learning Matters for Domain Tokenization
844
+
845
+ Current Transformer-based models are "frozen" after pre-training β€” they cannot incorporate new knowledge without retraining. For domain tokenization, this means:
846
+ - A recommendation model can't learn about new products added after training
847
+ - A fraud detection model can't adapt to new fraud patterns in real-time
848
+ - A customer model can't update its understanding of a customer's evolving preferences
849
+
850
+ The HOPE architecture addresses this via:
851
+ 1. **Continuum Memory System (CMS):** Multiple MLP blocks updating at different frequencies β€” some update every few tokens (catching immediate patterns), others update only after millions of tokens (storing persistent knowledge). This prevents catastrophic forgetting.
852
+ 2. **Self-Modifying Titans:** The model's projection layers update themselves in real-time based on incoming data, enabling continuous adaptation.
853
+
854
+ **For domainTokenizer, the implication is:** a domain model built with Nested Learning principles could continuously learn from new transactions, adapting its understanding of products, customer preferences, and behavioral patterns without retraining from scratch.
855
+
856
+ This is an area of active exploration for future versions of domainTokenizer.
857
+
858
+ For the full research report on Nested Learning, see the [HOPE / Nested Learning discussion on HF Papers](https://huggingface.co/papers/2512.24695).
859
+
860
+ ---
861
+
862
+ *This report is a living document and will be updated as the domainTokenizer project evolves.*