domainTokenizer / docs /research_report.md
rtferraz's picture
Add comprehensive research report on domain-specific tokenization
be86e60 verified
# Domain Tokenization: Beyond Words β€” A Research Report
> **Building small models that understand domain tokens, not just words.**
>
> *Last updated: April 2026*
---
## Table of Contents
1. [Executive Summary](#1-executive-summary)
2. [The Problem: Why Words Are Not Enough](#2-the-problem-why-words-are-not-enough)
3. [The Core Insight: Anything Can Be a Token](#3-the-core-insight-anything-can-be-a-token)
4. [Research Landscape: Five Paradigms of Domain Tokenization](#4-research-landscape-five-paradigms-of-domain-tokenization)
- 4.1 [Semantic ID Tokenization (Products & Items)](#41-semantic-id-tokenization-products--items)
- 4.2 [Action Sequence Tokenization (User Behaviors)](#42-action-sequence-tokenization-user-behaviors)
- 4.3 [Financial Transaction Tokenization](#43-financial-transaction-tokenization)
- 4.4 [Tabular Feature Tokenization](#44-tabular-feature-tokenization)
- 4.5 [Universal Modality Tokenization](#45-universal-modality-tokenization)
5. [Key Papers: Detailed Analysis](#5-key-papers-detailed-analysis)
- 5.1 [TIGER β€” Semantic IDs for Generative Retrieval](#51-tiger--semantic-ids-for-generative-retrieval)
- 5.2 [ActionPiece β€” BPE for User Actions](#52-actionpiece--bpe-for-user-actions)
- 5.3 [Banking Transaction Flow β€” Transactions as Tokens](#53-banking-transaction-flow--transactions-as-tokens)
- 5.4 [LETTER β€” Learnable Item Tokenization](#54-letter--learnable-item-tokenization)
- 5.5 [TP-BERTa β€” Numerical Value Tokenization](#55-tp-berta--numerical-value-tokenization)
- 5.6 [Meta-Transformer β€” 12 Modalities, One Token Space](#56-meta-transformer--12-modalities-one-token-space)
6. [Tokenization Methods: A Technical Taxonomy](#6-tokenization-methods-a-technical-taxonomy)
- 6.1 [Quantization-Based (RQ-VAE, VQ-VAE)](#61-quantization-based-rq-vae-vq-vae)
- 6.2 [BPE-Inspired Merging](#62-bpe-inspired-merging)
- 6.3 [Magnitude & Binning Approaches](#63-magnitude--binning-approaches)
- 6.4 [Learnable End-to-End Tokenizers](#64-learnable-end-to-end-tokenizers)
- 6.5 [Serialization-Based (Text Templates)](#65-serialization-based-text-templates)
7. [The domainTokenizer Blueprint: How to Build It](#7-the-domaintokenizer-blueprint-how-to-build-it)
- 7.1 [Architecture Design](#71-architecture-design)
- 7.2 [Tokenizer Construction Pipeline](#72-tokenizer-construction-pipeline)
- 7.3 [Pre-training Objectives](#73-pre-training-objectives)
- 7.4 [Downstream Task Adaptation](#74-downstream-task-adaptation)
8. [Use Case Walkthrough: E-Commerce Transaction Model](#8-use-case-walkthrough-e-commerce-transaction-model)
9. [Open Challenges and Research Gaps](#9-open-challenges-and-research-gaps)
10. [Complete Paper Reference Table](#10-complete-paper-reference-table)
11. [Related Concepts: Nested Learning & Continual Adaptation](#11-related-concepts-nested-learning--continual-adaptation)
---
## 1. Executive Summary
Large Language Models (LLMs) process text by breaking it into **tokens** β€” subword units learned via algorithms like BPE (Byte-Pair Encoding). This tokenization is the foundation that allows Transformers to model sequential patterns via next-token prediction.
But words are just one type of sequential data. Businesses generate vast amounts of **non-textual sequential data** every day:
- **E-commerce:** millions of purchase transactions, each with product IDs, amounts, timestamps, categories
- **Banking:** transaction flows with dates, amounts, merchant codes, and descriptions
- **Healthcare:** sequences of diagnoses, procedures, lab results, medications
- **Advertising:** impression β†’ click β†’ conversion funnels with bid amounts and user features
- **Logistics:** shipping events, warehouse movements, delivery status sequences
**The central question this project explores:** Can we build tokenizers that encode these domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then train small, efficient Transformer models that understand domain patterns the way LLMs understand language?
**The answer from recent research is a resounding yes.** This report surveys 25+ papers spanning 2021–2026 that collectively establish a new paradigm: **domain tokenization**. The key findings are:
1. **Semantic IDs** (Google, 2023): Products can be encoded as tuples of discrete tokens derived from their content embeddings via quantization (RQ-VAE). A Transformer trained on sequences of these Semantic IDs outperforms traditional recommendation systems and generalizes to unseen items.
2. **Action tokenization** (Google DeepMind, 2025): User action sequences can be tokenized using a BPE-like algorithm that merges frequently co-occurring features β€” the same algorithm that powers text tokenization, applied to business events instead of characters.
3. **Transaction tokenization** (2024): Banking transactions β€” multimodal events of (date, amount, text) β€” can be encoded as composite tokens and modeled with self-supervised pre-training, achieving state-of-the-art on fraud detection and credit scoring.
4. **Tabular tokenization** (2024–2025): Individual feature values (numerical, categorical) can be tokenized via relative magnitude encoding or serialization, enabling foundation models that transfer across different tabular datasets.
5. **Universal tokenization** (2023–2024): Frameworks like Meta-Transformer demonstrate that 12+ modalities including time series and tabular data can be projected into a shared token space and processed by a single frozen Transformer.
This report details each paradigm, provides technical depth on the tokenization methods, and lays out a concrete blueprint for building domainTokenizer.
---
## 2. The Problem: Why Words Are Not Enough
### 2.1 The Mismatch Between Business Data and Text Tokens
When an e-commerce platform processes a customer's purchase history, the raw data looks like:
```
customer_42 | 2025-03-15 | SKU-8847291 | Electronics > Headphones | $79.99 | Credit Card | qty: 1
customer_42 | 2025-03-15 | SKU-3321098 | Electronics > Cables | $12.49 | Credit Card | qty: 2
customer_42 | 2025-04-01 | SKU-5519273 | Books > Technical | $44.95 | Debit Card | qty: 1
```
If you feed this to a standard LLM tokenizer (e.g., GPT-4o's `cl100k_base`), you get:
- `SKU-8847291` β†’ split into meaningless subword fragments like `SK`, `U-`, `884`, `72`, `91`
- `$79.99` β†’ tokenized as `$`, `79`, `.`, `99` β€” losing the semantic meaning of "a mid-range purchase"
- `2025-03-15` β†’ fragmented into date components with no temporal understanding
- The **relationships** between fields (this amount goes with this product in this category) are lost in a flat token stream
**The fundamental problem:** text tokenizers are optimized for the statistical structure of natural language. They know that `ing` and `tion` are common suffixes, that `the` is frequent, that `un-` is a prefix. They know nothing about:
- Product similarity (headphones and earbuds are related)
- Price ranges ($79.99 is "mid-range electronics" vs. $2,499 is "premium")
- Temporal patterns (weekly vs. monthly purchase cadence)
- Cross-field interactions (buying a cable right after headphones = accessory purchase)
### 2.2 The Opportunity: Domain Structure is Richer Than Language
Business domains have structure that goes beyond what text captures:
| Dimension | Language | Business Domain |
|-----------|----------|-----------------|
| **Vocabulary** | ~50K–256K subwords | Millions of SKUs, thousands of categories |
| **Sequence meaning** | Word order determines syntax | Temporal order determines behavioral patterns |
| **Similarity** | Semantic (synonyms, paraphrases) | Collaborative (users who buy X also buy Y) |
| **Numerical values** | Rare, incidental | Central (prices, quantities, timestamps) |
| **Compositionality** | Words compose into sentences | Features compose into events/transactions |
| **Temporal dynamics** | Mostly static semantics | Evolving trends, seasonal patterns |
A domain tokenizer should exploit all of this structure.
### 2.3 Why Small Models?
This project focuses on **small** models (tens of millions to low billions of parameters) because:
1. **Domain data is structured** β€” you don't need 70B parameters to learn that "users who buy phones often buy cases." The pattern space is narrower than open-domain language.
2. **Latency matters** β€” production systems need real-time inference (fraud detection, recommendations, pricing).
3. **Data efficiency** β€” most businesses have millions, not trillions, of training examples.
4. **Cost** β€” training and serving small models is orders of magnitude cheaper.
5. **Interpretability** β€” smaller models with domain-specific tokens are more auditable than black-box LLMs.
---
## 3. The Core Insight: Anything Can Be a Token
The survey **"Next Token Prediction Towards Multimodal Intelligence"** ([arXiv: 2412.18619](https://arxiv.org/abs/2412.18619), 59 upvotes) formalizes this principle:
> Next-Token Prediction (NTP) is a **universal training objective** that works across modalities. The bottleneck is not the model architecture β€” it's **tokenization**: how you map domain entities into discrete token spaces.
This means the entire LLM machinery β€” attention, scaling laws, in-context learning, transfer learning β€” becomes available for any domain once you solve the tokenization problem.
The precedent is clear across modalities:
| Modality | How It's Tokenized | Key Paper |
|----------|--------------------|-----------|
| **Text** | BPE / WordPiece / SentencePiece | GPT, BERT, Llama |
| **Images** | VQ-VAE, patch embeddings | DALL-E, ViT |
| **Audio** | Spectral codecs (EnCodec) | AudioLM, Whisper |
| **Video** | 3D causal VAE | HiTVideo, Emu3 |
| **Robotics actions** | Discrete Cosine Transform | FAST (2501.09747) |
| **Products/Items** | **Semantic IDs via RQ-VAE** | **TIGER** |
| **User actions** | **BPE on feature sets** | **ActionPiece** |
| **Transactions** | **Composite (date+amount+text)** | **Banking TF** |
| **Tabular features** | **Magnitude binning, serialization** | **TP-BERTa, TabuLa** |
| **Time series** | Scalar quantization, symbolic discretization | TokenCast, LLMTime |
The bottom half of this table β€” the business-domain entries β€” is where domainTokenizer operates.
---
## 4. Research Landscape: Five Paradigms of Domain Tokenization
### 4.1 Semantic ID Tokenization (Products & Items)
**Core idea:** Encode each item (product, video, song, article) as a **sequence of discrete semantic tokens** derived from its content features.
**How it works:**
1. Extract a dense embedding from item features (e.g., product title + description β†’ SentenceT5 β†’ 768-dim vector)
2. Apply **Residual Quantization (RQ-VAE)**: iteratively quantize the embedding into a sequence of codebook indices
3. The resulting tuple `(c1, c2, c3, ...)` is the item's **Semantic ID** β€” its "word" in the domain language
4. Train a Transformer to predict sequences of these Semantic IDs
**Key property:** Items with similar content share token prefixes, creating a hierarchical semantic structure:
```
Headphones A: [Audio, 23, 7, 41]
Headphones B: [Audio, 23, 7, 55] ← shares 3/4 prefix tokens
Laptop C: [Computing, 8, 31, 12] ← completely different tokens
```
**Papers:**
- **TIGER** (Google, 2023) β€” [arXiv: 2305.05065](https://arxiv.org/abs/2305.05065) β€” The landmark paper introducing Semantic IDs for recommendation. [GitHub 781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender)
- **Semantic IDs at YouTube** (Google, 2023) β€” [arXiv: 2306.08121](https://arxiv.org/abs/2306.08121) β€” Deployed at industry scale, replacing random IDs
- **PRISM** (2025) β€” [arXiv: 2601.16556](https://arxiv.org/abs/2601.16556) β€” Purified quantization for better semantic tokenization
- **MMGRec** (2024) β€” [arXiv: 2404.16555](https://arxiv.org/abs/2404.16555) β€” Graph RQ-VAE incorporating multimodal item features
- **Semantic IDs for Joint Search & Rec** (2025) β€” [arXiv: 2508.10478](https://arxiv.org/abs/2508.10478) β€” Unified Semantic IDs across search and recommendation
### 4.2 Action Sequence Tokenization (User Behaviors)
**Core idea:** Don't just tokenize individual items β€” tokenize the **entire action sequence**, where each action is a composite event with multiple features.
**How it works:**
1. Represent each user action as an **unordered set of features**: `{category: Electronics, price_bin: $50-100, brand: Sony, payment: Credit}`
2. Apply a **BPE-like vocabulary construction** algorithm that merges frequently co-occurring feature patterns:
- Count co-occurrence of feature pairs both within actions and across adjacent actions
- Merge the most frequent pair into a new token
- Repeat until desired vocabulary size is reached
3. The same action can be tokenized differently depending on surrounding context
**Key insight (from ActionPiece):** Just as BPE discovers that `t` + `h` + `e` should be merged into a single `the` token in English, the action tokenizer discovers that `{Electronics, $50-100}` should be merged into a single composite token because they co-occur frequently in purchase sequences.
**Papers:**
- **ActionPiece** (Google DeepMind, 2025) β€” [arXiv: 2502.13581](https://arxiv.org/abs/2502.13581) β€” First context-aware action sequence tokenizer. [GitHub 53⭐](https://github.com/google-deepmind/action_piece)
- **MBGen** (2024) β€” [arXiv: 2405.16871](https://arxiv.org/abs/2405.16871) β€” Multi-behavior generative recommendation (view, click, purchase as different token types). [GitHub 57⭐](https://github.com/anananan116/MBGen)
- **SETRec** (2025) β€” [arXiv: 2502.10833](https://arxiv.org/abs/2502.10833) β€” Order-agnostic set identifiers integrating collaborative + semantic signals
- **ContRec** (2025) β€” [arXiv: 2504.12007](https://arxiv.org/abs/2504.12007) β€” Continuous tokens via sigma-VAE + diffusion
### 4.3 Financial Transaction Tokenization
**Core idea:** Banking/financial transactions are **multimodal sequential events** (date + amount + description). Design a composite tokenizer that encodes all three modalities jointly.
**How it works (from Banking Transaction Flow paper):**
1. **Date tokenization:** Convert to day-of-week + relative time since last transaction
2. **Amount tokenization:** Quantize into logarithmic bins (captures the difference between $5 and $500 better than linear bins)
3. **Wording tokenization:** Standard BPE on the transaction description text (e.g., "AMAZON MARKETPLACE" β†’ subword tokens)
4. **Composite token:** Combine date + amount + wording tokens into a single transaction representation
5. **Sequence ordering:** Within each day, sort transactions by ascending amount; across days, chronological order
6. **Pre-train** with masked transaction prediction (mask entire transactions, not just subwords)
**Papers:**
- **Banking Transaction Flow** (2024) β€” [arXiv: 2410.08243](https://arxiv.org/abs/2410.08243) β€” Custom tokenizer for banking transactions; pre-trained models outperform prior art on transaction categorization (31 classes) and credit risk scoring
- **LBSF** (2024) β€” [arXiv: 2411.15056](https://arxiv.org/abs/2411.15056) β€” Long-term payment behavior sequence folding by merchant, with multi-field behavior encoding
- **Temporal Tokenization Strategies** (2025) β€” [arXiv: 2512.13618](https://arxiv.org/abs/2512.13618) β€” Systematic comparison of how to tokenize timestamps for event sequences. Key finding: log-based encoding works best for skewed financial data
- **FinTRec** (2025) β€” [arXiv: 2511.14865](https://arxiv.org/abs/2511.14865) β€” Transformer for long-range financial product recommendation with temporally heterogeneous context
- **TIMeSynC** (2024) β€” [arXiv: 2410.12825](https://arxiv.org/abs/2410.12825) β€” Encoder-decoder transformer for sequential intent prediction in financial services
### 4.4 Tabular Feature Tokenization
**Core idea:** Each row in a table can be serialized as a sequence of tokens, and each feature value can be encoded meaningfully (not just as a text fragment).
**Key methods:**
- **Relative Magnitude Tokenization (RMT):** Instead of tokenizing "$79.99" as text fragments, discretize it relative to the feature's distribution β†’ "percentile_75" or "bin_high". This preserves ordinal relationships.
- **Intra-Feature Attention:** Bind each value token to its column name via attention, so the model knows "$79.99" means "price is $79.99", not just a number.
- **Serialization:** Convert rows to natural language: `"price: $79.99, category: Electronics, brand: Sony"` β€” surprisingly effective with large enough models.
**Papers:**
- **TP-BERTa** (2024) β€” [arXiv: 2403.01841](https://arxiv.org/abs/2403.01841) β€” Relative Magnitude Tokenization + intra-feature attention. Competitive with XGBoost/LightGBM.
- **TabuLa-8B** (2024) β€” [arXiv: 2406.12031](https://arxiv.org/abs/2406.12031) β€” Llama 3-8B fine-tuned on serialized tabular data. Strong zero/few-shot. [GitHub 71⭐](https://github.com/mlfoundations/rtfm)
- **TabSTAR** (2025) β€” [arXiv: 2505.18125](https://arxiv.org/abs/2505.18125) β€” Foundation tabular model with semantically target-aware representations. [GitHub 83⭐](https://github.com/alanarazi7/TabSTAR). 112 upvotes.
- **UniTabE** (2023) β€” [arXiv: 2307.09249](https://arxiv.org/abs/2307.09249) β€” Universal pretraining protocol for tabular foundation models
- **TARTE** (2025) β€” [arXiv: 2505.14415](https://arxiv.org/abs/2505.14415) β€” Knowledge-enhanced tabular representations via pre-training on column names + table entries
- **TabICL** (2025) β€” [arXiv: 2502.05564](https://arxiv.org/abs/2502.05564) β€” Column-then-row attention, scales to 500K samples
- **Language Modeling on Tabular Data: A Survey** (2024) β€” [arXiv: 2408.10548](https://arxiv.org/abs/2408.10548) β€” Comprehensive survey. [GitHub 33⭐](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey)
### 4.5 Universal Modality Tokenization
**Core idea:** Project all modalities β€” including time series, tabular data, graphs β€” into a **shared discrete token space** and process them with a single Transformer.
**Papers:**
- **Meta-Transformer** (2023) β€” [arXiv: 2307.10802](https://arxiv.org/abs/2307.10802) β€” 12 modalities (text, image, audio, video, point cloud, **time series**, **tabular**, IMU, graph, etc.) via a unified tokenizer + frozen encoder. [GitHub 1652⭐](https://github.com/invictus717/MetaTransformer). 45 upvotes.
- **Emu3** (2024) β€” [arXiv: 2409.18869](https://arxiv.org/abs/2409.18869) β€” Next-token prediction is all you need across modalities. [GitHub 2400⭐](https://github.com/baaivision/emu3). 99 upvotes.
- **Unified-IO 2** (2023) β€” [arXiv: 2312.17172](https://arxiv.org/abs/2312.17172) β€” Images, text, audio, and actions in one autoregressive model. [GitHub 647⭐](https://github.com/allenai/unified-io-2). 30 upvotes.
- **NTP Multimodal Survey** (2024) β€” [arXiv: 2412.18619](https://arxiv.org/abs/2412.18619) β€” Comprehensive taxonomy of multimodal tokenization + NTP. [GitHub 478⭐](https://github.com/lmm101/awesome-multimodal-next-token-prediction). 59 upvotes.
- **LongCat-Next** (2025) β€” [arXiv: 2603.27538](https://arxiv.org/abs/2603.27538) β€” Lexicalizing modalities as discrete tokens. [GitHub 409⭐](https://github.com/meituan-longcat/LongCat-Next). 145 upvotes.
---
## 5. Key Papers: Detailed Analysis
### 5.1 TIGER β€” Semantic IDs for Generative Retrieval
**Full title:** "Recommender Systems with Generative Retrieval"
**Authors:** Shashank Rajput, Nikhil Mehta, Anima Singh, et al. (Google Research)
**Link:** [arXiv: 2305.05065](https://arxiv.org/abs/2305.05065) | [GitHub 781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender)
**What it does:**
TIGER (Transformer Index for GEnerative Recommenders) replaces the traditional two-stage retrieve-and-rank pipeline with a single generative model. Each item is assigned a Semantic ID β€” a tuple of discrete codewords β€” and the model autoregressively generates the Semantic ID of the next item a user will interact with.
**Semantic ID generation process:**
```
Item features (title, description, ...)
β†’ Pre-trained text encoder (SentenceT5)
β†’ Dense embedding (768-dim)
β†’ Residual Quantization (RQ-VAE)
β†’ Semantic ID: (c1, c2, c3, ..., cK) # K codewords from K codebooks
```
**Residual Quantization (RQ):**
1. Quantize the embedding to the nearest codebook entry β†’ c1
2. Compute the **residual** (difference between original and quantized)
3. Quantize the residual β†’ c2
4. Repeat K times
This creates a **hierarchical** representation: c1 captures coarse semantics (category-level), c2 refines it, c3 further, etc.
**Training:**
- Input: sequence of Semantic IDs representing a user's past interactions
- Target: Semantic ID of the next item
- Loss: cross-entropy at each code position
- Architecture: standard Transformer encoder-decoder
**Key results:**
- Outperforms SASRec, BERT4Rec, and dual-encoder baselines on Amazon datasets
- **Cold-start capability:** can recommend items never seen in training (because Semantic IDs generalize via shared prefixes)
- **Diversity:** beam search with temperature naturally produces diverse recommendations
**Relevance to domainTokenizer:** TIGER's Semantic ID is the canonical example of how to create a "word" for a non-textual entity. The RQ-VAE approach is directly applicable to any item-based domain.
---
### 5.2 ActionPiece β€” BPE for User Actions
**Full title:** "ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation"
**Authors:** Yupeng Hou, Jianmo Ni, Zhankui He, et al. (Google DeepMind)
**Link:** [arXiv: 2502.13581](https://arxiv.org/abs/2502.13581) | [GitHub 53⭐](https://github.com/google-deepmind/action_piece)
**What it does:**
ActionPiece is the first **context-aware** tokenizer for user action sequences. It applies the BPE principle β€” merging frequently co-occurring pairs β€” but on **sets of item features** rather than characters.
**Key innovation β€” actions as unordered feature sets:**
Instead of treating each item as an atomic ID, ActionPiece represents each user action as a set of features:
```
Action = {category: "Electronics", brand: "Sony", price_range: "$50-100", ...}
```
**Vocabulary construction (BPE-like):**
1. Start with base vocabulary = all individual features
2. Count co-occurrence of feature pairs:
- **Intra-action:** features within the same action (e.g., "Electronics" + "$50-100")
- **Inter-action:** features across adjacent actions (e.g., "Phone" in action t, "PhoneCase" in action t+1)
3. Merge the most frequent pair into a new composite token
4. Repeat until desired vocabulary size
**Set Permutation Regularization (SPR):**
Because feature sets are unordered, the same action can be tokenized with different internal orderings. SPR produces multiple segmentations of the same sequence, acting as data augmentation and preventing the model from overfitting to arbitrary feature orderings.
**Key results:**
- Outperforms TIGER, SASRec, BERT4Rec on Amazon Sports, Beauty, and CDs datasets
- NDCG@10 improvements of 5–15% over TIGER
- The context-aware tokenization means the same item gets different tokens in different behavioral contexts
**Relevance to domainTokenizer:** ActionPiece is the most directly applicable template for building a domain tokenizer. Its BPE-like algorithm can be generalized to any domain where events are composed of multiple features.
---
### 5.3 Banking Transaction Flow β€” Transactions as Tokens
**Full title:** "Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow"
**Authors:** Cyrile Delestre, Yoann Sola
**Link:** [arXiv: 2410.08243](https://arxiv.org/abs/2410.08243)
**What it does:**
Designs a custom tokenizer for banking transactions β€” multimodal events consisting of (date, numerical amount, text wording) β€” and pre-trains Transformer and RNN models on large-scale transaction data.
**Tokenization scheme:**
1. **Date modality:** Converted to relative temporal features (days since last transaction, day of week)
2. **Amount modality:** Quantized into bins. The paper doesn't specify the exact binning, but refers to discretization that preserves order and magnitude.
3. **Wording modality:** Standard BPE tokenization on the text description (e.g., merchant names, transaction descriptions) after normalization (removing account numbers, dates from text, standardizing merchant names)
4. **Composite embedding:** Each modality's tokens are independently embedded, then combined via concatenation or learned projection into a single transaction-level representation
**Sequence construction:**
- Within each day: transactions sorted by ascending amount
- Across days: chronological order
- Special separator tokens between days
**Pre-training (self-supervised):**
- **Masked Transaction Prediction (MTP):** Mask entire transactions (not just subword tokens within a description), predict the masked transaction. This forces the model to learn cross-transaction patterns.
- Both RNN (BiLSTM-based, ELMo-style) and Transformer (BERT-style) pre-training explored
**Downstream tasks:**
- **Transaction categorization:** 31 classes (income, shopping, subscription, transport, savings, etc.). Fine-tuned pre-trained models beat all baselines.
- **Credit risk scoring:** Binary classification of default risk. Pre-trained models significantly outperform non-pre-trained approaches.
**Relevance to domainTokenizer:** This is the closest existing work to an e-commerce transaction tokenizer. The multimodal composite tokenization approach (date + amount + text) is directly applicable.
---
### 5.4 LETTER β€” Learnable Item Tokenization
**Full title:** "Learnable Item Tokenization for Generative Recommendation"
**Authors:** Wenjie Wang, Honghui Bao, et al.
**Link:** [arXiv: 2405.07314](https://arxiv.org/abs/2405.07314) | [GitHub 153⭐](https://github.com/honghuibao2000/letter)
**What it does:**
LETTER addresses three limitations of prior item tokenization methods:
1. **ID-based:** No semantic information, can't generalize to new items
2. **Text-based:** Lose collaborative signals (who bought what with what)
3. **Codebook-based (RQ-VAE):** Suffer from code assignment bias (popular items get all the good codes)
**LETTER's solution β€” a learnable tokenizer with three objectives:**
1. **Semantic regularization:** Tokenizer's codebook should respect semantic similarity (similar items β†’ similar codes)
2. **Contrastive alignment:** Tokens should capture collaborative filtering signals (items bought together β†’ nearby in token space)
3. **Diversity loss:** Prevent codebook collapse β€” ensure all codes are used, not just a few popular ones
**Architecture:**
- Uses Residual Quantized VAE (like TIGER) as the base tokenizer
- Adds the three losses above during tokenizer training
- The tokenizer is trained jointly with (or alternately with) the generative recommendation model
**Key results:**
- Outperforms TIGER, P5, and other generative recommendation baselines
- Particularly strong on long-tail items (items with few interactions) due to the diversity loss
**Relevance to domainTokenizer:** LETTER shows that **the tokenizer itself should be a learnable model** trained with domain-specific objectives, not just a fixed preprocessing step.
---
### 5.5 TP-BERTa β€” Numerical Value Tokenization
**Full title:** "Making Pre-trained Language Models Great on Tabular Prediction"
**Authors:** Jiahuan Yan, et al.
**Link:** [arXiv: 2403.01841](https://arxiv.org/abs/2403.01841)
**What it does:**
Solves the fundamental problem of representing **numerical feature values** as tokens. Standard text tokenizers fragment numbers meaninglessly. TP-BERTa introduces **Relative Magnitude Tokenization (RMT)**.
**Relative Magnitude Tokenization:**
Instead of tokenizing the raw number "$79.99" as text:
1. Compute the feature's distribution across the dataset
2. Express each value as its **relative position** in that distribution
3. Discretize into bins: "very_low", "low", "medium", "high", "very_high" (or finer)
4. The token is the bin label, which preserves ordinal relationships
Example:
```
price = $79.99
β†’ Within the "price" feature distribution, $79.99 is at the 73rd percentile
β†’ Token: "price_bin_73" or "price_high"
```
**Intra-Feature Attention:**
Each feature value is paired with its feature name:
```
"price" β†’ [price_name_embedding]
"$79.99" β†’ [price_value_embedding via RMT]
```
Intra-feature attention binds them, so the model knows this number means "price" not "quantity" or "weight".
**Key results:**
- TP-BERTa is competitive with XGBoost and LightGBM on standard tabular benchmarks
- Significantly outperforms other deep learning approaches on tabular data
- The pre-trained model transfers across different tables
**Relevance to domainTokenizer:** RMT solves the critical problem of numerical tokenization. Every domain tokenizer will need to handle numbers (prices, amounts, quantities, durations), and RMT is currently the best approach.
---
### 5.6 Meta-Transformer β€” 12 Modalities, One Token Space
**Full title:** "Meta-Transformer: A Unified Framework for Multimodal Learning"
**Authors:** Yiyuan Zhang, Kaixiong Gong, et al.
**Link:** [arXiv: 2307.10802](https://arxiv.org/abs/2307.10802) | [GitHub 1652⭐](https://github.com/invictus717/MetaTransformer)
**What it does:**
Demonstrates that a single frozen Transformer encoder can process 12 different modalities β€” including **time series** and **tabular data** β€” by projecting each modality into a shared token space via modality-specific tokenizers.
**Modality-specific tokenizers:**
- **Text:** standard embedding
- **Image:** patch embedding (ViT-style)
- **Audio:** spectrogram patches
- **Time series:** segment embedding (chop time series into fixed-length segments, project each to a token)
- **Tabular:** feature-wise embedding (each column value becomes a token)
- **Graph:** node feature embedding
- **Point cloud:** point group embedding
**Key insight:** The tokenizers are lightweight (small learnable projections), and the Transformer encoder is **frozen** β€” trained once and shared across all modalities. This means the bulk of the computation is modality-agnostic.
**Relevance to domainTokenizer:** Meta-Transformer proves the viability of the unified approach. A domain tokenizer could use a similar architecture: lightweight domain-specific tokenizers feeding into a shared Transformer backbone.
---
## 6. Tokenization Methods: A Technical Taxonomy
### 6.1 Quantization-Based (RQ-VAE, VQ-VAE)
**How it works:**
- Train a Vector Quantized Variational Autoencoder on item embeddings
- The encoder maps items to a continuous latent space
- The quantization layer maps each embedding to the nearest entry in a learned codebook
- **Residual Quantization (RQ):** apply quantization iteratively on residuals for multi-token representations
- The decoder reconstructs the original embedding from the quantized codes
**Strengths:**
- Produces hierarchically structured tokens (coarse-to-fine)
- Items with similar content naturally share token prefixes
- Controllable vocabulary size (codebook size Γ— number of levels)
**Weaknesses:**
- Codebook collapse (some codes rarely used)
- Training instability (requires commitment loss, EMA updates, etc.)
- No collaborative signal unless explicitly added (see LETTER)
**Used by:** TIGER, LETTER, PRISM, MMGRec, MiniOneRec, GenRec
### 6.2 BPE-Inspired Merging
**How it works:**
- Start with atomic features as the base vocabulary
- Count co-occurrence frequencies of feature pairs in the corpus
- Merge the most frequent pair into a new composite token
- Repeat until desired vocabulary size
**Strengths:**
- Naturally discovers meaningful composite patterns
- Context-aware (merges depend on surrounding actions)
- Directly analogous to text BPE β€” well-understood properties
- No neural network training required for vocabulary construction
**Weaknesses:**
- Greedy algorithm β€” may not find globally optimal vocabulary
- Requires careful handling of unordered feature sets (set permutation regularization)
- Vocabulary depends on corpus statistics β€” may not generalize to distribution shifts
**Used by:** ActionPiece
### 6.3 Magnitude & Binning Approaches
**How it works:**
- For numerical values: compute distribution statistics, discretize into bins
- Options: uniform bins, quantile bins, logarithmic bins, adaptive bins
- For timestamps: calendar tokens (day-of-week, month, etc.) or relative encodings
**Strengths:**
- Simple, interpretable, no training required
- Preserves ordinal relationships
- Handles numerical data natively (no text conversion)
**Weaknesses:**
- Fixed granularity (bin resolution)
- Information loss at bin boundaries
- Requires domain knowledge to choose binning strategy
**Used by:** TP-BERTa, Banking Transaction Flow, Temporal Tokenization Strategies
### 6.4 Learnable End-to-End Tokenizers
**How it works:**
- A neural network (encoder) maps raw domain data to discrete tokens
- The tokenizer is trained end-to-end with the downstream model
- Uses techniques like Gumbel-Softmax for differentiable discretization
**Strengths:**
- Tokenizer adapts to the downstream task
- Can incorporate multiple objectives (semantic, collaborative, diversity)
- No manual design of tokenization rules
**Weaknesses:**
- More complex training (joint optimization)
- Risk of tokenizer-model co-adaptation (poor generalization)
- Harder to interpret what tokens mean
**Used by:** LETTER, UniGRec, ContRec, MANTa
### 6.5 Serialization-Based (Text Templates)
**How it works:**
- Convert each data record to a natural language string:
`"The customer bought Sony WH-1000XM5 headphones for $349.99 using a credit card on March 15, 2025."`
- Use a standard text tokenizer (BPE) on the serialized string
- Feed to a pre-trained LLM
**Strengths:**
- Zero engineering β€” use off-the-shelf LLMs
- Benefits from LLM's pre-trained world knowledge
- Handles heterogeneous schemas easily
**Weaknesses:**
- Extremely token-inefficient (one row might become 100+ tokens)
- Numerical values still poorly handled by text tokenizers
- Requires large models to work well (no "small model" possibility)
- No exploitation of domain structure
**Used by:** TabuLa-8B, TabSTAR (partially), various LLM-for-tabular approaches
---
## 7. The domainTokenizer Blueprint: How to Build It
### 7.1 Architecture Design
Based on the research, domainTokenizer should have three components:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ domainTokenizer β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Domain β”‚ β”‚ Transformer β”‚ β”‚ Task β”‚ β”‚
β”‚ β”‚ Tokenizer │──│ Backbone │──│ Heads β”‚ β”‚
β”‚ β”‚ (learnable) β”‚ β”‚ (small) β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ Tokenizer: Domain events β†’ discrete tokens β”‚
β”‚ Backbone: Sequence modeling via attention β”‚
β”‚ Heads: Task-specific outputs β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
**Domain Tokenizer (per-domain, learnable):**
- Handles the conversion of raw domain events into discrete tokens
- Combines multiple strategies: RQ-VAE for items, magnitude binning for numbers, BPE-like merging for feature compositions, calendar encoding for timestamps
- Small and fast (a few million parameters at most)
**Transformer Backbone (shared, small):**
- Standard causal or bidirectional Transformer
- Target sizes: 10M, 50M, 150M, 350M parameters
- Pre-trained on domain sequences with self-supervised objectives
- Potentially shareable across related domains
**Task Heads (per-task):**
- Classification head for fraud detection, churn prediction, etc.
- Generation head for next-event prediction, recommendation
- Regression head for value prediction (LTV, credit score, etc.)
### 7.2 Tokenizer Construction Pipeline
For a given domain (e.g., e-commerce), the tokenizer construction follows:
**Step 1: Schema Analysis**
```python
# Identify field types in the domain data
schema = {
"product_id": "categorical_entity", # β†’ Semantic ID via RQ-VAE
"category": "categorical_fixed", # β†’ direct vocabulary mapping
"price": "numerical_continuous", # β†’ magnitude binning (RMT)
"quantity": "numerical_discrete", # β†’ small fixed vocabulary
"timestamp": "temporal", # β†’ calendar + relative encoding
"description": "text", # β†’ standard BPE (subword)
"payment_method": "categorical_small", # β†’ direct mapping
"customer_id": "entity_id", # β†’ learned embedding or behavioral cluster
}
```
**Step 2: Per-Field Tokenization**
| Field Type | Method | Output |
|------------|--------|--------|
| Categorical entity (products) | RQ-VAE Semantic IDs | Tuple of K codebook indices |
| Categorical fixed (categories) | Direct vocab mapping | Single token index |
| Numerical continuous (prices) | Relative Magnitude Tokenization | Bin token |
| Temporal (timestamps) | Calendar tokens + relative delta | 2–3 tokens (day-of-week, time-of-day, delta) |
| Text (descriptions) | Standard BPE | Variable-length subword tokens |
| Entity ID (customers) | Behavioral clustering or learned embedding | Single token or short sequence |
**Step 3: Composite Token Construction (BPE-like)**
Following ActionPiece, apply a BPE-like merge algorithm on the composite per-field tokens to discover meaningful multi-field patterns:
```
Initial: [Electronics] [price_high] [CreditCard] [Weekday]
After merging: [Electronics+price_high] [CreditCard+Weekday]
Further: [HighEndElectronicsPurchase] [WeekdayCreditCard]
```
**Step 4: Special Tokens**
```
[SEP] - separates transactions in a sequence
[DAY_SEP] - separates days
[PAD] - padding
[MASK] - for masked pre-training
[CLS] - sequence-level representation
[UNK] - unknown/out-of-vocabulary events
```
### 7.3 Pre-training Objectives
Based on the literature, the following self-supervised objectives are most effective:
**1. Masked Event Prediction (MEP) β€” BERT-style**
- Mask 15% of complete events (not just individual tokens within an event)
- Predict all tokens of the masked event
- Forces the model to learn cross-event patterns
**2. Next Event Prediction (NEP) β€” GPT-style**
- Given a sequence of events, predict the next event autoregressively
- Generate the event's token sequence (e.g., Semantic ID) token by token
- The primary objective for generative recommendation
**3. Contrastive Sequence Learning**
- Similar customer sequences should have similar representations
- Push apart sequences from different behavioral clusters
- Helps with customer segmentation and transfer learning
**4. Temporal Ordering**
- Given a shuffled sequence, predict the correct temporal order
- Forces the model to learn temporal patterns (seasonality, cadence, trends)
### 7.4 Downstream Task Adaptation
Once pre-trained, the model can be fine-tuned for specific tasks:
| Task | Adaptation Method | Head |
|------|-------------------|------|
| **Next purchase prediction** | Continue NEP, decode Semantic IDs | Generative (autoregressive) |
| **Fraud detection** | Fine-tune on labeled transactions | Binary classifier on [CLS] |
| **Customer segmentation** | Extract [CLS] embeddings, cluster | No head (use embeddings) |
| **Churn prediction** | Fine-tune on labeled sequences | Binary classifier on [CLS] |
| **Credit scoring** | Fine-tune on labeled customer histories | Regression or classification |
| **Demand forecasting** | Adapt temporal patterns | Regression on quantity tokens |
| **Product recommendation** | NEP with Semantic ID decoding | Generative (beam search) |
---
## 8. Use Case Walkthrough: E-Commerce Transaction Model
### The Scenario
An e-commerce platform with:
- 2M customers
- 500K products
- 100M transactions over 2 years
- Each transaction: `(customer_id, product_id, category, price, quantity, timestamp, payment_method, device)`
### Step 1: Build the Tokenizer
**Product Semantic IDs:**
```python
# 1. Generate product embeddings from title + description
product_embeddings = sentence_encoder(product_titles + product_descriptions) # 500K Γ— 768
# 2. Train RQ-VAE with 4 codebooks of 256 entries each
rq_vae = ResidualQuantizedVAE(n_codebooks=4, codebook_size=256)
rq_vae.fit(product_embeddings)
# 3. Each product gets a 4-token Semantic ID
product_semantic_ids = rq_vae.encode(product_embeddings) # 500K Γ— 4
# e.g., Headphones β†’ [42, 187, 23, 91]
```
**Price Tokenization (RMT):**
```python
# Compute percentile bins
price_bins = compute_quantile_bins(all_prices, n_bins=50)
# $79.99 β†’ "price_bin_37" (37th percentile bin)
```
**Timestamp Tokenization:**
```python
# Calendar features + relative delta
def tokenize_timestamp(ts, prev_ts):
return [
day_of_week_token(ts), # "wednesday"
time_of_day_token(ts), # "afternoon"
delta_token(ts - prev_ts), # "2_days_later"
]
```
**Composite vocabulary construction (BPE-like):**
```python
# Run ActionPiece-style merging on the corpus of tokenized transaction sequences
vocabulary = actionpiece_vocab_construction(
corpus=all_tokenized_transactions,
target_vocab_size=8192,
consider_intra_event=True, # merge features within a transaction
consider_inter_event=True, # merge features across adjacent transactions
)
```
### Step 2: Pre-train
```python
# Tokenize all 100M transactions
tokenized_corpus = tokenize_all_transactions(transactions, tokenizer)
# Pre-train a small Transformer (150M params)
model = TransformerLM(
vocab_size=8192 + special_tokens,
d_model=768,
n_heads=12,
n_layers=12,
max_seq_len=256, # ~256 transactions per customer
)
# Self-supervised pre-training with MEP + NEP
train(model, tokenized_corpus, objectives=["masked_event", "next_event"])
```
### Step 3: Fine-tune & Deploy
```python
# Example: Fraud detection
fraud_model = add_classification_head(model, n_classes=2)
fine_tune(fraud_model, labeled_fraud_data)
# Example: Next purchase recommendation
rec_model = model # Use generative mode directly
next_item_semantic_id = rec_model.generate(customer_transaction_sequence)
next_item = rq_vae.decode(next_item_semantic_id) # Map back to product
```
---
## 9. Open Challenges and Research Gaps
### 9.1 Vocabulary Evolution
Products are added and removed constantly. Semantic IDs need to be recomputed, which may invalidate the model's learned associations. **Partial solutions:** periodic re-indexing (TIGER), using content features that are stable even when the catalog changes.
### 9.2 Cross-Domain Transfer
Can a tokenizer trained on e-commerce data transfer to banking? The field-level tokenizers (RMT for numbers, calendar for dates) should transfer, but composite vocabularies are domain-specific. **Open question:** is there a "universal domain tokenizer" or will each domain need its own?
### 9.3 Numerical Precision
All current methods lose some numerical precision through discretization. For applications where exact values matter (financial auditing, pricing optimization), this is a limitation. **Potential solution:** hybrid approaches that combine discrete tokens with continuous residuals.
### 9.4 Handling Missing Data
Real business data is full of missing values. Text tokenizers never face this issue. Domain tokenizers need explicit strategies: [MISSING] tokens, imputation, or learning to model missingness as a signal.
### 9.5 Privacy & Fairness
Tokenizing customer behavior raises privacy concerns. Semantic IDs could encode sensitive attributes (demographic patterns, financial status) in ways that are hard to audit. Domain tokenizers should be designed with fairness constraints.
### 9.6 Scalability of BPE-Like Merging
ActionPiece's vocabulary construction is O(N Γ— V) per merge step. For very large corpora (billions of events) and feature spaces (thousands of features), this may become prohibitively expensive. **Potential solution:** approximate counting, hierarchical merging, or neural vocabulary construction.
### 9.7 Evaluation Standards
There are no standard benchmarks for "domain tokenization quality." Text tokenizers can be evaluated by compression ratio and downstream perplexity. Domain tokenizers need domain-specific metrics: recommendation quality, prediction accuracy, calibration, etc.
### 9.8 Connection to Continual Learning
The HOPE / Nested Learning paradigm (see Section 11) suggests that models should continuously learn from new data. Domain tokenizers that can incrementally update their vocabularies β€” adding new product tokens, retiring obsolete ones β€” without full retraining would be highly valuable.
---
## 10. Complete Paper Reference Table
| # | Paper | Year | ArXiv | Domain | Key Contribution | GitHub |
|---|-------|------|-------|--------|-----------------|--------|
| 1 | **TIGER** | 2023 | [2305.05065](https://arxiv.org/abs/2305.05065) | Recommendation | Semantic IDs via RQ-VAE for generative retrieval | [781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender) |
| 2 | **Semantic IDs (YouTube)** | 2023 | [2306.08121](https://arxiv.org/abs/2306.08121) | Recommendation | Content-derived IDs at industry scale | β€” |
| 3 | **ActionPiece** | 2025 | [2502.13581](https://arxiv.org/abs/2502.13581) | Recommendation | BPE-like context-aware action tokenization | [53⭐](https://github.com/google-deepmind/action_piece) |
| 4 | **LETTER** | 2024 | [2405.07314](https://arxiv.org/abs/2405.07314) | Recommendation | Learnable tokenizer with semantic+collaborative+diversity | [153⭐](https://github.com/honghuibao2000/letter) |
| 5 | **SETRec** | 2025 | [2502.10833](https://arxiv.org/abs/2502.10833) | Recommendation | Order-agnostic set identifiers | β€” |
| 6 | **ContRec** | 2025 | [2504.12007](https://arxiv.org/abs/2504.12007) | Recommendation | Continuous tokens via sigma-VAE + diffusion | β€” |
| 7 | **GenRec** | 2026 | [2604.14878](https://arxiv.org/abs/2604.14878) | Recommendation | Page-wise NTP for large-scale recommendation | β€” |
| 8 | **MBGen** | 2024 | [2405.16871](https://arxiv.org/abs/2405.16871) | Recommendation | Multi-behavior (view/click/buy) as token types | [57⭐](https://github.com/anananan116/MBGen) |
| 9 | **RSLLM** | 2024 | [2412.16933](https://arxiv.org/abs/2412.16933) | Recommendation | Recommendation as a new language in LLMs | β€” |
| 10 | **PRISM** | 2025 | [2601.16556](https://arxiv.org/abs/2601.16556) | Recommendation | Purified quantization for semantic tokenization | β€” |
| 11 | **MMGRec** | 2024 | [2404.16555](https://arxiv.org/abs/2404.16555) | Recommendation | Graph RQ-VAE for multimodal items | β€” |
| 12 | **UniGRec** | 2025 | [2601.17438](https://arxiv.org/abs/2601.17438) | Recommendation | Soft item identifiers for end-to-end optimization | β€” |
| 13 | **Semantic IDs for Search+Rec** | 2025 | [2508.10478](https://arxiv.org/abs/2508.10478) | Recommendation | Joint search and recommendation Semantic IDs | β€” |
| 14 | **Banking Transaction Flow** | 2024 | [2410.08243](https://arxiv.org/abs/2410.08243) | Finance | Composite tokenizer for (date, amount, text) transactions | β€” |
| 15 | **LBSF** | 2024 | [2411.15056](https://arxiv.org/abs/2411.15056) | Finance | Long-term payment behavior folding by merchant | β€” |
| 16 | **Temporal Tokenization** | 2025 | [2512.13618](https://arxiv.org/abs/2512.13618) | Events | Systematic comparison of temporal tokenization strategies | β€” |
| 17 | **FinTRec** | 2025 | [2511.14865](https://arxiv.org/abs/2511.14865) | Finance | Transformer for long-range financial recommendation | β€” |
| 18 | **TIMeSynC** | 2024 | [2410.12825](https://arxiv.org/abs/2410.12825) | Finance | Temporal intent prediction in financial services | β€” |
| 19 | **TP-BERTa** | 2024 | [2403.01841](https://arxiv.org/abs/2403.01841) | Tabular | Relative Magnitude Tokenization for numbers | β€” |
| 20 | **TabuLa-8B** | 2024 | [2406.12031](https://arxiv.org/abs/2406.12031) | Tabular | Llama 3 fine-tuned on serialized tables | [71⭐](https://github.com/mlfoundations/rtfm) |
| 21 | **TabSTAR** | 2025 | [2505.18125](https://arxiv.org/abs/2505.18125) | Tabular | Semantically target-aware tabular foundation model | [83⭐](https://github.com/alanarazi7/TabSTAR) |
| 22 | **UniTabE** | 2023 | [2307.09249](https://arxiv.org/abs/2307.09249) | Tabular | Universal tabular pretraining protocol | β€” |
| 23 | **TARTE** | 2025 | [2505.14415](https://arxiv.org/abs/2505.14415) | Tabular | Knowledge-enhanced tabular representations | β€” |
| 24 | **TabICL** | 2025 | [2502.05564](https://arxiv.org/abs/2502.05564) | Tabular | Column-then-row attention, scales to 500K samples | β€” |
| 25 | **Meta-Transformer** | 2023 | [2307.10802](https://arxiv.org/abs/2307.10802) | Universal | 12 modalities in one token space | [1652⭐](https://github.com/invictus717/MetaTransformer) |
| 26 | **Emu3** | 2024 | [2409.18869](https://arxiv.org/abs/2409.18869) | Universal | NTP is all you need across modalities | [2400⭐](https://github.com/baaivision/emu3) |
| 27 | **Unified-IO 2** | 2023 | [2312.17172](https://arxiv.org/abs/2312.17172) | Universal | Image+text+audio+action in one model | [647⭐](https://github.com/allenai/unified-io-2) |
| 28 | **NTP Multimodal Survey** | 2024 | [2412.18619](https://arxiv.org/abs/2412.18619) | Survey | Taxonomy of multimodal tokenization + NTP | [478⭐](https://github.com/lmm101/awesome-multimodal-next-token-prediction) |
| 29 | **LongCat-Next** | 2025 | [2603.27538](https://arxiv.org/abs/2603.27538) | Universal | Lexicalizing modalities as discrete tokens | [409⭐](https://github.com/meituan-longcat/LongCat-Next) |
| 30 | **Tabular Data Survey** | 2024 | [2408.10548](https://arxiv.org/abs/2408.10548) | Survey | Comprehensive survey of LMs for tabular data | [33⭐](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey) |
| 31 | **KL3M Tokenizers** | 2025 | [2503.17247](https://arxiv.org/abs/2503.17247) | Legal/Finance | Domain-specific BPE for professional text | [GitHub](https://github.com/alea-institute/kl3m-tokenizer-paper) |
---
## 11. Related Concepts: Nested Learning & Continual Adaptation
An important related development is the **Nested Learning** paradigm introduced by Google Research ([arXiv: 2512.24695](https://arxiv.org/abs/2512.24695), by Ali Behrouz et al.), which presents the **HOPE** architecture.
### Why Nested Learning Matters for Domain Tokenization
Current Transformer-based models are "frozen" after pre-training β€” they cannot incorporate new knowledge without retraining. For domain tokenization, this means:
- A recommendation model can't learn about new products added after training
- A fraud detection model can't adapt to new fraud patterns in real-time
- A customer model can't update its understanding of a customer's evolving preferences
The HOPE architecture addresses this via:
1. **Continuum Memory System (CMS):** Multiple MLP blocks updating at different frequencies β€” some update every few tokens (catching immediate patterns), others update only after millions of tokens (storing persistent knowledge). This prevents catastrophic forgetting.
2. **Self-Modifying Titans:** The model's projection layers update themselves in real-time based on incoming data, enabling continuous adaptation.
**For domainTokenizer, the implication is:** a domain model built with Nested Learning principles could continuously learn from new transactions, adapting its understanding of products, customer preferences, and behavioral patterns without retraining from scratch.
This is an area of active exploration for future versions of domainTokenizer.
For the full research report on Nested Learning, see the [HOPE / Nested Learning discussion on HF Papers](https://huggingface.co/papers/2512.24695).
---
*This report is a living document and will be updated as the domainTokenizer project evolves.*