domainTokenizer / docs /research_report.md

Add comprehensive research report on domain-specific tokenization

be86e60 verified 9 days ago

preview code

raw

history blame

52.8 kB

Domain Tokenization: Beyond Words — A Research Report

Building small models that understand domain tokens, not just words.

Last updated: April 2026

Executive Summary
The Problem: Why Words Are Not Enough
The Core Insight: Anything Can Be a Token
Research Landscape: Five Paradigms of Domain Tokenization
Key Papers: Detailed Analysis
Tokenization Methods: A Technical Taxonomy
- 6.1 Quantization-Based (RQ-VAE, VQ-VAE)
- 6.2 BPE-Inspired Merging
- 6.3 Magnitude & Binning Approaches
- 6.4 Learnable End-to-End Tokenizers
- 6.5 Serialization-Based (Text Templates)
The domainTokenizer Blueprint: How to Build It
- 7.1 Architecture Design
- 7.2 Tokenizer Construction Pipeline
- 7.3 Pre-training Objectives
- 7.4 Downstream Task Adaptation
Use Case Walkthrough: E-Commerce Transaction Model
Open Challenges and Research Gaps
Complete Paper Reference Table
Related Concepts: Nested Learning & Continual Adaptation

1. Executive Summary

Large Language Models (LLMs) process text by breaking it into tokens — subword units learned via algorithms like BPE (Byte-Pair Encoding). This tokenization is the foundation that allows Transformers to model sequential patterns via next-token prediction.

But words are just one type of sequential data. Businesses generate vast amounts of non-textual sequential data every day:

E-commerce: millions of purchase transactions, each with product IDs, amounts, timestamps, categories
Banking: transaction flows with dates, amounts, merchant codes, and descriptions
Healthcare: sequences of diagnoses, procedures, lab results, medications
Advertising: impression → click → conversion funnels with bid amounts and user features
Logistics: shipping events, warehouse movements, delivery status sequences

The central question this project explores: Can we build tokenizers that encode these domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then train small, efficient Transformer models that understand domain patterns the way LLMs understand language?

The answer from recent research is a resounding yes. This report surveys 25+ papers spanning 2021–2026 that collectively establish a new paradigm: domain tokenization. The key findings are:

Semantic IDs (Google, 2023): Products can be encoded as tuples of discrete tokens derived from their content embeddings via quantization (RQ-VAE). A Transformer trained on sequences of these Semantic IDs outperforms traditional recommendation systems and generalizes to unseen items.
Action tokenization (Google DeepMind, 2025): User action sequences can be tokenized using a BPE-like algorithm that merges frequently co-occurring features — the same algorithm that powers text tokenization, applied to business events instead of characters.
Transaction tokenization (2024): Banking transactions — multimodal events of (date, amount, text) — can be encoded as composite tokens and modeled with self-supervised pre-training, achieving state-of-the-art on fraud detection and credit scoring.
Tabular tokenization (2024–2025): Individual feature values (numerical, categorical) can be tokenized via relative magnitude encoding or serialization, enabling foundation models that transfer across different tabular datasets.
Universal tokenization (2023–2024): Frameworks like Meta-Transformer demonstrate that 12+ modalities including time series and tabular data can be projected into a shared token space and processed by a single frozen Transformer.

This report details each paradigm, provides technical depth on the tokenization methods, and lays out a concrete blueprint for building domainTokenizer.

2. The Problem: Why Words Are Not Enough

2.1 The Mismatch Between Business Data and Text Tokens

When an e-commerce platform processes a customer's purchase history, the raw data looks like:

customer_42 | 2025-03-15 | SKU-8847291 | Electronics > Headphones | $79.99 | Credit Card | qty: 1
customer_42 | 2025-03-15 | SKU-3321098 | Electronics > Cables    | $12.49 | Credit Card | qty: 2
customer_42 | 2025-04-01 | SKU-5519273 | Books > Technical       | $44.95 | Debit Card  | qty: 1

If you feed this to a standard LLM tokenizer (e.g., GPT-4o's cl100k_base), you get:

SKU-8847291 → split into meaningless subword fragments like SK, U-, 884, 72, 91
$79.99 → tokenized as $, 79, ., 99 — losing the semantic meaning of "a mid-range purchase"
2025-03-15 → fragmented into date components with no temporal understanding
The relationships between fields (this amount goes with this product in this category) are lost in a flat token stream

The fundamental problem: text tokenizers are optimized for the statistical structure of natural language. They know that ing and tion are common suffixes, that the is frequent, that un- is a prefix. They know nothing about:

Product similarity (headphones and earbuds are related)
Price ranges ($79.99 is "mid-range electronics" vs. $2,499 is "premium")
Temporal patterns (weekly vs. monthly purchase cadence)
Cross-field interactions (buying a cable right after headphones = accessory purchase)

2.2 The Opportunity: Domain Structure is Richer Than Language

Business domains have structure that goes beyond what text captures:

Dimension	Language	Business Domain
Vocabulary	~50K–256K subwords	Millions of SKUs, thousands of categories
Sequence meaning	Word order determines syntax	Temporal order determines behavioral patterns
Similarity	Semantic (synonyms, paraphrases)	Collaborative (users who buy X also buy Y)
Numerical values	Rare, incidental	Central (prices, quantities, timestamps)
Compositionality	Words compose into sentences	Features compose into events/transactions
Temporal dynamics	Mostly static semantics	Evolving trends, seasonal patterns

A domain tokenizer should exploit all of this structure.

2.3 Why Small Models?

This project focuses on small models (tens of millions to low billions of parameters) because:

Domain data is structured — you don't need 70B parameters to learn that "users who buy phones often buy cases." The pattern space is narrower than open-domain language.
Latency matters — production systems need real-time inference (fraud detection, recommendations, pricing).
Data efficiency — most businesses have millions, not trillions, of training examples.
Cost — training and serving small models is orders of magnitude cheaper.
Interpretability — smaller models with domain-specific tokens are more auditable than black-box LLMs.

3. The Core Insight: Anything Can Be a Token

The survey "Next Token Prediction Towards Multimodal Intelligence" (arXiv: 2412.18619, 59 upvotes) formalizes this principle:

Next-Token Prediction (NTP) is a universal training objective that works across modalities. The bottleneck is not the model architecture — it's tokenization: how you map domain entities into discrete token spaces.

This means the entire LLM machinery — attention, scaling laws, in-context learning, transfer learning — becomes available for any domain once you solve the tokenization problem.

The precedent is clear across modalities:

Modality	How It's Tokenized	Key Paper
Text	BPE / WordPiece / SentencePiece	GPT, BERT, Llama
Images	VQ-VAE, patch embeddings	DALL-E, ViT
Audio	Spectral codecs (EnCodec)	AudioLM, Whisper
Video	3D causal VAE	HiTVideo, Emu3
Robotics actions	Discrete Cosine Transform	FAST (2501.09747)
Products/Items	Semantic IDs via RQ-VAE	TIGER
User actions	BPE on feature sets	ActionPiece
Transactions	Composite (date+amount+text)	Banking TF
Tabular features	Magnitude binning, serialization	TP-BERTa, TabuLa
Time series	Scalar quantization, symbolic discretization	TokenCast, LLMTime

The bottom half of this table — the business-domain entries — is where domainTokenizer operates.

4. Research Landscape: Five Paradigms of Domain Tokenization

4.1 Semantic ID Tokenization (Products & Items)

Core idea: Encode each item (product, video, song, article) as a sequence of discrete semantic tokens derived from its content features.

How it works:

Extract a dense embedding from item features (e.g., product title + description → SentenceT5 → 768-dim vector)
Apply Residual Quantization (RQ-VAE): iteratively quantize the embedding into a sequence of codebook indices
The resulting tuple (c1, c2, c3, ...) is the item's Semantic ID — its "word" in the domain language
Train a Transformer to predict sequences of these Semantic IDs

Key property: Items with similar content share token prefixes, creating a hierarchical semantic structure:

Headphones A:  [Audio, 23, 7, 41]
Headphones B:  [Audio, 23, 7, 55]    ← shares 3/4 prefix tokens
Laptop C:      [Computing, 8, 31, 12] ← completely different tokens

Papers:

TIGER (Google, 2023) — arXiv: 2305.05065 — The landmark paper introducing Semantic IDs for recommendation. GitHub 781⭐
Semantic IDs at YouTube (Google, 2023) — arXiv: 2306.08121 — Deployed at industry scale, replacing random IDs
PRISM (2025) — arXiv: 2601.16556 — Purified quantization for better semantic tokenization
MMGRec (2024) — arXiv: 2404.16555 — Graph RQ-VAE incorporating multimodal item features
Semantic IDs for Joint Search & Rec (2025) — arXiv: 2508.10478 — Unified Semantic IDs across search and recommendation

4.2 Action Sequence Tokenization (User Behaviors)

Core idea: Don't just tokenize individual items — tokenize the entire action sequence, where each action is a composite event with multiple features.

How it works:

Represent each user action as an unordered set of features: {category: Electronics, price_bin: $50-100, brand: Sony, payment: Credit}
Apply a BPE-like vocabulary construction algorithm that merges frequently co-occurring feature patterns:
- Count co-occurrence of feature pairs both within actions and across adjacent actions
- Merge the most frequent pair into a new token
- Repeat until desired vocabulary size is reached
The same action can be tokenized differently depending on surrounding context

Key insight (from ActionPiece): Just as BPE discovers that t + h + e should be merged into a single the token in English, the action tokenizer discovers that {Electronics, $50-100} should be merged into a single composite token because they co-occur frequently in purchase sequences.

Papers:

ActionPiece (Google DeepMind, 2025) — arXiv: 2502.13581 — First context-aware action sequence tokenizer. GitHub 53⭐
MBGen (2024) — arXiv: 2405.16871 — Multi-behavior generative recommendation (view, click, purchase as different token types). GitHub 57⭐
SETRec (2025) — arXiv: 2502.10833 — Order-agnostic set identifiers integrating collaborative + semantic signals
ContRec (2025) — arXiv: 2504.12007 — Continuous tokens via sigma-VAE + diffusion

4.3 Financial Transaction Tokenization

Core idea: Banking/financial transactions are multimodal sequential events (date + amount + description). Design a composite tokenizer that encodes all three modalities jointly.

How it works (from Banking Transaction Flow paper):

Date tokenization: Convert to day-of-week + relative time since last transaction
Amount tokenization: Quantize into logarithmic bins (captures the difference between $5 and $500 better than linear bins)
Wording tokenization: Standard BPE on the transaction description text (e.g., "AMAZON MARKETPLACE" → subword tokens)
Composite token: Combine date + amount + wording tokens into a single transaction representation
Sequence ordering: Within each day, sort transactions by ascending amount; across days, chronological order
Pre-train with masked transaction prediction (mask entire transactions, not just subwords)

Papers:

Banking Transaction Flow (2024) — arXiv: 2410.08243 — Custom tokenizer for banking transactions; pre-trained models outperform prior art on transaction categorization (31 classes) and credit risk scoring
LBSF (2024) — arXiv: 2411.15056 — Long-term payment behavior sequence folding by merchant, with multi-field behavior encoding
Temporal Tokenization Strategies (2025) — arXiv: 2512.13618 — Systematic comparison of how to tokenize timestamps for event sequences. Key finding: log-based encoding works best for skewed financial data
FinTRec (2025) — arXiv: 2511.14865 — Transformer for long-range financial product recommendation with temporally heterogeneous context
TIMeSynC (2024) — arXiv: 2410.12825 — Encoder-decoder transformer for sequential intent prediction in financial services

4.4 Tabular Feature Tokenization

Core idea: Each row in a table can be serialized as a sequence of tokens, and each feature value can be encoded meaningfully (not just as a text fragment).

Key methods:

Relative Magnitude Tokenization (RMT): Instead of tokenizing "$79.99" as text fragments, discretize it relative to the feature's distribution → "percentile_75" or "bin_high". This preserves ordinal relationships.
Intra-Feature Attention: Bind each value token to its column name via attention, so the model knows "$79.99" means "price is $79.99", not just a number.
Serialization: Convert rows to natural language: "price: $79.99, category: Electronics, brand: Sony" — surprisingly effective with large enough models.

Papers:

TP-BERTa (2024) — arXiv: 2403.01841 — Relative Magnitude Tokenization + intra-feature attention. Competitive with XGBoost/LightGBM.
TabuLa-8B (2024) — arXiv: 2406.12031 — Llama 3-8B fine-tuned on serialized tabular data. Strong zero/few-shot. GitHub 71⭐
TabSTAR (2025) — arXiv: 2505.18125 — Foundation tabular model with semantically target-aware representations. GitHub 83⭐. 112 upvotes.
UniTabE (2023) — arXiv: 2307.09249 — Universal pretraining protocol for tabular foundation models
TARTE (2025) — arXiv: 2505.14415 — Knowledge-enhanced tabular representations via pre-training on column names + table entries
TabICL (2025) — arXiv: 2502.05564 — Column-then-row attention, scales to 500K samples
Language Modeling on Tabular Data: A Survey (2024) — arXiv: 2408.10548 — Comprehensive survey. GitHub 33⭐

4.5 Universal Modality Tokenization

Core idea: Project all modalities — including time series, tabular data, graphs — into a shared discrete token space and process them with a single Transformer.

Papers:

Meta-Transformer (2023) — arXiv: 2307.10802 — 12 modalities (text, image, audio, video, point cloud, time series, tabular, IMU, graph, etc.) via a unified tokenizer + frozen encoder. GitHub 1652⭐. 45 upvotes.
Emu3 (2024) — arXiv: 2409.18869 — Next-token prediction is all you need across modalities. GitHub 2400⭐. 99 upvotes.
Unified-IO 2 (2023) — arXiv: 2312.17172 — Images, text, audio, and actions in one autoregressive model. GitHub 647⭐. 30 upvotes.
NTP Multimodal Survey (2024) — arXiv: 2412.18619 — Comprehensive taxonomy of multimodal tokenization + NTP. GitHub 478⭐. 59 upvotes.
LongCat-Next (2025) — arXiv: 2603.27538 — Lexicalizing modalities as discrete tokens. GitHub 409⭐. 145 upvotes.

5. Key Papers: Detailed Analysis

5.1 TIGER — Semantic IDs for Generative Retrieval

Full title: "Recommender Systems with Generative Retrieval" Authors: Shashank Rajput, Nikhil Mehta, Anima Singh, et al. (Google Research) Link: arXiv: 2305.05065 | GitHub 781⭐

What it does: TIGER (Transformer Index for GEnerative Recommenders) replaces the traditional two-stage retrieve-and-rank pipeline with a single generative model. Each item is assigned a Semantic ID — a tuple of discrete codewords — and the model autoregressively generates the Semantic ID of the next item a user will interact with.

Semantic ID generation process:

Item features (title, description, ...) 
    → Pre-trained text encoder (SentenceT5)
    → Dense embedding (768-dim)
    → Residual Quantization (RQ-VAE)
    → Semantic ID: (c1, c2, c3, ..., cK)    # K codewords from K codebooks

Residual Quantization (RQ):

Quantize the embedding to the nearest codebook entry → c1
Compute the residual (difference between original and quantized)
Quantize the residual → c2
Repeat K times

This creates a hierarchical representation: c1 captures coarse semantics (category-level), c2 refines it, c3 further, etc.

Training:

Input: sequence of Semantic IDs representing a user's past interactions
Target: Semantic ID of the next item
Loss: cross-entropy at each code position
Architecture: standard Transformer encoder-decoder

Key results:

Outperforms SASRec, BERT4Rec, and dual-encoder baselines on Amazon datasets
Cold-start capability: can recommend items never seen in training (because Semantic IDs generalize via shared prefixes)
Diversity: beam search with temperature naturally produces diverse recommendations

Relevance to domainTokenizer: TIGER's Semantic ID is the canonical example of how to create a "word" for a non-textual entity. The RQ-VAE approach is directly applicable to any item-based domain.

5.2 ActionPiece — BPE for User Actions

Full title: "ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation" Authors: Yupeng Hou, Jianmo Ni, Zhankui He, et al. (Google DeepMind) Link: arXiv: 2502.13581 | GitHub 53⭐

What it does: ActionPiece is the first context-aware tokenizer for user action sequences. It applies the BPE principle — merging frequently co-occurring pairs — but on sets of item features rather than characters.

Key innovation — actions as unordered feature sets: Instead of treating each item as an atomic ID, ActionPiece represents each user action as a set of features:

Action = {category: "Electronics", brand: "Sony", price_range: "$50-100", ...}

Vocabulary construction (BPE-like):

Start with base vocabulary = all individual features
Count co-occurrence of feature pairs:
- Intra-action: features within the same action (e.g., "Electronics" + "$50-100")
- Inter-action: features across adjacent actions (e.g., "Phone" in action t, "PhoneCase" in action t+1)
Merge the most frequent pair into a new composite token
Repeat until desired vocabulary size

Set Permutation Regularization (SPR): Because feature sets are unordered, the same action can be tokenized with different internal orderings. SPR produces multiple segmentations of the same sequence, acting as data augmentation and preventing the model from overfitting to arbitrary feature orderings.

Key results:

Outperforms TIGER, SASRec, BERT4Rec on Amazon Sports, Beauty, and CDs datasets
NDCG@10 improvements of 5–15% over TIGER
The context-aware tokenization means the same item gets different tokens in different behavioral contexts

Relevance to domainTokenizer: ActionPiece is the most directly applicable template for building a domain tokenizer. Its BPE-like algorithm can be generalized to any domain where events are composed of multiple features.

5.3 Banking Transaction Flow — Transactions as Tokens

Full title: "Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow" Authors: Cyrile Delestre, Yoann Sola Link: arXiv: 2410.08243

What it does: Designs a custom tokenizer for banking transactions — multimodal events consisting of (date, numerical amount, text wording) — and pre-trains Transformer and RNN models on large-scale transaction data.

Tokenization scheme:

Date modality: Converted to relative temporal features (days since last transaction, day of week)
Amount modality: Quantized into bins. The paper doesn't specify the exact binning, but refers to discretization that preserves order and magnitude.
Wording modality: Standard BPE tokenization on the text description (e.g., merchant names, transaction descriptions) after normalization (removing account numbers, dates from text, standardizing merchant names)
Composite embedding: Each modality's tokens are independently embedded, then combined via concatenation or learned projection into a single transaction-level representation

Sequence construction:

Within each day: transactions sorted by ascending amount
Across days: chronological order
Special separator tokens between days

Pre-training (self-supervised):

Masked Transaction Prediction (MTP): Mask entire transactions (not just subword tokens within a description), predict the masked transaction. This forces the model to learn cross-transaction patterns.
Both RNN (BiLSTM-based, ELMo-style) and Transformer (BERT-style) pre-training explored

Downstream tasks:

Transaction categorization: 31 classes (income, shopping, subscription, transport, savings, etc.). Fine-tuned pre-trained models beat all baselines.
Credit risk scoring: Binary classification of default risk. Pre-trained models significantly outperform non-pre-trained approaches.

Relevance to domainTokenizer: This is the closest existing work to an e-commerce transaction tokenizer. The multimodal composite tokenization approach (date + amount + text) is directly applicable.

5.4 LETTER — Learnable Item Tokenization

Full title: "Learnable Item Tokenization for Generative Recommendation" Authors: Wenjie Wang, Honghui Bao, et al. Link: arXiv: 2405.07314 | GitHub 153⭐

What it does: LETTER addresses three limitations of prior item tokenization methods:

ID-based: No semantic information, can't generalize to new items
Text-based: Lose collaborative signals (who bought what with what)
Codebook-based (RQ-VAE): Suffer from code assignment bias (popular items get all the good codes)

LETTER's solution — a learnable tokenizer with three objectives:

Semantic regularization: Tokenizer's codebook should respect semantic similarity (similar items → similar codes)
Contrastive alignment: Tokens should capture collaborative filtering signals (items bought together → nearby in token space)
Diversity loss: Prevent codebook collapse — ensure all codes are used, not just a few popular ones

Architecture:

Uses Residual Quantized VAE (like TIGER) as the base tokenizer
Adds the three losses above during tokenizer training
The tokenizer is trained jointly with (or alternately with) the generative recommendation model

Key results:

Outperforms TIGER, P5, and other generative recommendation baselines
Particularly strong on long-tail items (items with few interactions) due to the diversity loss

Relevance to domainTokenizer: LETTER shows that the tokenizer itself should be a learnable model trained with domain-specific objectives, not just a fixed preprocessing step.

5.5 TP-BERTa — Numerical Value Tokenization

Full title: "Making Pre-trained Language Models Great on Tabular Prediction" Authors: Jiahuan Yan, et al. Link: arXiv: 2403.01841

What it does: Solves the fundamental problem of representing numerical feature values as tokens. Standard text tokenizers fragment numbers meaninglessly. TP-BERTa introduces Relative Magnitude Tokenization (RMT).

Relative Magnitude Tokenization: Instead of tokenizing the raw number "$79.99" as text:

Compute the feature's distribution across the dataset
Express each value as its relative position in that distribution
Discretize into bins: "very_low", "low", "medium", "high", "very_high" (or finer)
The token is the bin label, which preserves ordinal relationships

Example:

price = $79.99
→ Within the "price" feature distribution, $79.99 is at the 73rd percentile
→ Token: "price_bin_73" or "price_high"

Intra-Feature Attention: Each feature value is paired with its feature name:

"price" → [price_name_embedding]
"$79.99" → [price_value_embedding via RMT]

Intra-feature attention binds them, so the model knows this number means "price" not "quantity" or "weight".

Key results:

TP-BERTa is competitive with XGBoost and LightGBM on standard tabular benchmarks
Significantly outperforms other deep learning approaches on tabular data
The pre-trained model transfers across different tables

Relevance to domainTokenizer: RMT solves the critical problem of numerical tokenization. Every domain tokenizer will need to handle numbers (prices, amounts, quantities, durations), and RMT is currently the best approach.

5.6 Meta-Transformer — 12 Modalities, One Token Space

Full title: "Meta-Transformer: A Unified Framework for Multimodal Learning" Authors: Yiyuan Zhang, Kaixiong Gong, et al. Link: arXiv: 2307.10802 | GitHub 1652⭐

What it does: Demonstrates that a single frozen Transformer encoder can process 12 different modalities — including time series and tabular data — by projecting each modality into a shared token space via modality-specific tokenizers.

Modality-specific tokenizers:

Text: standard embedding
Image: patch embedding (ViT-style)
Audio: spectrogram patches
Time series: segment embedding (chop time series into fixed-length segments, project each to a token)
Tabular: feature-wise embedding (each column value becomes a token)
Graph: node feature embedding
Point cloud: point group embedding

Key insight: The tokenizers are lightweight (small learnable projections), and the Transformer encoder is frozen — trained once and shared across all modalities. This means the bulk of the computation is modality-agnostic.

Relevance to domainTokenizer: Meta-Transformer proves the viability of the unified approach. A domain tokenizer could use a similar architecture: lightweight domain-specific tokenizers feeding into a shared Transformer backbone.

6. Tokenization Methods: A Technical Taxonomy

6.1 Quantization-Based (RQ-VAE, VQ-VAE)

How it works:

Train a Vector Quantized Variational Autoencoder on item embeddings
The encoder maps items to a continuous latent space
The quantization layer maps each embedding to the nearest entry in a learned codebook
Residual Quantization (RQ): apply quantization iteratively on residuals for multi-token representations
The decoder reconstructs the original embedding from the quantized codes

Strengths:

Produces hierarchically structured tokens (coarse-to-fine)
Items with similar content naturally share token prefixes
Controllable vocabulary size (codebook size × number of levels)

Weaknesses:

Codebook collapse (some codes rarely used)
Training instability (requires commitment loss, EMA updates, etc.)
No collaborative signal unless explicitly added (see LETTER)

Used by: TIGER, LETTER, PRISM, MMGRec, MiniOneRec, GenRec

6.2 BPE-Inspired Merging

How it works:

Start with atomic features as the base vocabulary
Count co-occurrence frequencies of feature pairs in the corpus
Merge the most frequent pair into a new composite token
Repeat until desired vocabulary size

Strengths:

Naturally discovers meaningful composite patterns
Context-aware (merges depend on surrounding actions)
Directly analogous to text BPE — well-understood properties
No neural network training required for vocabulary construction

Weaknesses:

Greedy algorithm — may not find globally optimal vocabulary
Requires careful handling of unordered feature sets (set permutation regularization)
Vocabulary depends on corpus statistics — may not generalize to distribution shifts

Used by: ActionPiece

6.3 Magnitude & Binning Approaches

How it works:

For numerical values: compute distribution statistics, discretize into bins
Options: uniform bins, quantile bins, logarithmic bins, adaptive bins
For timestamps: calendar tokens (day-of-week, month, etc.) or relative encodings

Strengths:

Simple, interpretable, no training required
Preserves ordinal relationships
Handles numerical data natively (no text conversion)

Weaknesses:

Fixed granularity (bin resolution)
Information loss at bin boundaries
Requires domain knowledge to choose binning strategy

Used by: TP-BERTa, Banking Transaction Flow, Temporal Tokenization Strategies

6.4 Learnable End-to-End Tokenizers

How it works:

A neural network (encoder) maps raw domain data to discrete tokens
The tokenizer is trained end-to-end with the downstream model
Uses techniques like Gumbel-Softmax for differentiable discretization

Strengths:

Tokenizer adapts to the downstream task
Can incorporate multiple objectives (semantic, collaborative, diversity)
No manual design of tokenization rules

Weaknesses:

More complex training (joint optimization)
Risk of tokenizer-model co-adaptation (poor generalization)
Harder to interpret what tokens mean

Used by: LETTER, UniGRec, ContRec, MANTa

6.5 Serialization-Based (Text Templates)

How it works:

Convert each data record to a natural language string: "The customer bought Sony WH-1000XM5 headphones for $349.99 using a credit card on March 15, 2025."
Use a standard text tokenizer (BPE) on the serialized string
Feed to a pre-trained LLM

Strengths:

Zero engineering — use off-the-shelf LLMs
Benefits from LLM's pre-trained world knowledge
Handles heterogeneous schemas easily

Weaknesses:

Extremely token-inefficient (one row might become 100+ tokens)
Numerical values still poorly handled by text tokenizers
Requires large models to work well (no "small model" possibility)
No exploitation of domain structure

Used by: TabuLa-8B, TabSTAR (partially), various LLM-for-tabular approaches

7. The domainTokenizer Blueprint: How to Build It

7.1 Architecture Design

Based on the research, domainTokenizer should have three components:

┌─────────────────────────────────────────────────┐
│                 domainTokenizer                  │
│                                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌────────┐ │
│  │  Domain       │  │  Transformer │  │  Task  │ │
│  │  Tokenizer    │──│  Backbone    │──│  Heads │ │
│  │  (learnable)  │  │  (small)     │  │        │ │
│  └──────────────┘  └──────────────┘  └────────┘ │
│                                                  │
│  Tokenizer: Domain events → discrete tokens      │
│  Backbone: Sequence modeling via attention        │
│  Heads: Task-specific outputs                    │
└─────────────────────────────────────────────────┘

Domain Tokenizer (per-domain, learnable):

Handles the conversion of raw domain events into discrete tokens
Combines multiple strategies: RQ-VAE for items, magnitude binning for numbers, BPE-like merging for feature compositions, calendar encoding for timestamps
Small and fast (a few million parameters at most)

Transformer Backbone (shared, small):

Standard causal or bidirectional Transformer
Target sizes: 10M, 50M, 150M, 350M parameters
Pre-trained on domain sequences with self-supervised objectives
Potentially shareable across related domains

Task Heads (per-task):

Classification head for fraud detection, churn prediction, etc.
Generation head for next-event prediction, recommendation
Regression head for value prediction (LTV, credit score, etc.)

7.2 Tokenizer Construction Pipeline

For a given domain (e.g., e-commerce), the tokenizer construction follows:

Step 1: Schema Analysis

# Identify field types in the domain data
schema = {
    "product_id": "categorical_entity",    # → Semantic ID via RQ-VAE
    "category": "categorical_fixed",       # → direct vocabulary mapping
    "price": "numerical_continuous",       # → magnitude binning (RMT)
    "quantity": "numerical_discrete",      # → small fixed vocabulary
    "timestamp": "temporal",               # → calendar + relative encoding
    "description": "text",                 # → standard BPE (subword)
    "payment_method": "categorical_small", # → direct mapping
    "customer_id": "entity_id",            # → learned embedding or behavioral cluster
}

Step 2: Per-Field Tokenization

Field Type	Method	Output
Categorical entity (products)	RQ-VAE Semantic IDs	Tuple of K codebook indices
Categorical fixed (categories)	Direct vocab mapping	Single token index
Numerical continuous (prices)	Relative Magnitude Tokenization	Bin token
Temporal (timestamps)	Calendar tokens + relative delta	2–3 tokens (day-of-week, time-of-day, delta)
Text (descriptions)	Standard BPE	Variable-length subword tokens
Entity ID (customers)	Behavioral clustering or learned embedding	Single token or short sequence

Step 3: Composite Token Construction (BPE-like) Following ActionPiece, apply a BPE-like merge algorithm on the composite per-field tokens to discover meaningful multi-field patterns:

Initial: [Electronics] [price_high] [CreditCard] [Weekday]
After merging: [Electronics+price_high] [CreditCard+Weekday]
Further: [HighEndElectronicsPurchase] [WeekdayCreditCard]

Step 4: Special Tokens

[SEP]       - separates transactions in a sequence
[DAY_SEP]   - separates days
[PAD]       - padding
[MASK]      - for masked pre-training
[CLS]       - sequence-level representation
[UNK]       - unknown/out-of-vocabulary events

7.3 Pre-training Objectives

Based on the literature, the following self-supervised objectives are most effective:

1. Masked Event Prediction (MEP) — BERT-style

Mask 15% of complete events (not just individual tokens within an event)
Predict all tokens of the masked event
Forces the model to learn cross-event patterns

2. Next Event Prediction (NEP) — GPT-style

Given a sequence of events, predict the next event autoregressively
Generate the event's token sequence (e.g., Semantic ID) token by token
The primary objective for generative recommendation

3. Contrastive Sequence Learning

Similar customer sequences should have similar representations
Push apart sequences from different behavioral clusters
Helps with customer segmentation and transfer learning

4. Temporal Ordering

Given a shuffled sequence, predict the correct temporal order
Forces the model to learn temporal patterns (seasonality, cadence, trends)

7.4 Downstream Task Adaptation

Once pre-trained, the model can be fine-tuned for specific tasks:

Task	Adaptation Method	Head
Next purchase prediction	Continue NEP, decode Semantic IDs	Generative (autoregressive)
Fraud detection	Fine-tune on labeled transactions	Binary classifier on [CLS]
Customer segmentation	Extract [CLS] embeddings, cluster	No head (use embeddings)
Churn prediction	Fine-tune on labeled sequences	Binary classifier on [CLS]
Credit scoring	Fine-tune on labeled customer histories	Regression or classification
Demand forecasting	Adapt temporal patterns	Regression on quantity tokens
Product recommendation	NEP with Semantic ID decoding	Generative (beam search)

8. Use Case Walkthrough: E-Commerce Transaction Model

The Scenario

An e-commerce platform with:

2M customers
500K products
100M transactions over 2 years
Each transaction: (customer_id, product_id, category, price, quantity, timestamp, payment_method, device)

Step 1: Build the Tokenizer

Product Semantic IDs:

# 1. Generate product embeddings from title + description
product_embeddings = sentence_encoder(product_titles + product_descriptions)  # 500K × 768

# 2. Train RQ-VAE with 4 codebooks of 256 entries each
rq_vae = ResidualQuantizedVAE(n_codebooks=4, codebook_size=256)
rq_vae.fit(product_embeddings)

# 3. Each product gets a 4-token Semantic ID
product_semantic_ids = rq_vae.encode(product_embeddings)  # 500K × 4
# e.g., Headphones → [42, 187, 23, 91]

Price Tokenization (RMT):

# Compute percentile bins
price_bins = compute_quantile_bins(all_prices, n_bins=50)
# $79.99 → "price_bin_37" (37th percentile bin)

Timestamp Tokenization:

# Calendar features + relative delta
def tokenize_timestamp(ts, prev_ts):
    return [
        day_of_week_token(ts),      # "wednesday"
        time_of_day_token(ts),       # "afternoon"  
        delta_token(ts - prev_ts),   # "2_days_later"
    ]

Composite vocabulary construction (BPE-like):

# Run ActionPiece-style merging on the corpus of tokenized transaction sequences
vocabulary = actionpiece_vocab_construction(
    corpus=all_tokenized_transactions,
    target_vocab_size=8192,
    consider_intra_event=True,   # merge features within a transaction
    consider_inter_event=True,   # merge features across adjacent transactions
)

Step 2: Pre-train

# Tokenize all 100M transactions
tokenized_corpus = tokenize_all_transactions(transactions, tokenizer)

# Pre-train a small Transformer (150M params)
model = TransformerLM(
    vocab_size=8192 + special_tokens,
    d_model=768,
    n_heads=12,
    n_layers=12,
    max_seq_len=256,  # ~256 transactions per customer
)

# Self-supervised pre-training with MEP + NEP
train(model, tokenized_corpus, objectives=["masked_event", "next_event"])

Step 3: Fine-tune & Deploy

# Example: Fraud detection
fraud_model = add_classification_head(model, n_classes=2)
fine_tune(fraud_model, labeled_fraud_data)

# Example: Next purchase recommendation
rec_model = model  # Use generative mode directly
next_item_semantic_id = rec_model.generate(customer_transaction_sequence)
next_item = rq_vae.decode(next_item_semantic_id)  # Map back to product

9. Open Challenges and Research Gaps

9.1 Vocabulary Evolution

Products are added and removed constantly. Semantic IDs need to be recomputed, which may invalidate the model's learned associations. Partial solutions: periodic re-indexing (TIGER), using content features that are stable even when the catalog changes.

9.2 Cross-Domain Transfer

Can a tokenizer trained on e-commerce data transfer to banking? The field-level tokenizers (RMT for numbers, calendar for dates) should transfer, but composite vocabularies are domain-specific. Open question: is there a "universal domain tokenizer" or will each domain need its own?

9.3 Numerical Precision

All current methods lose some numerical precision through discretization. For applications where exact values matter (financial auditing, pricing optimization), this is a limitation. Potential solution: hybrid approaches that combine discrete tokens with continuous residuals.

9.4 Handling Missing Data

Real business data is full of missing values. Text tokenizers never face this issue. Domain tokenizers need explicit strategies: [MISSING] tokens, imputation, or learning to model missingness as a signal.

9.5 Privacy & Fairness

Tokenizing customer behavior raises privacy concerns. Semantic IDs could encode sensitive attributes (demographic patterns, financial status) in ways that are hard to audit. Domain tokenizers should be designed with fairness constraints.

9.6 Scalability of BPE-Like Merging

ActionPiece's vocabulary construction is O(N × V) per merge step. For very large corpora (billions of events) and feature spaces (thousands of features), this may become prohibitively expensive. Potential solution: approximate counting, hierarchical merging, or neural vocabulary construction.

9.7 Evaluation Standards

There are no standard benchmarks for "domain tokenization quality." Text tokenizers can be evaluated by compression ratio and downstream perplexity. Domain tokenizers need domain-specific metrics: recommendation quality, prediction accuracy, calibration, etc.

9.8 Connection to Continual Learning

The HOPE / Nested Learning paradigm (see Section 11) suggests that models should continuously learn from new data. Domain tokenizers that can incrementally update their vocabularies — adding new product tokens, retiring obsolete ones — without full retraining would be highly valuable.

10. Complete Paper Reference Table

#	Paper	Year	ArXiv	Domain	Key Contribution	GitHub
1	TIGER	2023	2305.05065	Recommendation	Semantic IDs via RQ-VAE for generative retrieval	781⭐
2	Semantic IDs (YouTube)	2023	2306.08121	Recommendation	Content-derived IDs at industry scale	—
3	ActionPiece	2025	2502.13581	Recommendation	BPE-like context-aware action tokenization	53⭐
4	LETTER	2024	2405.07314	Recommendation	Learnable tokenizer with semantic+collaborative+diversity	153⭐
5	SETRec	2025	2502.10833	Recommendation	Order-agnostic set identifiers	—
6	ContRec	2025	2504.12007	Recommendation	Continuous tokens via sigma-VAE + diffusion	—
7	GenRec	2026	2604.14878	Recommendation	Page-wise NTP for large-scale recommendation	—
8	MBGen	2024	2405.16871	Recommendation	Multi-behavior (view/click/buy) as token types	57⭐
9	RSLLM	2024	2412.16933	Recommendation	Recommendation as a new language in LLMs	—
10	PRISM	2025	2601.16556	Recommendation	Purified quantization for semantic tokenization	—
11	MMGRec	2024	2404.16555	Recommendation	Graph RQ-VAE for multimodal items	—
12	UniGRec	2025	2601.17438	Recommendation	Soft item identifiers for end-to-end optimization	—
13	Semantic IDs for Search+Rec	2025	2508.10478	Recommendation	Joint search and recommendation Semantic IDs	—
14	Banking Transaction Flow	2024	2410.08243	Finance	Composite tokenizer for (date, amount, text) transactions	—
15	LBSF	2024	2411.15056	Finance	Long-term payment behavior folding by merchant	—
16	Temporal Tokenization	2025	2512.13618	Events	Systematic comparison of temporal tokenization strategies	—
17	FinTRec	2025	2511.14865	Finance	Transformer for long-range financial recommendation	—
18	TIMeSynC	2024	2410.12825	Finance	Temporal intent prediction in financial services	—
19	TP-BERTa	2024	2403.01841	Tabular	Relative Magnitude Tokenization for numbers	—
20	TabuLa-8B	2024	2406.12031	Tabular	Llama 3 fine-tuned on serialized tables	71⭐
21	TabSTAR	2025	2505.18125	Tabular	Semantically target-aware tabular foundation model	83⭐
22	UniTabE	2023	2307.09249	Tabular	Universal tabular pretraining protocol	—
23	TARTE	2025	2505.14415	Tabular	Knowledge-enhanced tabular representations	—
24	TabICL	2025	2502.05564	Tabular	Column-then-row attention, scales to 500K samples	—
25	Meta-Transformer	2023	2307.10802	Universal	12 modalities in one token space	1652⭐
26	Emu3	2024	2409.18869	Universal	NTP is all you need across modalities	2400⭐
27	Unified-IO 2	2023	2312.17172	Universal	Image+text+audio+action in one model	647⭐
28	NTP Multimodal Survey	2024	2412.18619	Survey	Taxonomy of multimodal tokenization + NTP	478⭐
29	LongCat-Next	2025	2603.27538	Universal	Lexicalizing modalities as discrete tokens	409⭐
30	Tabular Data Survey	2024	2408.10548	Survey	Comprehensive survey of LMs for tabular data	33⭐
31	KL3M Tokenizers	2025	2503.17247	Legal/Finance	Domain-specific BPE for professional text	GitHub

11. Related Concepts: Nested Learning & Continual Adaptation

An important related development is the Nested Learning paradigm introduced by Google Research (arXiv: 2512.24695, by Ali Behrouz et al.), which presents the HOPE architecture.

Why Nested Learning Matters for Domain Tokenization

Current Transformer-based models are "frozen" after pre-training — they cannot incorporate new knowledge without retraining. For domain tokenization, this means:

A recommendation model can't learn about new products added after training
A fraud detection model can't adapt to new fraud patterns in real-time
A customer model can't update its understanding of a customer's evolving preferences

The HOPE architecture addresses this via:

Continuum Memory System (CMS): Multiple MLP blocks updating at different frequencies — some update every few tokens (catching immediate patterns), others update only after millions of tokens (storing persistent knowledge). This prevents catastrophic forgetting.
Self-Modifying Titans: The model's projection layers update themselves in real-time based on incoming data, enabling continuous adaptation.

For domainTokenizer, the implication is: a domain model built with Nested Learning principles could continuously learn from new transactions, adapting its understanding of products, customer preferences, and behavioral patterns without retraining from scratch.

This is an area of active exploration for future versions of domainTokenizer.

For the full research report on Nested Learning, see the HOPE / Nested Learning discussion on HF Papers.

This report is a living document and will be updated as the domainTokenizer project evolves.

Domain Tokenization: Beyond Words — A Research Report

Table of Contents

1. Executive Summary

2. The Problem: Why Words Are Not Enough

2.1 The Mismatch Between Business Data and Text Tokens

2.2 The Opportunity: Domain Structure is Richer Than Language

2.3 Why Small Models?

3. The Core Insight: Anything Can Be a Token

4. Research Landscape: Five Paradigms of Domain Tokenization

4.1 Semantic ID Tokenization (Products & Items)

4.2 Action Sequence Tokenization (User Behaviors)

4.3 Financial Transaction Tokenization

4.4 Tabular Feature Tokenization

4.5 Universal Modality Tokenization

5. Key Papers: Detailed Analysis

5.1 TIGER — Semantic IDs for Generative Retrieval

5.2 ActionPiece — BPE for User Actions

5.3 Banking Transaction Flow — Transactions as Tokens

5.4 LETTER — Learnable Item Tokenization

5.5 TP-BERTa — Numerical Value Tokenization

5.6 Meta-Transformer — 12 Modalities, One Token Space

6. Tokenization Methods: A Technical Taxonomy

6.1 Quantization-Based (RQ-VAE, VQ-VAE)

6.2 BPE-Inspired Merging

6.3 Magnitude & Binning Approaches

6.4 Learnable End-to-End Tokenizers

6.5 Serialization-Based (Text Templates)

7. The domainTokenizer Blueprint: How to Build It

7.1 Architecture Design

7.2 Tokenizer Construction Pipeline

7.3 Pre-training Objectives

7.4 Downstream Task Adaptation

8. Use Case Walkthrough: E-Commerce Transaction Model

The Scenario

Step 1: Build the Tokenizer

Step 2: Pre-train

Step 3: Fine-tune & Deploy

9. Open Challenges and Research Gaps

9.1 Vocabulary Evolution

9.2 Cross-Domain Transfer

9.3 Numerical Precision

9.4 Handling Missing Data

9.5 Privacy & Fairness

9.6 Scalability of BPE-Like Merging

9.7 Evaluation Standards

9.8 Connection to Continual Learning

10. Complete Paper Reference Table

11. Related Concepts: Nested Learning & Continual Adaptation

Why Nested Learning Matters for Domain Tokenization