domainTokenizer / docs /research_report.md
rtferraz's picture
Add comprehensive research report on domain-specific tokenization
be86e60 verified
|
raw
history blame
52.8 kB

Domain Tokenization: Beyond Words β€” A Research Report

Building small models that understand domain tokens, not just words.

Last updated: April 2026


Table of Contents

  1. Executive Summary
  2. The Problem: Why Words Are Not Enough
  3. The Core Insight: Anything Can Be a Token
  4. Research Landscape: Five Paradigms of Domain Tokenization
  5. Key Papers: Detailed Analysis
  6. Tokenization Methods: A Technical Taxonomy
  7. The domainTokenizer Blueprint: How to Build It
  8. Use Case Walkthrough: E-Commerce Transaction Model
  9. Open Challenges and Research Gaps
  10. Complete Paper Reference Table
  11. Related Concepts: Nested Learning & Continual Adaptation

1. Executive Summary

Large Language Models (LLMs) process text by breaking it into tokens β€” subword units learned via algorithms like BPE (Byte-Pair Encoding). This tokenization is the foundation that allows Transformers to model sequential patterns via next-token prediction.

But words are just one type of sequential data. Businesses generate vast amounts of non-textual sequential data every day:

  • E-commerce: millions of purchase transactions, each with product IDs, amounts, timestamps, categories
  • Banking: transaction flows with dates, amounts, merchant codes, and descriptions
  • Healthcare: sequences of diagnoses, procedures, lab results, medications
  • Advertising: impression β†’ click β†’ conversion funnels with bid amounts and user features
  • Logistics: shipping events, warehouse movements, delivery status sequences

The central question this project explores: Can we build tokenizers that encode these domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then train small, efficient Transformer models that understand domain patterns the way LLMs understand language?

The answer from recent research is a resounding yes. This report surveys 25+ papers spanning 2021–2026 that collectively establish a new paradigm: domain tokenization. The key findings are:

  1. Semantic IDs (Google, 2023): Products can be encoded as tuples of discrete tokens derived from their content embeddings via quantization (RQ-VAE). A Transformer trained on sequences of these Semantic IDs outperforms traditional recommendation systems and generalizes to unseen items.

  2. Action tokenization (Google DeepMind, 2025): User action sequences can be tokenized using a BPE-like algorithm that merges frequently co-occurring features β€” the same algorithm that powers text tokenization, applied to business events instead of characters.

  3. Transaction tokenization (2024): Banking transactions β€” multimodal events of (date, amount, text) β€” can be encoded as composite tokens and modeled with self-supervised pre-training, achieving state-of-the-art on fraud detection and credit scoring.

  4. Tabular tokenization (2024–2025): Individual feature values (numerical, categorical) can be tokenized via relative magnitude encoding or serialization, enabling foundation models that transfer across different tabular datasets.

  5. Universal tokenization (2023–2024): Frameworks like Meta-Transformer demonstrate that 12+ modalities including time series and tabular data can be projected into a shared token space and processed by a single frozen Transformer.

This report details each paradigm, provides technical depth on the tokenization methods, and lays out a concrete blueprint for building domainTokenizer.


2. The Problem: Why Words Are Not Enough

2.1 The Mismatch Between Business Data and Text Tokens

When an e-commerce platform processes a customer's purchase history, the raw data looks like:

customer_42 | 2025-03-15 | SKU-8847291 | Electronics > Headphones | $79.99 | Credit Card | qty: 1
customer_42 | 2025-03-15 | SKU-3321098 | Electronics > Cables    | $12.49 | Credit Card | qty: 2
customer_42 | 2025-04-01 | SKU-5519273 | Books > Technical       | $44.95 | Debit Card  | qty: 1

If you feed this to a standard LLM tokenizer (e.g., GPT-4o's cl100k_base), you get:

  • SKU-8847291 β†’ split into meaningless subword fragments like SK, U-, 884, 72, 91
  • $79.99 β†’ tokenized as $, 79, ., 99 β€” losing the semantic meaning of "a mid-range purchase"
  • 2025-03-15 β†’ fragmented into date components with no temporal understanding
  • The relationships between fields (this amount goes with this product in this category) are lost in a flat token stream

The fundamental problem: text tokenizers are optimized for the statistical structure of natural language. They know that ing and tion are common suffixes, that the is frequent, that un- is a prefix. They know nothing about:

  • Product similarity (headphones and earbuds are related)
  • Price ranges ($79.99 is "mid-range electronics" vs. $2,499 is "premium")
  • Temporal patterns (weekly vs. monthly purchase cadence)
  • Cross-field interactions (buying a cable right after headphones = accessory purchase)

2.2 The Opportunity: Domain Structure is Richer Than Language

Business domains have structure that goes beyond what text captures:

Dimension Language Business Domain
Vocabulary ~50K–256K subwords Millions of SKUs, thousands of categories
Sequence meaning Word order determines syntax Temporal order determines behavioral patterns
Similarity Semantic (synonyms, paraphrases) Collaborative (users who buy X also buy Y)
Numerical values Rare, incidental Central (prices, quantities, timestamps)
Compositionality Words compose into sentences Features compose into events/transactions
Temporal dynamics Mostly static semantics Evolving trends, seasonal patterns

A domain tokenizer should exploit all of this structure.

2.3 Why Small Models?

This project focuses on small models (tens of millions to low billions of parameters) because:

  1. Domain data is structured β€” you don't need 70B parameters to learn that "users who buy phones often buy cases." The pattern space is narrower than open-domain language.
  2. Latency matters β€” production systems need real-time inference (fraud detection, recommendations, pricing).
  3. Data efficiency β€” most businesses have millions, not trillions, of training examples.
  4. Cost β€” training and serving small models is orders of magnitude cheaper.
  5. Interpretability β€” smaller models with domain-specific tokens are more auditable than black-box LLMs.

3. The Core Insight: Anything Can Be a Token

The survey "Next Token Prediction Towards Multimodal Intelligence" (arXiv: 2412.18619, 59 upvotes) formalizes this principle:

Next-Token Prediction (NTP) is a universal training objective that works across modalities. The bottleneck is not the model architecture β€” it's tokenization: how you map domain entities into discrete token spaces.

This means the entire LLM machinery β€” attention, scaling laws, in-context learning, transfer learning β€” becomes available for any domain once you solve the tokenization problem.

The precedent is clear across modalities:

Modality How It's Tokenized Key Paper
Text BPE / WordPiece / SentencePiece GPT, BERT, Llama
Images VQ-VAE, patch embeddings DALL-E, ViT
Audio Spectral codecs (EnCodec) AudioLM, Whisper
Video 3D causal VAE HiTVideo, Emu3
Robotics actions Discrete Cosine Transform FAST (2501.09747)
Products/Items Semantic IDs via RQ-VAE TIGER
User actions BPE on feature sets ActionPiece
Transactions Composite (date+amount+text) Banking TF
Tabular features Magnitude binning, serialization TP-BERTa, TabuLa
Time series Scalar quantization, symbolic discretization TokenCast, LLMTime

The bottom half of this table β€” the business-domain entries β€” is where domainTokenizer operates.


4. Research Landscape: Five Paradigms of Domain Tokenization

4.1 Semantic ID Tokenization (Products & Items)

Core idea: Encode each item (product, video, song, article) as a sequence of discrete semantic tokens derived from its content features.

How it works:

  1. Extract a dense embedding from item features (e.g., product title + description β†’ SentenceT5 β†’ 768-dim vector)
  2. Apply Residual Quantization (RQ-VAE): iteratively quantize the embedding into a sequence of codebook indices
  3. The resulting tuple (c1, c2, c3, ...) is the item's Semantic ID β€” its "word" in the domain language
  4. Train a Transformer to predict sequences of these Semantic IDs

Key property: Items with similar content share token prefixes, creating a hierarchical semantic structure:

Headphones A:  [Audio, 23, 7, 41]
Headphones B:  [Audio, 23, 7, 55]    ← shares 3/4 prefix tokens
Laptop C:      [Computing, 8, 31, 12] ← completely different tokens

Papers:

  • TIGER (Google, 2023) β€” arXiv: 2305.05065 β€” The landmark paper introducing Semantic IDs for recommendation. GitHub 781⭐
  • Semantic IDs at YouTube (Google, 2023) β€” arXiv: 2306.08121 β€” Deployed at industry scale, replacing random IDs
  • PRISM (2025) β€” arXiv: 2601.16556 β€” Purified quantization for better semantic tokenization
  • MMGRec (2024) β€” arXiv: 2404.16555 β€” Graph RQ-VAE incorporating multimodal item features
  • Semantic IDs for Joint Search & Rec (2025) β€” arXiv: 2508.10478 β€” Unified Semantic IDs across search and recommendation

4.2 Action Sequence Tokenization (User Behaviors)

Core idea: Don't just tokenize individual items β€” tokenize the entire action sequence, where each action is a composite event with multiple features.

How it works:

  1. Represent each user action as an unordered set of features: {category: Electronics, price_bin: $50-100, brand: Sony, payment: Credit}
  2. Apply a BPE-like vocabulary construction algorithm that merges frequently co-occurring feature patterns:
    • Count co-occurrence of feature pairs both within actions and across adjacent actions
    • Merge the most frequent pair into a new token
    • Repeat until desired vocabulary size is reached
  3. The same action can be tokenized differently depending on surrounding context

Key insight (from ActionPiece): Just as BPE discovers that t + h + e should be merged into a single the token in English, the action tokenizer discovers that {Electronics, $50-100} should be merged into a single composite token because they co-occur frequently in purchase sequences.

Papers:

  • ActionPiece (Google DeepMind, 2025) β€” arXiv: 2502.13581 β€” First context-aware action sequence tokenizer. GitHub 53⭐
  • MBGen (2024) β€” arXiv: 2405.16871 β€” Multi-behavior generative recommendation (view, click, purchase as different token types). GitHub 57⭐
  • SETRec (2025) β€” arXiv: 2502.10833 β€” Order-agnostic set identifiers integrating collaborative + semantic signals
  • ContRec (2025) β€” arXiv: 2504.12007 β€” Continuous tokens via sigma-VAE + diffusion

4.3 Financial Transaction Tokenization

Core idea: Banking/financial transactions are multimodal sequential events (date + amount + description). Design a composite tokenizer that encodes all three modalities jointly.

How it works (from Banking Transaction Flow paper):

  1. Date tokenization: Convert to day-of-week + relative time since last transaction
  2. Amount tokenization: Quantize into logarithmic bins (captures the difference between $5 and $500 better than linear bins)
  3. Wording tokenization: Standard BPE on the transaction description text (e.g., "AMAZON MARKETPLACE" β†’ subword tokens)
  4. Composite token: Combine date + amount + wording tokens into a single transaction representation
  5. Sequence ordering: Within each day, sort transactions by ascending amount; across days, chronological order
  6. Pre-train with masked transaction prediction (mask entire transactions, not just subwords)

Papers:

  • Banking Transaction Flow (2024) β€” arXiv: 2410.08243 β€” Custom tokenizer for banking transactions; pre-trained models outperform prior art on transaction categorization (31 classes) and credit risk scoring
  • LBSF (2024) β€” arXiv: 2411.15056 β€” Long-term payment behavior sequence folding by merchant, with multi-field behavior encoding
  • Temporal Tokenization Strategies (2025) β€” arXiv: 2512.13618 β€” Systematic comparison of how to tokenize timestamps for event sequences. Key finding: log-based encoding works best for skewed financial data
  • FinTRec (2025) β€” arXiv: 2511.14865 β€” Transformer for long-range financial product recommendation with temporally heterogeneous context
  • TIMeSynC (2024) β€” arXiv: 2410.12825 β€” Encoder-decoder transformer for sequential intent prediction in financial services

4.4 Tabular Feature Tokenization

Core idea: Each row in a table can be serialized as a sequence of tokens, and each feature value can be encoded meaningfully (not just as a text fragment).

Key methods:

  • Relative Magnitude Tokenization (RMT): Instead of tokenizing "$79.99" as text fragments, discretize it relative to the feature's distribution β†’ "percentile_75" or "bin_high". This preserves ordinal relationships.
  • Intra-Feature Attention: Bind each value token to its column name via attention, so the model knows "$79.99" means "price is $79.99", not just a number.
  • Serialization: Convert rows to natural language: "price: $79.99, category: Electronics, brand: Sony" β€” surprisingly effective with large enough models.

Papers:

  • TP-BERTa (2024) β€” arXiv: 2403.01841 β€” Relative Magnitude Tokenization + intra-feature attention. Competitive with XGBoost/LightGBM.
  • TabuLa-8B (2024) β€” arXiv: 2406.12031 β€” Llama 3-8B fine-tuned on serialized tabular data. Strong zero/few-shot. GitHub 71⭐
  • TabSTAR (2025) β€” arXiv: 2505.18125 β€” Foundation tabular model with semantically target-aware representations. GitHub 83⭐. 112 upvotes.
  • UniTabE (2023) β€” arXiv: 2307.09249 β€” Universal pretraining protocol for tabular foundation models
  • TARTE (2025) β€” arXiv: 2505.14415 β€” Knowledge-enhanced tabular representations via pre-training on column names + table entries
  • TabICL (2025) β€” arXiv: 2502.05564 β€” Column-then-row attention, scales to 500K samples
  • Language Modeling on Tabular Data: A Survey (2024) β€” arXiv: 2408.10548 β€” Comprehensive survey. GitHub 33⭐

4.5 Universal Modality Tokenization

Core idea: Project all modalities β€” including time series, tabular data, graphs β€” into a shared discrete token space and process them with a single Transformer.

Papers:


5. Key Papers: Detailed Analysis

5.1 TIGER β€” Semantic IDs for Generative Retrieval

Full title: "Recommender Systems with Generative Retrieval" Authors: Shashank Rajput, Nikhil Mehta, Anima Singh, et al. (Google Research) Link: arXiv: 2305.05065 | GitHub 781⭐

What it does: TIGER (Transformer Index for GEnerative Recommenders) replaces the traditional two-stage retrieve-and-rank pipeline with a single generative model. Each item is assigned a Semantic ID β€” a tuple of discrete codewords β€” and the model autoregressively generates the Semantic ID of the next item a user will interact with.

Semantic ID generation process:

Item features (title, description, ...) 
    β†’ Pre-trained text encoder (SentenceT5)
    β†’ Dense embedding (768-dim)
    β†’ Residual Quantization (RQ-VAE)
    β†’ Semantic ID: (c1, c2, c3, ..., cK)    # K codewords from K codebooks

Residual Quantization (RQ):

  1. Quantize the embedding to the nearest codebook entry β†’ c1
  2. Compute the residual (difference between original and quantized)
  3. Quantize the residual β†’ c2
  4. Repeat K times

This creates a hierarchical representation: c1 captures coarse semantics (category-level), c2 refines it, c3 further, etc.

Training:

  • Input: sequence of Semantic IDs representing a user's past interactions
  • Target: Semantic ID of the next item
  • Loss: cross-entropy at each code position
  • Architecture: standard Transformer encoder-decoder

Key results:

  • Outperforms SASRec, BERT4Rec, and dual-encoder baselines on Amazon datasets
  • Cold-start capability: can recommend items never seen in training (because Semantic IDs generalize via shared prefixes)
  • Diversity: beam search with temperature naturally produces diverse recommendations

Relevance to domainTokenizer: TIGER's Semantic ID is the canonical example of how to create a "word" for a non-textual entity. The RQ-VAE approach is directly applicable to any item-based domain.


5.2 ActionPiece β€” BPE for User Actions

Full title: "ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation" Authors: Yupeng Hou, Jianmo Ni, Zhankui He, et al. (Google DeepMind) Link: arXiv: 2502.13581 | GitHub 53⭐

What it does: ActionPiece is the first context-aware tokenizer for user action sequences. It applies the BPE principle β€” merging frequently co-occurring pairs β€” but on sets of item features rather than characters.

Key innovation β€” actions as unordered feature sets: Instead of treating each item as an atomic ID, ActionPiece represents each user action as a set of features:

Action = {category: "Electronics", brand: "Sony", price_range: "$50-100", ...}

Vocabulary construction (BPE-like):

  1. Start with base vocabulary = all individual features
  2. Count co-occurrence of feature pairs:
    • Intra-action: features within the same action (e.g., "Electronics" + "$50-100")
    • Inter-action: features across adjacent actions (e.g., "Phone" in action t, "PhoneCase" in action t+1)
  3. Merge the most frequent pair into a new composite token
  4. Repeat until desired vocabulary size

Set Permutation Regularization (SPR): Because feature sets are unordered, the same action can be tokenized with different internal orderings. SPR produces multiple segmentations of the same sequence, acting as data augmentation and preventing the model from overfitting to arbitrary feature orderings.

Key results:

  • Outperforms TIGER, SASRec, BERT4Rec on Amazon Sports, Beauty, and CDs datasets
  • NDCG@10 improvements of 5–15% over TIGER
  • The context-aware tokenization means the same item gets different tokens in different behavioral contexts

Relevance to domainTokenizer: ActionPiece is the most directly applicable template for building a domain tokenizer. Its BPE-like algorithm can be generalized to any domain where events are composed of multiple features.


5.3 Banking Transaction Flow β€” Transactions as Tokens

Full title: "Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow" Authors: Cyrile Delestre, Yoann Sola Link: arXiv: 2410.08243

What it does: Designs a custom tokenizer for banking transactions β€” multimodal events consisting of (date, numerical amount, text wording) β€” and pre-trains Transformer and RNN models on large-scale transaction data.

Tokenization scheme:

  1. Date modality: Converted to relative temporal features (days since last transaction, day of week)
  2. Amount modality: Quantized into bins. The paper doesn't specify the exact binning, but refers to discretization that preserves order and magnitude.
  3. Wording modality: Standard BPE tokenization on the text description (e.g., merchant names, transaction descriptions) after normalization (removing account numbers, dates from text, standardizing merchant names)
  4. Composite embedding: Each modality's tokens are independently embedded, then combined via concatenation or learned projection into a single transaction-level representation

Sequence construction:

  • Within each day: transactions sorted by ascending amount
  • Across days: chronological order
  • Special separator tokens between days

Pre-training (self-supervised):

  • Masked Transaction Prediction (MTP): Mask entire transactions (not just subword tokens within a description), predict the masked transaction. This forces the model to learn cross-transaction patterns.
  • Both RNN (BiLSTM-based, ELMo-style) and Transformer (BERT-style) pre-training explored

Downstream tasks:

  • Transaction categorization: 31 classes (income, shopping, subscription, transport, savings, etc.). Fine-tuned pre-trained models beat all baselines.
  • Credit risk scoring: Binary classification of default risk. Pre-trained models significantly outperform non-pre-trained approaches.

Relevance to domainTokenizer: This is the closest existing work to an e-commerce transaction tokenizer. The multimodal composite tokenization approach (date + amount + text) is directly applicable.


5.4 LETTER β€” Learnable Item Tokenization

Full title: "Learnable Item Tokenization for Generative Recommendation" Authors: Wenjie Wang, Honghui Bao, et al. Link: arXiv: 2405.07314 | GitHub 153⭐

What it does: LETTER addresses three limitations of prior item tokenization methods:

  1. ID-based: No semantic information, can't generalize to new items
  2. Text-based: Lose collaborative signals (who bought what with what)
  3. Codebook-based (RQ-VAE): Suffer from code assignment bias (popular items get all the good codes)

LETTER's solution β€” a learnable tokenizer with three objectives:

  1. Semantic regularization: Tokenizer's codebook should respect semantic similarity (similar items β†’ similar codes)
  2. Contrastive alignment: Tokens should capture collaborative filtering signals (items bought together β†’ nearby in token space)
  3. Diversity loss: Prevent codebook collapse β€” ensure all codes are used, not just a few popular ones

Architecture:

  • Uses Residual Quantized VAE (like TIGER) as the base tokenizer
  • Adds the three losses above during tokenizer training
  • The tokenizer is trained jointly with (or alternately with) the generative recommendation model

Key results:

  • Outperforms TIGER, P5, and other generative recommendation baselines
  • Particularly strong on long-tail items (items with few interactions) due to the diversity loss

Relevance to domainTokenizer: LETTER shows that the tokenizer itself should be a learnable model trained with domain-specific objectives, not just a fixed preprocessing step.


5.5 TP-BERTa β€” Numerical Value Tokenization

Full title: "Making Pre-trained Language Models Great on Tabular Prediction" Authors: Jiahuan Yan, et al. Link: arXiv: 2403.01841

What it does: Solves the fundamental problem of representing numerical feature values as tokens. Standard text tokenizers fragment numbers meaninglessly. TP-BERTa introduces Relative Magnitude Tokenization (RMT).

Relative Magnitude Tokenization: Instead of tokenizing the raw number "$79.99" as text:

  1. Compute the feature's distribution across the dataset
  2. Express each value as its relative position in that distribution
  3. Discretize into bins: "very_low", "low", "medium", "high", "very_high" (or finer)
  4. The token is the bin label, which preserves ordinal relationships

Example:

price = $79.99
β†’ Within the "price" feature distribution, $79.99 is at the 73rd percentile
β†’ Token: "price_bin_73" or "price_high"

Intra-Feature Attention: Each feature value is paired with its feature name:

"price" β†’ [price_name_embedding]
"$79.99" β†’ [price_value_embedding via RMT]

Intra-feature attention binds them, so the model knows this number means "price" not "quantity" or "weight".

Key results:

  • TP-BERTa is competitive with XGBoost and LightGBM on standard tabular benchmarks
  • Significantly outperforms other deep learning approaches on tabular data
  • The pre-trained model transfers across different tables

Relevance to domainTokenizer: RMT solves the critical problem of numerical tokenization. Every domain tokenizer will need to handle numbers (prices, amounts, quantities, durations), and RMT is currently the best approach.


5.6 Meta-Transformer β€” 12 Modalities, One Token Space

Full title: "Meta-Transformer: A Unified Framework for Multimodal Learning" Authors: Yiyuan Zhang, Kaixiong Gong, et al. Link: arXiv: 2307.10802 | GitHub 1652⭐

What it does: Demonstrates that a single frozen Transformer encoder can process 12 different modalities β€” including time series and tabular data β€” by projecting each modality into a shared token space via modality-specific tokenizers.

Modality-specific tokenizers:

  • Text: standard embedding
  • Image: patch embedding (ViT-style)
  • Audio: spectrogram patches
  • Time series: segment embedding (chop time series into fixed-length segments, project each to a token)
  • Tabular: feature-wise embedding (each column value becomes a token)
  • Graph: node feature embedding
  • Point cloud: point group embedding

Key insight: The tokenizers are lightweight (small learnable projections), and the Transformer encoder is frozen β€” trained once and shared across all modalities. This means the bulk of the computation is modality-agnostic.

Relevance to domainTokenizer: Meta-Transformer proves the viability of the unified approach. A domain tokenizer could use a similar architecture: lightweight domain-specific tokenizers feeding into a shared Transformer backbone.


6. Tokenization Methods: A Technical Taxonomy

6.1 Quantization-Based (RQ-VAE, VQ-VAE)

How it works:

  • Train a Vector Quantized Variational Autoencoder on item embeddings
  • The encoder maps items to a continuous latent space
  • The quantization layer maps each embedding to the nearest entry in a learned codebook
  • Residual Quantization (RQ): apply quantization iteratively on residuals for multi-token representations
  • The decoder reconstructs the original embedding from the quantized codes

Strengths:

  • Produces hierarchically structured tokens (coarse-to-fine)
  • Items with similar content naturally share token prefixes
  • Controllable vocabulary size (codebook size Γ— number of levels)

Weaknesses:

  • Codebook collapse (some codes rarely used)
  • Training instability (requires commitment loss, EMA updates, etc.)
  • No collaborative signal unless explicitly added (see LETTER)

Used by: TIGER, LETTER, PRISM, MMGRec, MiniOneRec, GenRec

6.2 BPE-Inspired Merging

How it works:

  • Start with atomic features as the base vocabulary
  • Count co-occurrence frequencies of feature pairs in the corpus
  • Merge the most frequent pair into a new composite token
  • Repeat until desired vocabulary size

Strengths:

  • Naturally discovers meaningful composite patterns
  • Context-aware (merges depend on surrounding actions)
  • Directly analogous to text BPE β€” well-understood properties
  • No neural network training required for vocabulary construction

Weaknesses:

  • Greedy algorithm β€” may not find globally optimal vocabulary
  • Requires careful handling of unordered feature sets (set permutation regularization)
  • Vocabulary depends on corpus statistics β€” may not generalize to distribution shifts

Used by: ActionPiece

6.3 Magnitude & Binning Approaches

How it works:

  • For numerical values: compute distribution statistics, discretize into bins
  • Options: uniform bins, quantile bins, logarithmic bins, adaptive bins
  • For timestamps: calendar tokens (day-of-week, month, etc.) or relative encodings

Strengths:

  • Simple, interpretable, no training required
  • Preserves ordinal relationships
  • Handles numerical data natively (no text conversion)

Weaknesses:

  • Fixed granularity (bin resolution)
  • Information loss at bin boundaries
  • Requires domain knowledge to choose binning strategy

Used by: TP-BERTa, Banking Transaction Flow, Temporal Tokenization Strategies

6.4 Learnable End-to-End Tokenizers

How it works:

  • A neural network (encoder) maps raw domain data to discrete tokens
  • The tokenizer is trained end-to-end with the downstream model
  • Uses techniques like Gumbel-Softmax for differentiable discretization

Strengths:

  • Tokenizer adapts to the downstream task
  • Can incorporate multiple objectives (semantic, collaborative, diversity)
  • No manual design of tokenization rules

Weaknesses:

  • More complex training (joint optimization)
  • Risk of tokenizer-model co-adaptation (poor generalization)
  • Harder to interpret what tokens mean

Used by: LETTER, UniGRec, ContRec, MANTa

6.5 Serialization-Based (Text Templates)

How it works:

  • Convert each data record to a natural language string: "The customer bought Sony WH-1000XM5 headphones for $349.99 using a credit card on March 15, 2025."
  • Use a standard text tokenizer (BPE) on the serialized string
  • Feed to a pre-trained LLM

Strengths:

  • Zero engineering β€” use off-the-shelf LLMs
  • Benefits from LLM's pre-trained world knowledge
  • Handles heterogeneous schemas easily

Weaknesses:

  • Extremely token-inefficient (one row might become 100+ tokens)
  • Numerical values still poorly handled by text tokenizers
  • Requires large models to work well (no "small model" possibility)
  • No exploitation of domain structure

Used by: TabuLa-8B, TabSTAR (partially), various LLM-for-tabular approaches


7. The domainTokenizer Blueprint: How to Build It

7.1 Architecture Design

Based on the research, domainTokenizer should have three components:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 domainTokenizer                  β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Domain       β”‚  β”‚  Transformer β”‚  β”‚  Task  β”‚ β”‚
β”‚  β”‚  Tokenizer    │──│  Backbone    │──│  Heads β”‚ β”‚
β”‚  β”‚  (learnable)  β”‚  β”‚  (small)     β”‚  β”‚        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                  β”‚
β”‚  Tokenizer: Domain events β†’ discrete tokens      β”‚
β”‚  Backbone: Sequence modeling via attention        β”‚
β”‚  Heads: Task-specific outputs                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Domain Tokenizer (per-domain, learnable):

  • Handles the conversion of raw domain events into discrete tokens
  • Combines multiple strategies: RQ-VAE for items, magnitude binning for numbers, BPE-like merging for feature compositions, calendar encoding for timestamps
  • Small and fast (a few million parameters at most)

Transformer Backbone (shared, small):

  • Standard causal or bidirectional Transformer
  • Target sizes: 10M, 50M, 150M, 350M parameters
  • Pre-trained on domain sequences with self-supervised objectives
  • Potentially shareable across related domains

Task Heads (per-task):

  • Classification head for fraud detection, churn prediction, etc.
  • Generation head for next-event prediction, recommendation
  • Regression head for value prediction (LTV, credit score, etc.)

7.2 Tokenizer Construction Pipeline

For a given domain (e.g., e-commerce), the tokenizer construction follows:

Step 1: Schema Analysis

# Identify field types in the domain data
schema = {
    "product_id": "categorical_entity",    # β†’ Semantic ID via RQ-VAE
    "category": "categorical_fixed",       # β†’ direct vocabulary mapping
    "price": "numerical_continuous",       # β†’ magnitude binning (RMT)
    "quantity": "numerical_discrete",      # β†’ small fixed vocabulary
    "timestamp": "temporal",               # β†’ calendar + relative encoding
    "description": "text",                 # β†’ standard BPE (subword)
    "payment_method": "categorical_small", # β†’ direct mapping
    "customer_id": "entity_id",            # β†’ learned embedding or behavioral cluster
}

Step 2: Per-Field Tokenization

Field Type Method Output
Categorical entity (products) RQ-VAE Semantic IDs Tuple of K codebook indices
Categorical fixed (categories) Direct vocab mapping Single token index
Numerical continuous (prices) Relative Magnitude Tokenization Bin token
Temporal (timestamps) Calendar tokens + relative delta 2–3 tokens (day-of-week, time-of-day, delta)
Text (descriptions) Standard BPE Variable-length subword tokens
Entity ID (customers) Behavioral clustering or learned embedding Single token or short sequence

Step 3: Composite Token Construction (BPE-like) Following ActionPiece, apply a BPE-like merge algorithm on the composite per-field tokens to discover meaningful multi-field patterns:

Initial: [Electronics] [price_high] [CreditCard] [Weekday]
After merging: [Electronics+price_high] [CreditCard+Weekday]
Further: [HighEndElectronicsPurchase] [WeekdayCreditCard]

Step 4: Special Tokens

[SEP]       - separates transactions in a sequence
[DAY_SEP]   - separates days
[PAD]       - padding
[MASK]      - for masked pre-training
[CLS]       - sequence-level representation
[UNK]       - unknown/out-of-vocabulary events

7.3 Pre-training Objectives

Based on the literature, the following self-supervised objectives are most effective:

1. Masked Event Prediction (MEP) β€” BERT-style

  • Mask 15% of complete events (not just individual tokens within an event)
  • Predict all tokens of the masked event
  • Forces the model to learn cross-event patterns

2. Next Event Prediction (NEP) β€” GPT-style

  • Given a sequence of events, predict the next event autoregressively
  • Generate the event's token sequence (e.g., Semantic ID) token by token
  • The primary objective for generative recommendation

3. Contrastive Sequence Learning

  • Similar customer sequences should have similar representations
  • Push apart sequences from different behavioral clusters
  • Helps with customer segmentation and transfer learning

4. Temporal Ordering

  • Given a shuffled sequence, predict the correct temporal order
  • Forces the model to learn temporal patterns (seasonality, cadence, trends)

7.4 Downstream Task Adaptation

Once pre-trained, the model can be fine-tuned for specific tasks:

Task Adaptation Method Head
Next purchase prediction Continue NEP, decode Semantic IDs Generative (autoregressive)
Fraud detection Fine-tune on labeled transactions Binary classifier on [CLS]
Customer segmentation Extract [CLS] embeddings, cluster No head (use embeddings)
Churn prediction Fine-tune on labeled sequences Binary classifier on [CLS]
Credit scoring Fine-tune on labeled customer histories Regression or classification
Demand forecasting Adapt temporal patterns Regression on quantity tokens
Product recommendation NEP with Semantic ID decoding Generative (beam search)

8. Use Case Walkthrough: E-Commerce Transaction Model

The Scenario

An e-commerce platform with:

  • 2M customers
  • 500K products
  • 100M transactions over 2 years
  • Each transaction: (customer_id, product_id, category, price, quantity, timestamp, payment_method, device)

Step 1: Build the Tokenizer

Product Semantic IDs:

# 1. Generate product embeddings from title + description
product_embeddings = sentence_encoder(product_titles + product_descriptions)  # 500K Γ— 768

# 2. Train RQ-VAE with 4 codebooks of 256 entries each
rq_vae = ResidualQuantizedVAE(n_codebooks=4, codebook_size=256)
rq_vae.fit(product_embeddings)

# 3. Each product gets a 4-token Semantic ID
product_semantic_ids = rq_vae.encode(product_embeddings)  # 500K Γ— 4
# e.g., Headphones β†’ [42, 187, 23, 91]

Price Tokenization (RMT):

# Compute percentile bins
price_bins = compute_quantile_bins(all_prices, n_bins=50)
# $79.99 β†’ "price_bin_37" (37th percentile bin)

Timestamp Tokenization:

# Calendar features + relative delta
def tokenize_timestamp(ts, prev_ts):
    return [
        day_of_week_token(ts),      # "wednesday"
        time_of_day_token(ts),       # "afternoon"  
        delta_token(ts - prev_ts),   # "2_days_later"
    ]

Composite vocabulary construction (BPE-like):

# Run ActionPiece-style merging on the corpus of tokenized transaction sequences
vocabulary = actionpiece_vocab_construction(
    corpus=all_tokenized_transactions,
    target_vocab_size=8192,
    consider_intra_event=True,   # merge features within a transaction
    consider_inter_event=True,   # merge features across adjacent transactions
)

Step 2: Pre-train

# Tokenize all 100M transactions
tokenized_corpus = tokenize_all_transactions(transactions, tokenizer)

# Pre-train a small Transformer (150M params)
model = TransformerLM(
    vocab_size=8192 + special_tokens,
    d_model=768,
    n_heads=12,
    n_layers=12,
    max_seq_len=256,  # ~256 transactions per customer
)

# Self-supervised pre-training with MEP + NEP
train(model, tokenized_corpus, objectives=["masked_event", "next_event"])

Step 3: Fine-tune & Deploy

# Example: Fraud detection
fraud_model = add_classification_head(model, n_classes=2)
fine_tune(fraud_model, labeled_fraud_data)

# Example: Next purchase recommendation
rec_model = model  # Use generative mode directly
next_item_semantic_id = rec_model.generate(customer_transaction_sequence)
next_item = rq_vae.decode(next_item_semantic_id)  # Map back to product

9. Open Challenges and Research Gaps

9.1 Vocabulary Evolution

Products are added and removed constantly. Semantic IDs need to be recomputed, which may invalidate the model's learned associations. Partial solutions: periodic re-indexing (TIGER), using content features that are stable even when the catalog changes.

9.2 Cross-Domain Transfer

Can a tokenizer trained on e-commerce data transfer to banking? The field-level tokenizers (RMT for numbers, calendar for dates) should transfer, but composite vocabularies are domain-specific. Open question: is there a "universal domain tokenizer" or will each domain need its own?

9.3 Numerical Precision

All current methods lose some numerical precision through discretization. For applications where exact values matter (financial auditing, pricing optimization), this is a limitation. Potential solution: hybrid approaches that combine discrete tokens with continuous residuals.

9.4 Handling Missing Data

Real business data is full of missing values. Text tokenizers never face this issue. Domain tokenizers need explicit strategies: [MISSING] tokens, imputation, or learning to model missingness as a signal.

9.5 Privacy & Fairness

Tokenizing customer behavior raises privacy concerns. Semantic IDs could encode sensitive attributes (demographic patterns, financial status) in ways that are hard to audit. Domain tokenizers should be designed with fairness constraints.

9.6 Scalability of BPE-Like Merging

ActionPiece's vocabulary construction is O(N Γ— V) per merge step. For very large corpora (billions of events) and feature spaces (thousands of features), this may become prohibitively expensive. Potential solution: approximate counting, hierarchical merging, or neural vocabulary construction.

9.7 Evaluation Standards

There are no standard benchmarks for "domain tokenization quality." Text tokenizers can be evaluated by compression ratio and downstream perplexity. Domain tokenizers need domain-specific metrics: recommendation quality, prediction accuracy, calibration, etc.

9.8 Connection to Continual Learning

The HOPE / Nested Learning paradigm (see Section 11) suggests that models should continuously learn from new data. Domain tokenizers that can incrementally update their vocabularies β€” adding new product tokens, retiring obsolete ones β€” without full retraining would be highly valuable.


10. Complete Paper Reference Table

# Paper Year ArXiv Domain Key Contribution GitHub
1 TIGER 2023 2305.05065 Recommendation Semantic IDs via RQ-VAE for generative retrieval 781⭐
2 Semantic IDs (YouTube) 2023 2306.08121 Recommendation Content-derived IDs at industry scale β€”
3 ActionPiece 2025 2502.13581 Recommendation BPE-like context-aware action tokenization 53⭐
4 LETTER 2024 2405.07314 Recommendation Learnable tokenizer with semantic+collaborative+diversity 153⭐
5 SETRec 2025 2502.10833 Recommendation Order-agnostic set identifiers β€”
6 ContRec 2025 2504.12007 Recommendation Continuous tokens via sigma-VAE + diffusion β€”
7 GenRec 2026 2604.14878 Recommendation Page-wise NTP for large-scale recommendation β€”
8 MBGen 2024 2405.16871 Recommendation Multi-behavior (view/click/buy) as token types 57⭐
9 RSLLM 2024 2412.16933 Recommendation Recommendation as a new language in LLMs β€”
10 PRISM 2025 2601.16556 Recommendation Purified quantization for semantic tokenization β€”
11 MMGRec 2024 2404.16555 Recommendation Graph RQ-VAE for multimodal items β€”
12 UniGRec 2025 2601.17438 Recommendation Soft item identifiers for end-to-end optimization β€”
13 Semantic IDs for Search+Rec 2025 2508.10478 Recommendation Joint search and recommendation Semantic IDs β€”
14 Banking Transaction Flow 2024 2410.08243 Finance Composite tokenizer for (date, amount, text) transactions β€”
15 LBSF 2024 2411.15056 Finance Long-term payment behavior folding by merchant β€”
16 Temporal Tokenization 2025 2512.13618 Events Systematic comparison of temporal tokenization strategies β€”
17 FinTRec 2025 2511.14865 Finance Transformer for long-range financial recommendation β€”
18 TIMeSynC 2024 2410.12825 Finance Temporal intent prediction in financial services β€”
19 TP-BERTa 2024 2403.01841 Tabular Relative Magnitude Tokenization for numbers β€”
20 TabuLa-8B 2024 2406.12031 Tabular Llama 3 fine-tuned on serialized tables 71⭐
21 TabSTAR 2025 2505.18125 Tabular Semantically target-aware tabular foundation model 83⭐
22 UniTabE 2023 2307.09249 Tabular Universal tabular pretraining protocol β€”
23 TARTE 2025 2505.14415 Tabular Knowledge-enhanced tabular representations β€”
24 TabICL 2025 2502.05564 Tabular Column-then-row attention, scales to 500K samples β€”
25 Meta-Transformer 2023 2307.10802 Universal 12 modalities in one token space 1652⭐
26 Emu3 2024 2409.18869 Universal NTP is all you need across modalities 2400⭐
27 Unified-IO 2 2023 2312.17172 Universal Image+text+audio+action in one model 647⭐
28 NTP Multimodal Survey 2024 2412.18619 Survey Taxonomy of multimodal tokenization + NTP 478⭐
29 LongCat-Next 2025 2603.27538 Universal Lexicalizing modalities as discrete tokens 409⭐
30 Tabular Data Survey 2024 2408.10548 Survey Comprehensive survey of LMs for tabular data 33⭐
31 KL3M Tokenizers 2025 2503.17247 Legal/Finance Domain-specific BPE for professional text GitHub

11. Related Concepts: Nested Learning & Continual Adaptation

An important related development is the Nested Learning paradigm introduced by Google Research (arXiv: 2512.24695, by Ali Behrouz et al.), which presents the HOPE architecture.

Why Nested Learning Matters for Domain Tokenization

Current Transformer-based models are "frozen" after pre-training β€” they cannot incorporate new knowledge without retraining. For domain tokenization, this means:

  • A recommendation model can't learn about new products added after training
  • A fraud detection model can't adapt to new fraud patterns in real-time
  • A customer model can't update its understanding of a customer's evolving preferences

The HOPE architecture addresses this via:

  1. Continuum Memory System (CMS): Multiple MLP blocks updating at different frequencies β€” some update every few tokens (catching immediate patterns), others update only after millions of tokens (storing persistent knowledge). This prevents catastrophic forgetting.
  2. Self-Modifying Titans: The model's projection layers update themselves in real-time based on incoming data, enabling continuous adaptation.

For domainTokenizer, the implication is: a domain model built with Nested Learning principles could continuously learn from new transactions, adapting its understanding of products, customer preferences, and behavioral patterns without retraining from scratch.

This is an area of active exploration for future versions of domainTokenizer.

For the full research report on Nested Learning, see the HOPE / Nested Learning discussion on HF Papers.


This report is a living document and will be updated as the domainTokenizer project evolves.