File size: 52,752 Bytes
be86e60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
# Domain Tokenization: Beyond Words β€” A Research Report

> **Building small models that understand domain tokens, not just words.**
>
> *Last updated: April 2026*

---

## Table of Contents

1. [Executive Summary](#1-executive-summary)
2. [The Problem: Why Words Are Not Enough](#2-the-problem-why-words-are-not-enough)
3. [The Core Insight: Anything Can Be a Token](#3-the-core-insight-anything-can-be-a-token)
4. [Research Landscape: Five Paradigms of Domain Tokenization](#4-research-landscape-five-paradigms-of-domain-tokenization)
   - 4.1 [Semantic ID Tokenization (Products & Items)](#41-semantic-id-tokenization-products--items)
   - 4.2 [Action Sequence Tokenization (User Behaviors)](#42-action-sequence-tokenization-user-behaviors)
   - 4.3 [Financial Transaction Tokenization](#43-financial-transaction-tokenization)
   - 4.4 [Tabular Feature Tokenization](#44-tabular-feature-tokenization)
   - 4.5 [Universal Modality Tokenization](#45-universal-modality-tokenization)
5. [Key Papers: Detailed Analysis](#5-key-papers-detailed-analysis)
   - 5.1 [TIGER β€” Semantic IDs for Generative Retrieval](#51-tiger--semantic-ids-for-generative-retrieval)
   - 5.2 [ActionPiece β€” BPE for User Actions](#52-actionpiece--bpe-for-user-actions)
   - 5.3 [Banking Transaction Flow β€” Transactions as Tokens](#53-banking-transaction-flow--transactions-as-tokens)
   - 5.4 [LETTER β€” Learnable Item Tokenization](#54-letter--learnable-item-tokenization)
   - 5.5 [TP-BERTa β€” Numerical Value Tokenization](#55-tp-berta--numerical-value-tokenization)
   - 5.6 [Meta-Transformer β€” 12 Modalities, One Token Space](#56-meta-transformer--12-modalities-one-token-space)
6. [Tokenization Methods: A Technical Taxonomy](#6-tokenization-methods-a-technical-taxonomy)
   - 6.1 [Quantization-Based (RQ-VAE, VQ-VAE)](#61-quantization-based-rq-vae-vq-vae)
   - 6.2 [BPE-Inspired Merging](#62-bpe-inspired-merging)
   - 6.3 [Magnitude & Binning Approaches](#63-magnitude--binning-approaches)
   - 6.4 [Learnable End-to-End Tokenizers](#64-learnable-end-to-end-tokenizers)
   - 6.5 [Serialization-Based (Text Templates)](#65-serialization-based-text-templates)
7. [The domainTokenizer Blueprint: How to Build It](#7-the-domaintokenizer-blueprint-how-to-build-it)
   - 7.1 [Architecture Design](#71-architecture-design)
   - 7.2 [Tokenizer Construction Pipeline](#72-tokenizer-construction-pipeline)
   - 7.3 [Pre-training Objectives](#73-pre-training-objectives)
   - 7.4 [Downstream Task Adaptation](#74-downstream-task-adaptation)
8. [Use Case Walkthrough: E-Commerce Transaction Model](#8-use-case-walkthrough-e-commerce-transaction-model)
9. [Open Challenges and Research Gaps](#9-open-challenges-and-research-gaps)
10. [Complete Paper Reference Table](#10-complete-paper-reference-table)
11. [Related Concepts: Nested Learning & Continual Adaptation](#11-related-concepts-nested-learning--continual-adaptation)

---

## 1. Executive Summary

Large Language Models (LLMs) process text by breaking it into **tokens** β€” subword units learned via algorithms like BPE (Byte-Pair Encoding). This tokenization is the foundation that allows Transformers to model sequential patterns via next-token prediction.

But words are just one type of sequential data. Businesses generate vast amounts of **non-textual sequential data** every day:

- **E-commerce:** millions of purchase transactions, each with product IDs, amounts, timestamps, categories
- **Banking:** transaction flows with dates, amounts, merchant codes, and descriptions
- **Healthcare:** sequences of diagnoses, procedures, lab results, medications
- **Advertising:** impression β†’ click β†’ conversion funnels with bid amounts and user features
- **Logistics:** shipping events, warehouse movements, delivery status sequences

**The central question this project explores:** Can we build tokenizers that encode these domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then train small, efficient Transformer models that understand domain patterns the way LLMs understand language?

**The answer from recent research is a resounding yes.** This report surveys 25+ papers spanning 2021–2026 that collectively establish a new paradigm: **domain tokenization**. The key findings are:

1. **Semantic IDs** (Google, 2023): Products can be encoded as tuples of discrete tokens derived from their content embeddings via quantization (RQ-VAE). A Transformer trained on sequences of these Semantic IDs outperforms traditional recommendation systems and generalizes to unseen items.

2. **Action tokenization** (Google DeepMind, 2025): User action sequences can be tokenized using a BPE-like algorithm that merges frequently co-occurring features β€” the same algorithm that powers text tokenization, applied to business events instead of characters.

3. **Transaction tokenization** (2024): Banking transactions β€” multimodal events of (date, amount, text) β€” can be encoded as composite tokens and modeled with self-supervised pre-training, achieving state-of-the-art on fraud detection and credit scoring.

4. **Tabular tokenization** (2024–2025): Individual feature values (numerical, categorical) can be tokenized via relative magnitude encoding or serialization, enabling foundation models that transfer across different tabular datasets.

5. **Universal tokenization** (2023–2024): Frameworks like Meta-Transformer demonstrate that 12+ modalities including time series and tabular data can be projected into a shared token space and processed by a single frozen Transformer.

This report details each paradigm, provides technical depth on the tokenization methods, and lays out a concrete blueprint for building domainTokenizer.

---

## 2. The Problem: Why Words Are Not Enough

### 2.1 The Mismatch Between Business Data and Text Tokens

When an e-commerce platform processes a customer's purchase history, the raw data looks like:

```
customer_42 | 2025-03-15 | SKU-8847291 | Electronics > Headphones | $79.99 | Credit Card | qty: 1
customer_42 | 2025-03-15 | SKU-3321098 | Electronics > Cables    | $12.49 | Credit Card | qty: 2
customer_42 | 2025-04-01 | SKU-5519273 | Books > Technical       | $44.95 | Debit Card  | qty: 1
```

If you feed this to a standard LLM tokenizer (e.g., GPT-4o's `cl100k_base`), you get:

- `SKU-8847291` β†’ split into meaningless subword fragments like `SK`, `U-`, `884`, `72`, `91`
- `$79.99` β†’ tokenized as `$`, `79`, `.`, `99` β€” losing the semantic meaning of "a mid-range purchase"
- `2025-03-15` β†’ fragmented into date components with no temporal understanding
- The **relationships** between fields (this amount goes with this product in this category) are lost in a flat token stream

**The fundamental problem:** text tokenizers are optimized for the statistical structure of natural language. They know that `ing` and `tion` are common suffixes, that `the` is frequent, that `un-` is a prefix. They know nothing about:

- Product similarity (headphones and earbuds are related)
- Price ranges ($79.99 is "mid-range electronics" vs. $2,499 is "premium")
- Temporal patterns (weekly vs. monthly purchase cadence)
- Cross-field interactions (buying a cable right after headphones = accessory purchase)

### 2.2 The Opportunity: Domain Structure is Richer Than Language

Business domains have structure that goes beyond what text captures:

| Dimension | Language | Business Domain |
|-----------|----------|-----------------|
| **Vocabulary** | ~50K–256K subwords | Millions of SKUs, thousands of categories |
| **Sequence meaning** | Word order determines syntax | Temporal order determines behavioral patterns |
| **Similarity** | Semantic (synonyms, paraphrases) | Collaborative (users who buy X also buy Y) |
| **Numerical values** | Rare, incidental | Central (prices, quantities, timestamps) |
| **Compositionality** | Words compose into sentences | Features compose into events/transactions |
| **Temporal dynamics** | Mostly static semantics | Evolving trends, seasonal patterns |

A domain tokenizer should exploit all of this structure.

### 2.3 Why Small Models?

This project focuses on **small** models (tens of millions to low billions of parameters) because:

1. **Domain data is structured** β€” you don't need 70B parameters to learn that "users who buy phones often buy cases." The pattern space is narrower than open-domain language.
2. **Latency matters** β€” production systems need real-time inference (fraud detection, recommendations, pricing).
3. **Data efficiency** β€” most businesses have millions, not trillions, of training examples.
4. **Cost** β€” training and serving small models is orders of magnitude cheaper.
5. **Interpretability** β€” smaller models with domain-specific tokens are more auditable than black-box LLMs.

---

## 3. The Core Insight: Anything Can Be a Token

The survey **"Next Token Prediction Towards Multimodal Intelligence"** ([arXiv: 2412.18619](https://arxiv.org/abs/2412.18619), 59 upvotes) formalizes this principle:

> Next-Token Prediction (NTP) is a **universal training objective** that works across modalities. The bottleneck is not the model architecture β€” it's **tokenization**: how you map domain entities into discrete token spaces.

This means the entire LLM machinery β€” attention, scaling laws, in-context learning, transfer learning β€” becomes available for any domain once you solve the tokenization problem.

The precedent is clear across modalities:

| Modality | How It's Tokenized | Key Paper |
|----------|--------------------|-----------|
| **Text** | BPE / WordPiece / SentencePiece | GPT, BERT, Llama |
| **Images** | VQ-VAE, patch embeddings | DALL-E, ViT |
| **Audio** | Spectral codecs (EnCodec) | AudioLM, Whisper |
| **Video** | 3D causal VAE | HiTVideo, Emu3 |
| **Robotics actions** | Discrete Cosine Transform | FAST (2501.09747) |
| **Products/Items** | **Semantic IDs via RQ-VAE** | **TIGER** |
| **User actions** | **BPE on feature sets** | **ActionPiece** |
| **Transactions** | **Composite (date+amount+text)** | **Banking TF** |
| **Tabular features** | **Magnitude binning, serialization** | **TP-BERTa, TabuLa** |
| **Time series** | Scalar quantization, symbolic discretization | TokenCast, LLMTime |

The bottom half of this table β€” the business-domain entries β€” is where domainTokenizer operates.

---

## 4. Research Landscape: Five Paradigms of Domain Tokenization

### 4.1 Semantic ID Tokenization (Products & Items)

**Core idea:** Encode each item (product, video, song, article) as a **sequence of discrete semantic tokens** derived from its content features.

**How it works:**
1. Extract a dense embedding from item features (e.g., product title + description β†’ SentenceT5 β†’ 768-dim vector)
2. Apply **Residual Quantization (RQ-VAE)**: iteratively quantize the embedding into a sequence of codebook indices
3. The resulting tuple `(c1, c2, c3, ...)` is the item's **Semantic ID** β€” its "word" in the domain language
4. Train a Transformer to predict sequences of these Semantic IDs

**Key property:** Items with similar content share token prefixes, creating a hierarchical semantic structure:
```
Headphones A:  [Audio, 23, 7, 41]
Headphones B:  [Audio, 23, 7, 55]    ← shares 3/4 prefix tokens
Laptop C:      [Computing, 8, 31, 12] ← completely different tokens
```

**Papers:**
- **TIGER** (Google, 2023) β€” [arXiv: 2305.05065](https://arxiv.org/abs/2305.05065) β€” The landmark paper introducing Semantic IDs for recommendation. [GitHub 781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender)
- **Semantic IDs at YouTube** (Google, 2023) β€” [arXiv: 2306.08121](https://arxiv.org/abs/2306.08121) β€” Deployed at industry scale, replacing random IDs
- **PRISM** (2025) β€” [arXiv: 2601.16556](https://arxiv.org/abs/2601.16556) β€” Purified quantization for better semantic tokenization
- **MMGRec** (2024) β€” [arXiv: 2404.16555](https://arxiv.org/abs/2404.16555) β€” Graph RQ-VAE incorporating multimodal item features
- **Semantic IDs for Joint Search & Rec** (2025) β€” [arXiv: 2508.10478](https://arxiv.org/abs/2508.10478) β€” Unified Semantic IDs across search and recommendation

### 4.2 Action Sequence Tokenization (User Behaviors)

**Core idea:** Don't just tokenize individual items β€” tokenize the **entire action sequence**, where each action is a composite event with multiple features.

**How it works:**
1. Represent each user action as an **unordered set of features**: `{category: Electronics, price_bin: $50-100, brand: Sony, payment: Credit}`
2. Apply a **BPE-like vocabulary construction** algorithm that merges frequently co-occurring feature patterns:
   - Count co-occurrence of feature pairs both within actions and across adjacent actions
   - Merge the most frequent pair into a new token
   - Repeat until desired vocabulary size is reached
3. The same action can be tokenized differently depending on surrounding context

**Key insight (from ActionPiece):** Just as BPE discovers that `t` + `h` + `e` should be merged into a single `the` token in English, the action tokenizer discovers that `{Electronics, $50-100}` should be merged into a single composite token because they co-occur frequently in purchase sequences.

**Papers:**
- **ActionPiece** (Google DeepMind, 2025) β€” [arXiv: 2502.13581](https://arxiv.org/abs/2502.13581) β€” First context-aware action sequence tokenizer. [GitHub 53⭐](https://github.com/google-deepmind/action_piece)
- **MBGen** (2024) β€” [arXiv: 2405.16871](https://arxiv.org/abs/2405.16871) β€” Multi-behavior generative recommendation (view, click, purchase as different token types). [GitHub 57⭐](https://github.com/anananan116/MBGen)
- **SETRec** (2025) β€” [arXiv: 2502.10833](https://arxiv.org/abs/2502.10833) β€” Order-agnostic set identifiers integrating collaborative + semantic signals
- **ContRec** (2025) β€” [arXiv: 2504.12007](https://arxiv.org/abs/2504.12007) β€” Continuous tokens via sigma-VAE + diffusion

### 4.3 Financial Transaction Tokenization

**Core idea:** Banking/financial transactions are **multimodal sequential events** (date + amount + description). Design a composite tokenizer that encodes all three modalities jointly.

**How it works (from Banking Transaction Flow paper):**
1. **Date tokenization:** Convert to day-of-week + relative time since last transaction
2. **Amount tokenization:** Quantize into logarithmic bins (captures the difference between $5 and $500 better than linear bins)
3. **Wording tokenization:** Standard BPE on the transaction description text (e.g., "AMAZON MARKETPLACE" β†’ subword tokens)
4. **Composite token:** Combine date + amount + wording tokens into a single transaction representation
5. **Sequence ordering:** Within each day, sort transactions by ascending amount; across days, chronological order
6. **Pre-train** with masked transaction prediction (mask entire transactions, not just subwords)

**Papers:**
- **Banking Transaction Flow** (2024) β€” [arXiv: 2410.08243](https://arxiv.org/abs/2410.08243) β€” Custom tokenizer for banking transactions; pre-trained models outperform prior art on transaction categorization (31 classes) and credit risk scoring
- **LBSF** (2024) β€” [arXiv: 2411.15056](https://arxiv.org/abs/2411.15056) β€” Long-term payment behavior sequence folding by merchant, with multi-field behavior encoding
- **Temporal Tokenization Strategies** (2025) β€” [arXiv: 2512.13618](https://arxiv.org/abs/2512.13618) β€” Systematic comparison of how to tokenize timestamps for event sequences. Key finding: log-based encoding works best for skewed financial data
- **FinTRec** (2025) β€” [arXiv: 2511.14865](https://arxiv.org/abs/2511.14865) β€” Transformer for long-range financial product recommendation with temporally heterogeneous context
- **TIMeSynC** (2024) β€” [arXiv: 2410.12825](https://arxiv.org/abs/2410.12825) β€” Encoder-decoder transformer for sequential intent prediction in financial services

### 4.4 Tabular Feature Tokenization

**Core idea:** Each row in a table can be serialized as a sequence of tokens, and each feature value can be encoded meaningfully (not just as a text fragment).

**Key methods:**
- **Relative Magnitude Tokenization (RMT):** Instead of tokenizing "$79.99" as text fragments, discretize it relative to the feature's distribution β†’ "percentile_75" or "bin_high". This preserves ordinal relationships.
- **Intra-Feature Attention:** Bind each value token to its column name via attention, so the model knows "$79.99" means "price is $79.99", not just a number.
- **Serialization:** Convert rows to natural language: `"price: $79.99, category: Electronics, brand: Sony"` β€” surprisingly effective with large enough models.

**Papers:**
- **TP-BERTa** (2024) β€” [arXiv: 2403.01841](https://arxiv.org/abs/2403.01841) β€” Relative Magnitude Tokenization + intra-feature attention. Competitive with XGBoost/LightGBM.
- **TabuLa-8B** (2024) β€” [arXiv: 2406.12031](https://arxiv.org/abs/2406.12031) β€” Llama 3-8B fine-tuned on serialized tabular data. Strong zero/few-shot. [GitHub 71⭐](https://github.com/mlfoundations/rtfm)
- **TabSTAR** (2025) β€” [arXiv: 2505.18125](https://arxiv.org/abs/2505.18125) β€” Foundation tabular model with semantically target-aware representations. [GitHub 83⭐](https://github.com/alanarazi7/TabSTAR). 112 upvotes.
- **UniTabE** (2023) β€” [arXiv: 2307.09249](https://arxiv.org/abs/2307.09249) β€” Universal pretraining protocol for tabular foundation models
- **TARTE** (2025) β€” [arXiv: 2505.14415](https://arxiv.org/abs/2505.14415) β€” Knowledge-enhanced tabular representations via pre-training on column names + table entries
- **TabICL** (2025) β€” [arXiv: 2502.05564](https://arxiv.org/abs/2502.05564) β€” Column-then-row attention, scales to 500K samples
- **Language Modeling on Tabular Data: A Survey** (2024) β€” [arXiv: 2408.10548](https://arxiv.org/abs/2408.10548) β€” Comprehensive survey. [GitHub 33⭐](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey)

### 4.5 Universal Modality Tokenization

**Core idea:** Project all modalities β€” including time series, tabular data, graphs β€” into a **shared discrete token space** and process them with a single Transformer.

**Papers:**
- **Meta-Transformer** (2023) β€” [arXiv: 2307.10802](https://arxiv.org/abs/2307.10802) β€” 12 modalities (text, image, audio, video, point cloud, **time series**, **tabular**, IMU, graph, etc.) via a unified tokenizer + frozen encoder. [GitHub 1652⭐](https://github.com/invictus717/MetaTransformer). 45 upvotes.
- **Emu3** (2024) β€” [arXiv: 2409.18869](https://arxiv.org/abs/2409.18869) β€” Next-token prediction is all you need across modalities. [GitHub 2400⭐](https://github.com/baaivision/emu3). 99 upvotes.
- **Unified-IO 2** (2023) β€” [arXiv: 2312.17172](https://arxiv.org/abs/2312.17172) β€” Images, text, audio, and actions in one autoregressive model. [GitHub 647⭐](https://github.com/allenai/unified-io-2). 30 upvotes.
- **NTP Multimodal Survey** (2024) β€” [arXiv: 2412.18619](https://arxiv.org/abs/2412.18619) β€” Comprehensive taxonomy of multimodal tokenization + NTP. [GitHub 478⭐](https://github.com/lmm101/awesome-multimodal-next-token-prediction). 59 upvotes.
- **LongCat-Next** (2025) β€” [arXiv: 2603.27538](https://arxiv.org/abs/2603.27538) β€” Lexicalizing modalities as discrete tokens. [GitHub 409⭐](https://github.com/meituan-longcat/LongCat-Next). 145 upvotes.

---

## 5. Key Papers: Detailed Analysis

### 5.1 TIGER β€” Semantic IDs for Generative Retrieval

**Full title:** "Recommender Systems with Generative Retrieval"
**Authors:** Shashank Rajput, Nikhil Mehta, Anima Singh, et al. (Google Research)
**Link:** [arXiv: 2305.05065](https://arxiv.org/abs/2305.05065) | [GitHub 781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender)

**What it does:**
TIGER (Transformer Index for GEnerative Recommenders) replaces the traditional two-stage retrieve-and-rank pipeline with a single generative model. Each item is assigned a Semantic ID β€” a tuple of discrete codewords β€” and the model autoregressively generates the Semantic ID of the next item a user will interact with.

**Semantic ID generation process:**
```
Item features (title, description, ...) 
    β†’ Pre-trained text encoder (SentenceT5)
    β†’ Dense embedding (768-dim)
    β†’ Residual Quantization (RQ-VAE)
    β†’ Semantic ID: (c1, c2, c3, ..., cK)    # K codewords from K codebooks
```

**Residual Quantization (RQ):**
1. Quantize the embedding to the nearest codebook entry β†’ c1
2. Compute the **residual** (difference between original and quantized)
3. Quantize the residual β†’ c2
4. Repeat K times

This creates a **hierarchical** representation: c1 captures coarse semantics (category-level), c2 refines it, c3 further, etc.

**Training:**
- Input: sequence of Semantic IDs representing a user's past interactions
- Target: Semantic ID of the next item
- Loss: cross-entropy at each code position
- Architecture: standard Transformer encoder-decoder

**Key results:**
- Outperforms SASRec, BERT4Rec, and dual-encoder baselines on Amazon datasets
- **Cold-start capability:** can recommend items never seen in training (because Semantic IDs generalize via shared prefixes)
- **Diversity:** beam search with temperature naturally produces diverse recommendations

**Relevance to domainTokenizer:** TIGER's Semantic ID is the canonical example of how to create a "word" for a non-textual entity. The RQ-VAE approach is directly applicable to any item-based domain.

---

### 5.2 ActionPiece β€” BPE for User Actions

**Full title:** "ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation"
**Authors:** Yupeng Hou, Jianmo Ni, Zhankui He, et al. (Google DeepMind)
**Link:** [arXiv: 2502.13581](https://arxiv.org/abs/2502.13581) | [GitHub 53⭐](https://github.com/google-deepmind/action_piece)

**What it does:**
ActionPiece is the first **context-aware** tokenizer for user action sequences. It applies the BPE principle β€” merging frequently co-occurring pairs β€” but on **sets of item features** rather than characters.

**Key innovation β€” actions as unordered feature sets:**
Instead of treating each item as an atomic ID, ActionPiece represents each user action as a set of features:
```
Action = {category: "Electronics", brand: "Sony", price_range: "$50-100", ...}
```

**Vocabulary construction (BPE-like):**
1. Start with base vocabulary = all individual features
2. Count co-occurrence of feature pairs:
   - **Intra-action:** features within the same action (e.g., "Electronics" + "$50-100")
   - **Inter-action:** features across adjacent actions (e.g., "Phone" in action t, "PhoneCase" in action t+1)
3. Merge the most frequent pair into a new composite token
4. Repeat until desired vocabulary size

**Set Permutation Regularization (SPR):**
Because feature sets are unordered, the same action can be tokenized with different internal orderings. SPR produces multiple segmentations of the same sequence, acting as data augmentation and preventing the model from overfitting to arbitrary feature orderings.

**Key results:**
- Outperforms TIGER, SASRec, BERT4Rec on Amazon Sports, Beauty, and CDs datasets
- NDCG@10 improvements of 5–15% over TIGER
- The context-aware tokenization means the same item gets different tokens in different behavioral contexts

**Relevance to domainTokenizer:** ActionPiece is the most directly applicable template for building a domain tokenizer. Its BPE-like algorithm can be generalized to any domain where events are composed of multiple features.

---

### 5.3 Banking Transaction Flow β€” Transactions as Tokens

**Full title:** "Self-Attention Mechanism in Multimodal Context for Banking Transaction Flow"
**Authors:** Cyrile Delestre, Yoann Sola
**Link:** [arXiv: 2410.08243](https://arxiv.org/abs/2410.08243)

**What it does:**
Designs a custom tokenizer for banking transactions β€” multimodal events consisting of (date, numerical amount, text wording) β€” and pre-trains Transformer and RNN models on large-scale transaction data.

**Tokenization scheme:**
1. **Date modality:** Converted to relative temporal features (days since last transaction, day of week)
2. **Amount modality:** Quantized into bins. The paper doesn't specify the exact binning, but refers to discretization that preserves order and magnitude.
3. **Wording modality:** Standard BPE tokenization on the text description (e.g., merchant names, transaction descriptions) after normalization (removing account numbers, dates from text, standardizing merchant names)
4. **Composite embedding:** Each modality's tokens are independently embedded, then combined via concatenation or learned projection into a single transaction-level representation

**Sequence construction:**
- Within each day: transactions sorted by ascending amount
- Across days: chronological order
- Special separator tokens between days

**Pre-training (self-supervised):**
- **Masked Transaction Prediction (MTP):** Mask entire transactions (not just subword tokens within a description), predict the masked transaction. This forces the model to learn cross-transaction patterns.
- Both RNN (BiLSTM-based, ELMo-style) and Transformer (BERT-style) pre-training explored

**Downstream tasks:**
- **Transaction categorization:** 31 classes (income, shopping, subscription, transport, savings, etc.). Fine-tuned pre-trained models beat all baselines.
- **Credit risk scoring:** Binary classification of default risk. Pre-trained models significantly outperform non-pre-trained approaches.

**Relevance to domainTokenizer:** This is the closest existing work to an e-commerce transaction tokenizer. The multimodal composite tokenization approach (date + amount + text) is directly applicable.

---

### 5.4 LETTER β€” Learnable Item Tokenization

**Full title:** "Learnable Item Tokenization for Generative Recommendation"
**Authors:** Wenjie Wang, Honghui Bao, et al.
**Link:** [arXiv: 2405.07314](https://arxiv.org/abs/2405.07314) | [GitHub 153⭐](https://github.com/honghuibao2000/letter)

**What it does:**
LETTER addresses three limitations of prior item tokenization methods:
1. **ID-based:** No semantic information, can't generalize to new items
2. **Text-based:** Lose collaborative signals (who bought what with what)
3. **Codebook-based (RQ-VAE):** Suffer from code assignment bias (popular items get all the good codes)

**LETTER's solution β€” a learnable tokenizer with three objectives:**
1. **Semantic regularization:** Tokenizer's codebook should respect semantic similarity (similar items β†’ similar codes)
2. **Contrastive alignment:** Tokens should capture collaborative filtering signals (items bought together β†’ nearby in token space)
3. **Diversity loss:** Prevent codebook collapse β€” ensure all codes are used, not just a few popular ones

**Architecture:**
- Uses Residual Quantized VAE (like TIGER) as the base tokenizer
- Adds the three losses above during tokenizer training
- The tokenizer is trained jointly with (or alternately with) the generative recommendation model

**Key results:**
- Outperforms TIGER, P5, and other generative recommendation baselines
- Particularly strong on long-tail items (items with few interactions) due to the diversity loss

**Relevance to domainTokenizer:** LETTER shows that **the tokenizer itself should be a learnable model** trained with domain-specific objectives, not just a fixed preprocessing step.

---

### 5.5 TP-BERTa β€” Numerical Value Tokenization

**Full title:** "Making Pre-trained Language Models Great on Tabular Prediction"
**Authors:** Jiahuan Yan, et al.
**Link:** [arXiv: 2403.01841](https://arxiv.org/abs/2403.01841)

**What it does:**
Solves the fundamental problem of representing **numerical feature values** as tokens. Standard text tokenizers fragment numbers meaninglessly. TP-BERTa introduces **Relative Magnitude Tokenization (RMT)**.

**Relative Magnitude Tokenization:**
Instead of tokenizing the raw number "$79.99" as text:
1. Compute the feature's distribution across the dataset
2. Express each value as its **relative position** in that distribution
3. Discretize into bins: "very_low", "low", "medium", "high", "very_high" (or finer)
4. The token is the bin label, which preserves ordinal relationships

Example:
```
price = $79.99
β†’ Within the "price" feature distribution, $79.99 is at the 73rd percentile
β†’ Token: "price_bin_73" or "price_high"
```

**Intra-Feature Attention:**
Each feature value is paired with its feature name:
```
"price" β†’ [price_name_embedding]
"$79.99" β†’ [price_value_embedding via RMT]
```
Intra-feature attention binds them, so the model knows this number means "price" not "quantity" or "weight".

**Key results:**
- TP-BERTa is competitive with XGBoost and LightGBM on standard tabular benchmarks
- Significantly outperforms other deep learning approaches on tabular data
- The pre-trained model transfers across different tables

**Relevance to domainTokenizer:** RMT solves the critical problem of numerical tokenization. Every domain tokenizer will need to handle numbers (prices, amounts, quantities, durations), and RMT is currently the best approach.

---

### 5.6 Meta-Transformer β€” 12 Modalities, One Token Space

**Full title:** "Meta-Transformer: A Unified Framework for Multimodal Learning"
**Authors:** Yiyuan Zhang, Kaixiong Gong, et al.
**Link:** [arXiv: 2307.10802](https://arxiv.org/abs/2307.10802) | [GitHub 1652⭐](https://github.com/invictus717/MetaTransformer)

**What it does:**
Demonstrates that a single frozen Transformer encoder can process 12 different modalities β€” including **time series** and **tabular data** β€” by projecting each modality into a shared token space via modality-specific tokenizers.

**Modality-specific tokenizers:**
- **Text:** standard embedding
- **Image:** patch embedding (ViT-style)
- **Audio:** spectrogram patches
- **Time series:** segment embedding (chop time series into fixed-length segments, project each to a token)
- **Tabular:** feature-wise embedding (each column value becomes a token)
- **Graph:** node feature embedding
- **Point cloud:** point group embedding

**Key insight:** The tokenizers are lightweight (small learnable projections), and the Transformer encoder is **frozen** β€” trained once and shared across all modalities. This means the bulk of the computation is modality-agnostic.

**Relevance to domainTokenizer:** Meta-Transformer proves the viability of the unified approach. A domain tokenizer could use a similar architecture: lightweight domain-specific tokenizers feeding into a shared Transformer backbone.

---

## 6. Tokenization Methods: A Technical Taxonomy

### 6.1 Quantization-Based (RQ-VAE, VQ-VAE)

**How it works:**
- Train a Vector Quantized Variational Autoencoder on item embeddings
- The encoder maps items to a continuous latent space
- The quantization layer maps each embedding to the nearest entry in a learned codebook
- **Residual Quantization (RQ):** apply quantization iteratively on residuals for multi-token representations
- The decoder reconstructs the original embedding from the quantized codes

**Strengths:**
- Produces hierarchically structured tokens (coarse-to-fine)
- Items with similar content naturally share token prefixes
- Controllable vocabulary size (codebook size Γ— number of levels)

**Weaknesses:**
- Codebook collapse (some codes rarely used)
- Training instability (requires commitment loss, EMA updates, etc.)
- No collaborative signal unless explicitly added (see LETTER)

**Used by:** TIGER, LETTER, PRISM, MMGRec, MiniOneRec, GenRec

### 6.2 BPE-Inspired Merging

**How it works:**
- Start with atomic features as the base vocabulary
- Count co-occurrence frequencies of feature pairs in the corpus
- Merge the most frequent pair into a new composite token
- Repeat until desired vocabulary size

**Strengths:**
- Naturally discovers meaningful composite patterns
- Context-aware (merges depend on surrounding actions)
- Directly analogous to text BPE β€” well-understood properties
- No neural network training required for vocabulary construction

**Weaknesses:**
- Greedy algorithm β€” may not find globally optimal vocabulary
- Requires careful handling of unordered feature sets (set permutation regularization)
- Vocabulary depends on corpus statistics β€” may not generalize to distribution shifts

**Used by:** ActionPiece

### 6.3 Magnitude & Binning Approaches

**How it works:**
- For numerical values: compute distribution statistics, discretize into bins
- Options: uniform bins, quantile bins, logarithmic bins, adaptive bins
- For timestamps: calendar tokens (day-of-week, month, etc.) or relative encodings

**Strengths:**
- Simple, interpretable, no training required
- Preserves ordinal relationships
- Handles numerical data natively (no text conversion)

**Weaknesses:**
- Fixed granularity (bin resolution)
- Information loss at bin boundaries
- Requires domain knowledge to choose binning strategy

**Used by:** TP-BERTa, Banking Transaction Flow, Temporal Tokenization Strategies

### 6.4 Learnable End-to-End Tokenizers

**How it works:**
- A neural network (encoder) maps raw domain data to discrete tokens
- The tokenizer is trained end-to-end with the downstream model
- Uses techniques like Gumbel-Softmax for differentiable discretization

**Strengths:**
- Tokenizer adapts to the downstream task
- Can incorporate multiple objectives (semantic, collaborative, diversity)
- No manual design of tokenization rules

**Weaknesses:**
- More complex training (joint optimization)
- Risk of tokenizer-model co-adaptation (poor generalization)
- Harder to interpret what tokens mean

**Used by:** LETTER, UniGRec, ContRec, MANTa

### 6.5 Serialization-Based (Text Templates)

**How it works:**
- Convert each data record to a natural language string:
  `"The customer bought Sony WH-1000XM5 headphones for $349.99 using a credit card on March 15, 2025."`
- Use a standard text tokenizer (BPE) on the serialized string
- Feed to a pre-trained LLM

**Strengths:**
- Zero engineering β€” use off-the-shelf LLMs
- Benefits from LLM's pre-trained world knowledge
- Handles heterogeneous schemas easily

**Weaknesses:**
- Extremely token-inefficient (one row might become 100+ tokens)
- Numerical values still poorly handled by text tokenizers
- Requires large models to work well (no "small model" possibility)
- No exploitation of domain structure

**Used by:** TabuLa-8B, TabSTAR (partially), various LLM-for-tabular approaches

---

## 7. The domainTokenizer Blueprint: How to Build It

### 7.1 Architecture Design

Based on the research, domainTokenizer should have three components:

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 domainTokenizer                  β”‚
β”‚                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚  Domain       β”‚  β”‚  Transformer β”‚  β”‚  Task  β”‚ β”‚
β”‚  β”‚  Tokenizer    │──│  Backbone    │──│  Heads β”‚ β”‚
β”‚  β”‚  (learnable)  β”‚  β”‚  (small)     β”‚  β”‚        β”‚ β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                  β”‚
β”‚  Tokenizer: Domain events β†’ discrete tokens      β”‚
β”‚  Backbone: Sequence modeling via attention        β”‚
β”‚  Heads: Task-specific outputs                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

**Domain Tokenizer (per-domain, learnable):**
- Handles the conversion of raw domain events into discrete tokens
- Combines multiple strategies: RQ-VAE for items, magnitude binning for numbers, BPE-like merging for feature compositions, calendar encoding for timestamps
- Small and fast (a few million parameters at most)

**Transformer Backbone (shared, small):**
- Standard causal or bidirectional Transformer
- Target sizes: 10M, 50M, 150M, 350M parameters
- Pre-trained on domain sequences with self-supervised objectives
- Potentially shareable across related domains

**Task Heads (per-task):**
- Classification head for fraud detection, churn prediction, etc.
- Generation head for next-event prediction, recommendation
- Regression head for value prediction (LTV, credit score, etc.)

### 7.2 Tokenizer Construction Pipeline

For a given domain (e.g., e-commerce), the tokenizer construction follows:

**Step 1: Schema Analysis**
```python
# Identify field types in the domain data
schema = {
    "product_id": "categorical_entity",    # β†’ Semantic ID via RQ-VAE
    "category": "categorical_fixed",       # β†’ direct vocabulary mapping
    "price": "numerical_continuous",       # β†’ magnitude binning (RMT)
    "quantity": "numerical_discrete",      # β†’ small fixed vocabulary
    "timestamp": "temporal",               # β†’ calendar + relative encoding
    "description": "text",                 # β†’ standard BPE (subword)
    "payment_method": "categorical_small", # β†’ direct mapping
    "customer_id": "entity_id",            # β†’ learned embedding or behavioral cluster
}
```

**Step 2: Per-Field Tokenization**

| Field Type | Method | Output |
|------------|--------|--------|
| Categorical entity (products) | RQ-VAE Semantic IDs | Tuple of K codebook indices |
| Categorical fixed (categories) | Direct vocab mapping | Single token index |
| Numerical continuous (prices) | Relative Magnitude Tokenization | Bin token |
| Temporal (timestamps) | Calendar tokens + relative delta | 2–3 tokens (day-of-week, time-of-day, delta) |
| Text (descriptions) | Standard BPE | Variable-length subword tokens |
| Entity ID (customers) | Behavioral clustering or learned embedding | Single token or short sequence |

**Step 3: Composite Token Construction (BPE-like)**
Following ActionPiece, apply a BPE-like merge algorithm on the composite per-field tokens to discover meaningful multi-field patterns:
```
Initial: [Electronics] [price_high] [CreditCard] [Weekday]
After merging: [Electronics+price_high] [CreditCard+Weekday]
Further: [HighEndElectronicsPurchase] [WeekdayCreditCard]
```

**Step 4: Special Tokens**
```
[SEP]       - separates transactions in a sequence
[DAY_SEP]   - separates days
[PAD]       - padding
[MASK]      - for masked pre-training
[CLS]       - sequence-level representation
[UNK]       - unknown/out-of-vocabulary events
```

### 7.3 Pre-training Objectives

Based on the literature, the following self-supervised objectives are most effective:

**1. Masked Event Prediction (MEP) β€” BERT-style**
- Mask 15% of complete events (not just individual tokens within an event)
- Predict all tokens of the masked event
- Forces the model to learn cross-event patterns

**2. Next Event Prediction (NEP) β€” GPT-style**
- Given a sequence of events, predict the next event autoregressively
- Generate the event's token sequence (e.g., Semantic ID) token by token
- The primary objective for generative recommendation

**3. Contrastive Sequence Learning**
- Similar customer sequences should have similar representations
- Push apart sequences from different behavioral clusters
- Helps with customer segmentation and transfer learning

**4. Temporal Ordering**
- Given a shuffled sequence, predict the correct temporal order
- Forces the model to learn temporal patterns (seasonality, cadence, trends)

### 7.4 Downstream Task Adaptation

Once pre-trained, the model can be fine-tuned for specific tasks:

| Task | Adaptation Method | Head |
|------|-------------------|------|
| **Next purchase prediction** | Continue NEP, decode Semantic IDs | Generative (autoregressive) |
| **Fraud detection** | Fine-tune on labeled transactions | Binary classifier on [CLS] |
| **Customer segmentation** | Extract [CLS] embeddings, cluster | No head (use embeddings) |
| **Churn prediction** | Fine-tune on labeled sequences | Binary classifier on [CLS] |
| **Credit scoring** | Fine-tune on labeled customer histories | Regression or classification |
| **Demand forecasting** | Adapt temporal patterns | Regression on quantity tokens |
| **Product recommendation** | NEP with Semantic ID decoding | Generative (beam search) |

---

## 8. Use Case Walkthrough: E-Commerce Transaction Model

### The Scenario
An e-commerce platform with:
- 2M customers
- 500K products
- 100M transactions over 2 years
- Each transaction: `(customer_id, product_id, category, price, quantity, timestamp, payment_method, device)`

### Step 1: Build the Tokenizer

**Product Semantic IDs:**
```python
# 1. Generate product embeddings from title + description
product_embeddings = sentence_encoder(product_titles + product_descriptions)  # 500K Γ— 768

# 2. Train RQ-VAE with 4 codebooks of 256 entries each
rq_vae = ResidualQuantizedVAE(n_codebooks=4, codebook_size=256)
rq_vae.fit(product_embeddings)

# 3. Each product gets a 4-token Semantic ID
product_semantic_ids = rq_vae.encode(product_embeddings)  # 500K Γ— 4
# e.g., Headphones β†’ [42, 187, 23, 91]
```

**Price Tokenization (RMT):**
```python
# Compute percentile bins
price_bins = compute_quantile_bins(all_prices, n_bins=50)
# $79.99 β†’ "price_bin_37" (37th percentile bin)
```

**Timestamp Tokenization:**
```python
# Calendar features + relative delta
def tokenize_timestamp(ts, prev_ts):
    return [
        day_of_week_token(ts),      # "wednesday"
        time_of_day_token(ts),       # "afternoon"  
        delta_token(ts - prev_ts),   # "2_days_later"
    ]
```

**Composite vocabulary construction (BPE-like):**
```python
# Run ActionPiece-style merging on the corpus of tokenized transaction sequences
vocabulary = actionpiece_vocab_construction(
    corpus=all_tokenized_transactions,
    target_vocab_size=8192,
    consider_intra_event=True,   # merge features within a transaction
    consider_inter_event=True,   # merge features across adjacent transactions
)
```

### Step 2: Pre-train

```python
# Tokenize all 100M transactions
tokenized_corpus = tokenize_all_transactions(transactions, tokenizer)

# Pre-train a small Transformer (150M params)
model = TransformerLM(
    vocab_size=8192 + special_tokens,
    d_model=768,
    n_heads=12,
    n_layers=12,
    max_seq_len=256,  # ~256 transactions per customer
)

# Self-supervised pre-training with MEP + NEP
train(model, tokenized_corpus, objectives=["masked_event", "next_event"])
```

### Step 3: Fine-tune & Deploy

```python
# Example: Fraud detection
fraud_model = add_classification_head(model, n_classes=2)
fine_tune(fraud_model, labeled_fraud_data)

# Example: Next purchase recommendation
rec_model = model  # Use generative mode directly
next_item_semantic_id = rec_model.generate(customer_transaction_sequence)
next_item = rq_vae.decode(next_item_semantic_id)  # Map back to product
```

---

## 9. Open Challenges and Research Gaps

### 9.1 Vocabulary Evolution
Products are added and removed constantly. Semantic IDs need to be recomputed, which may invalidate the model's learned associations. **Partial solutions:** periodic re-indexing (TIGER), using content features that are stable even when the catalog changes.

### 9.2 Cross-Domain Transfer
Can a tokenizer trained on e-commerce data transfer to banking? The field-level tokenizers (RMT for numbers, calendar for dates) should transfer, but composite vocabularies are domain-specific. **Open question:** is there a "universal domain tokenizer" or will each domain need its own?

### 9.3 Numerical Precision
All current methods lose some numerical precision through discretization. For applications where exact values matter (financial auditing, pricing optimization), this is a limitation. **Potential solution:** hybrid approaches that combine discrete tokens with continuous residuals.

### 9.4 Handling Missing Data
Real business data is full of missing values. Text tokenizers never face this issue. Domain tokenizers need explicit strategies: [MISSING] tokens, imputation, or learning to model missingness as a signal.

### 9.5 Privacy & Fairness
Tokenizing customer behavior raises privacy concerns. Semantic IDs could encode sensitive attributes (demographic patterns, financial status) in ways that are hard to audit. Domain tokenizers should be designed with fairness constraints.

### 9.6 Scalability of BPE-Like Merging
ActionPiece's vocabulary construction is O(N Γ— V) per merge step. For very large corpora (billions of events) and feature spaces (thousands of features), this may become prohibitively expensive. **Potential solution:** approximate counting, hierarchical merging, or neural vocabulary construction.

### 9.7 Evaluation Standards
There are no standard benchmarks for "domain tokenization quality." Text tokenizers can be evaluated by compression ratio and downstream perplexity. Domain tokenizers need domain-specific metrics: recommendation quality, prediction accuracy, calibration, etc.

### 9.8 Connection to Continual Learning
The HOPE / Nested Learning paradigm (see Section 11) suggests that models should continuously learn from new data. Domain tokenizers that can incrementally update their vocabularies β€” adding new product tokens, retiring obsolete ones β€” without full retraining would be highly valuable.

---

## 10. Complete Paper Reference Table

| # | Paper | Year | ArXiv | Domain | Key Contribution | GitHub |
|---|-------|------|-------|--------|-----------------|--------|
| 1 | **TIGER** | 2023 | [2305.05065](https://arxiv.org/abs/2305.05065) | Recommendation | Semantic IDs via RQ-VAE for generative retrieval | [781⭐](https://github.com/EdoardoBotta/RQ-VAE-Recommender) |
| 2 | **Semantic IDs (YouTube)** | 2023 | [2306.08121](https://arxiv.org/abs/2306.08121) | Recommendation | Content-derived IDs at industry scale | β€” |
| 3 | **ActionPiece** | 2025 | [2502.13581](https://arxiv.org/abs/2502.13581) | Recommendation | BPE-like context-aware action tokenization | [53⭐](https://github.com/google-deepmind/action_piece) |
| 4 | **LETTER** | 2024 | [2405.07314](https://arxiv.org/abs/2405.07314) | Recommendation | Learnable tokenizer with semantic+collaborative+diversity | [153⭐](https://github.com/honghuibao2000/letter) |
| 5 | **SETRec** | 2025 | [2502.10833](https://arxiv.org/abs/2502.10833) | Recommendation | Order-agnostic set identifiers | β€” |
| 6 | **ContRec** | 2025 | [2504.12007](https://arxiv.org/abs/2504.12007) | Recommendation | Continuous tokens via sigma-VAE + diffusion | β€” |
| 7 | **GenRec** | 2026 | [2604.14878](https://arxiv.org/abs/2604.14878) | Recommendation | Page-wise NTP for large-scale recommendation | β€” |
| 8 | **MBGen** | 2024 | [2405.16871](https://arxiv.org/abs/2405.16871) | Recommendation | Multi-behavior (view/click/buy) as token types | [57⭐](https://github.com/anananan116/MBGen) |
| 9 | **RSLLM** | 2024 | [2412.16933](https://arxiv.org/abs/2412.16933) | Recommendation | Recommendation as a new language in LLMs | β€” |
| 10 | **PRISM** | 2025 | [2601.16556](https://arxiv.org/abs/2601.16556) | Recommendation | Purified quantization for semantic tokenization | β€” |
| 11 | **MMGRec** | 2024 | [2404.16555](https://arxiv.org/abs/2404.16555) | Recommendation | Graph RQ-VAE for multimodal items | β€” |
| 12 | **UniGRec** | 2025 | [2601.17438](https://arxiv.org/abs/2601.17438) | Recommendation | Soft item identifiers for end-to-end optimization | β€” |
| 13 | **Semantic IDs for Search+Rec** | 2025 | [2508.10478](https://arxiv.org/abs/2508.10478) | Recommendation | Joint search and recommendation Semantic IDs | β€” |
| 14 | **Banking Transaction Flow** | 2024 | [2410.08243](https://arxiv.org/abs/2410.08243) | Finance | Composite tokenizer for (date, amount, text) transactions | β€” |
| 15 | **LBSF** | 2024 | [2411.15056](https://arxiv.org/abs/2411.15056) | Finance | Long-term payment behavior folding by merchant | β€” |
| 16 | **Temporal Tokenization** | 2025 | [2512.13618](https://arxiv.org/abs/2512.13618) | Events | Systematic comparison of temporal tokenization strategies | β€” |
| 17 | **FinTRec** | 2025 | [2511.14865](https://arxiv.org/abs/2511.14865) | Finance | Transformer for long-range financial recommendation | β€” |
| 18 | **TIMeSynC** | 2024 | [2410.12825](https://arxiv.org/abs/2410.12825) | Finance | Temporal intent prediction in financial services | β€” |
| 19 | **TP-BERTa** | 2024 | [2403.01841](https://arxiv.org/abs/2403.01841) | Tabular | Relative Magnitude Tokenization for numbers | β€” |
| 20 | **TabuLa-8B** | 2024 | [2406.12031](https://arxiv.org/abs/2406.12031) | Tabular | Llama 3 fine-tuned on serialized tables | [71⭐](https://github.com/mlfoundations/rtfm) |
| 21 | **TabSTAR** | 2025 | [2505.18125](https://arxiv.org/abs/2505.18125) | Tabular | Semantically target-aware tabular foundation model | [83⭐](https://github.com/alanarazi7/TabSTAR) |
| 22 | **UniTabE** | 2023 | [2307.09249](https://arxiv.org/abs/2307.09249) | Tabular | Universal tabular pretraining protocol | β€” |
| 23 | **TARTE** | 2025 | [2505.14415](https://arxiv.org/abs/2505.14415) | Tabular | Knowledge-enhanced tabular representations | β€” |
| 24 | **TabICL** | 2025 | [2502.05564](https://arxiv.org/abs/2502.05564) | Tabular | Column-then-row attention, scales to 500K samples | β€” |
| 25 | **Meta-Transformer** | 2023 | [2307.10802](https://arxiv.org/abs/2307.10802) | Universal | 12 modalities in one token space | [1652⭐](https://github.com/invictus717/MetaTransformer) |
| 26 | **Emu3** | 2024 | [2409.18869](https://arxiv.org/abs/2409.18869) | Universal | NTP is all you need across modalities | [2400⭐](https://github.com/baaivision/emu3) |
| 27 | **Unified-IO 2** | 2023 | [2312.17172](https://arxiv.org/abs/2312.17172) | Universal | Image+text+audio+action in one model | [647⭐](https://github.com/allenai/unified-io-2) |
| 28 | **NTP Multimodal Survey** | 2024 | [2412.18619](https://arxiv.org/abs/2412.18619) | Survey | Taxonomy of multimodal tokenization + NTP | [478⭐](https://github.com/lmm101/awesome-multimodal-next-token-prediction) |
| 29 | **LongCat-Next** | 2025 | [2603.27538](https://arxiv.org/abs/2603.27538) | Universal | Lexicalizing modalities as discrete tokens | [409⭐](https://github.com/meituan-longcat/LongCat-Next) |
| 30 | **Tabular Data Survey** | 2024 | [2408.10548](https://arxiv.org/abs/2408.10548) | Survey | Comprehensive survey of LMs for tabular data | [33⭐](https://github.com/lanxiang1017/language-modeling-on-tabular-data-survey) |
| 31 | **KL3M Tokenizers** | 2025 | [2503.17247](https://arxiv.org/abs/2503.17247) | Legal/Finance | Domain-specific BPE for professional text | [GitHub](https://github.com/alea-institute/kl3m-tokenizer-paper) |

---

## 11. Related Concepts: Nested Learning & Continual Adaptation

An important related development is the **Nested Learning** paradigm introduced by Google Research ([arXiv: 2512.24695](https://arxiv.org/abs/2512.24695), by Ali Behrouz et al.), which presents the **HOPE** architecture.

### Why Nested Learning Matters for Domain Tokenization

Current Transformer-based models are "frozen" after pre-training β€” they cannot incorporate new knowledge without retraining. For domain tokenization, this means:
- A recommendation model can't learn about new products added after training
- A fraud detection model can't adapt to new fraud patterns in real-time
- A customer model can't update its understanding of a customer's evolving preferences

The HOPE architecture addresses this via:
1. **Continuum Memory System (CMS):** Multiple MLP blocks updating at different frequencies β€” some update every few tokens (catching immediate patterns), others update only after millions of tokens (storing persistent knowledge). This prevents catastrophic forgetting.
2. **Self-Modifying Titans:** The model's projection layers update themselves in real-time based on incoming data, enabling continuous adaptation.

**For domainTokenizer, the implication is:** a domain model built with Nested Learning principles could continuously learn from new transactions, adapting its understanding of products, customer preferences, and behavioral patterns without retraining from scratch.

This is an area of active exploration for future versions of domainTokenizer.

For the full research report on Nested Learning, see the [HOPE / Nested Learning discussion on HF Papers](https://huggingface.co/papers/2512.24695).

---

*This report is a living document and will be updated as the domainTokenizer project evolves.*