🔑 domainTokenizer

Building small models that understand domain tokens — not just words.

The Idea

LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

Text LLM:      "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase

Why This Matters

Problem	Text Tokenizer	Domain Tokenizer
Product `SKU-8847291`	Split into meaningless fragments: `SK`, `U-`, `884`...	Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning
Price `$79.99`	Fragmented: `$`, `79`, `.`, `99`	Tokenized as `price_bin_37` (73rd percentile = "mid-range")
Timestamp `2025-03-15`	Calendar-unaware text fragments	`[Wednesday, Afternoon, 2_days_later]`
Cross-field patterns	Lost in flat token stream	Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token

Research Foundation

This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm — the challenge is how to tokenize.

Five paradigms have emerged:

Paradigm	Method	Key Paper
Semantic IDs	RQ-VAE quantization of item embeddings	TIGER (Google, 2023)
Action Tokenization	BPE-like merging of feature patterns	ActionPiece (DeepMind, 2025)
Transaction Tokenization	Composite (date + amount + text) encoding	Banking TF (2024)
Tabular Tokenization	Relative magnitude encoding for numbers	TP-BERTa (2024)
Universal Tokenization	All modalities → shared discrete space	Meta-Transformer (2023)

📄 Full research report: docs/research_report.md

Project Vision

Phase 1: Research & Survey (✅ Current)

Literature survey of domain tokenization methods
Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
Blueprint for a general-purpose domain tokenizer

Phase 2: Core Tokenizer Library

Implement per-field tokenizers:
- SemanticIDTokenizer — RQ-VAE for entity encoding
- MagnitudeTokenizer — relative magnitude binning for numerical values
- TemporalTokenizer — calendar + relative delta encoding
- CompositeTokenizer — BPE-like merging of multi-field patterns (ActionPiece-style)
Schema-driven automatic tokenizer selection

Phase 3: Pre-training Framework

Self-supervised objectives: Masked Event Prediction, Next Event Prediction
Small Transformer backbone (10M–350M parameters)
Domain-agnostic training loop that works with any tokenizer configuration

Phase 4: Domain Demos

E-commerce: next purchase prediction, customer segmentation
Finance: fraud detection, credit scoring
Healthcare: clinical event prediction

Repo Structure

domainTokenizer/
├── docs/
│   └── research_report.md    # Detailed research findings (30+ papers)
├── src/                       # (coming) Core library
│   ├── tokenizers/            # Per-field tokenizer implementations
│   ├── models/                # Small Transformer backbones
│   └── training/              # Pre-training and fine-tuning
├── examples/                  # (coming) Domain-specific demos
└── README.md

Key References

Paper	Year	What It Does	Link
TIGER	2023	Semantic IDs for products via RQ-VAE	arXiv
ActionPiece	2025	BPE for user action sequences	arXiv
Banking TF	2024	Tokenizer for financial transactions	arXiv
LETTER	2024	Learnable item tokenization	arXiv
TP-BERTa	2024	Numerical value tokenization	arXiv
Meta-Transformer	2023	12 modalities, one token space	arXiv
NTP Survey	2024	Comprehensive multimodal NTP taxonomy	arXiv
Nested Learning (HOPE)	2025	Continual learning via multi-timescale memory	arXiv

See the full reference table with 31 papers in the research report.

License

MIT

domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.