Add README with project overview and vision

f930fef verified 11 days ago

5.68 kB

	# 🔑 domainTokenizer

	Building small models that understand domain tokens — not just words.

	---

	## The Idea

	LLMs process text by breaking it into tokens (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

	But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

	domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

	```
	Text LLM: "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

	domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
	```

	## Why This Matters

	\| Problem \| Text Tokenizer \| Domain Tokenizer \|
	\|---------\|---------------\|-----------------\|
	\| Product `SKU-8847291` \| Split into meaningless fragments: `SK`, `U-`, `884`... \| Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning \|
	\| Price `$79.99` \| Fragmented: `$`, `79`, `.`, `99` \| Tokenized as `price_bin_37` (73rd percentile = "mid-range") \|
	\| Timestamp `2025-03-15` \| Calendar-unaware text fragments \| `[Wednesday, Afternoon, 2_days_later]` \|
	\| Cross-field patterns \| Lost in flat token stream \| Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token \|

	## Research Foundation

	This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm — the challenge is how to tokenize.

	Five paradigms have emerged:

	\| Paradigm \| Method \| Key Paper \|
	\|----------\|--------\|-----------\|
	\| Semantic IDs \| RQ-VAE quantization of item embeddings \| [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) \|
	\| Action Tokenization \| BPE-like merging of feature patterns \| [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) \|
	\| Transaction Tokenization \| Composite (date + amount + text) encoding \| [Banking TF](https://arxiv.org/abs/2410.08243) (2024) \|
	\| Tabular Tokenization \| Relative magnitude encoding for numbers \| [TP-BERTa](https://arxiv.org/abs/2403.01841) (2024) \|
	\| Universal Tokenization \| All modalities → shared discrete space \| [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) \|

	📄 Full research report: [`docs/research_report.md`](docs/research_report.md)

	## Project Vision

	### Phase 1: Research & Survey (✅ Current)
	- Literature survey of domain tokenization methods
	- Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
	- Blueprint for a general-purpose domain tokenizer

	### Phase 2: Core Tokenizer Library
	- Implement per-field tokenizers:
	- `SemanticIDTokenizer` — RQ-VAE for entity encoding
	- `MagnitudeTokenizer` — relative magnitude binning for numerical values
	- `TemporalTokenizer` — calendar + relative delta encoding
	- `CompositeTokenizer` — BPE-like merging of multi-field patterns (ActionPiece-style)
	- Schema-driven automatic tokenizer selection

	### Phase 3: Pre-training Framework
	- Self-supervised objectives: Masked Event Prediction, Next Event Prediction
	- Small Transformer backbone (10M–350M parameters)
	- Domain-agnostic training loop that works with any tokenizer configuration

	### Phase 4: Domain Demos
	- E-commerce: next purchase prediction, customer segmentation
	- Finance: fraud detection, credit scoring
	- Healthcare: clinical event prediction

	## Repo Structure

	```
	domainTokenizer/
	├── docs/
	│ └── research_report.md # Detailed research findings (30+ papers)
	├── src/ # (coming) Core library
	│ ├── tokenizers/ # Per-field tokenizer implementations
	│ ├── models/ # Small Transformer backbones
	│ └── training/ # Pre-training and fine-tuning
	├── examples/ # (coming) Domain-specific demos
	└── README.md
	```

	## Key References

	\| Paper \| Year \| What It Does \| Link \|
	\|-------\|------\|-------------\|------\|
	\| TIGER \| 2023 \| Semantic IDs for products via RQ-VAE \| [arXiv](https://arxiv.org/abs/2305.05065) \|
	\| ActionPiece \| 2025 \| BPE for user action sequences \| [arXiv](https://arxiv.org/abs/2502.13581) \|
	\| Banking TF \| 2024 \| Tokenizer for financial transactions \| [arXiv](https://arxiv.org/abs/2410.08243) \|
	\| LETTER \| 2024 \| Learnable item tokenization \| [arXiv](https://arxiv.org/abs/2405.07314) \|
	\| TP-BERTa \| 2024 \| Numerical value tokenization \| [arXiv](https://arxiv.org/abs/2403.01841) \|
	\| Meta-Transformer \| 2023 \| 12 modalities, one token space \| [arXiv](https://arxiv.org/abs/2307.10802) \|
	\| NTP Survey \| 2024 \| Comprehensive multimodal NTP taxonomy \| [arXiv](https://arxiv.org/abs/2412.18619) \|
	\| Nested Learning (HOPE) \| 2025 \| Continual learning via multi-timescale memory \| [arXiv](https://arxiv.org/abs/2512.24695) \|

	See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 31 papers in the research report.

	## License

	MIT

	---

	domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.