rtferraz
/

domainTokenizer

Model card Files Files and versions

xet

Community

rtferraz commited on 8 days ago

Commit

f930fef

verified ·

1 Parent(s): be86e60

Add README with project overview and vision

Browse files

Files changed (1) hide show

README.md +106 -0

README.md ADDED Viewed

	@@ -0,0 +1,106 @@

+# 🔑 domainTokenizer
+**Building small models that understand domain tokens — not just words.**
+---
+## The Idea
+LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.
+But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
+**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
+```
+Text LLM:      "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word
+domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
+```
+## Why This Matters
+| Problem | Text Tokenizer | Domain Tokenizer |
+|---------|---------------|-----------------|
+| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
+| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
+| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
+| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
+## Research Foundation
+This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.
+Five paradigms have emerged:
+| Paradigm | Method | Key Paper |
+|----------|--------|-----------|
+| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
+| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
+| **Transaction Tokenization** | Composite (date + amount + text) encoding | [Banking TF](https://arxiv.org/abs/2410.08243) (2024) |
+| **Tabular Tokenization** | Relative magnitude encoding for numbers | [TP-BERTa](https://arxiv.org/abs/2403.01841) (2024) |
+| **Universal Tokenization** | All modalities → shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
+📄 **Full research report:** [`docs/research_report.md`](docs/research_report.md)
+## Project Vision
+### Phase 1: Research & Survey (✅ Current)
+- Literature survey of domain tokenization methods
+- Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
+- Blueprint for a general-purpose domain tokenizer
+### Phase 2: Core Tokenizer Library
+- Implement per-field tokenizers:
+  - `SemanticIDTokenizer` — RQ-VAE for entity encoding
+  - `MagnitudeTokenizer` — relative magnitude binning for numerical values
+  - `TemporalTokenizer` — calendar + relative delta encoding
+  - `CompositeTokenizer` — BPE-like merging of multi-field patterns (ActionPiece-style)
+- Schema-driven automatic tokenizer selection
+### Phase 3: Pre-training Framework
+- Self-supervised objectives: Masked Event Prediction, Next Event Prediction
+- Small Transformer backbone (10M–350M parameters)
+- Domain-agnostic training loop that works with any tokenizer configuration
+### Phase 4: Domain Demos
+- E-commerce: next purchase prediction, customer segmentation
+- Finance: fraud detection, credit scoring
+- Healthcare: clinical event prediction
+## Repo Structure
+```
+domainTokenizer/
+├── docs/
+│   └── research_report.md    # Detailed research findings (30+ papers)
+├── src/                       # (coming) Core library
+│   ├── tokenizers/            # Per-field tokenizer implementations
+│   ├── models/                # Small Transformer backbones
+│   └── training/              # Pre-training and fine-tuning
+├── examples/                  # (coming) Domain-specific demos
+└── README.md
+```
+## Key References
+| Paper | Year | What It Does | Link |
+|-------|------|-------------|------|
+| TIGER | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
+| ActionPiece | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
+| Banking TF | 2024 | Tokenizer for financial transactions | [arXiv](https://arxiv.org/abs/2410.08243) |
+| LETTER | 2024 | Learnable item tokenization | [arXiv](https://arxiv.org/abs/2405.07314) |
+| TP-BERTa | 2024 | Numerical value tokenization | [arXiv](https://arxiv.org/abs/2403.01841) |
+| Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
+| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
+| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
+See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 31 papers in the research report.
+## License
+MIT
+---
+*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*