domainTokenizer / README.md
rtferraz's picture
Add README with project overview and vision
f930fef verified
|
raw
history blame
5.68 kB

πŸ”‘ domainTokenizer

Building small models that understand domain tokens β€” not just words.


The Idea

LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

Text LLM:      "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word

domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase

Why This Matters

Problem Text Tokenizer Domain Tokenizer
Product SKU-8847291 Split into meaningless fragments: SK, U-, 884... Encoded as Semantic ID: [Electronics, 23, 7, 41] with hierarchical meaning
Price $79.99 Fragmented: $, 79, ., 99 Tokenized as price_bin_37 (73rd percentile = "mid-range")
Timestamp 2025-03-15 Calendar-unaware text fragments [Wednesday, Afternoon, 2_days_later]
Cross-field patterns Lost in flat token stream Discovered via BPE-like merging: {Electronics + $50-100} β†’ composite token

Research Foundation

This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm β€” the challenge is how to tokenize.

Five paradigms have emerged:

Paradigm Method Key Paper
Semantic IDs RQ-VAE quantization of item embeddings TIGER (Google, 2023)
Action Tokenization BPE-like merging of feature patterns ActionPiece (DeepMind, 2025)
Transaction Tokenization Composite (date + amount + text) encoding Banking TF (2024)
Tabular Tokenization Relative magnitude encoding for numbers TP-BERTa (2024)
Universal Tokenization All modalities β†’ shared discrete space Meta-Transformer (2023)

πŸ“„ Full research report: docs/research_report.md

Project Vision

Phase 1: Research & Survey (βœ… Current)

  • Literature survey of domain tokenization methods
  • Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
  • Blueprint for a general-purpose domain tokenizer

Phase 2: Core Tokenizer Library

  • Implement per-field tokenizers:
    • SemanticIDTokenizer β€” RQ-VAE for entity encoding
    • MagnitudeTokenizer β€” relative magnitude binning for numerical values
    • TemporalTokenizer β€” calendar + relative delta encoding
    • CompositeTokenizer β€” BPE-like merging of multi-field patterns (ActionPiece-style)
  • Schema-driven automatic tokenizer selection

Phase 3: Pre-training Framework

  • Self-supervised objectives: Masked Event Prediction, Next Event Prediction
  • Small Transformer backbone (10M–350M parameters)
  • Domain-agnostic training loop that works with any tokenizer configuration

Phase 4: Domain Demos

  • E-commerce: next purchase prediction, customer segmentation
  • Finance: fraud detection, credit scoring
  • Healthcare: clinical event prediction

Repo Structure

domainTokenizer/
β”œβ”€β”€ docs/
β”‚   └── research_report.md    # Detailed research findings (30+ papers)
β”œβ”€β”€ src/                       # (coming) Core library
β”‚   β”œβ”€β”€ tokenizers/            # Per-field tokenizer implementations
β”‚   β”œβ”€β”€ models/                # Small Transformer backbones
β”‚   └── training/              # Pre-training and fine-tuning
β”œβ”€β”€ examples/                  # (coming) Domain-specific demos
└── README.md

Key References

Paper Year What It Does Link
TIGER 2023 Semantic IDs for products via RQ-VAE arXiv
ActionPiece 2025 BPE for user action sequences arXiv
Banking TF 2024 Tokenizer for financial transactions arXiv
LETTER 2024 Learnable item tokenization arXiv
TP-BERTa 2024 Numerical value tokenization arXiv
Meta-Transformer 2023 12 modalities, one token space arXiv
NTP Survey 2024 Comprehensive multimodal NTP taxonomy arXiv
Nested Learning (HOPE) 2025 Continual learning via multi-timescale memory arXiv

See the full reference table with 31 papers in the research report.

License

MIT


domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.