🔑 domainTokenizer

Building small models that understand domain tokens — not just words.

The Idea

LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

Text LLM:      "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase

🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

Paper: "Your spending needs attention: Modeling financial habits with transformers" (arXiv: 2507.23267)
Blog series: Building Nubank — Foundation Models

Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → ~14 tokens per transaction → GPT-style Transformer (24M-330M params) → +1.25% relative AUC over LightGBM (3× their production launch threshold).

📄 Full reverse-engineering: docs/nubank_nuformer_analysis.md

Why This Matters

Problem	Text Tokenizer	Domain Tokenizer
Product `SKU-8847291`	Split into meaningless fragments: `SK`, `U-`, `884`...	Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning
Price `$79.99`	Fragmented: `$`, `79`, `.`, `99`	Tokenized as `price_bin_37` (73rd percentile = "mid-range")
Timestamp `2025-03-15`	Calendar-unaware text fragments	`[Wednesday, Afternoon, 2_days_later]`
Cross-field patterns	Lost in flat token stream	Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token

Research Foundation

This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm — the challenge is how to tokenize.

Paradigm	Method	Key Paper
Semantic IDs	RQ-VAE quantization of item embeddings	TIGER (Google, 2023)
Action Tokenization	BPE-like merging of feature patterns	ActionPiece (DeepMind, 2025)
Transaction Tokenization	Special tokens + BPE hybrid	nuFormer (Nubank, 2025)
Tabular Tokenization	Periodic embeddings for numbers	PLR (Yandex, 2022)
Universal Tokenization	All modalities → shared discrete space	Meta-Transformer (2023)

Documentation

Document	Description
📄 `docs/research_report.md`	Comprehensive research survey — 31 papers across 5 paradigms, technical taxonomy, full blueprint
🏦 `docs/nubank_nuformer_analysis.md`	Nubank reverse-engineering — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks
🏗️ `docs/adr/ADR-001-implementation-framework.md`	Architecture Decision Record — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code

Implementation Decision

After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:

Decision: PyTorch + HuggingFace Transformers (with JAX as future scaling path)

Key reasons:

5 of 6 reference papers use PyTorch (including Google DeepMind's ActionPiece)
HuggingFace has the only complete custom tokenizer pipeline (PreTrainedTokenizerFast → Trainer → push_to_hub)
Production deployment is direct: ONNX, TGI, vLLM all first-class
JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale

Full analysis: docs/adr/ADR-001-implementation-framework.md

Project Roadmap

Phase 1: Research & Survey ✅

Literature survey (35+ papers)
Nubank nuFormer reverse-engineering
Framework ADR with detailed implementation plan

Phase 2: Core Library (Next — ~9 weeks)

Weeks 1–3: Domain tokenizer library (schema → per-field tokenizers → HF-compatible composite tokenizer)
Weeks 3–5: GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
Weeks 5–7: Pre-training pipeline (CLM on domain sequences via HF Trainer)
Weeks 7–9: Fine-tuning pipeline (nuFormer-style joint fusion)

Phase 3: Domain Demos (Weeks 9–12)

Finance: fraud detection, credit scoring
E-commerce: next purchase prediction, customer segmentation

Phase 4: Scale & Optimize (Weeks 12+)

330M param models, torch.compile(), ONNX export, ActionPiece vocabulary

Repo Structure

domainTokenizer/
├── docs/
│   ├── research_report.md              # 51KB — Full research survey
│   ├── nubank_nuformer_analysis.md     # 29KB — Nubank pipeline analysis
│   └── adr/
│       └── ADR-001-implementation-framework.md  # Framework decision + roadmap
├── src/                                 # (Phase 2) Core library
│   ├── tokenizers/                      # Schema, field tokenizers, composite builder
│   ├── models/                          # DomainTransformer, PLR, DCNv2, JointFusion
│   └── training/                        # Data pipeline, pre-training, fine-tuning
├── examples/                            # (Phase 3) Domain-specific demos
└── README.md

Key References

Paper	Year	What It Does	Link
nuFormer (Nubank)	2025	Transaction foundation model at production scale	arXiv
TIGER (Google)	2023	Semantic IDs for products via RQ-VAE	arXiv
ActionPiece (DeepMind)	2025	BPE for user action sequences	arXiv
RecFormer	2023	Items as key-value text representations	arXiv
PLR Embeddings (Yandex)	2022	Periodic embeddings for numerical features	arXiv
DCN V2 (Google)	2021	Feature crossing for tabular data	arXiv
NoPE	2023	No positional encoding beats RoPE/ALiBi	arXiv
KL3M Tokenizers	2025	Domain-specific BPE for finance/legal	arXiv
Banking TF	2024	Transaction tokenizer for French banking	arXiv
Nested Learning (HOPE)	2025	Continual learning via multi-timescale memory	arXiv

Full reference table (35+ papers): docs/research_report.md

License

MIT

domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.