π domainTokenizer
Building small models that understand domain tokens β not just words.
The Idea
LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.
But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
π¦ Industry Validation: Nubank's nuFormer
This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
- Paper: "Your spending needs attention: Modeling financial habits with transformers" (arXiv: 2507.23267)
- Blog series: Building Nubank β Foundation Models
Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β ~14 tokens per transaction β GPT-style Transformer (24M-330M params) β +1.25% relative AUC over LightGBM (3Γ their production launch threshold).
π Full reverse-engineering: docs/nubank_nuformer_analysis.md
Why This Matters
| Problem | Text Tokenizer | Domain Tokenizer |
|---|---|---|
Product SKU-8847291 |
Split into meaningless fragments: SK, U-, 884... |
Encoded as Semantic ID: [Electronics, 23, 7, 41] with hierarchical meaning |
Price $79.99 |
Fragmented: $, 79, ., 99 |
Tokenized as price_bin_37 (73rd percentile = "mid-range") |
Timestamp 2025-03-15 |
Calendar-unaware text fragments | [Wednesday, Afternoon, 2_days_later] |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: {Electronics + $50-100} β composite token |
Research Foundation
This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm β the challenge is how to tokenize.
| Paradigm | Method | Key Paper |
|---|---|---|
| Semantic IDs | RQ-VAE quantization of item embeddings | TIGER (Google, 2023) |
| Action Tokenization | BPE-like merging of feature patterns | ActionPiece (DeepMind, 2025) |
| Transaction Tokenization | Special tokens + BPE hybrid | nuFormer (Nubank, 2025) |
| Tabular Tokenization | Periodic embeddings for numbers | PLR (Yandex, 2022) |
| Universal Tokenization | All modalities β shared discrete space | Meta-Transformer (2023) |
Documentation
| Document | Description |
|---|---|
π docs/research_report.md |
Comprehensive research survey β 31 papers across 5 paradigms, technical taxonomy, full blueprint |
π¦ docs/nubank_nuformer_analysis.md |
Nubank reverse-engineering β complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
ποΈ docs/adr/ADR-001-implementation-framework.md |
Architecture Decision Record β framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
Implementation Decision
After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
Decision: PyTorch + HuggingFace Transformers (with JAX as future scaling path)
Key reasons:
- 5 of 6 reference papers use PyTorch (including Google DeepMind's ActionPiece)
- HuggingFace has the only complete custom tokenizer pipeline (
PreTrainedTokenizerFastβ Trainer β push_to_hub) - Production deployment is direct: ONNX, TGI, vLLM all first-class
- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β not at our 24Mβ330M scale
Full analysis: docs/adr/ADR-001-implementation-framework.md
Project Roadmap
Phase 1: Research & Survey β
- Literature survey (35+ papers)
- Nubank nuFormer reverse-engineering
- Framework ADR with detailed implementation plan
Phase 2: Core Library (Next β ~9 weeks)
- Weeks 1β3: Domain tokenizer library (schema β per-field tokenizers β HF-compatible composite tokenizer)
- Weeks 3β5: GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
- Weeks 5β7: Pre-training pipeline (CLM on domain sequences via HF Trainer)
- Weeks 7β9: Fine-tuning pipeline (nuFormer-style joint fusion)
Phase 3: Domain Demos (Weeks 9β12)
- Finance: fraud detection, credit scoring
- E-commerce: next purchase prediction, customer segmentation
Phase 4: Scale & Optimize (Weeks 12+)
- 330M param models,
torch.compile(), ONNX export, ActionPiece vocabulary
Repo Structure
domainTokenizer/
βββ docs/
β βββ research_report.md # 51KB β Full research survey
β βββ nubank_nuformer_analysis.md # 29KB β Nubank pipeline analysis
β βββ adr/
β βββ ADR-001-implementation-framework.md # Framework decision + roadmap
βββ src/ # (Phase 2) Core library
β βββ tokenizers/ # Schema, field tokenizers, composite builder
β βββ models/ # DomainTransformer, PLR, DCNv2, JointFusion
β βββ training/ # Data pipeline, pre-training, fine-tuning
βββ examples/ # (Phase 3) Domain-specific demos
βββ README.md
Key References
| Paper | Year | What It Does | Link |
|---|---|---|---|
| nuFormer (Nubank) | 2025 | Transaction foundation model at production scale | arXiv |
| TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | arXiv |
| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | arXiv |
| RecFormer | 2023 | Items as key-value text representations | arXiv |
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | arXiv |
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | arXiv |
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | arXiv |
| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | arXiv |
| Banking TF | 2024 | Transaction tokenizer for French banking | arXiv |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | arXiv |
Full reference table (35+ papers): docs/research_report.md
License
MIT
domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.