File size: 8,237 Bytes
f930fef e30a14d a239d6e e30a14d f930fef e30a14d f930fef e30a14d f930fef e30a14d a239d6e f930fef a239d6e f930fef e30a14d a239d6e f930fef a239d6e f930fef a239d6e f930fef e30a14d a239d6e e30a14d f930fef a239d6e f930fef | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 | # π domainTokenizer
**Building small models that understand domain tokens β not just words.**
---
## The Idea
LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.
But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
```
Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
```
## π¦ Industry Validation: Nubank's nuFormer
This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
- **Blog series:** [Building Nubank β Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)
**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold).
π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
## Why This Matters
| Problem | Text Tokenizer | Domain Tokenizer |
|---------|---------------|-----------------|
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β composite token |
## Research Foundation
This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize.
| Paradigm | Method | Key Paper |
|----------|--------|-----------|
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
| **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
| **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
| **Universal Tokenization** | All modalities β shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
## Documentation
| Document | Description |
|----------|-------------|
| π [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β 31 papers across 5 paradigms, technical taxonomy, full blueprint |
| π¦ [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
| ποΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
## Implementation Decision
After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
Key reasons:
- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` β Trainer β push_to_hub)
- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β not at our 24Mβ330M scale
Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
## Project Roadmap
### Phase 1: Research & Survey β
- Literature survey (35+ papers)
- Nubank nuFormer reverse-engineering
- Framework ADR with detailed implementation plan
### Phase 2: Core Library (Next β ~9 weeks)
- **Weeks 1β3:** Domain tokenizer library (schema β per-field tokenizers β HF-compatible composite tokenizer)
- **Weeks 3β5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
- **Weeks 5β7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
- **Weeks 7β9:** Fine-tuning pipeline (nuFormer-style joint fusion)
### Phase 3: Domain Demos (Weeks 9β12)
- Finance: fraud detection, credit scoring
- E-commerce: next purchase prediction, customer segmentation
### Phase 4: Scale & Optimize (Weeks 12+)
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
## Repo Structure
```
domainTokenizer/
βββ docs/
β βββ research_report.md # 51KB β Full research survey
β βββ nubank_nuformer_analysis.md # 29KB β Nubank pipeline analysis
β βββ adr/
β βββ ADR-001-implementation-framework.md # Framework decision + roadmap
βββ src/ # (Phase 2) Core library
β βββ tokenizers/ # Schema, field tokenizers, composite builder
β βββ models/ # DomainTransformer, PLR, DCNv2, JointFusion
β βββ training/ # Data pipeline, pre-training, fine-tuning
βββ examples/ # (Phase 3) Domain-specific demos
βββ README.md
```
## Key References
| Paper | Year | What It Does | Link |
|-------|------|-------------|------|
| **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
| TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
| RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
## License
MIT
---
*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.* |