| # π domainTokenizer |
|
|
| **Building small models that understand domain tokens β not just words.** |
|
|
| --- |
|
|
| ## The Idea |
|
|
| LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns. |
|
|
| But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture. |
|
|
| **domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language? |
|
|
| ``` |
| Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word |
| |
| domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase |
| ``` |
|
|
| ## π¦ Industry Validation: Nubank's nuFormer |
|
|
| This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe: |
|
|
| - **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267) |
| - **Blog series:** [Building Nubank β Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) |
|
|
| **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold). |
|
|
| π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) |
|
|
| ## Why This Matters |
|
|
| | Problem | Text Tokenizer | Domain Tokenizer | |
| |---------|---------------|-----------------| |
| | Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning | |
| | Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") | |
| | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` | |
| | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β composite token | |
|
|
| ## Research Foundation |
|
|
| This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize. |
|
|
| | Paradigm | Method | Key Paper | |
| |----------|--------|-----------| |
| | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) | |
| | **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) | |
| | **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) | |
| | **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) | |
| | **Universal Tokenization** | All modalities β shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) | |
|
|
| ## Documentation |
|
|
| | Document | Description | |
| |----------|-------------| |
| | π [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β 31 papers across 5 paradigms, technical taxonomy, full blueprint | |
| | π¦ [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β complete pipeline reconstruction, 4 academic pillars, adaptation playbooks | |
| | ποΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code | |
|
|
| ## Implementation Decision |
|
|
| After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX: |
|
|
| **Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path) |
|
|
| Key reasons: |
| - **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece) |
| - **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` β Trainer β push_to_hub) |
| - **Production deployment is direct:** ONNX, TGI, vLLM all first-class |
| - JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β not at our 24Mβ330M scale |
|
|
| Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) |
|
|
| ## Project Roadmap |
|
|
| ### Phase 1: Research & Survey β
|
| - Literature survey (35+ papers) |
| - Nubank nuFormer reverse-engineering |
| - Framework ADR with detailed implementation plan |
|
|
| ### Phase 2: Core Library (Next β ~9 weeks) |
| - **Weeks 1β3:** Domain tokenizer library (schema β per-field tokenizers β HF-compatible composite tokenizer) |
| - **Weeks 3β5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion |
| - **Weeks 5β7:** Pre-training pipeline (CLM on domain sequences via HF Trainer) |
| - **Weeks 7β9:** Fine-tuning pipeline (nuFormer-style joint fusion) |
|
|
| ### Phase 3: Domain Demos (Weeks 9β12) |
| - Finance: fraud detection, credit scoring |
| - E-commerce: next purchase prediction, customer segmentation |
|
|
| ### Phase 4: Scale & Optimize (Weeks 12+) |
| - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary |
|
|
| ## Repo Structure |
|
|
| ``` |
| domainTokenizer/ |
| βββ docs/ |
| β βββ research_report.md # 51KB β Full research survey |
| β βββ nubank_nuformer_analysis.md # 29KB β Nubank pipeline analysis |
| β βββ adr/ |
| β βββ ADR-001-implementation-framework.md # Framework decision + roadmap |
| βββ src/ # (Phase 2) Core library |
| β βββ tokenizers/ # Schema, field tokenizers, composite builder |
| β βββ models/ # DomainTransformer, PLR, DCNv2, JointFusion |
| β βββ training/ # Data pipeline, pre-training, fine-tuning |
| βββ examples/ # (Phase 3) Domain-specific demos |
| βββ README.md |
| ``` |
|
|
| ## Key References |
|
|
| | Paper | Year | What It Does | Link | |
| |-------|------|-------------|------| |
| | **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) | |
| | TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) | |
| | ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) | |
| | RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) | |
| | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) | |
| | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) | |
| | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) | |
| | KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) | |
| | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) | |
| | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) | |
|
|
| Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table) |
|
|
| ## License |
|
|
| MIT |
|
|
| --- |
|
|
| *domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.* |