File size: 8,237 Bytes
f930fef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e30a14d
 
 
 
 
 
 
 
 
a239d6e
e30a14d
f930fef
 
 
 
 
 
 
 
 
 
 
e30a14d
f930fef
 
 
 
 
e30a14d
 
f930fef
 
e30a14d
 
 
 
a239d6e
 
 
 
 
 
 
 
 
f930fef
a239d6e
 
 
 
 
 
 
 
 
f930fef
e30a14d
a239d6e
 
 
 
 
 
 
 
 
 
 
 
f930fef
a239d6e
 
 
f930fef
 
 
 
 
 
a239d6e
 
 
 
 
 
 
 
 
f930fef
 
 
 
 
 
 
e30a14d
 
 
 
 
 
 
a239d6e
e30a14d
f930fef
 
a239d6e
f930fef
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# πŸ”‘ domainTokenizer

**Building small models that understand domain tokens β€” not just words.**

---

## The Idea

LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

```
Text LLM:      "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word

domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase
```

## 🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
- **Blog series:** [Building Nubank β€” Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)

**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ **~14 tokens per transaction** β†’ GPT-style Transformer (24M-330M params) β†’ **+1.25% relative AUC over LightGBM** (3Γ— their production launch threshold).

πŸ“„ **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)

## Why This Matters

| Problem | Text Tokenizer | Domain Tokenizer |
|---------|---------------|-----------------|
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β†’ composite token |

## Research Foundation

This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β€” the challenge is *how* to tokenize.

| Paradigm | Method | Key Paper |
|----------|--------|-----------|
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
| **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
| **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
| **Universal Tokenization** | All modalities β†’ shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |

## Documentation

| Document | Description |
|----------|-------------|
| πŸ“„ [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β€” 31 papers across 5 paradigms, technical taxonomy, full blueprint |
| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β€” complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
| πŸ—οΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β€” framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |

## Implementation Decision

After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:

**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)

Key reasons:
- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` β†’ Trainer β†’ push_to_hub)
- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β€” not at our 24M–330M scale

Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)

## Project Roadmap

### Phase 1: Research & Survey βœ…
- Literature survey (35+ papers)
- Nubank nuFormer reverse-engineering  
- Framework ADR with detailed implementation plan

### Phase 2: Core Library (Next β€” ~9 weeks)
- **Weeks 1–3:** Domain tokenizer library (schema β†’ per-field tokenizers β†’ HF-compatible composite tokenizer)
- **Weeks 3–5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
- **Weeks 5–7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
- **Weeks 7–9:** Fine-tuning pipeline (nuFormer-style joint fusion)

### Phase 3: Domain Demos (Weeks 9–12)
- Finance: fraud detection, credit scoring
- E-commerce: next purchase prediction, customer segmentation

### Phase 4: Scale & Optimize (Weeks 12+)
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary

## Repo Structure

```
domainTokenizer/
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ research_report.md              # 51KB β€” Full research survey
β”‚   β”œβ”€β”€ nubank_nuformer_analysis.md     # 29KB β€” Nubank pipeline analysis
β”‚   └── adr/
β”‚       └── ADR-001-implementation-framework.md  # Framework decision + roadmap
β”œβ”€β”€ src/                                 # (Phase 2) Core library
β”‚   β”œβ”€β”€ tokenizers/                      # Schema, field tokenizers, composite builder
β”‚   β”œβ”€β”€ models/                          # DomainTransformer, PLR, DCNv2, JointFusion
β”‚   └── training/                        # Data pipeline, pre-training, fine-tuning
β”œβ”€β”€ examples/                            # (Phase 3) Domain-specific demos
└── README.md
```

## Key References

| Paper | Year | What It Does | Link |
|-------|------|-------------|------|
| **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
| TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
| RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |

Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)

## License

MIT

---

*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*