File size: 8,237 Bytes

f930fef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e30a14d
 
 
 
 
 
 
 
 
a239d6e
e30a14d
f930fef
 
 
 
 
 
 
 
 
 
 
e30a14d
f930fef
 
 
 
 
e30a14d
 
f930fef
 
e30a14d
 
 
 
a239d6e
 
 
 
 
 
 
 
 
f930fef
a239d6e
 
 
 
 
 
 
 
 
f930fef
e30a14d
a239d6e
 
 
 
 
 
 
 
 
 
 
 
f930fef
a239d6e
 
 
f930fef
 
 
 
 
 
a239d6e
 
 
 
 
 
 
 
 
f930fef
 
 
 
 
 
 
e30a14d
 
 
 
 
 
 
a239d6e
e30a14d
f930fef
 
a239d6e
f930fef

# 🔑 domainTokenizer

**Building small models that understand domain tokens — not just words.**

---

## The Idea

LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

```
Text LLM:      "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
```

## 🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
- **Blog series:** [Building Nubank — Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)

**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → **~14 tokens per transaction** → GPT-style Transformer (24M-330M params) → **+1.25% relative AUC over LightGBM** (3× their production launch threshold).

📄 **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)

## Why This Matters

| Problem | Text Tokenizer | Domain Tokenizer |
|---------|---------------|-----------------|
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |

## Research Foundation

This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.

| Paradigm | Method | Key Paper |
|----------|--------|-----------|
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
| **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
| **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
| **Universal Tokenization** | All modalities → shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |

## Documentation

| Document | Description |
|----------|-------------|
| 📄 [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** — 31 papers across 5 paradigms, technical taxonomy, full blueprint |
| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |

## Implementation Decision

After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:

**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)

Key reasons:
- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale

Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)

## Project Roadmap

### Phase 1: Research & Survey ✅
- Literature survey (35+ papers)
- Nubank nuFormer reverse-engineering  
- Framework ADR with detailed implementation plan

### Phase 2: Core Library (Next — ~9 weeks)
- **Weeks 1–3:** Domain tokenizer library (schema → per-field tokenizers → HF-compatible composite tokenizer)
- **Weeks 3–5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
- **Weeks 5–7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
- **Weeks 7–9:** Fine-tuning pipeline (nuFormer-style joint fusion)

### Phase 3: Domain Demos (Weeks 9–12)
- Finance: fraud detection, credit scoring
- E-commerce: next purchase prediction, customer segmentation

### Phase 4: Scale & Optimize (Weeks 12+)
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary

## Repo Structure

```
domainTokenizer/
├── docs/
│   ├── research_report.md              # 51KB — Full research survey
│   ├── nubank_nuformer_analysis.md     # 29KB — Nubank pipeline analysis
│   └── adr/
│       └── ADR-001-implementation-framework.md  # Framework decision + roadmap
├── src/                                 # (Phase 2) Core library
│   ├── tokenizers/                      # Schema, field tokenizers, composite builder
│   ├── models/                          # DomainTransformer, PLR, DCNv2, JointFusion
│   └── training/                        # Data pipeline, pre-training, fine-tuning
├── examples/                            # (Phase 3) Domain-specific demos
└── README.md
```

## Key References

| Paper | Year | What It Does | Link |
|-------|------|-------------|------|
| **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
| TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
| RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |

Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)

## License

MIT

---

*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*