Update README with Nubank case study and expanded repo structure
Browse files
README.md
CHANGED
|
@@ -18,6 +18,17 @@ Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat]
|
|
| 18 |
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
|
| 19 |
```
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## Why This Matters
|
| 22 |
|
| 23 |
| Problem | Text Tokenizer | Domain Tokenizer |
|
|
@@ -29,7 +40,7 @@ domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCred
|
|
| 29 |
|
| 30 |
## Research Foundation
|
| 31 |
|
| 32 |
-
This project is grounded in
|
| 33 |
|
| 34 |
Five paradigms have emerged:
|
| 35 |
|
|
@@ -37,35 +48,42 @@ Five paradigms have emerged:
|
|
| 37 |
|----------|--------|-----------|
|
| 38 |
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
|
| 39 |
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
|
| 40 |
-
| **Transaction Tokenization** |
|
| 41 |
-
| **Tabular Tokenization** |
|
| 42 |
| **Universal Tokenization** | All modalities β shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 45 |
|
| 46 |
## Project Vision
|
| 47 |
|
| 48 |
-
### Phase 1: Research & Survey
|
| 49 |
-
- Literature survey of domain tokenization methods
|
| 50 |
-
-
|
| 51 |
- Blueprint for a general-purpose domain tokenizer
|
| 52 |
|
| 53 |
-
### Phase 2: Core Tokenizer Library
|
| 54 |
- Implement per-field tokenizers:
|
| 55 |
-
- `
|
| 56 |
-
- `
|
| 57 |
-
- `
|
|
|
|
| 58 |
- `CompositeTokenizer` β BPE-like merging of multi-field patterns (ActionPiece-style)
|
| 59 |
- Schema-driven automatic tokenizer selection
|
| 60 |
|
| 61 |
### Phase 3: Pre-training Framework
|
| 62 |
-
-
|
| 63 |
-
- Small Transformer backbone (
|
| 64 |
-
-
|
|
|
|
| 65 |
|
| 66 |
### Phase 4: Domain Demos
|
|
|
|
| 67 |
- E-commerce: next purchase prediction, customer segmentation
|
| 68 |
-
- Finance: fraud detection, credit scoring
|
| 69 |
- Healthcare: clinical event prediction
|
| 70 |
|
| 71 |
## Repo Structure
|
|
@@ -73,12 +91,13 @@ Five paradigms have emerged:
|
|
| 73 |
```
|
| 74 |
domainTokenizer/
|
| 75 |
βββ docs/
|
| 76 |
-
β
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
β βββ
|
| 80 |
-
β
|
| 81 |
-
|
|
|
|
| 82 |
βββ README.md
|
| 83 |
```
|
| 84 |
|
|
@@ -86,16 +105,19 @@ domainTokenizer/
|
|
| 86 |
|
| 87 |
| Paper | Year | What It Does | Link |
|
| 88 |
|-------|------|-------------|------|
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
|
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
|
| 94 |
| Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
|
| 95 |
| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
|
| 96 |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
| 97 |
|
| 98 |
-
See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with
|
| 99 |
|
| 100 |
## License
|
| 101 |
|
|
|
|
| 18 |
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
|
| 19 |
```
|
| 20 |
|
| 21 |
+
## π¦ Industry Validation: Nubank's nuFormer
|
| 22 |
+
|
| 23 |
+
This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
|
| 24 |
+
|
| 25 |
+
- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
|
| 26 |
+
- **Blog series:** [Building Nubank β Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)
|
| 27 |
+
|
| 28 |
+
**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold).
|
| 29 |
+
|
| 30 |
+
π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) β complete pipeline reconstruction with code patterns, scaling laws, and adaptation guides for e-commerce and healthcare.
|
| 31 |
+
|
| 32 |
## Why This Matters
|
| 33 |
|
| 34 |
| Problem | Text Tokenizer | Domain Tokenizer |
|
|
|
|
| 40 |
|
| 41 |
## Research Foundation
|
| 42 |
|
| 43 |
+
This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize.
|
| 44 |
|
| 45 |
Five paradigms have emerged:
|
| 46 |
|
|
|
|
| 48 |
|----------|--------|-----------|
|
| 49 |
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
|
| 50 |
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
|
| 51 |
+
| **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
|
| 52 |
+
| **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
|
| 53 |
| **Universal Tokenization** | All modalities β shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
|
| 54 |
|
| 55 |
+
## Documentation
|
| 56 |
+
|
| 57 |
+
| Document | Description |
|
| 58 |
+
|----------|-------------|
|
| 59 |
+
| [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β 31 papers across 5 paradigms, technical taxonomy of tokenization methods, full blueprint for building domain tokenizers |
|
| 60 |
+
| [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β complete pipeline reconstruction (tokenizer β pre-training β joint fusion), all 4 academic pillars explained, adaptation playbooks for finance/e-commerce/healthcare |
|
| 61 |
|
| 62 |
## Project Vision
|
| 63 |
|
| 64 |
+
### Phase 1: Research & Survey β
|
| 65 |
+
- Literature survey of domain tokenization methods (31+ papers)
|
| 66 |
+
- Nubank nuFormer full reverse-engineering
|
| 67 |
- Blueprint for a general-purpose domain tokenizer
|
| 68 |
|
| 69 |
+
### Phase 2: Core Tokenizer Library (Next)
|
| 70 |
- Implement per-field tokenizers:
|
| 71 |
+
- `SignBucketTokenizer` β sign + magnitude bins for monetary values (Nubank-style)
|
| 72 |
+
- `CalendarTokenizer` β month/dow/dom/hour decomposition
|
| 73 |
+
- `SemanticIDTokenizer` β RQ-VAE for entity encoding (TIGER-style)
|
| 74 |
+
- `PLREmbedding` β periodic linear representation for numerical tabular features
|
| 75 |
- `CompositeTokenizer` β BPE-like merging of multi-field patterns (ActionPiece-style)
|
| 76 |
- Schema-driven automatic tokenizer selection
|
| 77 |
|
| 78 |
### Phase 3: Pre-training Framework
|
| 79 |
+
- GPT-style causal LM with NoPE (no positional encoding)
|
| 80 |
+
- Small Transformer backbone (24Mβ350M parameters, following Nubank's sizes)
|
| 81 |
+
- Joint fusion with DCNv2 + PLR for tabular features
|
| 82 |
+
- Domain-agnostic training loop
|
| 83 |
|
| 84 |
### Phase 4: Domain Demos
|
| 85 |
+
- Finance: transaction recommendation, fraud detection, credit scoring
|
| 86 |
- E-commerce: next purchase prediction, customer segmentation
|
|
|
|
| 87 |
- Healthcare: clinical event prediction
|
| 88 |
|
| 89 |
## Repo Structure
|
|
|
|
| 91 |
```
|
| 92 |
domainTokenizer/
|
| 93 |
βββ docs/
|
| 94 |
+
β βββ research_report.md # 51KB β Full research survey (31+ papers)
|
| 95 |
+
β βββ nubank_nuformer_analysis.md # Nubank pipeline reverse-engineering
|
| 96 |
+
βββ src/ # (Phase 2) Core library
|
| 97 |
+
β βββ tokenizers/ # Per-field tokenizer implementations
|
| 98 |
+
β βββ models/ # Small Transformer backbones
|
| 99 |
+
β βββ training/ # Pre-training and fine-tuning
|
| 100 |
+
βββ examples/ # (Phase 4) Domain-specific demos
|
| 101 |
βββ README.md
|
| 102 |
```
|
| 103 |
|
|
|
|
| 105 |
|
| 106 |
| Paper | Year | What It Does | Link |
|
| 107 |
|-------|------|-------------|------|
|
| 108 |
+
| **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
|
| 109 |
+
| TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
|
| 110 |
+
| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
|
| 111 |
+
| RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
|
| 112 |
+
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
|
| 113 |
+
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
|
| 114 |
+
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
|
| 115 |
+
| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
|
| 116 |
| Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
|
| 117 |
| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
|
| 118 |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
| 119 |
|
| 120 |
+
See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 35+ papers in the research report.
|
| 121 |
|
| 122 |
## License
|
| 123 |
|