rtferraz commited on
Commit
e30a14d
Β·
verified Β·
1 Parent(s): 51149fa

Update README with Nubank case study and expanded repo structure

Browse files
Files changed (1) hide show
  1. README.md +49 -27
README.md CHANGED
@@ -18,6 +18,17 @@ Text LLM: "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat]
18
  domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase
19
  ```
20
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## Why This Matters
22
 
23
  | Problem | Text Tokenizer | Domain Tokenizer |
@@ -29,7 +40,7 @@ domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCred
29
 
30
  ## Research Foundation
31
 
32
- This project is grounded in 30+ papers from Google, Google DeepMind, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β€” the challenge is *how* to tokenize.
33
 
34
  Five paradigms have emerged:
35
 
@@ -37,35 +48,42 @@ Five paradigms have emerged:
37
  |----------|--------|-----------|
38
  | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
39
  | **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
40
- | **Transaction Tokenization** | Composite (date + amount + text) encoding | [Banking TF](https://arxiv.org/abs/2410.08243) (2024) |
41
- | **Tabular Tokenization** | Relative magnitude encoding for numbers | [TP-BERTa](https://arxiv.org/abs/2403.01841) (2024) |
42
  | **Universal Tokenization** | All modalities β†’ shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
43
 
44
- πŸ“„ **Full research report:** [`docs/research_report.md`](docs/research_report.md)
 
 
 
 
 
45
 
46
  ## Project Vision
47
 
48
- ### Phase 1: Research & Survey (βœ… Current)
49
- - Literature survey of domain tokenization methods
50
- - Analysis of tokenization strategies across recommendation, finance, tabular, and universal domains
51
  - Blueprint for a general-purpose domain tokenizer
52
 
53
- ### Phase 2: Core Tokenizer Library
54
  - Implement per-field tokenizers:
55
- - `SemanticIDTokenizer` β€” RQ-VAE for entity encoding
56
- - `MagnitudeTokenizer` β€” relative magnitude binning for numerical values
57
- - `TemporalTokenizer` β€” calendar + relative delta encoding
 
58
  - `CompositeTokenizer` β€” BPE-like merging of multi-field patterns (ActionPiece-style)
59
  - Schema-driven automatic tokenizer selection
60
 
61
  ### Phase 3: Pre-training Framework
62
- - Self-supervised objectives: Masked Event Prediction, Next Event Prediction
63
- - Small Transformer backbone (10M–350M parameters)
64
- - Domain-agnostic training loop that works with any tokenizer configuration
 
65
 
66
  ### Phase 4: Domain Demos
 
67
  - E-commerce: next purchase prediction, customer segmentation
68
- - Finance: fraud detection, credit scoring
69
  - Healthcare: clinical event prediction
70
 
71
  ## Repo Structure
@@ -73,12 +91,13 @@ Five paradigms have emerged:
73
  ```
74
  domainTokenizer/
75
  β”œβ”€β”€ docs/
76
- β”‚ └── research_report.md # Detailed research findings (30+ papers)
77
- β”œβ”€β”€ src/ # (coming) Core library
78
- β”‚ β”œβ”€β”€ tokenizers/ # Per-field tokenizer implementations
79
- β”‚ β”œβ”€β”€ models/ # Small Transformer backbones
80
- β”‚ └── training/ # Pre-training and fine-tuning
81
- β”œβ”€β”€ examples/ # (coming) Domain-specific demos
 
82
  └── README.md
83
  ```
84
 
@@ -86,16 +105,19 @@ domainTokenizer/
86
 
87
  | Paper | Year | What It Does | Link |
88
  |-------|------|-------------|------|
89
- | TIGER | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
90
- | ActionPiece | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
91
- | Banking TF | 2024 | Tokenizer for financial transactions | [arXiv](https://arxiv.org/abs/2410.08243) |
92
- | LETTER | 2024 | Learnable item tokenization | [arXiv](https://arxiv.org/abs/2405.07314) |
93
- | TP-BERTa | 2024 | Numerical value tokenization | [arXiv](https://arxiv.org/abs/2403.01841) |
 
 
 
94
  | Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
95
  | NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
96
  | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
97
 
98
- See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 31 papers in the research report.
99
 
100
  ## License
101
 
 
18
  domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase
19
  ```
20
 
21
+ ## 🏦 Industry Validation: Nubank's nuFormer
22
+
23
+ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
24
+
25
+ - **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
26
+ - **Blog series:** [Building Nubank β€” Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)
27
+
28
+ **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ **~14 tokens per transaction** β†’ GPT-style Transformer (24M-330M params) β†’ **+1.25% relative AUC over LightGBM** (3Γ— their production launch threshold).
29
+
30
+ πŸ“„ **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) β€” complete pipeline reconstruction with code patterns, scaling laws, and adaptation guides for e-commerce and healthcare.
31
+
32
  ## Why This Matters
33
 
34
  | Problem | Text Tokenizer | Domain Tokenizer |
 
40
 
41
  ## Research Foundation
42
 
43
+ This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β€” the challenge is *how* to tokenize.
44
 
45
  Five paradigms have emerged:
46
 
 
48
  |----------|--------|-----------|
49
  | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
50
  | **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
51
+ | **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
52
+ | **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
53
  | **Universal Tokenization** | All modalities β†’ shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
54
 
55
+ ## Documentation
56
+
57
+ | Document | Description |
58
+ |----------|-------------|
59
+ | [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β€” 31 papers across 5 paradigms, technical taxonomy of tokenization methods, full blueprint for building domain tokenizers |
60
+ | [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β€” complete pipeline reconstruction (tokenizer β†’ pre-training β†’ joint fusion), all 4 academic pillars explained, adaptation playbooks for finance/e-commerce/healthcare |
61
 
62
  ## Project Vision
63
 
64
+ ### Phase 1: Research & Survey βœ…
65
+ - Literature survey of domain tokenization methods (31+ papers)
66
+ - Nubank nuFormer full reverse-engineering
67
  - Blueprint for a general-purpose domain tokenizer
68
 
69
+ ### Phase 2: Core Tokenizer Library (Next)
70
  - Implement per-field tokenizers:
71
+ - `SignBucketTokenizer` β€” sign + magnitude bins for monetary values (Nubank-style)
72
+ - `CalendarTokenizer` β€” month/dow/dom/hour decomposition
73
+ - `SemanticIDTokenizer` β€” RQ-VAE for entity encoding (TIGER-style)
74
+ - `PLREmbedding` β€” periodic linear representation for numerical tabular features
75
  - `CompositeTokenizer` β€” BPE-like merging of multi-field patterns (ActionPiece-style)
76
  - Schema-driven automatic tokenizer selection
77
 
78
  ### Phase 3: Pre-training Framework
79
+ - GPT-style causal LM with NoPE (no positional encoding)
80
+ - Small Transformer backbone (24M–350M parameters, following Nubank's sizes)
81
+ - Joint fusion with DCNv2 + PLR for tabular features
82
+ - Domain-agnostic training loop
83
 
84
  ### Phase 4: Domain Demos
85
+ - Finance: transaction recommendation, fraud detection, credit scoring
86
  - E-commerce: next purchase prediction, customer segmentation
 
87
  - Healthcare: clinical event prediction
88
 
89
  ## Repo Structure
 
91
  ```
92
  domainTokenizer/
93
  β”œβ”€β”€ docs/
94
+ β”‚ β”œβ”€β”€ research_report.md # 51KB β€” Full research survey (31+ papers)
95
+ β”‚ └── nubank_nuformer_analysis.md # Nubank pipeline reverse-engineering
96
+ β”œβ”€β”€ src/ # (Phase 2) Core library
97
+ β”‚ β”œβ”€β”€ tokenizers/ # Per-field tokenizer implementations
98
+ β”‚ β”œβ”€β”€ models/ # Small Transformer backbones
99
+ β”‚ └── training/ # Pre-training and fine-tuning
100
+ β”œβ”€β”€ examples/ # (Phase 4) Domain-specific demos
101
  └── README.md
102
  ```
103
 
 
105
 
106
  | Paper | Year | What It Does | Link |
107
  |-------|------|-------------|------|
108
+ | **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
109
+ | TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
110
+ | ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
111
+ | RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
112
+ | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
113
+ | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
114
+ | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
115
+ | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
116
  | Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
117
  | NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
118
  | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
119
 
120
+ See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 35+ papers in the research report.
121
 
122
  ## License
123