Add Nubank nuFormer reverse-engineering analysis — full pipeline reconstruction
Browse files- docs/nubank_nuformer_analysis.md +610 -0
docs/nubank_nuformer_analysis.md
ADDED
|
@@ -0,0 +1,610 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Reverse-Engineering Nubank's nuFormer: A Transaction Foundation Model
|
| 2 |
+
|
| 3 |
+
> **How Nubank built a domain tokenizer for 100M+ customers and O(100 billion) transactions — and how to replicate this for finance, e-commerce, and other domains.**
|
| 4 |
+
>
|
| 5 |
+
> *Analysis based on: arXiv:2507.23267 ("Your Spending Needs Attention"), the Building Nubank blog series, and all referenced academic papers.*
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## Table of Contents
|
| 10 |
+
|
| 11 |
+
1. [Why This Matters for domainTokenizer](#1-why-this-matters-for-domaintokenizer)
|
| 12 |
+
2. [The Nubank Blog Series: Complete Inventory](#2-the-nubank-blog-series-complete-inventory)
|
| 13 |
+
3. [The nuFormer Architecture: Full Reconstruction](#3-the-nuformer-architecture-full-reconstruction)
|
| 14 |
+
- 3.1 [Step 1: The Domain Tokenizer — Transactions → Tokens](#31-step-1-the-domain-tokenizer--transactions--tokens)
|
| 15 |
+
- 3.2 [Step 2: The Transaction Transformer — Pre-training](#32-step-2-the-transaction-transformer--pre-training)
|
| 16 |
+
- 3.3 [Step 3: Joint Fusion — Combining Sequences + Tabular Features](#33-step-3-joint-fusion--combining-sequences--tabular-features)
|
| 17 |
+
4. [The Four Academic Pillars](#4-the-four-academic-pillars)
|
| 18 |
+
- 4.1 [RecFormer: Items as Sentences, Not IDs](#41-recformer-items-as-sentences-not-ids)
|
| 19 |
+
- 4.2 [PLR Embeddings: Making Numbers First-Class Citizens](#42-plr-embeddings-making-numbers-first-class-citizens)
|
| 20 |
+
- 4.3 [DCN V2: Explicit Feature Crossing](#43-dcn-v2-explicit-feature-crossing)
|
| 21 |
+
- 4.4 [NoPE: No Positional Encoding Needed](#44-nope-no-positional-encoding-needed)
|
| 22 |
+
5. [Results & Scaling Laws](#5-results--scaling-laws)
|
| 23 |
+
6. [Connection to domainTokenizer Research](#6-connection-to-domaintokenizer-research)
|
| 24 |
+
7. [The Playbook: How to Walk Nubank's Path](#7-the-playbook-how-to-walk-nubanks-path)
|
| 25 |
+
8. [Complete Reference List](#8-complete-reference-list)
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 1. Why This Matters for domainTokenizer
|
| 30 |
+
|
| 31 |
+
Nubank didn't just build a model — they built **exactly what domainTokenizer envisions**: a domain-specific tokenizer that converts financial transactions into tokens, trains a small Transformer on those tokens, and uses it as a foundation model for downstream business tasks.
|
| 32 |
+
|
| 33 |
+
**The connection is direct:**
|
| 34 |
+
|
| 35 |
+
| domainTokenizer Concept | Nubank's Implementation |
|
| 36 |
+
|------------------------|------------------------|
|
| 37 |
+
| Domain tokens (not words) | Special tokens for amount, date, sign + BPE for descriptions |
|
| 38 |
+
| Small models that understand domain data | 24M and 330M parameter Transformers |
|
| 39 |
+
| Pre-training on domain sequences | Next-token prediction on transaction sequences |
|
| 40 |
+
| Fine-tuning for business tasks | Product recommendation (binary: will user activate?) |
|
| 41 |
+
| Beating traditional ML baselines | +1.25% relative AUC over LightGBM = 3× launch threshold |
|
| 42 |
+
|
| 43 |
+
Nubank **validated** the domainTokenizer thesis at production scale (100M+ users, 100B+ transactions) and published both the recipe and results. This is our blueprint.
|
| 44 |
+
|
| 45 |
+
---
|
| 46 |
+
|
| 47 |
+
## 2. The Nubank Blog Series: Complete Inventory
|
| 48 |
+
|
| 49 |
+
Nubank published a comprehensive blog series on Building Nubank documenting their foundation model journey:
|
| 50 |
+
|
| 51 |
+
| # | Title | Focus | URL |
|
| 52 |
+
|---|-------|-------|-----|
|
| 53 |
+
| 1 | **Unlocking financial insights: How Nubank powers personalized experiences with foundation models** | Overview & motivation | [building.nubank.com/unlocking-financial-insights...](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) |
|
| 54 |
+
| 2 | **Defining an interface between transaction data and foundation models** | The tokenizer design | [Braithwaite & Udagawa, 2025a] |
|
| 55 |
+
| 3 | **Fine-tuning transaction user models** | nuFormer fine-tuning recipe | [Braithwaite, Cavalcanti & Udagawa, 2025b] |
|
| 56 |
+
| 4 | **Understanding our customers' finances through foundation models** | Application layer & results | [Braithwaite & Udagawa, 2025c] |
|
| 57 |
+
| 5 | **Optimizing user narratives for foundation models** | Context window optimization | [Foust, 2025] |
|
| 58 |
+
| 6 | **Building foundation models into Nubank's AI platform** | MLOps & infrastructure | [Udagawa, 2025] |
|
| 59 |
+
|
| 60 |
+
**The arXiv paper** consolidating all technical details:
|
| 61 |
+
- **"Your spending needs attention: Modeling financial habits with transformers"** — [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) (Braithwaite et al., July 2025)
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## 3. The nuFormer Architecture: Full Reconstruction
|
| 66 |
+
|
| 67 |
+
### 3.1 Step 1: The Domain Tokenizer — Transactions → Tokens
|
| 68 |
+
|
| 69 |
+
This is the **core innovation** and the part most relevant to domainTokenizer. Nubank's tokenizer converts raw financial transactions into discrete token sequences.
|
| 70 |
+
|
| 71 |
+
#### Raw Transaction Data
|
| 72 |
+
Each transaction has three raw fields:
|
| 73 |
+
```
|
| 74 |
+
{
|
| 75 |
+
"amount": 79.99, // float (positive or negative)
|
| 76 |
+
"date": "2025-03-15T14:23:00", // timestamp
|
| 77 |
+
"description": "AMAZON MARKETPLACE" // free text
|
| 78 |
+
}
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
#### The Tokenization Decision
|
| 82 |
+
|
| 83 |
+
Nubank explicitly considered and **rejected** two extremes:
|
| 84 |
+
|
| 85 |
+
1. ❌ **Pure text serialization** (JSON stringification → BPE): Too many tokens per transaction. A JSON string like `{"amount": 79.99, "date": "2025-03-15", "desc": "AMAZON MARKETPLACE"}` would consume ~30-50 BPE tokens per transaction, leaving only ~40-60 transactions in a 2048-token context window.
|
| 86 |
+
|
| 87 |
+
2. ❌ **Pure numerical encoding** (all fields as embeddings, no text): Loses the rich information in transaction descriptions (merchant names, payment categories, etc.)
|
| 88 |
+
|
| 89 |
+
3. ✅ **Hybrid: Special tokens for structured fields + BPE for text**: Best of both worlds.
|
| 90 |
+
|
| 91 |
+
#### The Special Token Vocabulary
|
| 92 |
+
|
| 93 |
+
Each structured field gets its own small, fixed vocabulary of **special tokens**:
|
| 94 |
+
|
| 95 |
+
| Field | Tokenizer Function | Vocabulary Size | Example |
|
| 96 |
+
|-------|-------------------|-----------------|---------|
|
| 97 |
+
| **Amount Sign** | `ϕ_sign : ℝ → V_sign` | **2 tokens** | `[CREDIT]` or `[DEBIT]` |
|
| 98 |
+
| **Amount Bucket** | `ϕ_amt : ℝ → V_amt` (quantized bins) | **21 tokens** | `[AMT_BIN_14]` (e.g., $50-$100 range) |
|
| 99 |
+
| **Month** | `ϕ_month : date → V_month` | **12 tokens** | `[MARCH]` |
|
| 100 |
+
| **Day of Week** | `ϕ_dow : date → V_dow` | **7 tokens** | `[WEDNESDAY]` |
|
| 101 |
+
| **Day of Month** | `ϕ_dom : date → V_dom` | **31 tokens** | `[DAY_15]` |
|
| 102 |
+
| **Hour** | `ϕ_hour : date → V_hour` | **24 tokens** | `[HOUR_14]` |
|
| 103 |
+
|
| 104 |
+
**Total special tokens:** 2 + 21 + 12 + 7 + 31 + 24 = **97 special tokens**
|
| 105 |
+
|
| 106 |
+
The text description field uses standard **BPE tokenization**, producing a variable number of subword tokens.
|
| 107 |
+
|
| 108 |
+
#### Combined Vocabulary
|
| 109 |
+
|
| 110 |
+
```
|
| 111 |
+
V = V_special (97 tokens) ∪ V_BPE (standard BPE vocabulary)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
#### Token Sequence Layout Per Transaction
|
| 115 |
+
|
| 116 |
+
```
|
| 117 |
+
Transaction t_i = [
|
| 118 |
+
AMT_SIGN_TOKEN, # 1 token: CREDIT or DEBIT
|
| 119 |
+
AMT_BUCKET_TOKEN, # 1 token: one of 21 quantized bins
|
| 120 |
+
MONTH_TOKEN, # 1 token: Jan–Dec
|
| 121 |
+
DOW_TOKEN, # 1 token: Mon–Sun
|
| 122 |
+
DOM_TOKEN, # 1 token: 1–31
|
| 123 |
+
HOUR_TOKEN, # 1 token: 0–23
|
| 124 |
+
desc_tok_1, # variable: BPE tokens for "AMAZON"
|
| 125 |
+
desc_tok_2, # "MARKET"
|
| 126 |
+
desc_tok_3, # "PLACE"
|
| 127 |
+
...
|
| 128 |
+
]
|
| 129 |
+
```
|
| 130 |
+
|
| 131 |
+
**Average: ~14 tokens per transaction.**
|
| 132 |
+
|
| 133 |
+
This means a **2048-token context window holds approximately 146 transactions** — enough to capture several months of financial behavior for a typical consumer.
|
| 134 |
+
|
| 135 |
+
#### User Sequence Construction
|
| 136 |
+
|
| 137 |
+
For each user, transactions are ordered chronologically:
|
| 138 |
+
```
|
| 139 |
+
user_sequence = [t_1, t_2, t_3, ..., t_N]
|
| 140 |
+
```
|
| 141 |
+
Where N varies per user (truncated to fit context window, taking the most recent transactions).
|
| 142 |
+
|
| 143 |
+
#### Why This Design Wins
|
| 144 |
+
|
| 145 |
+
| Metric | Pure Text | Pure Embedding | Nubank Hybrid |
|
| 146 |
+
|--------|-----------|----------------|---------------|
|
| 147 |
+
| Tokens per transaction | ~35-50 | 1 (but fixed-dim) | **~14** |
|
| 148 |
+
| Transactions in 2048 context | ~40-60 | 2048 | **~146** |
|
| 149 |
+
| Captures description text | ✅ | ❌ | ✅ |
|
| 150 |
+
| Captures numerical structure | ❌ (fragmented) | ✅ | ✅ |
|
| 151 |
+
| Captures temporal patterns | ❌ | Partial | ✅ |
|
| 152 |
+
| Works with standard Transformer | ✅ | Needs custom arch | ✅ |
|
| 153 |
+
|
| 154 |
+
### 3.2 Step 2: The Transaction Transformer — Pre-training
|
| 155 |
+
|
| 156 |
+
#### Architecture Choice: GPT-style Causal Decoder
|
| 157 |
+
|
| 158 |
+
Nubank chose a **decoder-only, GPT-style causal Transformer**, not BERT-style bidirectional. Reasons:
|
| 159 |
+
|
| 160 |
+
1. **Industry precedent:** State-of-the-art sequential recommendation systems (Pinterest PinnerFormer, Meta NxtPost) use causal architectures
|
| 161 |
+
2. **No autoregressive generation needed:** At inference, the model produces a single user embedding from the full sequence — no token-by-token generation required
|
| 162 |
+
3. **Better for long-range dependencies:** Causal attention naturally models temporal ordering
|
| 163 |
+
|
| 164 |
+
#### No Positional Encoding (NoPE)
|
| 165 |
+
|
| 166 |
+
Based on Kazemnejad et al. (2023), nuFormer uses **no explicit positional encoding**. The finding: NoPE outperforms RoPE, ALiBi, and learned absolute position embeddings on length generalization. Since users have varying transaction history lengths, length generalization is critical.
|
| 167 |
+
|
| 168 |
+
#### Model Sizes
|
| 169 |
+
|
| 170 |
+
| Variant | Parameters | Hidden Dim | Layers | Heads | Context |
|
| 171 |
+
|---------|-----------|------------|--------|-------|---------|
|
| 172 |
+
| **nuFormer-Small** | **24M** | 256 | 24 | 16 | 2048 |
|
| 173 |
+
| **nuFormer-Large** | **330M** | 1024 | 24 | 16 | 2048 |
|
| 174 |
+
|
| 175 |
+
Both share the same depth (24 layers, 16 heads) — they differ only in embedding dimension.
|
| 176 |
+
|
| 177 |
+
#### Pre-training Objective
|
| 178 |
+
|
| 179 |
+
**Causal Language Modeling (CLM):** Standard next-token prediction on the flattened transaction token sequences.
|
| 180 |
+
|
| 181 |
+
Given a user's transaction sequence tokenized as `[w_1, w_2, ..., w_T]`, the loss is:
|
| 182 |
+
|
| 183 |
+
```
|
| 184 |
+
L = -Σ_{t=1}^{T} log P(w_t | w_1, ..., w_{t-1})
|
| 185 |
+
```
|
| 186 |
+
|
| 187 |
+
This is the same objective as GPT — but instead of predicting the next word in a sentence, the model predicts the next token in a transaction sequence. This could be the next amount bucket, the next merchant name token, or the next month token.
|
| 188 |
+
|
| 189 |
+
#### Pre-training Data
|
| 190 |
+
|
| 191 |
+
- **20M user rows** for baseline experiments
|
| 192 |
+
- Up to **203M labeled rows** for fine-tuning experiments
|
| 193 |
+
- Data spans credit card, debit card, open finance, wires, transfers, and bill items
|
| 194 |
+
- **O(100 billion) total transactions** across Nubank's 100M+ member base
|
| 195 |
+
|
| 196 |
+
### 3.3 Step 3: Joint Fusion — Combining Sequences + Tabular Features
|
| 197 |
+
|
| 198 |
+
Nubank explored three fusion strategies for combining the transaction transformer with traditional tabular features:
|
| 199 |
+
|
| 200 |
+
#### Strategy A: Early Fusion (Extract → Downstream)
|
| 201 |
+
```
|
| 202 |
+
Transaction Sequence → Pre-trained Transformer → User Embedding (frozen)
|
| 203 |
+
↓
|
| 204 |
+
Feed into LightGBM with other features
|
| 205 |
+
```
|
| 206 |
+
Fastest to iterate but loses end-to-end gradients.
|
| 207 |
+
|
| 208 |
+
#### Strategy B: Late Fusion (Concatenate → Joint Head)
|
| 209 |
+
```
|
| 210 |
+
Transaction Sequence → Transformer → User Embedding ─┐
|
| 211 |
+
├─→ MLP Head → Prediction
|
| 212 |
+
Tabular Features (291) → Simple Embedding ────────────┘
|
| 213 |
+
```
|
| 214 |
+
Better than early fusion but the tabular branch is underparameterized.
|
| 215 |
+
|
| 216 |
+
#### Strategy C: Joint Fusion = nuFormer (Best)
|
| 217 |
+
```
|
| 218 |
+
Transaction Sequence → Transformer → User Embedding ─────────────────┐
|
| 219 |
+
├─→ Shared MLP → Prediction
|
| 220 |
+
Tabular Features (291) → PLR Embeddings → DCNv2 → Feature Embedding ─┘
|
| 221 |
+
```
|
| 222 |
+
|
| 223 |
+
**This is the production architecture.** The key insight: the tabular branch needs its own powerful backbone (DCNv2) to match the expressiveness of the transformer branch. Joint end-to-end training allows both branches to co-adapt.
|
| 224 |
+
|
| 225 |
+
#### The Tabular Branch: DCNv2 + PLR
|
| 226 |
+
|
| 227 |
+
**291 hand-crafted features** (numerical + categorical), processed as follows:
|
| 228 |
+
|
| 229 |
+
1. **Numerical features:** Transformed via PLR (Periodic Linear Representation):
|
| 230 |
+
```
|
| 231 |
+
PLR(x) = ReLU(Linear([sin(2πw₁x + b₁), cos(2πw₁x + b₁), ..., sin(2πwₙx + bₙ), cos(2πwₙx + bₙ)]))
|
| 232 |
+
```
|
| 233 |
+
Where frequencies `w` and phases `b` are **learned parameters**. This maps scalars to high-dimensional dense vectors that capture both magnitude and periodicity.
|
| 234 |
+
|
| 235 |
+
2. **Categorical features:** Standard embedding lookup tables.
|
| 236 |
+
|
| 237 |
+
3. **Feature interaction:** DCN V2 (Deep Cross Network V2) models explicit feature interactions:
|
| 238 |
+
```
|
| 239 |
+
x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l
|
| 240 |
+
```
|
| 241 |
+
Full-rank weight matrices enable capturing all pairwise and higher-order feature interactions.
|
| 242 |
+
|
| 243 |
+
4. **Regularization:** L2 regularization on DCNv2 cross-layer weights to prevent overfitting.
|
| 244 |
+
|
| 245 |
+
---
|
| 246 |
+
|
| 247 |
+
## 4. The Four Academic Pillars
|
| 248 |
+
|
| 249 |
+
Nubank's architecture stands on four papers. Understanding them is essential for replication.
|
| 250 |
+
|
| 251 |
+
### 4.1 RecFormer: Items as Sentences, Not IDs
|
| 252 |
+
|
| 253 |
+
**Paper:** "Text Is All You Need: Learning Language Representations for Sequential Recommendation"
|
| 254 |
+
**Authors:** Li et al. (UCSD + Amazon) | **KDD 2023** | [arXiv: 2305.13731](https://arxiv.org/abs/2305.13731) | [GitHub 130⭐](https://github.com/aaronheee/recformer)
|
| 255 |
+
|
| 256 |
+
**Core idea:** Abolish item IDs entirely. Represent each item as a key-value attribute dictionary flattened into text:
|
| 257 |
+
```
|
| 258 |
+
Item: {Color: Black, Brand: Nike, Category: Shoes}
|
| 259 |
+
→ Tokens: ["Color", "Black", "Brand", "Nike", "Category", "Shoes"]
|
| 260 |
+
```
|
| 261 |
+
|
| 262 |
+
A user's interaction sequence becomes a sequence of these "item sentences."
|
| 263 |
+
|
| 264 |
+
**Four-embedding architecture:**
|
| 265 |
+
```
|
| 266 |
+
E_token = LayerNorm(A_token + B_position + C_type + D_item_position)
|
| 267 |
+
```
|
| 268 |
+
- A = token embedding (shared vocabulary)
|
| 269 |
+
- B = token position in full sequence
|
| 270 |
+
- C = token type (key vs. value vs. special)
|
| 271 |
+
- D = item position (which item in the user sequence)
|
| 272 |
+
|
| 273 |
+
**What Nubank took:** The key-value flattening philosophy, but modified it with special tokens for structured fields (amount, date) to reduce tokens per transaction from ~35 to ~14.
|
| 274 |
+
|
| 275 |
+
### 4.2 PLR Embeddings: Making Numbers First-Class Citizens
|
| 276 |
+
|
| 277 |
+
**Paper:** "On Embeddings for Numerical Features in Tabular Deep Learning"
|
| 278 |
+
**Authors:** Gorishniy et al. (Yandex) | **NeurIPS 2022** | [arXiv: 2203.05556](https://arxiv.org/abs/2203.05556) | [GitHub](https://github.com/yandex-research/tabular-dl-num-embeddings)
|
| 279 |
+
|
| 280 |
+
**Core idea:** Raw scalar features fed into MLPs/Transformers are poorly optimized. **Lifting scalars into high-dimensional periodic embeddings** dramatically improves performance.
|
| 281 |
+
|
| 282 |
+
**PLR (Periodic → Linear → ReLU):**
|
| 283 |
+
```python
|
| 284 |
+
def plr_embedding(x, frequencies, phases):
|
| 285 |
+
# x: scalar feature value
|
| 286 |
+
# frequencies, phases: LEARNED parameters
|
| 287 |
+
periodic = torch.cat([
|
| 288 |
+
torch.sin(2 * π * frequencies * x + phases),
|
| 289 |
+
torch.cos(2 * π * frequencies * x + phases)
|
| 290 |
+
])
|
| 291 |
+
return relu(linear(periodic))
|
| 292 |
+
```
|
| 293 |
+
|
| 294 |
+
**Key result:** With PLR embeddings, a plain MLP can match attention-based Transformers on tabular benchmarks. PLR is what lets DCNv2 beat LightGBM.
|
| 295 |
+
|
| 296 |
+
**What Nubank took:** PLR embeddings for all 291 numerical tabular features in the joint fusion branch. This was the critical ingredient:
|
| 297 |
+
|
| 298 |
+
| Model | Relative AUC vs. LightGBM |
|
| 299 |
+
|-------|--------------------------|
|
| 300 |
+
| DCNv2 (without PLR) | -0.09% |
|
| 301 |
+
| DCNv2 + PLR | **+0.06%** ← first to beat GBDT |
|
| 302 |
+
| DCNv2 + PLR + L2 | +0.08% |
|
| 303 |
+
| **nuFormer (full)** | **+0.31% to +0.52%** |
|
| 304 |
+
|
| 305 |
+
### 4.3 DCN V2: Explicit Feature Crossing
|
| 306 |
+
|
| 307 |
+
**Paper:** "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems"
|
| 308 |
+
**Authors:** Wang et al. (Google) | **WebConf 2021** | [arXiv: 2008.13535](https://arxiv.org/abs/2008.13535) | **Production at Google**
|
| 309 |
+
|
| 310 |
+
**Core idea:** Explicitly model feature interactions (crosses) via specialized cross layers with full-rank weight matrices:
|
| 311 |
+
```
|
| 312 |
+
x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l # element-wise product with input anchor
|
| 313 |
+
```
|
| 314 |
+
|
| 315 |
+
This captures feature interactions of degree L+1 for an L-layer cross network. DCNv2 improves on DCN (2017) by using full-rank matrices instead of rank-1.
|
| 316 |
+
|
| 317 |
+
**What Nubank took:** DCNv2 as the backbone for the tabular feature branch (291 features). Combined with PLR embeddings, it forms the "tabular half" of the joint fusion nuFormer architecture.
|
| 318 |
+
|
| 319 |
+
### 4.4 NoPE: No Positional Encoding Needed
|
| 320 |
+
|
| 321 |
+
**Paper:** "The Impact of Positional Encoding on Length Generalization in Transformers"
|
| 322 |
+
**Authors:** Kazemnejad et al. (McGill/Mila) | **NeurIPS 2023** | [arXiv: 2305.19466](https://arxiv.org/abs/2305.19466) | [HF Paper](https://huggingface.co/papers/2305.19466)
|
| 323 |
+
|
| 324 |
+
**Core finding:** Decoder-only Transformers with **no positional encoding** (NoPE) outperform those with RoPE, ALiBi, and absolute position embeddings on length generalization tasks.
|
| 325 |
+
|
| 326 |
+
**Why it works (theoretically):**
|
| 327 |
+
- **Theorem 1:** The first layer of a NoPE causal Transformer can recover absolute positions from causal attention patterns alone
|
| 328 |
+
- **Theorem 2:** Subsequent layers can implement relative PE via learned query-key interactions
|
| 329 |
+
- **Empirically:** NoPE's learned attention patterns converge to T5's relative PE — it gets relative PE "for free"
|
| 330 |
+
|
| 331 |
+
**What Nubank took:** No positional encoding in the transaction Transformer. Since users have vastly different transaction history lengths (some have 20 transactions, some have 2000+), length generalization is critical for production deployment.
|
| 332 |
+
|
| 333 |
+
---
|
| 334 |
+
|
| 335 |
+
## 5. Results & Scaling Laws
|
| 336 |
+
|
| 337 |
+
### Production Results
|
| 338 |
+
|
| 339 |
+
| Model | Relative AUC vs. LightGBM |
|
| 340 |
+
|-------|--------------------------|
|
| 341 |
+
| MLP (raw features) | -0.44% |
|
| 342 |
+
| DCNv2 | -0.09% |
|
| 343 |
+
| MLP + PLR | -0.23% |
|
| 344 |
+
| LightGBM (baseline) | 0.00% |
|
| 345 |
+
| DCNv2 + PLR | +0.06% |
|
| 346 |
+
| DCNv2 + PLR + L2 | +0.08% |
|
| 347 |
+
| **nuFormer-Small (24M, Joint Fusion)** | **+0.31%** |
|
| 348 |
+
| **nuFormer-Large (330M, Joint Fusion)** | **+0.52%** |
|
| 349 |
+
|
| 350 |
+
**Final production deployment: +1.25% relative AUC improvement** — cited as **3× the typical model launch threshold** at Nubank. This is a massive result for a production recommendation system.
|
| 351 |
+
|
| 352 |
+
### Scaling Laws
|
| 353 |
+
|
| 354 |
+
Nubank observed clear scaling laws across three dimensions:
|
| 355 |
+
|
| 356 |
+
**Model size scaling:**
|
| 357 |
+
| Model | Parameters | AUC Improvement |
|
| 358 |
+
|-------|-----------|-----------------|
|
| 359 |
+
| nuFormer-Small | 24M | +0.31% |
|
| 360 |
+
| nuFormer-Large | 330M | +0.52% |
|
| 361 |
+
|
| 362 |
+
**Context length scaling:**
|
| 363 |
+
| Context | Transactions Covered | Effect |
|
| 364 |
+
|---------|---------------------|--------|
|
| 365 |
+
| 512 tokens | ~36 transactions | Baseline |
|
| 366 |
+
| 1024 tokens | ~73 transactions | Better |
|
| 367 |
+
| 2048 tokens | ~146 transactions | **Best** (monotonic improvement) |
|
| 368 |
+
|
| 369 |
+
Larger models benefit more from longer context — the 330M model extracts more value from additional transaction history than the 24M model.
|
| 370 |
+
|
| 371 |
+
**Fine-tuning data scaling:**
|
| 372 |
+
| Training Rows | Effect |
|
| 373 |
+
|--------------|--------|
|
| 374 |
+
| 5M | Baseline |
|
| 375 |
+
| 20M | Better |
|
| 376 |
+
| 40M | Better still |
|
| 377 |
+
| 100M | Best |
|
| 378 |
+
|
| 379 |
+
Again, larger models show steeper improvement with more data.
|
| 380 |
+
|
| 381 |
+
### Data Source Ablation (Critical Insight)
|
| 382 |
+
|
| 383 |
+
Nubank tested three anonymized data sources (A, B, C — likely credit card, debit, open finance):
|
| 384 |
+
|
| 385 |
+
| Sources | AUC vs. ABC Baseline |
|
| 386 |
+
|---------|---------------------|
|
| 387 |
+
| A alone | +0.72 |
|
| 388 |
+
| B alone | -8.21 |
|
| 389 |
+
| C alone | -20.52 |
|
| 390 |
+
| **AB** | **+0.91 (best!)** |
|
| 391 |
+
| BC | -12.24 |
|
| 392 |
+
| AC | -0.27 |
|
| 393 |
+
| ABC (all) | 0.00 (baseline) |
|
| 394 |
+
|
| 395 |
+
**Key insight:** More data sources can **hurt** performance. Source B and C are lower-information-density — when they crowd out high-signal transactions (source A) in the fixed 2048-token context window, overall performance drops. **AB outperforms ABC**, meaning the debit/open-finance data was actually diluting the credit card signal.
|
| 396 |
+
|
| 397 |
+
**Implication for domainTokenizer:** Context window is a **resource allocation problem**. You must carefully choose which data to include, not just maximize volume.
|
| 398 |
+
|
| 399 |
+
---
|
| 400 |
+
|
| 401 |
+
## 6. Connection to domainTokenizer Research
|
| 402 |
+
|
| 403 |
+
### Direct Mapping to Our Framework
|
| 404 |
+
|
| 405 |
+
| Our Research Report Section | Nubank's Implementation |
|
| 406 |
+
|---------------------------|------------------------|
|
| 407 |
+
| §4.1 Semantic ID Tokenization | Not used — Nubank uses special tokens instead of RQ-VAE |
|
| 408 |
+
| §4.2 Action Sequence Tokenization (ActionPiece) | Partially analogous — the BPE-on-descriptions is similar, but no cross-field merging |
|
| 409 |
+
| §4.3 Financial Transaction Tokenization | **Exact match** — special tokens for amount/date + BPE for text |
|
| 410 |
+
| §4.4 Tabular Feature Tokenization (PLR) | **Exact match** — PLR embeddings for the 291 tabular features |
|
| 411 |
+
| §6.1 Quantization-Based (RQ-VAE) | Not used |
|
| 412 |
+
| §6.2 BPE-Inspired Merging | Only for text descriptions, not for structured fields |
|
| 413 |
+
| §6.3 Magnitude & Binning | **Exact match** — amount quantized to 21 bins |
|
| 414 |
+
| §6.5 Serialization-Based | Explicitly rejected as too token-hungry |
|
| 415 |
+
|
| 416 |
+
### What Nubank Validates
|
| 417 |
+
|
| 418 |
+
1. ✅ **Domain tokens work better than text tokens** — the special token vocabulary is the key innovation
|
| 419 |
+
2. ✅ **Small models (24M-330M) are sufficient** — you don't need 7B+ parameter LLMs
|
| 420 |
+
3. ✅ **Self-supervised pre-training transfers** — pre-trained transaction Transformer improves downstream tasks
|
| 421 |
+
4. ✅ **Hybrid tokenization wins** — special tokens for structured data + BPE for text
|
| 422 |
+
5. ✅ **GPT-style causal modeling works for event sequences** — not just BERT-style masking
|
| 423 |
+
|
| 424 |
+
### What Nubank Didn't Do (Opportunities for domainTokenizer)
|
| 425 |
+
|
| 426 |
+
1. ❌ **No Semantic IDs (RQ-VAE):** Nubank tokenizes merchant descriptions via BPE but doesn't create learned codebook-based product/merchant IDs. This could be a significant improvement — merchants that always appear together could share semantic ID prefixes.
|
| 427 |
+
|
| 428 |
+
2. ❌ **No cross-field composite tokens (ActionPiece-style):** Each field is tokenized independently. A BPE-like merging of `{amount_bin + category + time_of_day}` into composite tokens could further compress the sequence and capture higher-order patterns.
|
| 429 |
+
|
| 430 |
+
3. ❌ **No continual learning (HOPE-style):** nuFormer is frozen after pre-training. The Nested Learning / HOPE paradigm could enable continuous adaptation to new spending patterns, new merchants, and seasonal shifts.
|
| 431 |
+
|
| 432 |
+
4. ❌ **No multi-resolution memory (CMS):** All tokens are treated equally in the attention window. A Continuum Memory System with different update frequencies could better handle the difference between recent transactions (high signal) and historical patterns (persistent knowledge).
|
| 433 |
+
|
| 434 |
+
### Nubank's Recipe = Our Blueprint for Phase 2
|
| 435 |
+
|
| 436 |
+
Nubank's exact pipeline maps to domainTokenizer's planned implementation:
|
| 437 |
+
|
| 438 |
+
```
|
| 439 |
+
domainTokenizer Phase 2 Implementation Plan
|
| 440 |
+
(directly following Nubank's validated recipe)
|
| 441 |
+
|
| 442 |
+
1. Schema Analysis → Identify field types
|
| 443 |
+
[Nubank: amount(float), date(timestamp), description(text)]
|
| 444 |
+
|
| 445 |
+
2. Per-Field Tokenizer Construction
|
| 446 |
+
[Nubank: ϕ_sign(2), ϕ_amt(21), ϕ_month(12), ϕ_dow(7), ϕ_dom(31), ϕ_hour(24), BPE(text)]
|
| 447 |
+
[Us: same pattern, extensible to any domain schema]
|
| 448 |
+
|
| 449 |
+
3. Pre-train GPT-style Causal Transformer (NoPE)
|
| 450 |
+
[Nubank: 24M-330M params, 2048 context, CLM objective]
|
| 451 |
+
[Us: configurable sizes, same objective]
|
| 452 |
+
|
| 453 |
+
4. Joint Fusion Fine-tuning
|
| 454 |
+
[Nubank: Transformer embeddings + DCNv2(PLR) on tabular features]
|
| 455 |
+
[Us: pluggable fusion with any tabular backbone]
|
| 456 |
+
```
|
| 457 |
+
|
| 458 |
+
---
|
| 459 |
+
|
| 460 |
+
## 7. The Playbook: How to Walk Nubank's Path
|
| 461 |
+
|
| 462 |
+
### For Finance (Replicating Nubank)
|
| 463 |
+
|
| 464 |
+
**Step 1: Define your transaction schema**
|
| 465 |
+
```python
|
| 466 |
+
schema = {
|
| 467 |
+
"amount": {"type": "numerical", "tokenizer": "sign_bucket", "sign_vocab": 2, "bucket_vocab": 21},
|
| 468 |
+
"timestamp": {"type": "temporal", "tokenizer": "calendar",
|
| 469 |
+
"fields": ["month(12)", "dow(7)", "dom(31)", "hour(24)"]},
|
| 470 |
+
"description": {"type": "text", "tokenizer": "bpe"},
|
| 471 |
+
# Extensions beyond Nubank:
|
| 472 |
+
"merchant_category": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 50},
|
| 473 |
+
"channel": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 10},
|
| 474 |
+
}
|
| 475 |
+
```
|
| 476 |
+
|
| 477 |
+
**Step 2: Build tokenizer (97 special tokens + BPE)**
|
| 478 |
+
```python
|
| 479 |
+
class TransactionTokenizer:
|
| 480 |
+
def __init__(self, schema):
|
| 481 |
+
self.special_tokens = build_special_vocab(schema) # ~97-150 tokens
|
| 482 |
+
self.bpe_tokenizer = AutoTokenizer.from_pretrained("...") # for text fields
|
| 483 |
+
|
| 484 |
+
def tokenize_transaction(self, txn):
|
| 485 |
+
tokens = []
|
| 486 |
+
tokens.append(self.sign_token(txn.amount)) # 1 token
|
| 487 |
+
tokens.append(self.amount_bucket(txn.amount)) # 1 token
|
| 488 |
+
tokens.extend(self.calendar_tokens(txn.timestamp)) # 4 tokens
|
| 489 |
+
tokens.extend(self.bpe_tokenizer(txn.description)) # ~8 tokens avg
|
| 490 |
+
return tokens # ~14 tokens total
|
| 491 |
+
```
|
| 492 |
+
|
| 493 |
+
**Step 3: Pre-train (24M params, CLM)**
|
| 494 |
+
```python
|
| 495 |
+
model = GPTCausalLM(
|
| 496 |
+
vocab_size=len(special_tokens) + bpe_vocab_size,
|
| 497 |
+
d_model=256, n_layers=24, n_heads=16,
|
| 498 |
+
max_seq_len=2048,
|
| 499 |
+
positional_encoding=None, # NoPE!
|
| 500 |
+
)
|
| 501 |
+
# Pre-train on transaction sequences
|
| 502 |
+
train_clm(model, transaction_sequences, epochs=...)
|
| 503 |
+
```
|
| 504 |
+
|
| 505 |
+
**Step 4: Joint Fusion Fine-tuning**
|
| 506 |
+
```python
|
| 507 |
+
class NuFormer(nn.Module):
|
| 508 |
+
def __init__(self, txn_transformer, tabular_features):
|
| 509 |
+
self.txn_branch = txn_transformer # pre-trained, unfrozen
|
| 510 |
+
self.tab_branch = DCNv2(
|
| 511 |
+
input_dim=len(tabular_features),
|
| 512 |
+
num_embeddings=PLREmbed(n_frequencies=64),
|
| 513 |
+
cross_layers=3, deep_layers=3,
|
| 514 |
+
)
|
| 515 |
+
self.head = MLP(txn_dim + tab_dim, hidden, 1)
|
| 516 |
+
|
| 517 |
+
def forward(self, txn_tokens, tabular_features):
|
| 518 |
+
txn_embed = self.txn_branch(txn_tokens)[:, -1, :] # last token embedding
|
| 519 |
+
tab_embed = self.tab_branch(tabular_features)
|
| 520 |
+
combined = torch.cat([txn_embed, tab_embed], dim=-1)
|
| 521 |
+
return self.head(combined)
|
| 522 |
+
```
|
| 523 |
+
|
| 524 |
+
### For E-Commerce (Adapting Nubank's Recipe)
|
| 525 |
+
|
| 526 |
+
**The adaptation is straightforward — replace transaction fields with e-commerce event fields:**
|
| 527 |
+
|
| 528 |
+
| Finance (Nubank) | E-Commerce (Adaptation) |
|
| 529 |
+
|------------------|----------------------|
|
| 530 |
+
| amount (float) | price (float) — same ϕ_amt tokenizer |
|
| 531 |
+
| amount sign (credit/debit) | event_type (view/cart/purchase/return) — expand to 4+ tokens |
|
| 532 |
+
| timestamp (month/dow/dom/hour) | timestamp — same calendar tokens |
|
| 533 |
+
| description (merchant text) | product_title (BPE) — same approach |
|
| 534 |
+
| — | category (hierarchical) — add special tokens |
|
| 535 |
+
| — | brand — add special tokens or BPE |
|
| 536 |
+
| — | quantity — small fixed vocab (1-10+) |
|
| 537 |
+
|
| 538 |
+
**E-commerce special token vocabulary:**
|
| 539 |
+
```python
|
| 540 |
+
e_commerce_special_tokens = {
|
| 541 |
+
"event_type": 5, # view, cart, purchase, return, wishlist
|
| 542 |
+
"price_bucket": 21, # same binning as Nubank
|
| 543 |
+
"quantity": 11, # 1-10, 10+
|
| 544 |
+
"category_l1": 30, # top-level categories
|
| 545 |
+
"category_l2": 200, # subcategories
|
| 546 |
+
"month": 12,
|
| 547 |
+
"dow": 7,
|
| 548 |
+
"dom": 31,
|
| 549 |
+
"hour": 24,
|
| 550 |
+
}
|
| 551 |
+
# Total: ~341 special tokens + BPE for product titles
|
| 552 |
+
# ~16 tokens per event → 2048 context ≈ 128 events
|
| 553 |
+
```
|
| 554 |
+
|
| 555 |
+
**Pre-training objectives (same as Nubank):**
|
| 556 |
+
- Causal LM: predict next token in the event sequence
|
| 557 |
+
- Downstream: next purchase prediction, churn, product recommendation, customer segmentation
|
| 558 |
+
|
| 559 |
+
### For Healthcare (Same Pattern)
|
| 560 |
+
|
| 561 |
+
```python
|
| 562 |
+
healthcare_special_tokens = {
|
| 563 |
+
"event_type": 10, # diagnosis, procedure, lab, medication, visit, ...
|
| 564 |
+
"icd_category": 50, # top-level ICD-10 groups
|
| 565 |
+
"cpt_category": 40, # procedure categories
|
| 566 |
+
"cost_bucket": 21, # same binning
|
| 567 |
+
"provider_type": 15, # PCP, specialist, ER, ...
|
| 568 |
+
"month": 12, "dow": 7, "dom": 31,
|
| 569 |
+
}
|
| 570 |
+
# Description: BPE on clinical notes/medication names
|
| 571 |
+
```
|
| 572 |
+
|
| 573 |
+
---
|
| 574 |
+
|
| 575 |
+
## 8. Complete Reference List
|
| 576 |
+
|
| 577 |
+
### Nubank Sources
|
| 578 |
+
|
| 579 |
+
| Ref | Authors | Title | Link |
|
| 580 |
+
|-----|---------|-------|------|
|
| 581 |
+
| **Primary** | Braithwaite et al. | Your spending needs attention: Modeling financial habits with transformers | [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) |
|
| 582 |
+
| Blog 1 | — | Unlocking financial insights: How Nubank powers personalized experiences | [building.nubank.com](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) |
|
| 583 |
+
| Blog 2 | Braithwaite & Udagawa | Defining an interface between transaction data and foundation models | Building Nubank, 2025a |
|
| 584 |
+
| Blog 3 | Braithwaite, Cavalcanti & Udagawa | Fine-tuning transaction user models | Building Nubank, 2025b |
|
| 585 |
+
| Blog 4 | Braithwaite & Udagawa | Understanding our customers' finances through foundation models | Building Nubank, 2025c |
|
| 586 |
+
| Blog 5 | Foust | Optimizing user narratives for foundation models | Building Nubank, 2025 |
|
| 587 |
+
| Blog 6 | Udagawa | Building foundation models into Nubank's AI platform | Building Nubank, 2025 |
|
| 588 |
+
|
| 589 |
+
### Academic References (Used by nuFormer)
|
| 590 |
+
|
| 591 |
+
| Paper | Authors | Year | ArXiv | Role in nuFormer |
|
| 592 |
+
|-------|---------|------|-------|-----------------|
|
| 593 |
+
| **RecFormer** | Li et al. | 2023 | [2305.13731](https://arxiv.org/abs/2305.13731) | Tokenization philosophy: items as key-value text |
|
| 594 |
+
| **PLR Embeddings** | Gorishniy et al. | 2022 | [2203.05556](https://arxiv.org/abs/2203.05556) | Numerical feature → periodic embeddings |
|
| 595 |
+
| **DCN V2** | Wang et al. | 2021 | [2008.13535](https://arxiv.org/abs/2008.13535) | Tabular feature cross-interaction backbone |
|
| 596 |
+
| **NoPE** | Kazemnejad et al. | 2023 | [2305.19466](https://arxiv.org/abs/2305.19466) | No positional encoding for length generalization |
|
| 597 |
+
| **FlashAttention** | Dao et al. | 2022 | [2205.14135](https://arxiv.org/abs/2205.14135) | Efficient attention computation |
|
| 598 |
+
| **Banking TF** | Delestre & Sola | 2024 | [2410.08243](https://arxiv.org/abs/2410.08243) | Parallel work: French bank transaction tokenizer |
|
| 599 |
+
|
| 600 |
+
### Related Papers from domainTokenizer Research
|
| 601 |
+
|
| 602 |
+
| Paper | Year | ArXiv | Connection |
|
| 603 |
+
|-------|------|-------|-----------|
|
| 604 |
+
| **TIGER** | 2023 | [2305.05065](https://arxiv.org/abs/2305.05065) | Alternative: RQ-VAE Semantic IDs (Nubank didn't use) |
|
| 605 |
+
| **ActionPiece** | 2025 | [2502.13581](https://arxiv.org/abs/2502.13581) | Alternative: BPE-like merging of action features (Nubank didn't use) |
|
| 606 |
+
| **Nested Learning (HOPE)** | 2025 | [2512.24695](https://arxiv.org/abs/2512.24695) | Future: continual learning for domain models |
|
| 607 |
+
|
| 608 |
+
---
|
| 609 |
+
|
| 610 |
+
*This analysis reconstructs Nubank's full pipeline from public sources. The actual production system may have additional proprietary components not disclosed in the blog series or arXiv paper.*
|