| # Reverse-Engineering Nubank's nuFormer: A Transaction Foundation Model |
|
|
| > **How Nubank built a domain tokenizer for 100M+ customers and O(100 billion) transactions β and how to replicate this for finance, e-commerce, and other domains.** |
| > |
| > *Analysis based on: arXiv:2507.23267 ("Your Spending Needs Attention"), the Building Nubank blog series, and all referenced academic papers.* |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Why This Matters for domainTokenizer](#1-why-this-matters-for-domaintokenizer) |
| 2. [The Nubank Blog Series: Complete Inventory](#2-the-nubank-blog-series-complete-inventory) |
| 3. [The nuFormer Architecture: Full Reconstruction](#3-the-nuformer-architecture-full-reconstruction) |
| - 3.1 [Step 1: The Domain Tokenizer β Transactions β Tokens](#31-step-1-the-domain-tokenizer--transactions--tokens) |
| - 3.2 [Step 2: The Transaction Transformer β Pre-training](#32-step-2-the-transaction-transformer--pre-training) |
| - 3.3 [Step 3: Joint Fusion β Combining Sequences + Tabular Features](#33-step-3-joint-fusion--combining-sequences--tabular-features) |
| 4. [The Four Academic Pillars](#4-the-four-academic-pillars) |
| - 4.1 [RecFormer: Items as Sentences, Not IDs](#41-recformer-items-as-sentences-not-ids) |
| - 4.2 [PLR Embeddings: Making Numbers First-Class Citizens](#42-plr-embeddings-making-numbers-first-class-citizens) |
| - 4.3 [DCN V2: Explicit Feature Crossing](#43-dcn-v2-explicit-feature-crossing) |
| - 4.4 [NoPE: No Positional Encoding Needed](#44-nope-no-positional-encoding-needed) |
| 5. [Results & Scaling Laws](#5-results--scaling-laws) |
| 6. [Connection to domainTokenizer Research](#6-connection-to-domaintokenizer-research) |
| 7. [The Playbook: How to Walk Nubank's Path](#7-the-playbook-how-to-walk-nubanks-path) |
| 8. [Complete Reference List](#8-complete-reference-list) |
|
|
| --- |
|
|
| ## 1. Why This Matters for domainTokenizer |
|
|
| Nubank didn't just build a model β they built **exactly what domainTokenizer envisions**: a domain-specific tokenizer that converts financial transactions into tokens, trains a small Transformer on those tokens, and uses it as a foundation model for downstream business tasks. |
|
|
| **The connection is direct:** |
|
|
| | domainTokenizer Concept | Nubank's Implementation | |
| |------------------------|------------------------| |
| | Domain tokens (not words) | Special tokens for amount, date, sign + BPE for descriptions | |
| | Small models that understand domain data | 24M and 330M parameter Transformers | |
| | Pre-training on domain sequences | Next-token prediction on transaction sequences | |
| | Fine-tuning for business tasks | Product recommendation (binary: will user activate?) | |
| | Beating traditional ML baselines | +1.25% relative AUC over LightGBM = 3Γ launch threshold | |
|
|
| Nubank **validated** the domainTokenizer thesis at production scale (100M+ users, 100B+ transactions) and published both the recipe and results. This is our blueprint. |
|
|
| --- |
|
|
| ## 2. The Nubank Blog Series: Complete Inventory |
|
|
| Nubank published a comprehensive blog series on Building Nubank documenting their foundation model journey: |
|
|
| | # | Title | Focus | URL | |
| |---|-------|-------|-----| |
| | 1 | **Unlocking financial insights: How Nubank powers personalized experiences with foundation models** | Overview & motivation | [building.nubank.com/unlocking-financial-insights...](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) | |
| | 2 | **Defining an interface between transaction data and foundation models** | The tokenizer design | [Braithwaite & Udagawa, 2025a] | |
| | 3 | **Fine-tuning transaction user models** | nuFormer fine-tuning recipe | [Braithwaite, Cavalcanti & Udagawa, 2025b] | |
| | 4 | **Understanding our customers' finances through foundation models** | Application layer & results | [Braithwaite & Udagawa, 2025c] | |
| | 5 | **Optimizing user narratives for foundation models** | Context window optimization | [Foust, 2025] | |
| | 6 | **Building foundation models into Nubank's AI platform** | MLOps & infrastructure | [Udagawa, 2025] | |
|
|
| **The arXiv paper** consolidating all technical details: |
| - **"Your spending needs attention: Modeling financial habits with transformers"** β [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) (Braithwaite et al., July 2025) |
|
|
| --- |
|
|
| ## 3. The nuFormer Architecture: Full Reconstruction |
|
|
| ### 3.1 Step 1: The Domain Tokenizer β Transactions β Tokens |
|
|
| This is the **core innovation** and the part most relevant to domainTokenizer. Nubank's tokenizer converts raw financial transactions into discrete token sequences. |
|
|
| #### Raw Transaction Data |
| Each transaction has three raw fields: |
| ``` |
| { |
| "amount": 79.99, // float (positive or negative) |
| "date": "2025-03-15T14:23:00", // timestamp |
| "description": "AMAZON MARKETPLACE" // free text |
| } |
| ``` |
|
|
| #### The Tokenization Decision |
|
|
| Nubank explicitly considered and **rejected** two extremes: |
|
|
| 1. β **Pure text serialization** (JSON stringification β BPE): Too many tokens per transaction. A JSON string like `{"amount": 79.99, "date": "2025-03-15", "desc": "AMAZON MARKETPLACE"}` would consume ~30-50 BPE tokens per transaction, leaving only ~40-60 transactions in a 2048-token context window. |
|
|
| 2. β **Pure numerical encoding** (all fields as embeddings, no text): Loses the rich information in transaction descriptions (merchant names, payment categories, etc.) |
|
|
| 3. β
**Hybrid: Special tokens for structured fields + BPE for text**: Best of both worlds. |
|
|
| #### The Special Token Vocabulary |
|
|
| Each structured field gets its own small, fixed vocabulary of **special tokens**: |
|
|
| | Field | Tokenizer Function | Vocabulary Size | Example | |
| |-------|-------------------|-----------------|---------| |
| | **Amount Sign** | `Ο_sign : β β V_sign` | **2 tokens** | `[CREDIT]` or `[DEBIT]` | |
| | **Amount Bucket** | `Ο_amt : β β V_amt` (quantized bins) | **21 tokens** | `[AMT_BIN_14]` (e.g., $50-$100 range) | |
| | **Month** | `Ο_month : date β V_month` | **12 tokens** | `[MARCH]` | |
| | **Day of Week** | `Ο_dow : date β V_dow` | **7 tokens** | `[WEDNESDAY]` | |
| | **Day of Month** | `Ο_dom : date β V_dom` | **31 tokens** | `[DAY_15]` | |
| | **Hour** | `Ο_hour : date β V_hour` | **24 tokens** | `[HOUR_14]` | |
|
|
| **Total special tokens:** 2 + 21 + 12 + 7 + 31 + 24 = **97 special tokens** |
|
|
| The text description field uses standard **BPE tokenization**, producing a variable number of subword tokens. |
|
|
| #### Combined Vocabulary |
|
|
| ``` |
| V = V_special (97 tokens) βͺ V_BPE (standard BPE vocabulary) |
| ``` |
|
|
| #### Token Sequence Layout Per Transaction |
|
|
| ``` |
| Transaction t_i = [ |
| AMT_SIGN_TOKEN, # 1 token: CREDIT or DEBIT |
| AMT_BUCKET_TOKEN, # 1 token: one of 21 quantized bins |
| MONTH_TOKEN, # 1 token: JanβDec |
| DOW_TOKEN, # 1 token: MonβSun |
| DOM_TOKEN, # 1 token: 1β31 |
| HOUR_TOKEN, # 1 token: 0β23 |
| desc_tok_1, # variable: BPE tokens for "AMAZON" |
| desc_tok_2, # "MARKET" |
| desc_tok_3, # "PLACE" |
| ... |
| ] |
| ``` |
|
|
| **Average: ~14 tokens per transaction.** |
|
|
| This means a **2048-token context window holds approximately 146 transactions** β enough to capture several months of financial behavior for a typical consumer. |
|
|
| #### User Sequence Construction |
|
|
| For each user, transactions are ordered chronologically: |
| ``` |
| user_sequence = [t_1, t_2, t_3, ..., t_N] |
| ``` |
| Where N varies per user (truncated to fit context window, taking the most recent transactions). |
|
|
| #### Why This Design Wins |
|
|
| | Metric | Pure Text | Pure Embedding | Nubank Hybrid | |
| |--------|-----------|----------------|---------------| |
| | Tokens per transaction | ~35-50 | 1 (but fixed-dim) | **~14** | |
| | Transactions in 2048 context | ~40-60 | 2048 | **~146** | |
| | Captures description text | β
| β | β
| |
| | Captures numerical structure | β (fragmented) | β
| β
| |
| | Captures temporal patterns | β | Partial | β
| |
| | Works with standard Transformer | β
| Needs custom arch | β
| |
|
|
| ### 3.2 Step 2: The Transaction Transformer β Pre-training |
|
|
| #### Architecture Choice: GPT-style Causal Decoder |
|
|
| Nubank chose a **decoder-only, GPT-style causal Transformer**, not BERT-style bidirectional. Reasons: |
|
|
| 1. **Industry precedent:** State-of-the-art sequential recommendation systems (Pinterest PinnerFormer, Meta NxtPost) use causal architectures |
| 2. **No autoregressive generation needed:** At inference, the model produces a single user embedding from the full sequence β no token-by-token generation required |
| 3. **Better for long-range dependencies:** Causal attention naturally models temporal ordering |
|
|
| #### No Positional Encoding (NoPE) |
|
|
| Based on Kazemnejad et al. (2023), nuFormer uses **no explicit positional encoding**. The finding: NoPE outperforms RoPE, ALiBi, and learned absolute position embeddings on length generalization. Since users have varying transaction history lengths, length generalization is critical. |
|
|
| #### Model Sizes |
|
|
| | Variant | Parameters | Hidden Dim | Layers | Heads | Context | |
| |---------|-----------|------------|--------|-------|---------| |
| | **nuFormer-Small** | **24M** | 256 | 24 | 16 | 2048 | |
| | **nuFormer-Large** | **330M** | 1024 | 24 | 16 | 2048 | |
|
|
| Both share the same depth (24 layers, 16 heads) β they differ only in embedding dimension. |
|
|
| #### Pre-training Objective |
|
|
| **Causal Language Modeling (CLM):** Standard next-token prediction on the flattened transaction token sequences. |
|
|
| Given a user's transaction sequence tokenized as `[w_1, w_2, ..., w_T]`, the loss is: |
|
|
| ``` |
| L = -Ξ£_{t=1}^{T} log P(w_t | w_1, ..., w_{t-1}) |
| ``` |
|
|
| This is the same objective as GPT β but instead of predicting the next word in a sentence, the model predicts the next token in a transaction sequence. This could be the next amount bucket, the next merchant name token, or the next month token. |
|
|
| #### Pre-training Data |
|
|
| - **20M user rows** for baseline experiments |
| - Up to **203M labeled rows** for fine-tuning experiments |
| - Data spans credit card, debit card, open finance, wires, transfers, and bill items |
| - **O(100 billion) total transactions** across Nubank's 100M+ member base |
|
|
| ### 3.3 Step 3: Joint Fusion β Combining Sequences + Tabular Features |
|
|
| Nubank explored three fusion strategies for combining the transaction transformer with traditional tabular features: |
|
|
| #### Strategy A: Early Fusion (Extract β Downstream) |
| ``` |
| Transaction Sequence β Pre-trained Transformer β User Embedding (frozen) |
| β |
| Feed into LightGBM with other features |
| ``` |
| Fastest to iterate but loses end-to-end gradients. |
|
|
| #### Strategy B: Late Fusion (Concatenate β Joint Head) |
| ``` |
| Transaction Sequence β Transformer β User Embedding ββ |
| βββ MLP Head β Prediction |
| Tabular Features (291) β Simple Embedding βββββββββββββ |
| ``` |
| Better than early fusion but the tabular branch is underparameterized. |
|
|
| #### Strategy C: Joint Fusion = nuFormer (Best) |
| ``` |
| Transaction Sequence β Transformer β User Embedding ββββββββββββββββββ |
| βββ Shared MLP β Prediction |
| Tabular Features (291) β PLR Embeddings β DCNv2 β Feature Embedding ββ |
| ``` |
|
|
| **This is the production architecture.** The key insight: the tabular branch needs its own powerful backbone (DCNv2) to match the expressiveness of the transformer branch. Joint end-to-end training allows both branches to co-adapt. |
|
|
| #### The Tabular Branch: DCNv2 + PLR |
|
|
| **291 hand-crafted features** (numerical + categorical), processed as follows: |
|
|
| 1. **Numerical features:** Transformed via PLR (Periodic Linear Representation): |
| ``` |
| PLR(x) = ReLU(Linear([sin(2Οwβx + bβ), cos(2Οwβx + bβ), ..., sin(2Οwβx + bβ), cos(2Οwβx + bβ)])) |
| ``` |
| Where frequencies `w` and phases `b` are **learned parameters**. This maps scalars to high-dimensional dense vectors that capture both magnitude and periodicity. |
|
|
| 2. **Categorical features:** Standard embedding lookup tables. |
|
|
| 3. **Feature interaction:** DCN V2 (Deep Cross Network V2) models explicit feature interactions: |
| ``` |
| x_{l+1} = xβ β (W_l Β· x_l + b_l) + x_l |
| ``` |
| Full-rank weight matrices enable capturing all pairwise and higher-order feature interactions. |
|
|
| 4. **Regularization:** L2 regularization on DCNv2 cross-layer weights to prevent overfitting. |
|
|
| --- |
|
|
| ## 4. The Four Academic Pillars |
|
|
| Nubank's architecture stands on four papers. Understanding them is essential for replication. |
|
|
| ### 4.1 RecFormer: Items as Sentences, Not IDs |
|
|
| **Paper:** "Text Is All You Need: Learning Language Representations for Sequential Recommendation" |
| **Authors:** Li et al. (UCSD + Amazon) | **KDD 2023** | [arXiv: 2305.13731](https://arxiv.org/abs/2305.13731) | [GitHub 130β](https://github.com/aaronheee/recformer) |
|
|
| **Core idea:** Abolish item IDs entirely. Represent each item as a key-value attribute dictionary flattened into text: |
| ``` |
| Item: {Color: Black, Brand: Nike, Category: Shoes} |
| β Tokens: ["Color", "Black", "Brand", "Nike", "Category", "Shoes"] |
| ``` |
|
|
| A user's interaction sequence becomes a sequence of these "item sentences." |
|
|
| **Four-embedding architecture:** |
| ``` |
| E_token = LayerNorm(A_token + B_position + C_type + D_item_position) |
| ``` |
| - A = token embedding (shared vocabulary) |
| - B = token position in full sequence |
| - C = token type (key vs. value vs. special) |
| - D = item position (which item in the user sequence) |
|
|
| **What Nubank took:** The key-value flattening philosophy, but modified it with special tokens for structured fields (amount, date) to reduce tokens per transaction from ~35 to ~14. |
|
|
| ### 4.2 PLR Embeddings: Making Numbers First-Class Citizens |
|
|
| **Paper:** "On Embeddings for Numerical Features in Tabular Deep Learning" |
| **Authors:** Gorishniy et al. (Yandex) | **NeurIPS 2022** | [arXiv: 2203.05556](https://arxiv.org/abs/2203.05556) | [GitHub](https://github.com/yandex-research/tabular-dl-num-embeddings) |
|
|
| **Core idea:** Raw scalar features fed into MLPs/Transformers are poorly optimized. **Lifting scalars into high-dimensional periodic embeddings** dramatically improves performance. |
|
|
| **PLR (Periodic β Linear β ReLU):** |
| ```python |
| def plr_embedding(x, frequencies, phases): |
| # x: scalar feature value |
| # frequencies, phases: LEARNED parameters |
| periodic = torch.cat([ |
| torch.sin(2 * Ο * frequencies * x + phases), |
| torch.cos(2 * Ο * frequencies * x + phases) |
| ]) |
| return relu(linear(periodic)) |
| ``` |
|
|
| **Key result:** With PLR embeddings, a plain MLP can match attention-based Transformers on tabular benchmarks. PLR is what lets DCNv2 beat LightGBM. |
|
|
| **What Nubank took:** PLR embeddings for all 291 numerical tabular features in the joint fusion branch. This was the critical ingredient: |
|
|
| | Model | Relative AUC vs. LightGBM | |
| |-------|--------------------------| |
| | DCNv2 (without PLR) | -0.09% | |
| | DCNv2 + PLR | **+0.06%** β first to beat GBDT | |
| | DCNv2 + PLR + L2 | +0.08% | |
| | **nuFormer (full)** | **+0.31% to +0.52%** | |
|
|
| ### 4.3 DCN V2: Explicit Feature Crossing |
|
|
| **Paper:** "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems" |
| **Authors:** Wang et al. (Google) | **WebConf 2021** | [arXiv: 2008.13535](https://arxiv.org/abs/2008.13535) | **Production at Google** |
|
|
| **Core idea:** Explicitly model feature interactions (crosses) via specialized cross layers with full-rank weight matrices: |
| ``` |
| x_{l+1} = xβ β (W_l Β· x_l + b_l) + x_l # element-wise product with input anchor |
| ``` |
|
|
| This captures feature interactions of degree L+1 for an L-layer cross network. DCNv2 improves on DCN (2017) by using full-rank matrices instead of rank-1. |
|
|
| **What Nubank took:** DCNv2 as the backbone for the tabular feature branch (291 features). Combined with PLR embeddings, it forms the "tabular half" of the joint fusion nuFormer architecture. |
|
|
| ### 4.4 NoPE: No Positional Encoding Needed |
|
|
| **Paper:** "The Impact of Positional Encoding on Length Generalization in Transformers" |
| **Authors:** Kazemnejad et al. (McGill/Mila) | **NeurIPS 2023** | [arXiv: 2305.19466](https://arxiv.org/abs/2305.19466) | [HF Paper](https://huggingface.co/papers/2305.19466) |
|
|
| **Core finding:** Decoder-only Transformers with **no positional encoding** (NoPE) outperform those with RoPE, ALiBi, and absolute position embeddings on length generalization tasks. |
|
|
| **Why it works (theoretically):** |
| - **Theorem 1:** The first layer of a NoPE causal Transformer can recover absolute positions from causal attention patterns alone |
| - **Theorem 2:** Subsequent layers can implement relative PE via learned query-key interactions |
| - **Empirically:** NoPE's learned attention patterns converge to T5's relative PE β it gets relative PE "for free" |
|
|
| **What Nubank took:** No positional encoding in the transaction Transformer. Since users have vastly different transaction history lengths (some have 20 transactions, some have 2000+), length generalization is critical for production deployment. |
|
|
| --- |
|
|
| ## 5. Results & Scaling Laws |
|
|
| ### Production Results |
|
|
| | Model | Relative AUC vs. LightGBM | |
| |-------|--------------------------| |
| | MLP (raw features) | -0.44% | |
| | DCNv2 | -0.09% | |
| | MLP + PLR | -0.23% | |
| | LightGBM (baseline) | 0.00% | |
| | DCNv2 + PLR | +0.06% | |
| | DCNv2 + PLR + L2 | +0.08% | |
| | **nuFormer-Small (24M, Joint Fusion)** | **+0.31%** | |
| | **nuFormer-Large (330M, Joint Fusion)** | **+0.52%** | |
|
|
| **Final production deployment: +1.25% relative AUC improvement** β cited as **3Γ the typical model launch threshold** at Nubank. This is a massive result for a production recommendation system. |
|
|
| ### Scaling Laws |
|
|
| Nubank observed clear scaling laws across three dimensions: |
|
|
| **Model size scaling:** |
| | Model | Parameters | AUC Improvement | |
| |-------|-----------|-----------------| |
| | nuFormer-Small | 24M | +0.31% | |
| | nuFormer-Large | 330M | +0.52% | |
|
|
| **Context length scaling:** |
| | Context | Transactions Covered | Effect | |
| |---------|---------------------|--------| |
| | 512 tokens | ~36 transactions | Baseline | |
| | 1024 tokens | ~73 transactions | Better | |
| | 2048 tokens | ~146 transactions | **Best** (monotonic improvement) | |
|
|
| Larger models benefit more from longer context β the 330M model extracts more value from additional transaction history than the 24M model. |
|
|
| **Fine-tuning data scaling:** |
| | Training Rows | Effect | |
| |--------------|--------| |
| | 5M | Baseline | |
| | 20M | Better | |
| | 40M | Better still | |
| | 100M | Best | |
|
|
| Again, larger models show steeper improvement with more data. |
|
|
| ### Data Source Ablation (Critical Insight) |
|
|
| Nubank tested three anonymized data sources (A, B, C β likely credit card, debit, open finance): |
|
|
| | Sources | AUC vs. ABC Baseline | |
| |---------|---------------------| |
| | A alone | +0.72 | |
| | B alone | -8.21 | |
| | C alone | -20.52 | |
| | **AB** | **+0.91 (best!)** | |
| | BC | -12.24 | |
| | AC | -0.27 | |
| | ABC (all) | 0.00 (baseline) | |
|
|
| **Key insight:** More data sources can **hurt** performance. Source B and C are lower-information-density β when they crowd out high-signal transactions (source A) in the fixed 2048-token context window, overall performance drops. **AB outperforms ABC**, meaning the debit/open-finance data was actually diluting the credit card signal. |
|
|
| **Implication for domainTokenizer:** Context window is a **resource allocation problem**. You must carefully choose which data to include, not just maximize volume. |
|
|
| --- |
|
|
| ## 6. Connection to domainTokenizer Research |
|
|
| ### Direct Mapping to Our Framework |
|
|
| | Our Research Report Section | Nubank's Implementation | |
| |---------------------------|------------------------| |
| | Β§4.1 Semantic ID Tokenization | Not used β Nubank uses special tokens instead of RQ-VAE | |
| | Β§4.2 Action Sequence Tokenization (ActionPiece) | Partially analogous β the BPE-on-descriptions is similar, but no cross-field merging | |
| | Β§4.3 Financial Transaction Tokenization | **Exact match** β special tokens for amount/date + BPE for text | |
| | Β§4.4 Tabular Feature Tokenization (PLR) | **Exact match** β PLR embeddings for the 291 tabular features | |
| | Β§6.1 Quantization-Based (RQ-VAE) | Not used | |
| | Β§6.2 BPE-Inspired Merging | Only for text descriptions, not for structured fields | |
| | Β§6.3 Magnitude & Binning | **Exact match** β amount quantized to 21 bins | |
| | Β§6.5 Serialization-Based | Explicitly rejected as too token-hungry | |
|
|
| ### What Nubank Validates |
|
|
| 1. β
**Domain tokens work better than text tokens** β the special token vocabulary is the key innovation |
| 2. β
**Small models (24M-330M) are sufficient** β you don't need 7B+ parameter LLMs |
| 3. β
**Self-supervised pre-training transfers** β pre-trained transaction Transformer improves downstream tasks |
| 4. β
**Hybrid tokenization wins** β special tokens for structured data + BPE for text |
| 5. β
**GPT-style causal modeling works for event sequences** β not just BERT-style masking |
|
|
| ### What Nubank Didn't Do (Opportunities for domainTokenizer) |
|
|
| 1. β **No Semantic IDs (RQ-VAE):** Nubank tokenizes merchant descriptions via BPE but doesn't create learned codebook-based product/merchant IDs. This could be a significant improvement β merchants that always appear together could share semantic ID prefixes. |
|
|
| 2. β **No cross-field composite tokens (ActionPiece-style):** Each field is tokenized independently. A BPE-like merging of `{amount_bin + category + time_of_day}` into composite tokens could further compress the sequence and capture higher-order patterns. |
|
|
| 3. β **No continual learning (HOPE-style):** nuFormer is frozen after pre-training. The Nested Learning / HOPE paradigm could enable continuous adaptation to new spending patterns, new merchants, and seasonal shifts. |
|
|
| 4. β **No multi-resolution memory (CMS):** All tokens are treated equally in the attention window. A Continuum Memory System with different update frequencies could better handle the difference between recent transactions (high signal) and historical patterns (persistent knowledge). |
|
|
| ### Nubank's Recipe = Our Blueprint for Phase 2 |
|
|
| Nubank's exact pipeline maps to domainTokenizer's planned implementation: |
|
|
| ``` |
| domainTokenizer Phase 2 Implementation Plan |
| (directly following Nubank's validated recipe) |
| |
| 1. Schema Analysis β Identify field types |
| [Nubank: amount(float), date(timestamp), description(text)] |
| |
| 2. Per-Field Tokenizer Construction |
| [Nubank: Ο_sign(2), Ο_amt(21), Ο_month(12), Ο_dow(7), Ο_dom(31), Ο_hour(24), BPE(text)] |
| [Us: same pattern, extensible to any domain schema] |
| |
| 3. Pre-train GPT-style Causal Transformer (NoPE) |
| [Nubank: 24M-330M params, 2048 context, CLM objective] |
| [Us: configurable sizes, same objective] |
| |
| 4. Joint Fusion Fine-tuning |
| [Nubank: Transformer embeddings + DCNv2(PLR) on tabular features] |
| [Us: pluggable fusion with any tabular backbone] |
| ``` |
|
|
| --- |
|
|
| ## 7. The Playbook: How to Walk Nubank's Path |
|
|
| ### For Finance (Replicating Nubank) |
|
|
| **Step 1: Define your transaction schema** |
| ```python |
| schema = { |
| "amount": {"type": "numerical", "tokenizer": "sign_bucket", "sign_vocab": 2, "bucket_vocab": 21}, |
| "timestamp": {"type": "temporal", "tokenizer": "calendar", |
| "fields": ["month(12)", "dow(7)", "dom(31)", "hour(24)"]}, |
| "description": {"type": "text", "tokenizer": "bpe"}, |
| # Extensions beyond Nubank: |
| "merchant_category": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 50}, |
| "channel": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 10}, |
| } |
| ``` |
|
|
| **Step 2: Build tokenizer (97 special tokens + BPE)** |
| ```python |
| class TransactionTokenizer: |
| def __init__(self, schema): |
| self.special_tokens = build_special_vocab(schema) # ~97-150 tokens |
| self.bpe_tokenizer = AutoTokenizer.from_pretrained("...") # for text fields |
| |
| def tokenize_transaction(self, txn): |
| tokens = [] |
| tokens.append(self.sign_token(txn.amount)) # 1 token |
| tokens.append(self.amount_bucket(txn.amount)) # 1 token |
| tokens.extend(self.calendar_tokens(txn.timestamp)) # 4 tokens |
| tokens.extend(self.bpe_tokenizer(txn.description)) # ~8 tokens avg |
| return tokens # ~14 tokens total |
| ``` |
|
|
| **Step 3: Pre-train (24M params, CLM)** |
| ```python |
| model = GPTCausalLM( |
| vocab_size=len(special_tokens) + bpe_vocab_size, |
| d_model=256, n_layers=24, n_heads=16, |
| max_seq_len=2048, |
| positional_encoding=None, # NoPE! |
| ) |
| # Pre-train on transaction sequences |
| train_clm(model, transaction_sequences, epochs=...) |
| ``` |
|
|
| **Step 4: Joint Fusion Fine-tuning** |
| ```python |
| class NuFormer(nn.Module): |
| def __init__(self, txn_transformer, tabular_features): |
| self.txn_branch = txn_transformer # pre-trained, unfrozen |
| self.tab_branch = DCNv2( |
| input_dim=len(tabular_features), |
| num_embeddings=PLREmbed(n_frequencies=64), |
| cross_layers=3, deep_layers=3, |
| ) |
| self.head = MLP(txn_dim + tab_dim, hidden, 1) |
| |
| def forward(self, txn_tokens, tabular_features): |
| txn_embed = self.txn_branch(txn_tokens)[:, -1, :] # last token embedding |
| tab_embed = self.tab_branch(tabular_features) |
| combined = torch.cat([txn_embed, tab_embed], dim=-1) |
| return self.head(combined) |
| ``` |
|
|
| ### For E-Commerce (Adapting Nubank's Recipe) |
|
|
| **The adaptation is straightforward β replace transaction fields with e-commerce event fields:** |
|
|
| | Finance (Nubank) | E-Commerce (Adaptation) | |
| |------------------|----------------------| |
| | amount (float) | price (float) β same Ο_amt tokenizer | |
| | amount sign (credit/debit) | event_type (view/cart/purchase/return) β expand to 4+ tokens | |
| | timestamp (month/dow/dom/hour) | timestamp β same calendar tokens | |
| | description (merchant text) | product_title (BPE) β same approach | |
| | β | category (hierarchical) β add special tokens | |
| | β | brand β add special tokens or BPE | |
| | β | quantity β small fixed vocab (1-10+) | |
| |
| **E-commerce special token vocabulary:** |
| ```python |
| e_commerce_special_tokens = { |
| "event_type": 5, # view, cart, purchase, return, wishlist |
| "price_bucket": 21, # same binning as Nubank |
| "quantity": 11, # 1-10, 10+ |
| "category_l1": 30, # top-level categories |
| "category_l2": 200, # subcategories |
| "month": 12, |
| "dow": 7, |
| "dom": 31, |
| "hour": 24, |
| } |
| # Total: ~341 special tokens + BPE for product titles |
| # ~16 tokens per event β 2048 context β 128 events |
| ``` |
| |
| **Pre-training objectives (same as Nubank):** |
| - Causal LM: predict next token in the event sequence |
| - Downstream: next purchase prediction, churn, product recommendation, customer segmentation |
|
|
| ### For Healthcare (Same Pattern) |
|
|
| ```python |
| healthcare_special_tokens = { |
| "event_type": 10, # diagnosis, procedure, lab, medication, visit, ... |
| "icd_category": 50, # top-level ICD-10 groups |
| "cpt_category": 40, # procedure categories |
| "cost_bucket": 21, # same binning |
| "provider_type": 15, # PCP, specialist, ER, ... |
| "month": 12, "dow": 7, "dom": 31, |
| } |
| # Description: BPE on clinical notes/medication names |
| ``` |
|
|
| --- |
|
|
| ## 8. Complete Reference List |
|
|
| ### Nubank Sources |
|
|
| | Ref | Authors | Title | Link | |
| |-----|---------|-------|------| |
| | **Primary** | Braithwaite et al. | Your spending needs attention: Modeling financial habits with transformers | [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) | |
| | Blog 1 | β | Unlocking financial insights: How Nubank powers personalized experiences | [building.nubank.com](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) | |
| | Blog 2 | Braithwaite & Udagawa | Defining an interface between transaction data and foundation models | Building Nubank, 2025a | |
| | Blog 3 | Braithwaite, Cavalcanti & Udagawa | Fine-tuning transaction user models | Building Nubank, 2025b | |
| | Blog 4 | Braithwaite & Udagawa | Understanding our customers' finances through foundation models | Building Nubank, 2025c | |
| | Blog 5 | Foust | Optimizing user narratives for foundation models | Building Nubank, 2025 | |
| | Blog 6 | Udagawa | Building foundation models into Nubank's AI platform | Building Nubank, 2025 | |
|
|
| ### Academic References (Used by nuFormer) |
|
|
| | Paper | Authors | Year | ArXiv | Role in nuFormer | |
| |-------|---------|------|-------|-----------------| |
| | **RecFormer** | Li et al. | 2023 | [2305.13731](https://arxiv.org/abs/2305.13731) | Tokenization philosophy: items as key-value text | |
| | **PLR Embeddings** | Gorishniy et al. | 2022 | [2203.05556](https://arxiv.org/abs/2203.05556) | Numerical feature β periodic embeddings | |
| | **DCN V2** | Wang et al. | 2021 | [2008.13535](https://arxiv.org/abs/2008.13535) | Tabular feature cross-interaction backbone | |
| | **NoPE** | Kazemnejad et al. | 2023 | [2305.19466](https://arxiv.org/abs/2305.19466) | No positional encoding for length generalization | |
| | **FlashAttention** | Dao et al. | 2022 | [2205.14135](https://arxiv.org/abs/2205.14135) | Efficient attention computation | |
| | **Banking TF** | Delestre & Sola | 2024 | [2410.08243](https://arxiv.org/abs/2410.08243) | Parallel work: French bank transaction tokenizer | |
|
|
| ### Related Papers from domainTokenizer Research |
|
|
| | Paper | Year | ArXiv | Connection | |
| |-------|------|-------|-----------| |
| | **TIGER** | 2023 | [2305.05065](https://arxiv.org/abs/2305.05065) | Alternative: RQ-VAE Semantic IDs (Nubank didn't use) | |
| | **ActionPiece** | 2025 | [2502.13581](https://arxiv.org/abs/2502.13581) | Alternative: BPE-like merging of action features (Nubank didn't use) | |
| | **Nested Learning (HOPE)** | 2025 | [2512.24695](https://arxiv.org/abs/2512.24695) | Future: continual learning for domain models | |
|
|
| --- |
|
|
| *This analysis reconstructs Nubank's full pipeline from public sources. The actual production system may have additional proprietary components not disclosed in the blog series or arXiv paper.* |