domainTokenizer / docs /nubank_nuformer_analysis.md

Add Nubank nuFormer reverse-engineering analysis — full pipeline reconstruction

51149fa verified 8 days ago

29.9 kB

	# Reverse-Engineering Nubank's nuFormer: A Transaction Foundation Model

	> How Nubank built a domain tokenizer for 100M+ customers and O(100 billion) transactions — and how to replicate this for finance, e-commerce, and other domains.
	>
	> Analysis based on: arXiv:2507.23267 ("Your Spending Needs Attention"), the Building Nubank blog series, and all referenced academic papers.

	---

	## Table of Contents

	1. [Why This Matters for domainTokenizer](#1-why-this-matters-for-domaintokenizer)
	2. [The Nubank Blog Series: Complete Inventory](#2-the-nubank-blog-series-complete-inventory)
	3. [The nuFormer Architecture: Full Reconstruction](#3-the-nuformer-architecture-full-reconstruction)
	- 3.1 [Step 1: The Domain Tokenizer — Transactions → Tokens](#31-step-1-the-domain-tokenizer--transactions--tokens)
	- 3.2 [Step 2: The Transaction Transformer — Pre-training](#32-step-2-the-transaction-transformer--pre-training)
	- 3.3 [Step 3: Joint Fusion — Combining Sequences + Tabular Features](#33-step-3-joint-fusion--combining-sequences--tabular-features)
	4. [The Four Academic Pillars](#4-the-four-academic-pillars)
	- 4.1 [RecFormer: Items as Sentences, Not IDs](#41-recformer-items-as-sentences-not-ids)
	- 4.2 [PLR Embeddings: Making Numbers First-Class Citizens](#42-plr-embeddings-making-numbers-first-class-citizens)
	- 4.3 [DCN V2: Explicit Feature Crossing](#43-dcn-v2-explicit-feature-crossing)
	- 4.4 [NoPE: No Positional Encoding Needed](#44-nope-no-positional-encoding-needed)
	5. [Results & Scaling Laws](#5-results--scaling-laws)
	6. [Connection to domainTokenizer Research](#6-connection-to-domaintokenizer-research)
	7. [The Playbook: How to Walk Nubank's Path](#7-the-playbook-how-to-walk-nubanks-path)
	8. [Complete Reference List](#8-complete-reference-list)

	---

	## 1. Why This Matters for domainTokenizer

	Nubank didn't just build a model — they built exactly what domainTokenizer envisions: a domain-specific tokenizer that converts financial transactions into tokens, trains a small Transformer on those tokens, and uses it as a foundation model for downstream business tasks.

	The connection is direct:

	\| domainTokenizer Concept \| Nubank's Implementation \|
	\|------------------------\|------------------------\|
	\| Domain tokens (not words) \| Special tokens for amount, date, sign + BPE for descriptions \|
	\| Small models that understand domain data \| 24M and 330M parameter Transformers \|
	\| Pre-training on domain sequences \| Next-token prediction on transaction sequences \|
	\| Fine-tuning for business tasks \| Product recommendation (binary: will user activate?) \|
	\| Beating traditional ML baselines \| +1.25% relative AUC over LightGBM = 3× launch threshold \|

	Nubank validated the domainTokenizer thesis at production scale (100M+ users, 100B+ transactions) and published both the recipe and results. This is our blueprint.

	---

	## 2. The Nubank Blog Series: Complete Inventory

	Nubank published a comprehensive blog series on Building Nubank documenting their foundation model journey:

	\| # \| Title \| Focus \| URL \|
	\|---\|-------\|-------\|-----\|
	\| 1 \| Unlocking financial insights: How Nubank powers personalized experiences with foundation models \| Overview & motivation \| [building.nubank.com/unlocking-financial-insights...](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) \|
	\| 2 \| Defining an interface between transaction data and foundation models \| The tokenizer design \| [Braithwaite & Udagawa, 2025a] \|
	\| 3 \| Fine-tuning transaction user models \| nuFormer fine-tuning recipe \| [Braithwaite, Cavalcanti & Udagawa, 2025b] \|
	\| 4 \| Understanding our customers' finances through foundation models \| Application layer & results \| [Braithwaite & Udagawa, 2025c] \|
	\| 5 \| Optimizing user narratives for foundation models \| Context window optimization \| [Foust, 2025] \|
	\| 6 \| Building foundation models into Nubank's AI platform \| MLOps & infrastructure \| [Udagawa, 2025] \|

	The arXiv paper consolidating all technical details:
	- "Your spending needs attention: Modeling financial habits with transformers" — [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) (Braithwaite et al., July 2025)

	---

	## 3. The nuFormer Architecture: Full Reconstruction

	### 3.1 Step 1: The Domain Tokenizer — Transactions → Tokens

	This is the core innovation and the part most relevant to domainTokenizer. Nubank's tokenizer converts raw financial transactions into discrete token sequences.

	#### Raw Transaction Data
	Each transaction has three raw fields:
	```
	{
	"amount": 79.99, // float (positive or negative)
	"date": "2025-03-15T14:23:00", // timestamp
	"description": "AMAZON MARKETPLACE" // free text
	}
	```

	#### The Tokenization Decision

	Nubank explicitly considered and rejected two extremes:

	1. ❌ Pure text serialization (JSON stringification → BPE): Too many tokens per transaction. A JSON string like `{"amount": 79.99, "date": "2025-03-15", "desc": "AMAZON MARKETPLACE"}` would consume ~30-50 BPE tokens per transaction, leaving only ~40-60 transactions in a 2048-token context window.

	2. ❌ Pure numerical encoding (all fields as embeddings, no text): Loses the rich information in transaction descriptions (merchant names, payment categories, etc.)

	3. ✅ Hybrid: Special tokens for structured fields + BPE for text: Best of both worlds.

	#### The Special Token Vocabulary

	Each structured field gets its own small, fixed vocabulary of special tokens:

	\| Field \| Tokenizer Function \| Vocabulary Size \| Example \|
	\|-------\|-------------------\|-----------------\|---------\|
	\| Amount Sign \| `ϕ_sign : ℝ → V_sign` \| 2 tokens \| `[CREDIT]` or `[DEBIT]` \|
	\| Amount Bucket \| `ϕ_amt : ℝ → V_amt` (quantized bins) \| 21 tokens \| `[AMT_BIN_14]` (e.g., $50-$100 range) \|
	\| Month \| `ϕ_month : date → V_month` \| 12 tokens \| `[MARCH]` \|
	\| Day of Week \| `ϕ_dow : date → V_dow` \| 7 tokens \| `[WEDNESDAY]` \|
	\| Day of Month \| `ϕ_dom : date → V_dom` \| 31 tokens \| `[DAY_15]` \|
	\| Hour \| `ϕ_hour : date → V_hour` \| 24 tokens \| `[HOUR_14]` \|

	Total special tokens: 2 + 21 + 12 + 7 + 31 + 24 = 97 special tokens

	The text description field uses standard BPE tokenization, producing a variable number of subword tokens.

	#### Combined Vocabulary

	```
	V = V_special (97 tokens) ∪ V_BPE (standard BPE vocabulary)
	```

	#### Token Sequence Layout Per Transaction

	```
	Transaction t_i = [
	AMT_SIGN_TOKEN, # 1 token: CREDIT or DEBIT
	AMT_BUCKET_TOKEN, # 1 token: one of 21 quantized bins
	MONTH_TOKEN, # 1 token: Jan–Dec
	DOW_TOKEN, # 1 token: Mon–Sun
	DOM_TOKEN, # 1 token: 1–31
	HOUR_TOKEN, # 1 token: 0–23
	desc_tok_1, # variable: BPE tokens for "AMAZON"
	desc_tok_2, # "MARKET"
	desc_tok_3, # "PLACE"
	...
	]
	```

	Average: ~14 tokens per transaction.

	This means a 2048-token context window holds approximately 146 transactions — enough to capture several months of financial behavior for a typical consumer.

	#### User Sequence Construction

	For each user, transactions are ordered chronologically:
	```
	user_sequence = [t_1, t_2, t_3, ..., t_N]
	```
	Where N varies per user (truncated to fit context window, taking the most recent transactions).

	#### Why This Design Wins

	\| Metric \| Pure Text \| Pure Embedding \| Nubank Hybrid \|
	\|--------\|-----------\|----------------\|---------------\|
	\| Tokens per transaction \| ~35-50 \| 1 (but fixed-dim) \| ~14 \|
	\| Transactions in 2048 context \| ~40-60 \| 2048 \| ~146 \|
	\| Captures description text \| ✅ \| ❌ \| ✅ \|
	\| Captures numerical structure \| ❌ (fragmented) \| ✅ \| ✅ \|
	\| Captures temporal patterns \| ❌ \| Partial \| ✅ \|
	\| Works with standard Transformer \| ✅ \| Needs custom arch \| ✅ \|

	### 3.2 Step 2: The Transaction Transformer — Pre-training

	#### Architecture Choice: GPT-style Causal Decoder

	Nubank chose a decoder-only, GPT-style causal Transformer, not BERT-style bidirectional. Reasons:

	1. Industry precedent: State-of-the-art sequential recommendation systems (Pinterest PinnerFormer, Meta NxtPost) use causal architectures
	2. No autoregressive generation needed: At inference, the model produces a single user embedding from the full sequence — no token-by-token generation required
	3. Better for long-range dependencies: Causal attention naturally models temporal ordering

	#### No Positional Encoding (NoPE)

	Based on Kazemnejad et al. (2023), nuFormer uses no explicit positional encoding. The finding: NoPE outperforms RoPE, ALiBi, and learned absolute position embeddings on length generalization. Since users have varying transaction history lengths, length generalization is critical.

	#### Model Sizes

	\| Variant \| Parameters \| Hidden Dim \| Layers \| Heads \| Context \|
	\|---------\|-----------\|------------\|--------\|-------\|---------\|
	\| nuFormer-Small \| 24M \| 256 \| 24 \| 16 \| 2048 \|
	\| nuFormer-Large \| 330M \| 1024 \| 24 \| 16 \| 2048 \|

	Both share the same depth (24 layers, 16 heads) — they differ only in embedding dimension.

	#### Pre-training Objective

	Causal Language Modeling (CLM): Standard next-token prediction on the flattened transaction token sequences.

	Given a user's transaction sequence tokenized as `[w_1, w_2, ..., w_T]`, the loss is:

	```
	L = -Σ_{t=1}^{T} log P(w_t \| w_1, ..., w_{t-1})
	```

	This is the same objective as GPT — but instead of predicting the next word in a sentence, the model predicts the next token in a transaction sequence. This could be the next amount bucket, the next merchant name token, or the next month token.

	#### Pre-training Data

	- 20M user rows for baseline experiments
	- Up to 203M labeled rows for fine-tuning experiments
	- Data spans credit card, debit card, open finance, wires, transfers, and bill items
	- O(100 billion) total transactions across Nubank's 100M+ member base

	### 3.3 Step 3: Joint Fusion — Combining Sequences + Tabular Features

	Nubank explored three fusion strategies for combining the transaction transformer with traditional tabular features:

	#### Strategy A: Early Fusion (Extract → Downstream)
	```
	Transaction Sequence → Pre-trained Transformer → User Embedding (frozen)
	↓
	Feed into LightGBM with other features
	```
	Fastest to iterate but loses end-to-end gradients.

	#### Strategy B: Late Fusion (Concatenate → Joint Head)
	```
	Transaction Sequence → Transformer → User Embedding ─┐
	├─→ MLP Head → Prediction
	Tabular Features (291) → Simple Embedding ────────────┘
	```
	Better than early fusion but the tabular branch is underparameterized.

	#### Strategy C: Joint Fusion = nuFormer (Best)
	```
	Transaction Sequence → Transformer → User Embedding ─────────────────┐
	├─→ Shared MLP → Prediction
	Tabular Features (291) → PLR Embeddings → DCNv2 → Feature Embedding ─┘
	```

	This is the production architecture. The key insight: the tabular branch needs its own powerful backbone (DCNv2) to match the expressiveness of the transformer branch. Joint end-to-end training allows both branches to co-adapt.

	#### The Tabular Branch: DCNv2 + PLR

	291 hand-crafted features (numerical + categorical), processed as follows:

	1. Numerical features: Transformed via PLR (Periodic Linear Representation):
	```
	PLR(x) = ReLU(Linear([sin(2πw₁x + b₁), cos(2πw₁x + b₁), ..., sin(2πwₙx + bₙ), cos(2πwₙx + bₙ)]))
	```
	Where frequencies `w` and phases `b` are learned parameters. This maps scalars to high-dimensional dense vectors that capture both magnitude and periodicity.

	2. Categorical features: Standard embedding lookup tables.

	3. Feature interaction: DCN V2 (Deep Cross Network V2) models explicit feature interactions:
	```
	x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l
	```
	Full-rank weight matrices enable capturing all pairwise and higher-order feature interactions.

	4. Regularization: L2 regularization on DCNv2 cross-layer weights to prevent overfitting.

	---

	## 4. The Four Academic Pillars

	Nubank's architecture stands on four papers. Understanding them is essential for replication.

	### 4.1 RecFormer: Items as Sentences, Not IDs

	Paper: "Text Is All You Need: Learning Language Representations for Sequential Recommendation"
	Authors: Li et al. (UCSD + Amazon) \| KDD 2023 \| [arXiv: 2305.13731](https://arxiv.org/abs/2305.13731) \| [GitHub 130⭐](https://github.com/aaronheee/recformer)

	Core idea: Abolish item IDs entirely. Represent each item as a key-value attribute dictionary flattened into text:
	```
	Item: {Color: Black, Brand: Nike, Category: Shoes}
	→ Tokens: ["Color", "Black", "Brand", "Nike", "Category", "Shoes"]
	```

	A user's interaction sequence becomes a sequence of these "item sentences."

	Four-embedding architecture:
	```
	E_token = LayerNorm(A_token + B_position + C_type + D_item_position)
	```
	- A = token embedding (shared vocabulary)
	- B = token position in full sequence
	- C = token type (key vs. value vs. special)
	- D = item position (which item in the user sequence)

	What Nubank took: The key-value flattening philosophy, but modified it with special tokens for structured fields (amount, date) to reduce tokens per transaction from ~35 to ~14.

	### 4.2 PLR Embeddings: Making Numbers First-Class Citizens

	Paper: "On Embeddings for Numerical Features in Tabular Deep Learning"
	Authors: Gorishniy et al. (Yandex) \| NeurIPS 2022 \| [arXiv: 2203.05556](https://arxiv.org/abs/2203.05556) \| [GitHub](https://github.com/yandex-research/tabular-dl-num-embeddings)

	Core idea: Raw scalar features fed into MLPs/Transformers are poorly optimized. Lifting scalars into high-dimensional periodic embeddings dramatically improves performance.

	PLR (Periodic → Linear → ReLU):
	```python
	def plr_embedding(x, frequencies, phases):
	# x: scalar feature value
	# frequencies, phases: LEARNED parameters
	periodic = torch.cat([
	torch.sin(2 * π * frequencies * x + phases),
	torch.cos(2 * π * frequencies * x + phases)
	])
	return relu(linear(periodic))
	```

	Key result: With PLR embeddings, a plain MLP can match attention-based Transformers on tabular benchmarks. PLR is what lets DCNv2 beat LightGBM.

	What Nubank took: PLR embeddings for all 291 numerical tabular features in the joint fusion branch. This was the critical ingredient:

	\| Model \| Relative AUC vs. LightGBM \|
	\|-------\|--------------------------\|
	\| DCNv2 (without PLR) \| -0.09% \|
	\| DCNv2 + PLR \| +0.06% ← first to beat GBDT \|
	\| DCNv2 + PLR + L2 \| +0.08% \|
	\| nuFormer (full) \| +0.31% to +0.52% \|

	### 4.3 DCN V2: Explicit Feature Crossing

	Paper: "DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-Scale Learning to Rank Systems"
	Authors: Wang et al. (Google) \| WebConf 2021 \| [arXiv: 2008.13535](https://arxiv.org/abs/2008.13535) \| Production at Google

	Core idea: Explicitly model feature interactions (crosses) via specialized cross layers with full-rank weight matrices:
	```
	x_{l+1} = x₀ ⊙ (W_l · x_l + b_l) + x_l # element-wise product with input anchor
	```

	This captures feature interactions of degree L+1 for an L-layer cross network. DCNv2 improves on DCN (2017) by using full-rank matrices instead of rank-1.

	What Nubank took: DCNv2 as the backbone for the tabular feature branch (291 features). Combined with PLR embeddings, it forms the "tabular half" of the joint fusion nuFormer architecture.

	### 4.4 NoPE: No Positional Encoding Needed

	Paper: "The Impact of Positional Encoding on Length Generalization in Transformers"
	Authors: Kazemnejad et al. (McGill/Mila) \| NeurIPS 2023 \| [arXiv: 2305.19466](https://arxiv.org/abs/2305.19466) \| [HF Paper](https://huggingface.co/papers/2305.19466)

	Core finding: Decoder-only Transformers with no positional encoding (NoPE) outperform those with RoPE, ALiBi, and absolute position embeddings on length generalization tasks.

	Why it works (theoretically):
	- Theorem 1: The first layer of a NoPE causal Transformer can recover absolute positions from causal attention patterns alone
	- Theorem 2: Subsequent layers can implement relative PE via learned query-key interactions
	- Empirically: NoPE's learned attention patterns converge to T5's relative PE — it gets relative PE "for free"

	What Nubank took: No positional encoding in the transaction Transformer. Since users have vastly different transaction history lengths (some have 20 transactions, some have 2000+), length generalization is critical for production deployment.

	---

	## 5. Results & Scaling Laws

	### Production Results

	\| Model \| Relative AUC vs. LightGBM \|
	\|-------\|--------------------------\|
	\| MLP (raw features) \| -0.44% \|
	\| DCNv2 \| -0.09% \|
	\| MLP + PLR \| -0.23% \|
	\| LightGBM (baseline) \| 0.00% \|
	\| DCNv2 + PLR \| +0.06% \|
	\| DCNv2 + PLR + L2 \| +0.08% \|
	\| nuFormer-Small (24M, Joint Fusion) \| +0.31% \|
	\| nuFormer-Large (330M, Joint Fusion) \| +0.52% \|

	Final production deployment: +1.25% relative AUC improvement — cited as 3× the typical model launch threshold at Nubank. This is a massive result for a production recommendation system.

	### Scaling Laws

	Nubank observed clear scaling laws across three dimensions:

	Model size scaling:
	\| Model \| Parameters \| AUC Improvement \|
	\|-------\|-----------\|-----------------\|
	\| nuFormer-Small \| 24M \| +0.31% \|
	\| nuFormer-Large \| 330M \| +0.52% \|

	Context length scaling:
	\| Context \| Transactions Covered \| Effect \|
	\|---------\|---------------------\|--------\|
	\| 512 tokens \| ~36 transactions \| Baseline \|
	\| 1024 tokens \| ~73 transactions \| Better \|
	\| 2048 tokens \| ~146 transactions \| Best (monotonic improvement) \|

	Larger models benefit more from longer context — the 330M model extracts more value from additional transaction history than the 24M model.

	Fine-tuning data scaling:
	\| Training Rows \| Effect \|
	\|--------------\|--------\|
	\| 5M \| Baseline \|
	\| 20M \| Better \|
	\| 40M \| Better still \|
	\| 100M \| Best \|

	Again, larger models show steeper improvement with more data.

	### Data Source Ablation (Critical Insight)

	Nubank tested three anonymized data sources (A, B, C — likely credit card, debit, open finance):

	\| Sources \| AUC vs. ABC Baseline \|
	\|---------\|---------------------\|
	\| A alone \| +0.72 \|
	\| B alone \| -8.21 \|
	\| C alone \| -20.52 \|
	\| AB \| +0.91 (best!) \|
	\| BC \| -12.24 \|
	\| AC \| -0.27 \|
	\| ABC (all) \| 0.00 (baseline) \|

	Key insight: More data sources can hurt performance. Source B and C are lower-information-density — when they crowd out high-signal transactions (source A) in the fixed 2048-token context window, overall performance drops. AB outperforms ABC, meaning the debit/open-finance data was actually diluting the credit card signal.

	Implication for domainTokenizer: Context window is a resource allocation problem. You must carefully choose which data to include, not just maximize volume.

	---

	## 6. Connection to domainTokenizer Research

	### Direct Mapping to Our Framework

	\| Our Research Report Section \| Nubank's Implementation \|
	\|---------------------------\|------------------------\|
	\| §4.1 Semantic ID Tokenization \| Not used — Nubank uses special tokens instead of RQ-VAE \|
	\| §4.2 Action Sequence Tokenization (ActionPiece) \| Partially analogous — the BPE-on-descriptions is similar, but no cross-field merging \|
	\| §4.3 Financial Transaction Tokenization \| Exact match — special tokens for amount/date + BPE for text \|
	\| §4.4 Tabular Feature Tokenization (PLR) \| Exact match — PLR embeddings for the 291 tabular features \|
	\| §6.1 Quantization-Based (RQ-VAE) \| Not used \|
	\| §6.2 BPE-Inspired Merging \| Only for text descriptions, not for structured fields \|
	\| §6.3 Magnitude & Binning \| Exact match — amount quantized to 21 bins \|
	\| §6.5 Serialization-Based \| Explicitly rejected as too token-hungry \|

	### What Nubank Validates

	1. ✅ Domain tokens work better than text tokens — the special token vocabulary is the key innovation
	2. ✅ Small models (24M-330M) are sufficient — you don't need 7B+ parameter LLMs
	3. ✅ Self-supervised pre-training transfers — pre-trained transaction Transformer improves downstream tasks
	4. ✅ Hybrid tokenization wins — special tokens for structured data + BPE for text
	5. ✅ GPT-style causal modeling works for event sequences — not just BERT-style masking

	### What Nubank Didn't Do (Opportunities for domainTokenizer)

	1. ❌ No Semantic IDs (RQ-VAE): Nubank tokenizes merchant descriptions via BPE but doesn't create learned codebook-based product/merchant IDs. This could be a significant improvement — merchants that always appear together could share semantic ID prefixes.

	2. ❌ No cross-field composite tokens (ActionPiece-style): Each field is tokenized independently. A BPE-like merging of `{amount_bin + category + time_of_day}` into composite tokens could further compress the sequence and capture higher-order patterns.

	3. ❌ No continual learning (HOPE-style): nuFormer is frozen after pre-training. The Nested Learning / HOPE paradigm could enable continuous adaptation to new spending patterns, new merchants, and seasonal shifts.

	4. ❌ No multi-resolution memory (CMS): All tokens are treated equally in the attention window. A Continuum Memory System with different update frequencies could better handle the difference between recent transactions (high signal) and historical patterns (persistent knowledge).

	### Nubank's Recipe = Our Blueprint for Phase 2

	Nubank's exact pipeline maps to domainTokenizer's planned implementation:

	```
	domainTokenizer Phase 2 Implementation Plan
	(directly following Nubank's validated recipe)

	1. Schema Analysis → Identify field types
	[Nubank: amount(float), date(timestamp), description(text)]

	2. Per-Field Tokenizer Construction
	[Nubank: ϕ_sign(2), ϕ_amt(21), ϕ_month(12), ϕ_dow(7), ϕ_dom(31), ϕ_hour(24), BPE(text)]
	[Us: same pattern, extensible to any domain schema]

	3. Pre-train GPT-style Causal Transformer (NoPE)
	[Nubank: 24M-330M params, 2048 context, CLM objective]
	[Us: configurable sizes, same objective]

	4. Joint Fusion Fine-tuning
	[Nubank: Transformer embeddings + DCNv2(PLR) on tabular features]
	[Us: pluggable fusion with any tabular backbone]
	```

	---

	## 7. The Playbook: How to Walk Nubank's Path

	### For Finance (Replicating Nubank)

	Step 1: Define your transaction schema
	```python
	schema = {
	"amount": {"type": "numerical", "tokenizer": "sign_bucket", "sign_vocab": 2, "bucket_vocab": 21},
	"timestamp": {"type": "temporal", "tokenizer": "calendar",
	"fields": ["month(12)", "dow(7)", "dom(31)", "hour(24)"]},
	"description": {"type": "text", "tokenizer": "bpe"},
	# Extensions beyond Nubank:
	"merchant_category": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 50},
	"channel": {"type": "categorical", "tokenizer": "vocab", "vocab_size": 10},
	}
	```

	Step 2: Build tokenizer (97 special tokens + BPE)
	```python
	class TransactionTokenizer:
	def __init__(self, schema):
	self.special_tokens = build_special_vocab(schema) # ~97-150 tokens
	self.bpe_tokenizer = AutoTokenizer.from_pretrained("...") # for text fields

	def tokenize_transaction(self, txn):
	tokens = []
	tokens.append(self.sign_token(txn.amount)) # 1 token
	tokens.append(self.amount_bucket(txn.amount)) # 1 token
	tokens.extend(self.calendar_tokens(txn.timestamp)) # 4 tokens
	tokens.extend(self.bpe_tokenizer(txn.description)) # ~8 tokens avg
	return tokens # ~14 tokens total
	```

	Step 3: Pre-train (24M params, CLM)
	```python
	model = GPTCausalLM(
	vocab_size=len(special_tokens) + bpe_vocab_size,
	d_model=256, n_layers=24, n_heads=16,
	max_seq_len=2048,
	positional_encoding=None, # NoPE!
	)
	# Pre-train on transaction sequences
	train_clm(model, transaction_sequences, epochs=...)
	```

	Step 4: Joint Fusion Fine-tuning
	```python
	class NuFormer(nn.Module):
	def __init__(self, txn_transformer, tabular_features):
	self.txn_branch = txn_transformer # pre-trained, unfrozen
	self.tab_branch = DCNv2(
	input_dim=len(tabular_features),
	num_embeddings=PLREmbed(n_frequencies=64),
	cross_layers=3, deep_layers=3,
	)
	self.head = MLP(txn_dim + tab_dim, hidden, 1)

	def forward(self, txn_tokens, tabular_features):
	txn_embed = self.txn_branch(txn_tokens)[:, -1, :] # last token embedding
	tab_embed = self.tab_branch(tabular_features)
	combined = torch.cat([txn_embed, tab_embed], dim=-1)
	return self.head(combined)
	```

	### For E-Commerce (Adapting Nubank's Recipe)

	The adaptation is straightforward — replace transaction fields with e-commerce event fields:

	\| Finance (Nubank) \| E-Commerce (Adaptation) \|
	\|------------------\|----------------------\|
	\| amount (float) \| price (float) — same ϕ_amt tokenizer \|
	\| amount sign (credit/debit) \| event_type (view/cart/purchase/return) — expand to 4+ tokens \|
	\| timestamp (month/dow/dom/hour) \| timestamp — same calendar tokens \|
	\| description (merchant text) \| product_title (BPE) — same approach \|
	\| — \| category (hierarchical) — add special tokens \|
	\| — \| brand — add special tokens or BPE \|
	\| — \| quantity — small fixed vocab (1-10+) \|

	E-commerce special token vocabulary:
	```python
	e_commerce_special_tokens = {
	"event_type": 5, # view, cart, purchase, return, wishlist
	"price_bucket": 21, # same binning as Nubank
	"quantity": 11, # 1-10, 10+
	"category_l1": 30, # top-level categories
	"category_l2": 200, # subcategories
	"month": 12,
	"dow": 7,
	"dom": 31,
	"hour": 24,
	}
	# Total: ~341 special tokens + BPE for product titles
	# ~16 tokens per event → 2048 context ≈ 128 events
	```

	Pre-training objectives (same as Nubank):
	- Causal LM: predict next token in the event sequence
	- Downstream: next purchase prediction, churn, product recommendation, customer segmentation

	### For Healthcare (Same Pattern)

	```python
	healthcare_special_tokens = {
	"event_type": 10, # diagnosis, procedure, lab, medication, visit, ...
	"icd_category": 50, # top-level ICD-10 groups
	"cpt_category": 40, # procedure categories
	"cost_bucket": 21, # same binning
	"provider_type": 15, # PCP, specialist, ER, ...
	"month": 12, "dow": 7, "dom": 31,
	}
	# Description: BPE on clinical notes/medication names
	```

	---

	## 8. Complete Reference List

	### Nubank Sources

	\| Ref \| Authors \| Title \| Link \|
	\|-----\|---------\|-------\|------\|
	\| Primary \| Braithwaite et al. \| Your spending needs attention: Modeling financial habits with transformers \| [arXiv: 2507.23267](https://arxiv.org/abs/2507.23267) \|
	\| Blog 1 \| — \| Unlocking financial insights: How Nubank powers personalized experiences \| [building.nubank.com](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) \|
	\| Blog 2 \| Braithwaite & Udagawa \| Defining an interface between transaction data and foundation models \| Building Nubank, 2025a \|
	\| Blog 3 \| Braithwaite, Cavalcanti & Udagawa \| Fine-tuning transaction user models \| Building Nubank, 2025b \|
	\| Blog 4 \| Braithwaite & Udagawa \| Understanding our customers' finances through foundation models \| Building Nubank, 2025c \|
	\| Blog 5 \| Foust \| Optimizing user narratives for foundation models \| Building Nubank, 2025 \|
	\| Blog 6 \| Udagawa \| Building foundation models into Nubank's AI platform \| Building Nubank, 2025 \|

	### Academic References (Used by nuFormer)

	\| Paper \| Authors \| Year \| ArXiv \| Role in nuFormer \|
	\|-------\|---------\|------\|-------\|-----------------\|
	\| RecFormer \| Li et al. \| 2023 \| [2305.13731](https://arxiv.org/abs/2305.13731) \| Tokenization philosophy: items as key-value text \|
	\| PLR Embeddings \| Gorishniy et al. \| 2022 \| [2203.05556](https://arxiv.org/abs/2203.05556) \| Numerical feature → periodic embeddings \|
	\| DCN V2 \| Wang et al. \| 2021 \| [2008.13535](https://arxiv.org/abs/2008.13535) \| Tabular feature cross-interaction backbone \|
	\| NoPE \| Kazemnejad et al. \| 2023 \| [2305.19466](https://arxiv.org/abs/2305.19466) \| No positional encoding for length generalization \|
	\| FlashAttention \| Dao et al. \| 2022 \| [2205.14135](https://arxiv.org/abs/2205.14135) \| Efficient attention computation \|
	\| Banking TF \| Delestre & Sola \| 2024 \| [2410.08243](https://arxiv.org/abs/2410.08243) \| Parallel work: French bank transaction tokenizer \|

	### Related Papers from domainTokenizer Research

	\| Paper \| Year \| ArXiv \| Connection \|
	\|-------\|------\|-------\|-----------\|
	\| TIGER \| 2023 \| [2305.05065](https://arxiv.org/abs/2305.05065) \| Alternative: RQ-VAE Semantic IDs (Nubank didn't use) \|
	\| ActionPiece \| 2025 \| [2502.13581](https://arxiv.org/abs/2502.13581) \| Alternative: BPE-like merging of action features (Nubank didn't use) \|
	\| Nested Learning (HOPE) \| 2025 \| [2512.24695](https://arxiv.org/abs/2512.24695) \| Future: continual learning for domain models \|

	---

	This analysis reconstructs Nubank's full pipeline from public sources. The actual production system may have additional proprietary components not disclosed in the blog series or arXiv paper.