File size: 8,459 Bytes
f930fef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f580186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e30a14d
 
 
 
 
 
 
 
 
a239d6e
e30a14d
f930fef
 
 
 
 
 
 
 
 
e30a14d
 
 
 
f580186
 
 
 
a239d6e
 
f930fef
e30a14d
f580186
 
 
 
 
 
 
 
 
 
f930fef
a239d6e
f580186
a239d6e
f930fef
 
 
 
f580186
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f930fef
 
 
 
f580186
 
 
 
 
 
 
 
 
 
 
f930fef
a239d6e
f930fef
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
# πŸ”‘ domainTokenizer

**Building small models that understand domain tokens β€” not just words.**

---

## The Idea

LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

```
Text LLM:      "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word

domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase
```

## Quick Start

```python
from domain_tokenizer import (
    DomainTokenizerBuilder, DomainTransformerConfig,
    DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA

# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)                                          # fit magnitude bins on training data
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data (100% token utilization, zero padding waste)
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create model (GPT-style, NoPE, pre-norm β€” 24M params)
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)

# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",       # auto push to HF Hub
    num_epochs=10, learning_rate=3e-4,
    bf16=True,                             # A100/H100
    report_to="trackio",                   # live monitoring
)

# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
from domain_tokenizer import JointFusionModel
fusion = JointFusionModel(
    transformer_model=model,               # pre-trained, unfrozen
    n_tabular_features=291,                # hand-crafted tabular features
    n_classes=1,                           # binary: will user activate product?
)
# Train fusion model end-to-end on labeled data...
```

## 🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
- **Blog series:** [Building Nubank β€” Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)

**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ **~14 tokens per transaction** β†’ GPT-style Transformer (24M-330M params) β†’ **+1.25% relative AUC over LightGBM** (3Γ— their production launch threshold).

πŸ“„ **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)

## Why This Matters

| Problem | Text Tokenizer | Domain Tokenizer |
|---------|---------------|-----------------|
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β†’ composite token |

## Documentation

| Document | Description |
|----------|-------------|
| πŸ“„ [`docs/research_report.md`](docs/research_report.md) | **Research survey** β€” 31 papers across 5 paradigms, technical taxonomy, blueprint |
| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β€” full pipeline reconstruction, 4 academic pillars |
| πŸ—οΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β€” PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
| πŸ“Š [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** β€” Phase 2A-2C technical decisions, architecture, 124 tests |

## Project Roadmap

### Phase 1: Research & Survey βœ…
- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR

### Phase 2: Core Library βœ… (v0.3.0 β€” 124 tests passing)
- **2A:** Domain tokenizer library β€” schema, 5 field tokenizers, HF-compatible builder
- **2B:** Model architecture β€” DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
- **2C:** Pre-training pipeline β€” sequence packing, DataCollatorForLanguageModeling, HF Trainer
- **2D:** Fine-tuning pipeline (next)

### Phase 3: Domain Demos
- Finance: fraud detection, credit scoring on real data
- E-commerce: next purchase prediction, customer segmentation

### Phase 4: Scale & Optimize
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary

## Repo Structure

```
src/domain_tokenizer/
β”œβ”€β”€ __init__.py                     # v0.3.0 β€” all public exports
β”œβ”€β”€ schema.py                       # DomainSchema, FieldSpec, FieldType
β”œβ”€β”€ tokenizers/
β”‚   β”œβ”€β”€ field_tokenizers.py         # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
β”‚   └── domain_tokenizer.py         # DomainTokenizerBuilder β†’ HF PreTrainedTokenizerFast
β”œβ”€β”€ schemas/
β”‚   └── predefined.py               # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ configuration.py            # DomainTransformerConfig (24M/85M/330M presets)
β”‚   β”œβ”€β”€ modeling.py                 # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
β”‚   β”œβ”€β”€ plr_embeddings.py           # PeriodicLinearReLU (Gorishniy et al. 2022)
β”‚   └── joint_fusion.py             # DCNv2 + JointFusionModel (nuFormer-style)
└── training/
    β”œβ”€β”€ data_pipeline.py            # tokenize β†’ pack β†’ HFDataset
    └── pretrain.py                 # pretrain_domain_model (HF Trainer)
tests/
β”œβ”€β”€ test_tokenizer.py               # 72 tests
β”œβ”€β”€ test_model.py                   # 33 tests
└── test_training.py                # 19 tests
```

## Key References

| Paper | Year | Role in domainTokenizer | Link |
|-------|------|------------------------|------|
| **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) |
| **NoPE** | 2023 | No positional encoding β€” our attention design | [arXiv](https://arxiv.org/abs/2305.19466) |
| **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) |
| **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) |
| **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) |
| **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
| **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
| **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
| **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |

Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)

## License

MIT