File size: 8,459 Bytes
f930fef f580186 e30a14d a239d6e e30a14d f930fef e30a14d f580186 a239d6e f930fef e30a14d f580186 f930fef a239d6e f580186 a239d6e f930fef f580186 f930fef f580186 f930fef a239d6e f930fef | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 | # π domainTokenizer
**Building small models that understand domain tokens β not just words.**
---
## The Idea
LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.
But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
```
Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word
domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase
```
## Quick Start
```python
from domain_tokenizer import (
DomainTokenizerBuilder, DomainTransformerConfig,
DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA
# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events) # fit magnitude bins on training data
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
# 2. Prepare packed training data (100% token utilization, zero padding waste)
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
# 3. Create model (GPT-style, NoPE, pre-norm β 24M params)
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
pretrain_domain_model(
model, hf_tokenizer, dataset,
hub_model_id="org/finance-24m", # auto push to HF Hub
num_epochs=10, learning_rate=3e-4,
bf16=True, # A100/H100
report_to="trackio", # live monitoring
)
# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
from domain_tokenizer import JointFusionModel
fusion = JointFusionModel(
transformer_model=model, # pre-trained, unfrozen
n_tabular_features=291, # hand-crafted tabular features
n_classes=1, # binary: will user activate product?
)
# Train fusion model end-to-end on labeled data...
```
## π¦ Industry Validation: Nubank's nuFormer
This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
- **Blog series:** [Building Nubank β Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)
**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold).
π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
## Why This Matters
| Problem | Text Tokenizer | Domain Tokenizer |
|---------|---------------|-----------------|
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β composite token |
## Documentation
| Document | Description |
|----------|-------------|
| π [`docs/research_report.md`](docs/research_report.md) | **Research survey** β 31 papers across 5 paradigms, technical taxonomy, blueprint |
| π¦ [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β full pipeline reconstruction, 4 academic pillars |
| ποΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
| π [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** β Phase 2A-2C technical decisions, architecture, 124 tests |
## Project Roadmap
### Phase 1: Research & Survey β
- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR
### Phase 2: Core Library β
(v0.3.0 β 124 tests passing)
- **2A:** Domain tokenizer library β schema, 5 field tokenizers, HF-compatible builder
- **2B:** Model architecture β DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
- **2C:** Pre-training pipeline β sequence packing, DataCollatorForLanguageModeling, HF Trainer
- **2D:** Fine-tuning pipeline (next)
### Phase 3: Domain Demos
- Finance: fraud detection, credit scoring on real data
- E-commerce: next purchase prediction, customer segmentation
### Phase 4: Scale & Optimize
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
## Repo Structure
```
src/domain_tokenizer/
βββ __init__.py # v0.3.0 β all public exports
βββ schema.py # DomainSchema, FieldSpec, FieldType
βββ tokenizers/
β βββ field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
β βββ domain_tokenizer.py # DomainTokenizerBuilder β HF PreTrainedTokenizerFast
βββ schemas/
β βββ predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
βββ models/
β βββ configuration.py # DomainTransformerConfig (24M/85M/330M presets)
β βββ modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
β βββ plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022)
β βββ joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style)
βββ training/
βββ data_pipeline.py # tokenize β pack β HFDataset
βββ pretrain.py # pretrain_domain_model (HF Trainer)
tests/
βββ test_tokenizer.py # 72 tests
βββ test_model.py # 33 tests
βββ test_training.py # 19 tests
```
## Key References
| Paper | Year | Role in domainTokenizer | Link |
|-------|------|------------------------|------|
| **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) |
| **NoPE** | 2023 | No positional encoding β our attention design | [arXiv](https://arxiv.org/abs/2305.19466) |
| **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) |
| **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) |
| **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) |
| **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
| **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
| **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
| **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
## License
MIT
|