domainTokenizer / README.md
rtferraz's picture
Update README v0.3.0 β€” add usage example, update roadmap status, add implementation report link
f580186 verified
# πŸ”‘ domainTokenizer
**Building small models that understand domain tokens β€” not just words.**
---
## The Idea
LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.
But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.
**domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?
```
Text LLM: "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word
domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase
```
## Quick Start
```python
from domain_tokenizer import (
DomainTokenizerBuilder, DomainTransformerConfig,
DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA
# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events) # fit magnitude bins on training data
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
# 2. Prepare packed training data (100% token utilization, zero padding waste)
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
# 3. Create model (GPT-style, NoPE, pre-norm β€” 24M params)
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
pretrain_domain_model(
model, hf_tokenizer, dataset,
hub_model_id="org/finance-24m", # auto push to HF Hub
num_epochs=10, learning_rate=3e-4,
bf16=True, # A100/H100
report_to="trackio", # live monitoring
)
# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
from domain_tokenizer import JointFusionModel
fusion = JointFusionModel(
transformer_model=model, # pre-trained, unfrozen
n_tabular_features=291, # hand-crafted tabular features
n_classes=1, # binary: will user activate product?
)
# Train fusion model end-to-end on labeled data...
```
## 🏦 Industry Validation: Nubank's nuFormer
This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
- **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
- **Blog series:** [Building Nubank β€” Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)
**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ **~14 tokens per transaction** β†’ GPT-style Transformer (24M-330M params) β†’ **+1.25% relative AUC over LightGBM** (3Γ— their production launch threshold).
πŸ“„ **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
## Why This Matters
| Problem | Text Tokenizer | Domain Tokenizer |
|---------|---------------|-----------------|
| Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning |
| Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β†’ composite token |
## Documentation
| Document | Description |
|----------|-------------|
| πŸ“„ [`docs/research_report.md`](docs/research_report.md) | **Research survey** β€” 31 papers across 5 paradigms, technical taxonomy, blueprint |
| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β€” full pipeline reconstruction, 4 academic pillars |
| πŸ—οΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β€” PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
| πŸ“Š [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** β€” Phase 2A-2C technical decisions, architecture, 124 tests |
## Project Roadmap
### Phase 1: Research & Survey βœ…
- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR
### Phase 2: Core Library βœ… (v0.3.0 β€” 124 tests passing)
- **2A:** Domain tokenizer library β€” schema, 5 field tokenizers, HF-compatible builder
- **2B:** Model architecture β€” DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
- **2C:** Pre-training pipeline β€” sequence packing, DataCollatorForLanguageModeling, HF Trainer
- **2D:** Fine-tuning pipeline (next)
### Phase 3: Domain Demos
- Finance: fraud detection, credit scoring on real data
- E-commerce: next purchase prediction, customer segmentation
### Phase 4: Scale & Optimize
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
## Repo Structure
```
src/domain_tokenizer/
β”œβ”€β”€ __init__.py # v0.3.0 β€” all public exports
β”œβ”€β”€ schema.py # DomainSchema, FieldSpec, FieldType
β”œβ”€β”€ tokenizers/
β”‚ β”œβ”€β”€ field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
β”‚ └── domain_tokenizer.py # DomainTokenizerBuilder β†’ HF PreTrainedTokenizerFast
β”œβ”€β”€ schemas/
β”‚ └── predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
β”œβ”€β”€ models/
β”‚ β”œβ”€β”€ configuration.py # DomainTransformerConfig (24M/85M/330M presets)
β”‚ β”œβ”€β”€ modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
β”‚ β”œβ”€β”€ plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022)
β”‚ └── joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style)
└── training/
β”œβ”€β”€ data_pipeline.py # tokenize β†’ pack β†’ HFDataset
└── pretrain.py # pretrain_domain_model (HF Trainer)
tests/
β”œβ”€β”€ test_tokenizer.py # 72 tests
β”œβ”€β”€ test_model.py # 33 tests
└── test_training.py # 19 tests
```
## Key References
| Paper | Year | Role in domainTokenizer | Link |
|-------|------|------------------------|------|
| **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) |
| **NoPE** | 2023 | No positional encoding β€” our attention design | [arXiv](https://arxiv.org/abs/2305.19466) |
| **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) |
| **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) |
| **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) |
| **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
| **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
| **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
| **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
## License
MIT