| # π domainTokenizer |
|
|
| **Building small models that understand domain tokens β not just words.** |
|
|
| --- |
|
|
| ## The Idea |
|
|
| LLMs process text by breaking it into **tokens** (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns. |
|
|
| But words are just one type of sequential data. Businesses generate massive amounts of **non-textual sequential data** every day β purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture. |
|
|
| **domainTokenizer** explores a different approach: what if we built tokenizers that encode **domain-specific entities** β products, transactions, medical codes, user actions β as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language? |
|
|
| ``` |
| Text LLM: "The cat sat on the mat" β [The] [cat] [sat] [on] [the] [mat] β Transformer β next word |
| |
| domainTokenizer: Customer purchase history β [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β Transformer β next purchase |
| ``` |
|
|
| ## Quick Start |
|
|
| ```python |
| from domain_tokenizer import ( |
| DomainTokenizerBuilder, DomainTransformerConfig, |
| DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model, |
| ) |
| from domain_tokenizer.schemas import FINANCE_SCHEMA |
| |
| # 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE) |
| builder = DomainTokenizerBuilder(FINANCE_SCHEMA) |
| builder.fit(all_events) # fit magnitude bins on training data |
| hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000) |
| |
| # 2. Prepare packed training data (100% token utilization, zero padding waste) |
| dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512) |
| |
| # 3. Create model (GPT-style, NoPE, pre-norm β 24M params) |
| config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size) |
| model = DomainTransformerForCausalLM(config) |
| |
| # 4. Pre-train with HF Trainer (cosine schedule, CLM objective) |
| pretrain_domain_model( |
| model, hf_tokenizer, dataset, |
| hub_model_id="org/finance-24m", # auto push to HF Hub |
| num_epochs=10, learning_rate=3e-4, |
| bf16=True, # A100/H100 |
| report_to="trackio", # live monitoring |
| ) |
| |
| # 5. Fine-tune for downstream tasks (nuFormer-style joint fusion) |
| from domain_tokenizer import JointFusionModel |
| fusion = JointFusionModel( |
| transformer_model=model, # pre-trained, unfrozen |
| n_tabular_features=291, # hand-crafted tabular features |
| n_classes=1, # binary: will user activate product? |
| ) |
| # Train fusion model end-to-end on labeled data... |
| ``` |
|
|
| ## π¦ Industry Validation: Nubank's nuFormer |
|
|
| This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe: |
|
|
| - **Paper:** ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267) |
| - **Blog series:** [Building Nubank β Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/) |
|
|
| **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold). |
|
|
| π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) |
|
|
| ## Why This Matters |
|
|
| | Problem | Text Tokenizer | Domain Tokenizer | |
| |---------|---------------|-----------------| |
| | Product `SKU-8847291` | Split into meaningless fragments: `SK`, `U-`, `884`... | Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning | |
| | Price `$79.99` | Fragmented: `$`, `79`, `.`, `99` | Tokenized as `price_bin_37` (73rd percentile = "mid-range") | |
| | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` | |
| | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` β composite token | |
|
|
| ## Documentation |
|
|
| | Document | Description | |
| |----------|-------------| |
| | π [`docs/research_report.md`](docs/research_report.md) | **Research survey** β 31 papers across 5 paradigms, technical taxonomy, blueprint | |
| | π¦ [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β full pipeline reconstruction, 4 academic pillars | |
| | ποΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β PyTorch+HF vs JAX/Keras, trade-offs, roadmap | |
| | π [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** β Phase 2A-2C technical decisions, architecture, 124 tests | |
|
|
| ## Project Roadmap |
|
|
| ### Phase 1: Research & Survey β
|
| - Literature survey (35+ papers), Nubank reverse-engineering, framework ADR |
|
|
| ### Phase 2: Core Library β
(v0.3.0 β 124 tests passing) |
| - **2A:** Domain tokenizer library β schema, 5 field tokenizers, HF-compatible builder |
| - **2B:** Model architecture β DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion |
| - **2C:** Pre-training pipeline β sequence packing, DataCollatorForLanguageModeling, HF Trainer |
| - **2D:** Fine-tuning pipeline (next) |
|
|
| ### Phase 3: Domain Demos |
| - Finance: fraud detection, credit scoring on real data |
| - E-commerce: next purchase prediction, customer segmentation |
|
|
| ### Phase 4: Scale & Optimize |
| - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary |
|
|
| ## Repo Structure |
|
|
| ``` |
| src/domain_tokenizer/ |
| βββ __init__.py # v0.3.0 β all public exports |
| βββ schema.py # DomainSchema, FieldSpec, FieldType |
| βββ tokenizers/ |
| β βββ field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete |
| β βββ domain_tokenizer.py # DomainTokenizerBuilder β HF PreTrainedTokenizerFast |
| βββ schemas/ |
| β βββ predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA |
| βββ models/ |
| β βββ configuration.py # DomainTransformerConfig (24M/85M/330M presets) |
| β βββ modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied) |
| β βββ plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022) |
| β βββ joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style) |
| βββ training/ |
| βββ data_pipeline.py # tokenize β pack β HFDataset |
| βββ pretrain.py # pretrain_domain_model (HF Trainer) |
| tests/ |
| βββ test_tokenizer.py # 72 tests |
| βββ test_model.py # 33 tests |
| βββ test_training.py # 19 tests |
| ``` |
|
|
| ## Key References |
|
|
| | Paper | Year | Role in domainTokenizer | Link | |
| |-------|------|------------------------|------| |
| | **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) | |
| | **NoPE** | 2023 | No positional encoding β our attention design | [arXiv](https://arxiv.org/abs/2305.19466) | |
| | **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) | |
| | **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) | |
| | **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) | |
| | **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) | |
| | **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) | |
| | **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) | |
| | **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) | |
|
|
| Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table) |
|
|
| ## License |
|
|
| MIT |
|
|