Update README v0.3.0 — add usage example, update roadmap status, add implementation report link
Browse files
README.md
CHANGED
|
@@ -18,6 +18,46 @@ Text LLM: "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat]
|
|
| 18 |
domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
|
| 19 |
```
|
| 20 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
## 🏦 Industry Validation: Nubank's nuFormer
|
| 22 |
|
| 23 |
This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
|
|
@@ -38,98 +78,74 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
|
|
| 38 |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
|
| 39 |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
|
| 40 |
|
| 41 |
-
## Research Foundation
|
| 42 |
-
|
| 43 |
-
This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.
|
| 44 |
-
|
| 45 |
-
| Paradigm | Method | Key Paper |
|
| 46 |
-
|----------|--------|-----------|
|
| 47 |
-
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
|
| 48 |
-
| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
|
| 49 |
-
| **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
|
| 50 |
-
| **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
|
| 51 |
-
| **Universal Tokenization** | All modalities → shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
|
| 52 |
-
|
| 53 |
## Documentation
|
| 54 |
|
| 55 |
| Document | Description |
|
| 56 |
|----------|-------------|
|
| 57 |
-
| 📄 [`docs/research_report.md`](docs/research_report.md) | **
|
| 58 |
-
| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** —
|
| 59 |
-
| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** —
|
| 60 |
-
|
| 61 |
-
## Implementation Decision
|
| 62 |
-
|
| 63 |
-
After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
|
| 64 |
-
|
| 65 |
-
**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
|
| 66 |
-
|
| 67 |
-
Key reasons:
|
| 68 |
-
- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
|
| 69 |
-
- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
|
| 70 |
-
- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
|
| 71 |
-
- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale
|
| 72 |
-
|
| 73 |
-
Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
|
| 74 |
|
| 75 |
## Project Roadmap
|
| 76 |
|
| 77 |
### Phase 1: Research & Survey ✅
|
| 78 |
-
- Literature survey (35+ papers)
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
- **
|
| 84 |
-
- **
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
### Phase 3: Domain Demos (Weeks 9–12)
|
| 89 |
-
- Finance: fraud detection, credit scoring
|
| 90 |
- E-commerce: next purchase prediction, customer segmentation
|
| 91 |
|
| 92 |
-
### Phase 4: Scale & Optimize
|
| 93 |
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
|
| 94 |
|
| 95 |
## Repo Structure
|
| 96 |
|
| 97 |
```
|
| 98 |
-
|
| 99 |
-
├──
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
│
|
| 103 |
-
│
|
| 104 |
-
├──
|
| 105 |
-
│
|
| 106 |
-
|
| 107 |
-
│
|
| 108 |
-
├──
|
| 109 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
```
|
| 111 |
|
| 112 |
## Key References
|
| 113 |
|
| 114 |
-
| Paper | Year |
|
| 115 |
-
|-------|------|-------------|------|
|
| 116 |
-
| **nuFormer** (Nubank) | 2025 |
|
| 117 |
-
|
|
| 118 |
-
|
|
| 119 |
-
|
|
| 120 |
-
|
|
| 121 |
-
|
|
| 122 |
-
|
|
| 123 |
-
|
|
| 124 |
-
|
|
| 125 |
-
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
| 126 |
|
| 127 |
Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
|
| 128 |
|
| 129 |
## License
|
| 130 |
|
| 131 |
MIT
|
| 132 |
-
|
| 133 |
-
---
|
| 134 |
-
|
| 135 |
-
*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*
|
|
|
|
| 18 |
domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
|
| 19 |
```
|
| 20 |
|
| 21 |
+
## Quick Start
|
| 22 |
+
|
| 23 |
+
```python
|
| 24 |
+
from domain_tokenizer import (
|
| 25 |
+
DomainTokenizerBuilder, DomainTransformerConfig,
|
| 26 |
+
DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
|
| 27 |
+
)
|
| 28 |
+
from domain_tokenizer.schemas import FINANCE_SCHEMA
|
| 29 |
+
|
| 30 |
+
# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
|
| 31 |
+
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
|
| 32 |
+
builder.fit(all_events) # fit magnitude bins on training data
|
| 33 |
+
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
|
| 34 |
+
|
| 35 |
+
# 2. Prepare packed training data (100% token utilization, zero padding waste)
|
| 36 |
+
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
|
| 37 |
+
|
| 38 |
+
# 3. Create model (GPT-style, NoPE, pre-norm — 24M params)
|
| 39 |
+
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
|
| 40 |
+
model = DomainTransformerForCausalLM(config)
|
| 41 |
+
|
| 42 |
+
# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
|
| 43 |
+
pretrain_domain_model(
|
| 44 |
+
model, hf_tokenizer, dataset,
|
| 45 |
+
hub_model_id="org/finance-24m", # auto push to HF Hub
|
| 46 |
+
num_epochs=10, learning_rate=3e-4,
|
| 47 |
+
bf16=True, # A100/H100
|
| 48 |
+
report_to="trackio", # live monitoring
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
|
| 52 |
+
from domain_tokenizer import JointFusionModel
|
| 53 |
+
fusion = JointFusionModel(
|
| 54 |
+
transformer_model=model, # pre-trained, unfrozen
|
| 55 |
+
n_tabular_features=291, # hand-crafted tabular features
|
| 56 |
+
n_classes=1, # binary: will user activate product?
|
| 57 |
+
)
|
| 58 |
+
# Train fusion model end-to-end on labeled data...
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
## 🏦 Industry Validation: Nubank's nuFormer
|
| 62 |
|
| 63 |
This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
|
|
|
|
| 78 |
| Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
|
| 79 |
| Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
|
| 80 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
## Documentation
|
| 82 |
|
| 83 |
| Document | Description |
|
| 84 |
|----------|-------------|
|
| 85 |
+
| 📄 [`docs/research_report.md`](docs/research_report.md) | **Research survey** — 31 papers across 5 paradigms, technical taxonomy, blueprint |
|
| 86 |
+
| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — full pipeline reconstruction, 4 academic pillars |
|
| 87 |
+
| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
|
| 88 |
+
| 📊 [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** — Phase 2A-2C technical decisions, architecture, 124 tests |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
## Project Roadmap
|
| 91 |
|
| 92 |
### Phase 1: Research & Survey ✅
|
| 93 |
+
- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR
|
| 94 |
+
|
| 95 |
+
### Phase 2: Core Library ✅ (v0.3.0 — 124 tests passing)
|
| 96 |
+
- **2A:** Domain tokenizer library — schema, 5 field tokenizers, HF-compatible builder
|
| 97 |
+
- **2B:** Model architecture — DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
|
| 98 |
+
- **2C:** Pre-training pipeline — sequence packing, DataCollatorForLanguageModeling, HF Trainer
|
| 99 |
+
- **2D:** Fine-tuning pipeline (next)
|
| 100 |
+
|
| 101 |
+
### Phase 3: Domain Demos
|
| 102 |
+
- Finance: fraud detection, credit scoring on real data
|
|
|
|
|
|
|
| 103 |
- E-commerce: next purchase prediction, customer segmentation
|
| 104 |
|
| 105 |
+
### Phase 4: Scale & Optimize
|
| 106 |
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
|
| 107 |
|
| 108 |
## Repo Structure
|
| 109 |
|
| 110 |
```
|
| 111 |
+
src/domain_tokenizer/
|
| 112 |
+
├── __init__.py # v0.3.0 — all public exports
|
| 113 |
+
├── schema.py # DomainSchema, FieldSpec, FieldType
|
| 114 |
+
├── tokenizers/
|
| 115 |
+
│ ├── field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
|
| 116 |
+
│ └── domain_tokenizer.py # DomainTokenizerBuilder → HF PreTrainedTokenizerFast
|
| 117 |
+
├── schemas/
|
| 118 |
+
│ └── predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
|
| 119 |
+
├── models/
|
| 120 |
+
│ ├── configuration.py # DomainTransformerConfig (24M/85M/330M presets)
|
| 121 |
+
│ ├── modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
|
| 122 |
+
│ ├── plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022)
|
| 123 |
+
│ └── joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style)
|
| 124 |
+
└── training/
|
| 125 |
+
├── data_pipeline.py # tokenize → pack → HFDataset
|
| 126 |
+
└── pretrain.py # pretrain_domain_model (HF Trainer)
|
| 127 |
+
tests/
|
| 128 |
+
├── test_tokenizer.py # 72 tests
|
| 129 |
+
├── test_model.py # 33 tests
|
| 130 |
+
└── test_training.py # 19 tests
|
| 131 |
```
|
| 132 |
|
| 133 |
## Key References
|
| 134 |
|
| 135 |
+
| Paper | Year | Role in domainTokenizer | Link |
|
| 136 |
+
|-------|------|------------------------|------|
|
| 137 |
+
| **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) |
|
| 138 |
+
| **NoPE** | 2023 | No positional encoding — our attention design | [arXiv](https://arxiv.org/abs/2305.19466) |
|
| 139 |
+
| **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) |
|
| 140 |
+
| **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) |
|
| 141 |
+
| **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) |
|
| 142 |
+
| **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
|
| 143 |
+
| **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
|
| 144 |
+
| **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
|
| 145 |
+
| **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
|
|
|
| 146 |
|
| 147 |
Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
|
| 148 |
|
| 149 |
## License
|
| 150 |
|
| 151 |
MIT
|
|
|
|
|
|
|
|
|
|
|
|