rtferraz
/

domainTokenizer

Model card Files Files and versions

xet

Community

rtferraz commited on 8 days ago

Commit

f580186

verified ·

1 Parent(s): 6c4ad4d

Update README v0.3.0 — add usage example, update roadmap status, add implementation report link

Browse files

Files changed (1) hide show

README.md +86 -70

README.md CHANGED Viewed

@@ -18,6 +18,46 @@ Text LLM:      "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat]
 domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
 ```
 ## 🏦 Industry Validation: Nubank's nuFormer
 This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
@@ -38,98 +78,74 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
 | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
 | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
-## Research Foundation
-This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.
-| Paradigm | Method | Key Paper |
-|----------|--------|-----------|
-| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
-| **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
-| **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
-| **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
-| **Universal Tokenization** | All modalities → shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
 ## Documentation
 | Document | Description |
 |----------|-------------|
-| 📄 [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** — 31 papers across 5 paradigms, technical taxonomy, full blueprint |
-| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
-| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
-## Implementation Decision
-After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
-**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
-Key reasons:
-- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
-- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
-- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
-- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale
-Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
 ## Project Roadmap
 ### Phase 1: Research & Survey ✅
-- Literature survey (35+ papers)
-- Nubank nuFormer reverse-engineering
-- Framework ADR with detailed implementation plan
-### Phase 2: Core Library (Next — ~9 weeks)
-- **Weeks 1–3:** Domain tokenizer library (schema → per-field tokenizers → HF-compatible composite tokenizer)
-- **Weeks 3–5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
-- **Weeks 5–7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
-- **Weeks 7–9:** Fine-tuning pipeline (nuFormer-style joint fusion)
-### Phase 3: Domain Demos (Weeks 9–12)
-- Finance: fraud detection, credit scoring
 - E-commerce: next purchase prediction, customer segmentation
-### Phase 4: Scale & Optimize (Weeks 12+)
 - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
 ## Repo Structure
 ```
-domainTokenizer/
-├── docs/
-│   ├── research_report.md              # 51KB — Full research survey
-│   ├── nubank_nuformer_analysis.md     # 29KB — Nubank pipeline analysis
-│   └── adr/
-│       └── ADR-001-implementation-framework.md  # Framework decision + roadmap
-├── src/                                 # (Phase 2) Core library
-│   ├── tokenizers/                      # Schema, field tokenizers, composite builder
-│   ├── models/                          # DomainTransformer, PLR, DCNv2, JointFusion
-│   └── training/                        # Data pipeline, pre-training, fine-tuning
-├── examples/                            # (Phase 3) Domain-specific demos
-└── README.md
 ```
 ## Key References
-| Paper | Year | What It Does | Link |
-|-------|------|-------------|------|
-| **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
-| TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
-| ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
-| RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
-| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
-| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
-| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
-| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
-| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
-| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
 Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
 ## License
 MIT
----
-*domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*

 domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
 ```
+## Quick Start
+```python
+from domain_tokenizer import (
+    DomainTokenizerBuilder, DomainTransformerConfig,
+    DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
+)
+from domain_tokenizer.schemas import FINANCE_SCHEMA
+# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
+builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
+builder.fit(all_events)                                          # fit magnitude bins on training data
+hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
+# 2. Prepare packed training data (100% token utilization, zero padding waste)
+dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
+# 3. Create model (GPT-style, NoPE, pre-norm — 24M params)
+config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
+model = DomainTransformerForCausalLM(config)
+# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
+pretrain_domain_model(
+    model, hf_tokenizer, dataset,
+    hub_model_id="org/finance-24m",       # auto push to HF Hub
+    num_epochs=10, learning_rate=3e-4,
+    bf16=True,                             # A100/H100
+    report_to="trackio",                   # live monitoring
+)
+# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
+from domain_tokenizer import JointFusionModel
+fusion = JointFusionModel(
+    transformer_model=model,               # pre-trained, unfrozen
+    n_tabular_features=291,                # hand-crafted tabular features
+    n_classes=1,                           # binary: will user activate product?
+)
+# Train fusion model end-to-end on labeled data...
+```
 ## 🏦 Industry Validation: Nubank's nuFormer
 This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
 | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
 | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
 ## Documentation
 | Document | Description |
 |----------|-------------|
+| 📄 [`docs/research_report.md`](docs/research_report.md) | **Research survey** — 31 papers across 5 paradigms, technical taxonomy, blueprint |
+| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — full pipeline reconstruction, 4 academic pillars |
+| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
+| 📊 [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** — Phase 2A-2C technical decisions, architecture, 124 tests |
 ## Project Roadmap
 ### Phase 1: Research & Survey ✅
+- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR
+### Phase 2: Core Library ✅ (v0.3.0 — 124 tests passing)
+- **2A:** Domain tokenizer library — schema, 5 field tokenizers, HF-compatible builder
+- **2B:** Model architecture — DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
+- **2C:** Pre-training pipeline — sequence packing, DataCollatorForLanguageModeling, HF Trainer
+- **2D:** Fine-tuning pipeline (next)
+### Phase 3: Domain Demos
+- Finance: fraud detection, credit scoring on real data
 - E-commerce: next purchase prediction, customer segmentation
+### Phase 4: Scale & Optimize
 - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
 ## Repo Structure
 ```
+src/domain_tokenizer/
+├── __init__.py                     # v0.3.0 — all public exports
+├── schema.py                       # DomainSchema, FieldSpec, FieldType
+├── tokenizers/
+│   ├── field_tokenizers.py         # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
+│   └── domain_tokenizer.py         # DomainTokenizerBuilder → HF PreTrainedTokenizerFast
+├── schemas/
+│   └── predefined.py               # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
+├── models/
+│   ├── configuration.py            # DomainTransformerConfig (24M/85M/330M presets)
+│   ├── modeling.py                 # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
+│   ├── plr_embeddings.py           # PeriodicLinearReLU (Gorishniy et al. 2022)
+│   └── joint_fusion.py             # DCNv2 + JointFusionModel (nuFormer-style)
+└── training/
+    ├── data_pipeline.py            # tokenize → pack → HFDataset
+    └── pretrain.py                 # pretrain_domain_model (HF Trainer)
+tests/
+├── test_tokenizer.py               # 72 tests
+├── test_model.py                   # 33 tests
+└── test_training.py                # 19 tests
 ```
 ## Key References
+| Paper | Year | Role in domainTokenizer | Link |
+|-------|------|------------------------|------|
+| **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) |
+| **NoPE** | 2023 | No positional encoding — our attention design | [arXiv](https://arxiv.org/abs/2305.19466) |
+| **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) |
+| **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) |
+| **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) |
+| **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
+| **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
+| **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
+| **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
 Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
 ## License
 MIT