rtferraz
/

domainTokenizer

Model card Files Files and versions

xet

Community

rtferraz commited on 8 days ago

Commit

a239d6e

verified ·

1 Parent(s): 25a1093

Update README: add ADR reference, update documentation table and repo structure

Browse files

Files changed (1) hide show

README.md +45 -38

README.md CHANGED Viewed

@@ -27,7 +27,7 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
 **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → **~14 tokens per transaction** → GPT-style Transformer (24M-330M params) → **+1.25% relative AUC over LightGBM** (3× their production launch threshold).
-📄 **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) — complete pipeline reconstruction with code patterns, scaling laws, and adaptation guides for e-commerce and healthcare.
 ## Why This Matters
@@ -42,8 +42,6 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
 This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.
-Five paradigms have emerged:
 | Paradigm | Method | Key Paper |
 |----------|--------|-----------|
 | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
@@ -56,48 +54,58 @@ Five paradigms have emerged:
 | Document | Description |
 |----------|-------------|
-| [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** — 31 papers across 5 paradigms, technical taxonomy of tokenization methods, full blueprint for building domain tokenizers |
-| [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — complete pipeline reconstruction (tokenizer → pre-training → joint fusion), all 4 academic pillars explained, adaptation playbooks for finance/e-commerce/healthcare |
-## Project Vision
 ### Phase 1: Research & Survey ✅
-- Literature survey of domain tokenization methods (31+ papers)
-- Nubank nuFormer full reverse-engineering
-- Blueprint for a general-purpose domain tokenizer
-### Phase 2: Core Tokenizer Library (Next)
-- Implement per-field tokenizers:
-  - `SignBucketTokenizer` — sign + magnitude bins for monetary values (Nubank-style)
-  - `CalendarTokenizer` — month/dow/dom/hour decomposition
-  - `SemanticIDTokenizer` — RQ-VAE for entity encoding (TIGER-style)
-  - `PLREmbedding` — periodic linear representation for numerical tabular features
-  - `CompositeTokenizer` — BPE-like merging of multi-field patterns (ActionPiece-style)
-- Schema-driven automatic tokenizer selection
-### Phase 3: Pre-training Framework
-- GPT-style causal LM with NoPE (no positional encoding)
-- Small Transformer backbone (24M–350M parameters, following Nubank's sizes)
-- Joint fusion with DCNv2 + PLR for tabular features
-- Domain-agnostic training loop
-### Phase 4: Domain Demos
-- Finance: transaction recommendation, fraud detection, credit scoring
 - E-commerce: next purchase prediction, customer segmentation
-- Healthcare: clinical event prediction
 ## Repo Structure
 ```
 domainTokenizer/
 ├── docs/
-│   ├── research_report.md          # 51KB — Full research survey (31+ papers)
-│   └── nubank_nuformer_analysis.md # Nubank pipeline reverse-engineering
-├── src/                             # (Phase 2) Core library
-│   ├── tokenizers/                  # Per-field tokenizer implementations
-│   ├── models/                      # Small Transformer backbones
-│   └── training/                    # Pre-training and fine-tuning
-├── examples/                        # (Phase 4) Domain-specific demos
 └── README.md
 ```
@@ -112,12 +120,11 @@ domainTokenizer/
 | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
 | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
 | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
 | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
-| Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
-| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
 | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
-See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 35+ papers in the research report.
 ## License

 **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → **~14 tokens per transaction** → GPT-style Transformer (24M-330M params) → **+1.25% relative AUC over LightGBM** (3× their production launch threshold).
+📄 **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
 ## Why This Matters
 This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.
 | Paradigm | Method | Key Paper |
 |----------|--------|-----------|
 | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
 | Document | Description |
 |----------|-------------|
+| 📄 [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** — 31 papers across 5 paradigms, technical taxonomy, full blueprint |
+| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
+| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
+## Implementation Decision
+After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
+**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
+Key reasons:
+- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
+- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
+- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
+- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale
+Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
+## Project Roadmap
 ### Phase 1: Research & Survey ✅
+- Literature survey (35+ papers)
+- Nubank nuFormer reverse-engineering
+- Framework ADR with detailed implementation plan
+### Phase 2: Core Library (Next — ~9 weeks)
+- **Weeks 1–3:** Domain tokenizer library (schema → per-field tokenizers → HF-compatible composite tokenizer)
+- **Weeks 3–5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
+- **Weeks 5–7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
+- **Weeks 7–9:** Fine-tuning pipeline (nuFormer-style joint fusion)
+### Phase 3: Domain Demos (Weeks 9–12)
+- Finance: fraud detection, credit scoring
 - E-commerce: next purchase prediction, customer segmentation
+### Phase 4: Scale & Optimize (Weeks 12+)
+- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
 ## Repo Structure
 ```
 domainTokenizer/
 ├── docs/
+│   ├── research_report.md              # 51KB — Full research survey
+│   ├── nubank_nuformer_analysis.md     # 29KB — Nubank pipeline analysis
+│   └── adr/
+│       └── ADR-001-implementation-framework.md  # Framework decision + roadmap
+├── src/                                 # (Phase 2) Core library
+│   ├── tokenizers/                      # Schema, field tokenizers, composite builder
+│   ├── models/                          # DomainTransformer, PLR, DCNv2, JointFusion
+│   └── training/                        # Data pipeline, pre-training, fine-tuning
+├── examples/                            # (Phase 3) Domain-specific demos
 └── README.md
 ```
 | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
 | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
 | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
+| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
 | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
 | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
+Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
 ## License