Update README: add ADR reference, update documentation table and repo structure
Browse files
README.md
CHANGED
|
@@ -27,7 +27,7 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
|
|
| 27 |
|
| 28 |
**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold).
|
| 29 |
|
| 30 |
-
π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
|
| 31 |
|
| 32 |
## Why This Matters
|
| 33 |
|
|
@@ -42,8 +42,6 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
|
|
| 42 |
|
| 43 |
This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize.
|
| 44 |
|
| 45 |
-
Five paradigms have emerged:
|
| 46 |
-
|
| 47 |
| Paradigm | Method | Key Paper |
|
| 48 |
|----------|--------|-----------|
|
| 49 |
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
|
|
@@ -56,48 +54,58 @@ Five paradigms have emerged:
|
|
| 56 |
|
| 57 |
| Document | Description |
|
| 58 |
|----------|-------------|
|
| 59 |
-
| [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β 31 papers across 5 paradigms, technical taxonomy
|
| 60 |
-
| [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β complete pipeline reconstruction
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
### Phase 1: Research & Survey β
|
| 65 |
-
- Literature survey
|
| 66 |
-
- Nubank nuFormer
|
| 67 |
-
-
|
| 68 |
-
|
| 69 |
-
### Phase 2: Core
|
| 70 |
-
-
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
-
|
| 77 |
-
|
| 78 |
-
### Phase 3: Pre-training Framework
|
| 79 |
-
- GPT-style causal LM with NoPE (no positional encoding)
|
| 80 |
-
- Small Transformer backbone (24Mβ350M parameters, following Nubank's sizes)
|
| 81 |
-
- Joint fusion with DCNv2 + PLR for tabular features
|
| 82 |
-
- Domain-agnostic training loop
|
| 83 |
-
|
| 84 |
-
### Phase 4: Domain Demos
|
| 85 |
-
- Finance: transaction recommendation, fraud detection, credit scoring
|
| 86 |
- E-commerce: next purchase prediction, customer segmentation
|
| 87 |
-
|
|
|
|
|
|
|
| 88 |
|
| 89 |
## Repo Structure
|
| 90 |
|
| 91 |
```
|
| 92 |
domainTokenizer/
|
| 93 |
βββ docs/
|
| 94 |
-
β βββ research_report.md
|
| 95 |
-
β
|
| 96 |
-
|
| 97 |
-
β
|
| 98 |
-
|
| 99 |
-
β
|
| 100 |
-
βββ
|
|
|
|
|
|
|
| 101 |
βββ README.md
|
| 102 |
```
|
| 103 |
|
|
@@ -112,12 +120,11 @@ domainTokenizer/
|
|
| 112 |
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
|
| 113 |
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
|
| 114 |
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
|
|
|
|
| 115 |
| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
|
| 116 |
-
| Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
|
| 117 |
-
| NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
|
| 118 |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
| 119 |
|
| 120 |
-
|
| 121 |
|
| 122 |
## License
|
| 123 |
|
|
|
|
| 27 |
|
| 28 |
**Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β **~14 tokens per transaction** β GPT-style Transformer (24M-330M params) β **+1.25% relative AUC over LightGBM** (3Γ their production launch threshold).
|
| 29 |
|
| 30 |
+
π **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
|
| 31 |
|
| 32 |
## Why This Matters
|
| 33 |
|
|
|
|
| 42 |
|
| 43 |
This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β the challenge is *how* to tokenize.
|
| 44 |
|
|
|
|
|
|
|
| 45 |
| Paradigm | Method | Key Paper |
|
| 46 |
|----------|--------|-----------|
|
| 47 |
| **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
|
|
|
|
| 54 |
|
| 55 |
| Document | Description |
|
| 56 |
|----------|-------------|
|
| 57 |
+
| π [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β 31 papers across 5 paradigms, technical taxonomy, full blueprint |
|
| 58 |
+
| π¦ [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
|
| 59 |
+
| ποΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
|
| 60 |
+
|
| 61 |
+
## Implementation Decision
|
| 62 |
+
|
| 63 |
+
After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
|
| 64 |
+
|
| 65 |
+
**Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
|
| 66 |
|
| 67 |
+
Key reasons:
|
| 68 |
+
- **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
|
| 69 |
+
- **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` β Trainer β push_to_hub)
|
| 70 |
+
- **Production deployment is direct:** ONNX, TGI, vLLM all first-class
|
| 71 |
+
- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β not at our 24Mβ330M scale
|
| 72 |
+
|
| 73 |
+
Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
|
| 74 |
+
|
| 75 |
+
## Project Roadmap
|
| 76 |
|
| 77 |
### Phase 1: Research & Survey β
|
| 78 |
+
- Literature survey (35+ papers)
|
| 79 |
+
- Nubank nuFormer reverse-engineering
|
| 80 |
+
- Framework ADR with detailed implementation plan
|
| 81 |
+
|
| 82 |
+
### Phase 2: Core Library (Next β ~9 weeks)
|
| 83 |
+
- **Weeks 1β3:** Domain tokenizer library (schema β per-field tokenizers β HF-compatible composite tokenizer)
|
| 84 |
+
- **Weeks 3β5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
|
| 85 |
+
- **Weeks 5β7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
|
| 86 |
+
- **Weeks 7β9:** Fine-tuning pipeline (nuFormer-style joint fusion)
|
| 87 |
+
|
| 88 |
+
### Phase 3: Domain Demos (Weeks 9β12)
|
| 89 |
+
- Finance: fraud detection, credit scoring
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
- E-commerce: next purchase prediction, customer segmentation
|
| 91 |
+
|
| 92 |
+
### Phase 4: Scale & Optimize (Weeks 12+)
|
| 93 |
+
- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
|
| 94 |
|
| 95 |
## Repo Structure
|
| 96 |
|
| 97 |
```
|
| 98 |
domainTokenizer/
|
| 99 |
βββ docs/
|
| 100 |
+
β βββ research_report.md # 51KB β Full research survey
|
| 101 |
+
β βββ nubank_nuformer_analysis.md # 29KB β Nubank pipeline analysis
|
| 102 |
+
β βββ adr/
|
| 103 |
+
β βββ ADR-001-implementation-framework.md # Framework decision + roadmap
|
| 104 |
+
βββ src/ # (Phase 2) Core library
|
| 105 |
+
β βββ tokenizers/ # Schema, field tokenizers, composite builder
|
| 106 |
+
β βββ models/ # DomainTransformer, PLR, DCNv2, JointFusion
|
| 107 |
+
β βββ training/ # Data pipeline, pre-training, fine-tuning
|
| 108 |
+
βββ examples/ # (Phase 3) Domain-specific demos
|
| 109 |
βββ README.md
|
| 110 |
```
|
| 111 |
|
|
|
|
| 120 |
| PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
|
| 121 |
| DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
|
| 122 |
| NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
|
| 123 |
+
| KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
|
| 124 |
| Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
|
|
|
|
|
|
|
| 125 |
| Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
|
| 126 |
|
| 127 |
+
Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
|
| 128 |
|
| 129 |
## License
|
| 130 |
|