rtferraz commited on
Commit
a239d6e
Β·
verified Β·
1 Parent(s): 25a1093

Update README: add ADR reference, update documentation table and repo structure

Browse files
Files changed (1) hide show
  1. README.md +45 -38
README.md CHANGED
@@ -27,7 +27,7 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
27
 
28
  **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ **~14 tokens per transaction** β†’ GPT-style Transformer (24M-330M params) β†’ **+1.25% relative AUC over LightGBM** (3Γ— their production launch threshold).
29
 
30
- πŸ“„ **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) β€” complete pipeline reconstruction with code patterns, scaling laws, and adaptation guides for e-commerce and healthcare.
31
 
32
  ## Why This Matters
33
 
@@ -42,8 +42,6 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
42
 
43
  This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β€” the challenge is *how* to tokenize.
44
 
45
- Five paradigms have emerged:
46
-
47
  | Paradigm | Method | Key Paper |
48
  |----------|--------|-----------|
49
  | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
@@ -56,48 +54,58 @@ Five paradigms have emerged:
56
 
57
  | Document | Description |
58
  |----------|-------------|
59
- | [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β€” 31 papers across 5 paradigms, technical taxonomy of tokenization methods, full blueprint for building domain tokenizers |
60
- | [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β€” complete pipeline reconstruction (tokenizer β†’ pre-training β†’ joint fusion), all 4 academic pillars explained, adaptation playbooks for finance/e-commerce/healthcare |
 
 
 
 
 
 
 
61
 
62
- ## Project Vision
 
 
 
 
 
 
 
 
63
 
64
  ### Phase 1: Research & Survey βœ…
65
- - Literature survey of domain tokenization methods (31+ papers)
66
- - Nubank nuFormer full reverse-engineering
67
- - Blueprint for a general-purpose domain tokenizer
68
-
69
- ### Phase 2: Core Tokenizer Library (Next)
70
- - Implement per-field tokenizers:
71
- - `SignBucketTokenizer` β€” sign + magnitude bins for monetary values (Nubank-style)
72
- - `CalendarTokenizer` β€” month/dow/dom/hour decomposition
73
- - `SemanticIDTokenizer` β€” RQ-VAE for entity encoding (TIGER-style)
74
- - `PLREmbedding` β€” periodic linear representation for numerical tabular features
75
- - `CompositeTokenizer` β€” BPE-like merging of multi-field patterns (ActionPiece-style)
76
- - Schema-driven automatic tokenizer selection
77
-
78
- ### Phase 3: Pre-training Framework
79
- - GPT-style causal LM with NoPE (no positional encoding)
80
- - Small Transformer backbone (24M–350M parameters, following Nubank's sizes)
81
- - Joint fusion with DCNv2 + PLR for tabular features
82
- - Domain-agnostic training loop
83
-
84
- ### Phase 4: Domain Demos
85
- - Finance: transaction recommendation, fraud detection, credit scoring
86
  - E-commerce: next purchase prediction, customer segmentation
87
- - Healthcare: clinical event prediction
 
 
88
 
89
  ## Repo Structure
90
 
91
  ```
92
  domainTokenizer/
93
  β”œβ”€β”€ docs/
94
- β”‚ β”œβ”€β”€ research_report.md # 51KB β€” Full research survey (31+ papers)
95
- β”‚ └── nubank_nuformer_analysis.md # Nubank pipeline reverse-engineering
96
- β”œβ”€β”€ src/ # (Phase 2) Core library
97
- β”‚ β”œβ”€β”€ tokenizers/ # Per-field tokenizer implementations
98
- β”‚ β”œβ”€β”€ models/ # Small Transformer backbones
99
- β”‚ └── training/ # Pre-training and fine-tuning
100
- β”œβ”€β”€ examples/ # (Phase 4) Domain-specific demos
 
 
101
  └── README.md
102
  ```
103
 
@@ -112,12 +120,11 @@ domainTokenizer/
112
  | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
113
  | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
114
  | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
 
115
  | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
116
- | Meta-Transformer | 2023 | 12 modalities, one token space | [arXiv](https://arxiv.org/abs/2307.10802) |
117
- | NTP Survey | 2024 | Comprehensive multimodal NTP taxonomy | [arXiv](https://arxiv.org/abs/2412.18619) |
118
  | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
119
 
120
- See the [full reference table](docs/research_report.md#10-complete-paper-reference-table) with 35+ papers in the research report.
121
 
122
  ## License
123
 
 
27
 
28
  **Their approach:** 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ **~14 tokens per transaction** β†’ GPT-style Transformer (24M-330M params) β†’ **+1.25% relative AUC over LightGBM** (3Γ— their production launch threshold).
29
 
30
+ πŸ“„ **Full reverse-engineering:** [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)
31
 
32
  ## Why This Matters
33
 
 
42
 
43
  This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** β€” the challenge is *how* to tokenize.
44
 
 
 
45
  | Paradigm | Method | Key Paper |
46
  |----------|--------|-----------|
47
  | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
 
54
 
55
  | Document | Description |
56
  |----------|-------------|
57
+ | πŸ“„ [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** β€” 31 papers across 5 paradigms, technical taxonomy, full blueprint |
58
+ | 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** β€” complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
59
+ | πŸ—οΈ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** β€” framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
60
+
61
+ ## Implementation Decision
62
+
63
+ After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
64
+
65
+ **Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
66
 
67
+ Key reasons:
68
+ - **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
69
+ - **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` β†’ Trainer β†’ push_to_hub)
70
+ - **Production deployment is direct:** ONNX, TGI, vLLM all first-class
71
+ - JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β€” not at our 24M–330M scale
72
+
73
+ Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
74
+
75
+ ## Project Roadmap
76
 
77
  ### Phase 1: Research & Survey βœ…
78
+ - Literature survey (35+ papers)
79
+ - Nubank nuFormer reverse-engineering
80
+ - Framework ADR with detailed implementation plan
81
+
82
+ ### Phase 2: Core Library (Next β€” ~9 weeks)
83
+ - **Weeks 1–3:** Domain tokenizer library (schema β†’ per-field tokenizers β†’ HF-compatible composite tokenizer)
84
+ - **Weeks 3–5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
85
+ - **Weeks 5–7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
86
+ - **Weeks 7–9:** Fine-tuning pipeline (nuFormer-style joint fusion)
87
+
88
+ ### Phase 3: Domain Demos (Weeks 9–12)
89
+ - Finance: fraud detection, credit scoring
 
 
 
 
 
 
 
 
 
90
  - E-commerce: next purchase prediction, customer segmentation
91
+
92
+ ### Phase 4: Scale & Optimize (Weeks 12+)
93
+ - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
94
 
95
  ## Repo Structure
96
 
97
  ```
98
  domainTokenizer/
99
  β”œβ”€β”€ docs/
100
+ β”‚ β”œβ”€β”€ research_report.md # 51KB β€” Full research survey
101
+ β”‚ β”œβ”€β”€ nubank_nuformer_analysis.md # 29KB β€” Nubank pipeline analysis
102
+ β”‚ └── adr/
103
+ β”‚ └── ADR-001-implementation-framework.md # Framework decision + roadmap
104
+ β”œβ”€β”€ src/ # (Phase 2) Core library
105
+ β”‚ β”œβ”€β”€ tokenizers/ # Schema, field tokenizers, composite builder
106
+ β”‚ β”œβ”€β”€ models/ # DomainTransformer, PLR, DCNv2, JointFusion
107
+ β”‚ └── training/ # Data pipeline, pre-training, fine-tuning
108
+ β”œβ”€β”€ examples/ # (Phase 3) Domain-specific demos
109
  └── README.md
110
  ```
111
 
 
120
  | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
121
  | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
122
  | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
123
+ | KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
124
  | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
 
 
125
  | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
126
 
127
+ Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
128
 
129
  ## License
130