Update README: add ADR reference, update documentation table and repo structure

a239d6e verified 9 days ago

8.24 kB

	# 🔑 domainTokenizer

	Building small models that understand domain tokens — not just words.

	---

	## The Idea

	LLMs process text by breaking it into tokens (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

	But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

	domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

	```
	Text LLM: "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

	domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
	```

	## 🏦 Industry Validation: Nubank's nuFormer

	This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

	- Paper: ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
	- Blog series: [Building Nubank — Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)

	Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → ~14 tokens per transaction → GPT-style Transformer (24M-330M params) → +1.25% relative AUC over LightGBM (3× their production launch threshold).

	📄 Full reverse-engineering: [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)

	## Why This Matters

	\| Problem \| Text Tokenizer \| Domain Tokenizer \|
	\|---------\|---------------\|-----------------\|
	\| Product `SKU-8847291` \| Split into meaningless fragments: `SK`, `U-`, `884`... \| Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning \|
	\| Price `$79.99` \| Fragmented: `$`, `79`, `.`, `99` \| Tokenized as `price_bin_37` (73rd percentile = "mid-range") \|
	\| Timestamp `2025-03-15` \| Calendar-unaware text fragments \| `[Wednesday, Afternoon, 2_days_later]` \|
	\| Cross-field patterns \| Lost in flat token stream \| Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token \|

	## Research Foundation

	This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm — the challenge is how to tokenize.

	\| Paradigm \| Method \| Key Paper \|
	\|----------\|--------\|-----------\|
	\| Semantic IDs \| RQ-VAE quantization of item embeddings \| [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) \|
	\| Action Tokenization \| BPE-like merging of feature patterns \| [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) \|
	\| Transaction Tokenization \| Special tokens + BPE hybrid \| [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) \|
	\| Tabular Tokenization \| Periodic embeddings for numbers \| [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) \|
	\| Universal Tokenization \| All modalities → shared discrete space \| [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) \|

	## Documentation

	\| Document \| Description \|
	\|----------\|-------------\|
	\| 📄 [`docs/research_report.md`](docs/research_report.md) \| Comprehensive research survey — 31 papers across 5 paradigms, technical taxonomy, full blueprint \|
	\| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) \| Nubank reverse-engineering — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks \|
	\| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) \| Architecture Decision Record — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code \|

	## Implementation Decision

	After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:

	Decision: PyTorch + HuggingFace Transformers (with JAX as future scaling path)

	Key reasons:
	- 5 of 6 reference papers use PyTorch (including Google DeepMind's ActionPiece)
	- HuggingFace has the only complete custom tokenizer pipeline (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
	- Production deployment is direct: ONNX, TGI, vLLM all first-class
	- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale

	Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)

	## Project Roadmap

	### Phase 1: Research & Survey ✅
	- Literature survey (35+ papers)
	- Nubank nuFormer reverse-engineering
	- Framework ADR with detailed implementation plan

	### Phase 2: Core Library (Next — ~9 weeks)
	- Weeks 1–3: Domain tokenizer library (schema → per-field tokenizers → HF-compatible composite tokenizer)
	- Weeks 3–5: GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
	- Weeks 5–7: Pre-training pipeline (CLM on domain sequences via HF Trainer)
	- Weeks 7–9: Fine-tuning pipeline (nuFormer-style joint fusion)

	### Phase 3: Domain Demos (Weeks 9–12)
	- Finance: fraud detection, credit scoring
	- E-commerce: next purchase prediction, customer segmentation

	### Phase 4: Scale & Optimize (Weeks 12+)
	- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary

	## Repo Structure

	```
	domainTokenizer/
	├── docs/
	│ ├── research_report.md # 51KB — Full research survey
	│ ├── nubank_nuformer_analysis.md # 29KB — Nubank pipeline analysis
	│ └── adr/
	│ └── ADR-001-implementation-framework.md # Framework decision + roadmap
	├── src/ # (Phase 2) Core library
	│ ├── tokenizers/ # Schema, field tokenizers, composite builder
	│ ├── models/ # DomainTransformer, PLR, DCNv2, JointFusion
	│ └── training/ # Data pipeline, pre-training, fine-tuning
	├── examples/ # (Phase 3) Domain-specific demos
	└── README.md
	```

	## Key References

	\| Paper \| Year \| What It Does \| Link \|
	\|-------\|------\|-------------\|------\|
	\| nuFormer (Nubank) \| 2025 \| Transaction foundation model at production scale \| [arXiv](https://arxiv.org/abs/2507.23267) \|
	\| TIGER (Google) \| 2023 \| Semantic IDs for products via RQ-VAE \| [arXiv](https://arxiv.org/abs/2305.05065) \|
	\| ActionPiece (DeepMind) \| 2025 \| BPE for user action sequences \| [arXiv](https://arxiv.org/abs/2502.13581) \|
	\| RecFormer \| 2023 \| Items as key-value text representations \| [arXiv](https://arxiv.org/abs/2305.13731) \|
	\| PLR Embeddings (Yandex) \| 2022 \| Periodic embeddings for numerical features \| [arXiv](https://arxiv.org/abs/2203.05556) \|
	\| DCN V2 (Google) \| 2021 \| Feature crossing for tabular data \| [arXiv](https://arxiv.org/abs/2008.13535) \|
	\| NoPE \| 2023 \| No positional encoding beats RoPE/ALiBi \| [arXiv](https://arxiv.org/abs/2305.19466) \|
	\| KL3M Tokenizers \| 2025 \| Domain-specific BPE for finance/legal \| [arXiv](https://arxiv.org/abs/2503.17247) \|
	\| Banking TF \| 2024 \| Transaction tokenizer for French banking \| [arXiv](https://arxiv.org/abs/2410.08243) \|
	\| Nested Learning (HOPE) \| 2025 \| Continual learning via multi-timescale memory \| [arXiv](https://arxiv.org/abs/2512.24695) \|

	Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)

	## License

	MIT

	---

	domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.

	# 🔑 domainTokenizer

	Building small models that understand domain tokens — not just words.

	---

	## The Idea

	LLMs process text by breaking it into tokens (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

	But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

	domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

	```
	Text LLM: "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

	domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
	```

	## 🏦 Industry Validation: Nubank's nuFormer

	This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

	- Paper: ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
	- Blog series: [Building Nubank — Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)

	Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → ~14 tokens per transaction → GPT-style Transformer (24M-330M params) → +1.25% relative AUC over LightGBM (3× their production launch threshold).

	📄 Full reverse-engineering: [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)

	## Why This Matters

	\| Problem \| Text Tokenizer \| Domain Tokenizer \|
	\|---------\|---------------\|-----------------\|
	\| Product `SKU-8847291` \| Split into meaningless fragments: `SK`, `U-`, `884`... \| Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning \|
	\| Price `$79.99` \| Fragmented: `$`, `79`, `.`, `99` \| Tokenized as `price_bin_37` (73rd percentile = "mid-range") \|
	\| Timestamp `2025-03-15` \| Calendar-unaware text fragments \| `[Wednesday, Afternoon, 2_days_later]` \|
	\| Cross-field patterns \| Lost in flat token stream \| Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token \|

	## Research Foundation

	This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm — the challenge is how to tokenize.

	\| Paradigm \| Method \| Key Paper \|
	\|----------\|--------\|-----------\|
	\| Semantic IDs \| RQ-VAE quantization of item embeddings \| [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) \|
	\| Action Tokenization \| BPE-like merging of feature patterns \| [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) \|
	\| Transaction Tokenization \| Special tokens + BPE hybrid \| [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) \|
	\| Tabular Tokenization \| Periodic embeddings for numbers \| [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) \|
	\| Universal Tokenization \| All modalities → shared discrete space \| [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) \|

	## Documentation

	\| Document \| Description \|
	\|----------\|-------------\|
	\| 📄 [`docs/research_report.md`](docs/research_report.md) \| Comprehensive research survey — 31 papers across 5 paradigms, technical taxonomy, full blueprint \|
	\| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) \| Nubank reverse-engineering — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks \|
	\| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) \| Architecture Decision Record — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code \|

	## Implementation Decision

	After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:

	Decision: PyTorch + HuggingFace Transformers (with JAX as future scaling path)

	Key reasons:
	- 5 of 6 reference papers use PyTorch (including Google DeepMind's ActionPiece)
	- HuggingFace has the only complete custom tokenizer pipeline (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
	- Production deployment is direct: ONNX, TGI, vLLM all first-class
	- JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale

	Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)

	## Project Roadmap

	### Phase 1: Research & Survey ✅
	- Literature survey (35+ papers)
	- Nubank nuFormer reverse-engineering
	- Framework ADR with detailed implementation plan

	### Phase 2: Core Library (Next — ~9 weeks)
	- Weeks 1–3: Domain tokenizer library (schema → per-field tokenizers → HF-compatible composite tokenizer)
	- Weeks 3–5: GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
	- Weeks 5–7: Pre-training pipeline (CLM on domain sequences via HF Trainer)
	- Weeks 7–9: Fine-tuning pipeline (nuFormer-style joint fusion)

	### Phase 3: Domain Demos (Weeks 9–12)
	- Finance: fraud detection, credit scoring
	- E-commerce: next purchase prediction, customer segmentation

	### Phase 4: Scale & Optimize (Weeks 12+)
	- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary

	## Repo Structure

	```
	domainTokenizer/
	├── docs/
	│ ├── research_report.md # 51KB — Full research survey
	│ ├── nubank_nuformer_analysis.md # 29KB — Nubank pipeline analysis
	│ └── adr/
	│ └── ADR-001-implementation-framework.md # Framework decision + roadmap
	├── src/ # (Phase 2) Core library
	│ ├── tokenizers/ # Schema, field tokenizers, composite builder
	│ ├── models/ # DomainTransformer, PLR, DCNv2, JointFusion
	│ └── training/ # Data pipeline, pre-training, fine-tuning
	├── examples/ # (Phase 3) Domain-specific demos
	└── README.md
	```

	## Key References

	\| Paper \| Year \| What It Does \| Link \|
	\|-------\|------\|-------------\|------\|
	\| nuFormer (Nubank) \| 2025 \| Transaction foundation model at production scale \| [arXiv](https://arxiv.org/abs/2507.23267) \|
	\| TIGER (Google) \| 2023 \| Semantic IDs for products via RQ-VAE \| [arXiv](https://arxiv.org/abs/2305.05065) \|
	\| ActionPiece (DeepMind) \| 2025 \| BPE for user action sequences \| [arXiv](https://arxiv.org/abs/2502.13581) \|
	\| RecFormer \| 2023 \| Items as key-value text representations \| [arXiv](https://arxiv.org/abs/2305.13731) \|
	\| PLR Embeddings (Yandex) \| 2022 \| Periodic embeddings for numerical features \| [arXiv](https://arxiv.org/abs/2203.05556) \|
	\| DCN V2 (Google) \| 2021 \| Feature crossing for tabular data \| [arXiv](https://arxiv.org/abs/2008.13535) \|
	\| NoPE \| 2023 \| No positional encoding beats RoPE/ALiBi \| [arXiv](https://arxiv.org/abs/2305.19466) \|
	\| KL3M Tokenizers \| 2025 \| Domain-specific BPE for finance/legal \| [arXiv](https://arxiv.org/abs/2503.17247) \|
	\| Banking TF \| 2024 \| Transaction tokenizer for French banking \| [arXiv](https://arxiv.org/abs/2410.08243) \|
	\| Nested Learning (HOPE) \| 2025 \| Continual learning via multi-timescale memory \| [arXiv](https://arxiv.org/abs/2512.24695) \|

	Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)

	## License

	MIT

	---

	domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.