domainTokenizer / README.md
rtferraz's picture
Update README: add ADR reference, update documentation table and repo structure
a239d6e verified
|
raw
history blame
8.24 kB

πŸ”‘ domainTokenizer

Building small models that understand domain tokens β€” not just words.


The Idea

LLMs process text by breaking it into tokens (subword units like ing, the, un-). This tokenization is what enables Transformers to model sequential patterns.

But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day β€” purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities β€” products, transactions, medical codes, user actions β€” as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

Text LLM:      "The cat sat on the mat" β†’ [The] [cat] [sat] [on] [the] [mat] β†’ Transformer β†’ next word

domainTokenizer: Customer purchase history β†’ [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] β†’ Transformer β†’ next purchase

🏦 Industry Validation: Nubank's nuFormer

This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions β†’ ~14 tokens per transaction β†’ GPT-style Transformer (24M-330M params) β†’ +1.25% relative AUC over LightGBM (3Γ— their production launch threshold).

πŸ“„ Full reverse-engineering: docs/nubank_nuformer_analysis.md

Why This Matters

Problem Text Tokenizer Domain Tokenizer
Product SKU-8847291 Split into meaningless fragments: SK, U-, 884... Encoded as Semantic ID: [Electronics, 23, 7, 41] with hierarchical meaning
Price $79.99 Fragmented: $, 79, ., 99 Tokenized as price_bin_37 (73rd percentile = "mid-range")
Timestamp 2025-03-15 Calendar-unaware text fragments [Wednesday, Afternoon, 2_days_later]
Cross-field patterns Lost in flat token stream Discovered via BPE-like merging: {Electronics + $50-100} β†’ composite token

Research Foundation

This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: any sequential domain data can be tokenized and modeled with the LLM paradigm β€” the challenge is how to tokenize.

Paradigm Method Key Paper
Semantic IDs RQ-VAE quantization of item embeddings TIGER (Google, 2023)
Action Tokenization BPE-like merging of feature patterns ActionPiece (DeepMind, 2025)
Transaction Tokenization Special tokens + BPE hybrid nuFormer (Nubank, 2025)
Tabular Tokenization Periodic embeddings for numbers PLR (Yandex, 2022)
Universal Tokenization All modalities β†’ shared discrete space Meta-Transformer (2023)

Documentation

Document Description
πŸ“„ docs/research_report.md Comprehensive research survey β€” 31 papers across 5 paradigms, technical taxonomy, full blueprint
🏦 docs/nubank_nuformer_analysis.md Nubank reverse-engineering β€” complete pipeline reconstruction, 4 academic pillars, adaptation playbooks
πŸ—οΈ docs/adr/ADR-001-implementation-framework.md Architecture Decision Record β€” framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code

Implementation Decision

After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:

Decision: PyTorch + HuggingFace Transformers (with JAX as future scaling path)

Key reasons:

  • 5 of 6 reference papers use PyTorch (including Google DeepMind's ActionPiece)
  • HuggingFace has the only complete custom tokenizer pipeline (PreTrainedTokenizerFast β†’ Trainer β†’ push_to_hub)
  • Production deployment is direct: ONNX, TGI, vLLM all first-class
  • JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators β€” not at our 24M–330M scale

Full analysis: docs/adr/ADR-001-implementation-framework.md

Project Roadmap

Phase 1: Research & Survey βœ…

  • Literature survey (35+ papers)
  • Nubank nuFormer reverse-engineering
  • Framework ADR with detailed implementation plan

Phase 2: Core Library (Next β€” ~9 weeks)

  • Weeks 1–3: Domain tokenizer library (schema β†’ per-field tokenizers β†’ HF-compatible composite tokenizer)
  • Weeks 3–5: GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
  • Weeks 5–7: Pre-training pipeline (CLM on domain sequences via HF Trainer)
  • Weeks 7–9: Fine-tuning pipeline (nuFormer-style joint fusion)

Phase 3: Domain Demos (Weeks 9–12)

  • Finance: fraud detection, credit scoring
  • E-commerce: next purchase prediction, customer segmentation

Phase 4: Scale & Optimize (Weeks 12+)

  • 330M param models, torch.compile(), ONNX export, ActionPiece vocabulary

Repo Structure

domainTokenizer/
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ research_report.md              # 51KB β€” Full research survey
β”‚   β”œβ”€β”€ nubank_nuformer_analysis.md     # 29KB β€” Nubank pipeline analysis
β”‚   └── adr/
β”‚       └── ADR-001-implementation-framework.md  # Framework decision + roadmap
β”œβ”€β”€ src/                                 # (Phase 2) Core library
β”‚   β”œβ”€β”€ tokenizers/                      # Schema, field tokenizers, composite builder
β”‚   β”œβ”€β”€ models/                          # DomainTransformer, PLR, DCNv2, JointFusion
β”‚   └── training/                        # Data pipeline, pre-training, fine-tuning
β”œβ”€β”€ examples/                            # (Phase 3) Domain-specific demos
└── README.md

Key References

Paper Year What It Does Link
nuFormer (Nubank) 2025 Transaction foundation model at production scale arXiv
TIGER (Google) 2023 Semantic IDs for products via RQ-VAE arXiv
ActionPiece (DeepMind) 2025 BPE for user action sequences arXiv
RecFormer 2023 Items as key-value text representations arXiv
PLR Embeddings (Yandex) 2022 Periodic embeddings for numerical features arXiv
DCN V2 (Google) 2021 Feature crossing for tabular data arXiv
NoPE 2023 No positional encoding beats RoPE/ALiBi arXiv
KL3M Tokenizers 2025 Domain-specific BPE for finance/legal arXiv
Banking TF 2024 Transaction tokenizer for French banking arXiv
Nested Learning (HOPE) 2025 Continual learning via multi-timescale memory arXiv

Full reference table (35+ papers): docs/research_report.md

License

MIT


domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.