Update README v0.3.0 — add usage example, update roadmap status, add implementation report link

f580186 verified 8 days ago

8.46 kB

	# 🔑 domainTokenizer

	Building small models that understand domain tokens — not just words.

	---

	## The Idea

	LLMs process text by breaking it into tokens (subword units like `ing`, `the`, `un-`). This tokenization is what enables Transformers to model sequential patterns.

	But words are just one type of sequential data. Businesses generate massive amounts of non-textual sequential data every day — purchase transactions, banking flows, medical events, logistics chains, ad funnels. These carry rich structure that text tokenizers cannot capture.

	domainTokenizer explores a different approach: what if we built tokenizers that encode domain-specific entities — products, transactions, medical codes, user actions — as first-class tokens, and then trained small, efficient models that understand domain patterns the way LLMs understand language?

	```
	Text LLM: "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat] → Transformer → next word

	domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
	```

	## Quick Start

	```python
	from domain_tokenizer import (
	DomainTokenizerBuilder, DomainTransformerConfig,
	DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
	)
	from domain_tokenizer.schemas import FINANCE_SCHEMA

	# 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
	builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
	builder.fit(all_events) # fit magnitude bins on training data
	hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

	# 2. Prepare packed training data (100% token utilization, zero padding waste)
	dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

	# 3. Create model (GPT-style, NoPE, pre-norm — 24M params)
	config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
	model = DomainTransformerForCausalLM(config)

	# 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
	pretrain_domain_model(
	model, hf_tokenizer, dataset,
	hub_model_id="org/finance-24m", # auto push to HF Hub
	num_epochs=10, learning_rate=3e-4,
	bf16=True, # A100/H100
	report_to="trackio", # live monitoring
	)

	# 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
	from domain_tokenizer import JointFusionModel
	fusion = JointFusionModel(
	transformer_model=model, # pre-trained, unfrozen
	n_tabular_features=291, # hand-crafted tabular features
	n_classes=1, # binary: will user activate product?
	)
	# Train fusion model end-to-end on labeled data...
	```

	## 🏦 Industry Validation: Nubank's nuFormer

	This isn't just theory. Nubank (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:

	- Paper: ["Your spending needs attention: Modeling financial habits with transformers"](https://arxiv.org/abs/2507.23267) (arXiv: 2507.23267)
	- Blog series: [Building Nubank — Foundation Models](https://building.nubank.com/unlocking-financial-insights-how-nubank-powers-personalized-experiences-with-foundation-models/)

	Their approach: 97 special tokens for structured fields (amount bins, calendar features, credit/debit sign) + BPE for merchant descriptions → ~14 tokens per transaction → GPT-style Transformer (24M-330M params) → +1.25% relative AUC over LightGBM (3× their production launch threshold).

	📄 Full reverse-engineering: [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md)

	## Why This Matters

	\| Problem \| Text Tokenizer \| Domain Tokenizer \|
	\|---------\|---------------\|-----------------\|
	\| Product `SKU-8847291` \| Split into meaningless fragments: `SK`, `U-`, `884`... \| Encoded as Semantic ID: `[Electronics, 23, 7, 41]` with hierarchical meaning \|
	\| Price `$79.99` \| Fragmented: `$`, `79`, `.`, `99` \| Tokenized as `price_bin_37` (73rd percentile = "mid-range") \|
	\| Timestamp `2025-03-15` \| Calendar-unaware text fragments \| `[Wednesday, Afternoon, 2_days_later]` \|
	\| Cross-field patterns \| Lost in flat token stream \| Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token \|

	## Documentation

	\| Document \| Description \|
	\|----------\|-------------\|
	\| 📄 [`docs/research_report.md`](docs/research_report.md) \| Research survey — 31 papers across 5 paradigms, technical taxonomy, blueprint \|
	\| 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) \| Nubank reverse-engineering — full pipeline reconstruction, 4 academic pillars \|
	\| 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) \| Architecture Decision Record — PyTorch+HF vs JAX/Keras, trade-offs, roadmap \|
	\| 📊 [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) \| Implementation report — Phase 2A-2C technical decisions, architecture, 124 tests \|

	## Project Roadmap

	### Phase 1: Research & Survey ✅
	- Literature survey (35+ papers), Nubank reverse-engineering, framework ADR

	### Phase 2: Core Library ✅ (v0.3.0 — 124 tests passing)
	- 2A: Domain tokenizer library — schema, 5 field tokenizers, HF-compatible builder
	- 2B: Model architecture — DomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
	- 2C: Pre-training pipeline — sequence packing, DataCollatorForLanguageModeling, HF Trainer
	- 2D: Fine-tuning pipeline (next)

	### Phase 3: Domain Demos
	- Finance: fraud detection, credit scoring on real data
	- E-commerce: next purchase prediction, customer segmentation

	### Phase 4: Scale & Optimize
	- 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary

	## Repo Structure

	```
	src/domain_tokenizer/
	├── __init__.py # v0.3.0 — all public exports
	├── schema.py # DomainSchema, FieldSpec, FieldType
	├── tokenizers/
	│ ├── field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
	│ └── domain_tokenizer.py # DomainTokenizerBuilder → HF PreTrainedTokenizerFast
	├── schemas/
	│ └── predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
	├── models/
	│ ├── configuration.py # DomainTransformerConfig (24M/85M/330M presets)
	│ ├── modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
	│ ├── plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022)
	│ └── joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style)
	└── training/
	├── data_pipeline.py # tokenize → pack → HFDataset
	└── pretrain.py # pretrain_domain_model (HF Trainer)
	tests/
	├── test_tokenizer.py # 72 tests
	├── test_model.py # 33 tests
	└── test_training.py # 19 tests
	```

	## Key References

	\| Paper \| Year \| Role in domainTokenizer \| Link \|
	\|-------\|------\|------------------------\|------\|
	\| nuFormer (Nubank) \| 2025 \| Overall architecture blueprint \| [arXiv](https://arxiv.org/abs/2507.23267) \|
	\| NoPE \| 2023 \| No positional encoding — our attention design \| [arXiv](https://arxiv.org/abs/2305.19466) \|
	\| PLR Embeddings (Yandex) \| 2022 \| Numerical feature embeddings \| [arXiv](https://arxiv.org/abs/2203.05556) \|
	\| DCN V2 (Google) \| 2021 \| Tabular feature crossing in joint fusion \| [arXiv](https://arxiv.org/abs/2008.13535) \|
	\| RecFormer \| 2023 \| Items-as-text tokenization philosophy \| [arXiv](https://arxiv.org/abs/2305.13731) \|
	\| TIGER (Google) \| 2023 \| Semantic IDs via RQ-VAE \| [arXiv](https://arxiv.org/abs/2305.05065) \|
	\| ActionPiece (DeepMind) \| 2025 \| BPE for user action sequences \| [arXiv](https://arxiv.org/abs/2502.13581) \|
	\| Banking TF \| 2024 \| Transaction tokenizer for French banking \| [arXiv](https://arxiv.org/abs/2410.08243) \|
	\| Nested Learning (HOPE) \| 2025 \| Continual learning via multi-timescale memory \| [arXiv](https://arxiv.org/abs/2512.24695) \|

	Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)

	## License

	MIT