rtferraz
/

domainTokenizer

Model card Files Files and versions

xet

Community

rtferraz commited on 8 days ago

Commit

25a1093

verified ·

1 Parent(s): e30a14d

Add ADR-001: Implementation framework decision with detailed roadmap

Browse files

Files changed (1) hide show

docs/adr/ADR-001-implementation-framework.md +978 -0

docs/adr/ADR-001-implementation-framework.md ADDED Viewed

	@@ -0,0 +1,978 @@

+# ADR-001: Implementation Framework for domainTokenizer
+> **Status:** Accepted
+> **Date:** April 29, 2026
+> **Decision:** PyTorch + HuggingFace Transformers as primary framework, with JAX/Flax NNX as future scaling path
+> **Deciders:** domainTokenizer core team
+---
+## Table of Contents
+1. [Context](#1-context)
+2. [Goal](#2-goal)
+3. [Options Evaluated](#3-options-evaluated)
+4. [Decision](#4-decision)
+5. [Trade-offs and Justification](#5-trade-offs-and-justification)
+6. [Consequences](#6-consequences)
+7. [Implementation Roadmap](#7-implementation-roadmap)
+8. [Appendix A: Framework Usage Across Reference Papers](#appendix-a-framework-usage-across-reference-papers)
+9. [Appendix B: Head-to-Head Comparison Matrix](#appendix-b-head-to-head-comparison-matrix)
+10. [Appendix C: Key Code Patterns](#appendix-c-key-code-patterns)
+---
+## 1. Context
+### What We're Building
+domainTokenizer is a library for building **small Transformer models (24M–330M parameters)** that process **domain-specific tokens** — financial transactions, e-commerce events, healthcare records — instead of natural language text. The architecture follows the validated pattern from Nubank's nuFormer ([arXiv: 2507.23267](https://arxiv.org/abs/2507.23267)):
+```
+Domain Events → Custom Tokenizer → GPT-style Transformer → Foundation Model → Downstream Tasks
+```
+### The Implementation Question
+A video from Google for Developers presents the **Keras 3 + JAX/Flax NNX integration** as a potential framework, citing:
+- **Explicit state management** (Flax NNX) — useful for tracking sequential transaction state
+- **Custom training loops** — Keras structure + JAX/Optax for domain-specific gradient control
+- **JIT compilation** (`@nnx.jit`) — high-performance processing of millions of transactions
+- **Paradigm mixing** — Keras layers for standard components + NNX for custom sequential encoders
+The question: **Is Keras + JAX/Flax NNX the right framework for domainTokenizer, or is there a better choice?**
+### Constraints
+1. **Custom tokenizer required:** We need a tokenizer that maps structured fields (amounts, dates, categories) to special tokens — not a standard text tokenizer
+2. **Small models:** 24M–330M parameters, not 70B+ — framework overhead matters less than developer velocity
+3. **Production deployment:** Models must be servable with low latency for real-time applications (fraud detection, recommendations)
+4. **GPU hardware:** Development on A100/A10G GPUs, not TPUs (standard cloud environment)
+5. **Team context:** ML engineers familiar with Python, PyTorch, and the HuggingFace ecosystem
+6. **Iteration speed:** Need to prototype quickly across multiple domains (finance, e-commerce, healthcare)
+---
+## 2. Goal
+Choose an implementation framework that:
+1. **Minimizes time from research to working prototype** — weeks, not months
+2. **Supports custom domain tokenizers** as first-class citizens
+3. **Integrates with the HuggingFace Hub** for model sharing, versioning, and community
+4. **Enables production deployment** via standard serving infrastructure (ONNX, TGI, vLLM, etc.)
+5. **Scales to 330M parameters** on 4–8 GPUs without heroic engineering
+6. **Does not preclude future migration** to JAX/TPU if we need to scale beyond 1B parameters
+---
+## 3. Options Evaluated
+### Option A: PyTorch + HuggingFace Transformers
+The dominant ecosystem for custom NLP/sequential models. Provides `PreTrainedModel`, `PreTrainedTokenizerFast`, `Trainer`, `push_to_hub`, ONNX export, and integration with TRL, PEFT, Accelerate, DeepSpeed.
+### Option B: Keras 3 + JAX Backend + Flax NNX
+Google's multi-backend framework. Keras provides high-level APIs; JAX provides XLA compilation and functional transforms; Flax NNX provides PyTorch-like stateful modules on top of JAX.
+### Option C: Pure JAX + Flax NNX + Optax
+Skip Keras entirely. Use Flax NNX for model definition, Optax for optimization, Orbax for checkpointing, and Grain/tf.data for data loading. Google's MaxText framework follows this pattern.
+### Option D: PyTorch + Custom (no HuggingFace)
+Use PyTorch directly without the HuggingFace abstraction layer. Full control but no ecosystem integration.
+---
+## 4. Decision
+### Primary: PyTorch + HuggingFace Transformers (Option A)
+### Future scaling path: JAX/Flax NNX (Option C) — if and when we need TPU training at >1B parameters
+---
+## 5. Trade-offs and Justification
+### 5.1 What the Reference Papers Actually Use
+We audited the frameworks used by every paper in the domainTokenizer research corpus. The result is overwhelming:
+| Paper | Framework | Confidence |
+|-------|-----------|------------|
+| **nuFormer** (Nubank) | PyTorch + HF Transformers (inferred) | ~90% |
+| **TIGER** (Google) | JAX + T5X (official); PyTorch (community reimpl) | 100% |
+| **ActionPiece** (Google DeepMind) | **PyTorch + HF Transformers** (stated verbatim in paper) | 100% |
+| **RecFormer** (UCSD/Amazon) | **PyTorch + HF Transformers (Longformer)** (stated verbatim) | 100% |
+| **Banking Transaction Flow** | **PyTorch** (stated verbatim in appendix) | 100% |
+| **PLR Embeddings** (Yandex) | **PyTorch** + scikit-learn + Optuna | 100% |
+**5 of 6 papers use PyTorch.** The sole JAX user (TIGER) was a Google-internal project using T5X, and even its most popular community reimplementation (781⭐) is in PyTorch.
+Even **Google DeepMind's own ActionPiece** — the paper most relevant to our domain tokenization approach — uses PyTorch + HuggingFace. This is the strongest signal possible.
+### 5.2 Custom Tokenizer Story
+This is the **decisive factor**. domainTokenizer's core innovation is the tokenizer itself. The framework must provide first-class support for custom token vocabularies.
+**PyTorch + HuggingFace:**
+- Train custom BPE tokenizer via `tokenizers` library (Rust-backed, fast)
+- Wrap in `PreTrainedTokenizerFast` → full Trainer compatibility
+- Add domain special tokens via `add_special_tokens()` → auto-resize embeddings
+- Push tokenizer to Hub: `tokenizer.push_to_hub("org/my-tokenizer")`
+- Load anywhere: `AutoTokenizer.from_pretrained("org/my-tokenizer")`
+- **KL3M** ([arXiv: 2503.17247](https://arxiv.org/abs/2503.17247)) — the gold standard for financial domain tokenizers — is built entirely on this stack
+**Keras + JAX/Flax NNX:**
+- No equivalent to `PreTrainedTokenizerFast`
+- No Hub-integrated tokenizer format
+- Must build custom tokenizer from scratch with no ecosystem support
+- No standard serialization/deserialization for domain vocabularies
+**Verdict:** PyTorch/HF has a **complete, production-tested** custom tokenizer pipeline. JAX/Keras has **nothing** — you'd build everything from scratch.
+### 5.3 Production Deployment
+| Path | PyTorch | JAX/Keras |
+|------|---------|-----------|
+| ONNX export | `torch.onnx.export()` — one line | Requires TF backend intermediate or experimental `jax.export` |
+| TensorRT | ONNX → TRT (standard) | Multi-hop, fragile |
+| TGI (HuggingFace inference) | First-class | Not supported |
+| vLLM | First-class | Not supported |
+| Triton Inference Server | Direct ONNX/TorchScript | Via ONNX (workaround) |
+| BentoML | Supported | Supported |
+| Model Hub sharing | `push_to_hub()` → `from_pretrained()` | Works but fragmented (`.msgpack` weights, no Trainer compat) |
+**Verdict:** PyTorch has **direct, tested paths** to every major serving framework. JAX requires **multiple intermediate conversions**, each introducing failure points.
+### 5.4 Training Speed
+At our scale (24M–330M parameters on 4–8 A100s):
+| Scenario | PyTorch | JAX |
+|----------|---------|-----|
+| Steady-state training throughput | **Comparable** (`torch.compile`) | **Comparable** (XLA JIT) |
+| Variable-length sequences | **Native** — dynamic shapes | **Problematic** — recompiles on new shapes; must pad to buckets |
+| Multi-GPU (FSDP) | `accelerate` + FSDP2 — mature | `pmap`/`shard_map` — works but harder to configure |
+| First-run compilation | Instant (eager mode) | 5–20s JIT compilation overhead |
+| Debugging | Standard Python debugger | `print` debugging; cryptic XLA errors |
+**Verdict:** At 330M parameters, training speed is a **wash**. JAX's advantages (XLA kernel fusion, TPU native) only matter at 10B+ parameters on 256+ accelerators. At our scale, **developer velocity dominates throughput**.
+### 5.5 The JAX Advantage: When It Would Win
+JAX/Flax NNX would be the right choice **if**:
+1. **Training exclusively on Google TPUs** — JAX is the native TPU compiler; PyTorch/XLA is a port with overhead
+2. **Models >1B parameters** — XLA's whole-program optimization shines at scale
+3. **Fixed-shape workloads** — images, fixed-length token sequences (no variable-length padding issues)
+4. **Need functional transforms** — `vmap` (per-sample gradients), `pmap` (data parallelism), `grad` (higher-order derivatives)
+5. **Google Cloud infrastructure** — Vertex AI, TPU VMs, GCS integration
+For domainTokenizer's current scope (24M–330M, GPU, variable-length sequences, fast iteration), **none of these conditions apply**.
+### 5.6 The Keras + JAX Mixing Argument
+The Google for Developers video argues for mixing Keras layers (high-level) with NNX modules (custom, high-performance). In theory, this lets you:
+- Use Keras for standard Transformer layers
+- Use NNX for custom sequential transaction encoders
+- Get JIT compilation on the NNX parts
+**In practice, this creates problems:**
+1. **Two mental models:** Keras (layer-oriented, `fit/compile`) vs. NNX (functional, explicit state) — context switching slows development
+2. **Limited interop documentation:** Keras ↔ NNX examples are thin; edge cases are poorly documented
+3. **No HF ecosystem integration:** You lose Trainer, push_to_hub, PEFT, TRL, Accelerate — the entire ecosystem Nubank and ActionPiece rely on
+4. **Debugging complexity:** Errors in the Keras↔NNX boundary are hard to diagnose
+**Better approach with PyTorch:** Use `torch.compile()` on performance-critical modules to get JIT compilation benefits without leaving the PyTorch ecosystem. Write custom `nn.Module` subclasses for domain-specific components. This gives you the same "standard parts + custom parts" architecture without framework mixing.
+---
+## 6. Consequences
+### What We Gain
+1. **Immediate access to the entire HuggingFace ecosystem:** Trainer, Accelerate, PEFT (LoRA), TRL, Evaluate, push_to_hub, from_pretrained, ONNX export, TGI serving
+2. **Copy-paste from reference implementations:** ActionPiece, RecFormer, Banking TF, and PLR embeddings are all PyTorch — we can directly reuse their code
+3. **KL3M tokenizer as starting point:** The best financial domain tokenizer already exists in PyTorch/HF format at `alea-institute/kl3m-004-128k-cased`
+4. **Standard production deployment:** ONNX → TensorRT → Triton, or direct TGI/vLLM serving
+5. **Community and hiring:** PyTorch is the dominant ML framework; finding contributors and documentation is easy
+6. **`torch.compile()` for performance:** When we need JIT compilation on hot paths, `torch.compile()` provides 10–30% speedups without leaving the ecosystem
+### What We Accept
+1. **No native TPU support:** If we later need to train on Google TPUs, we'll need PyTorch/XLA (slower than native JAX) or migrate the model code
+2. **No functional transforms:** `vmap` (per-sample gradients) isn't available without `functorch` (experimental). If we need advanced gradient manipulation for meta-learning or Nested Learning (HOPE-style), JAX would be better
+3. **Potential future migration cost:** If we scale beyond 1B parameters and move to TPUs, we'll need to rewrite model code in Flax NNX. This is mitigated by keeping model definitions clean and modular
+### Migration Strategy (If Needed Later)
+If domainTokenizer grows to >1B parameters and we need TPU training:
+1. **Tokenizer layer stays in Python/HF:** Tokenizer is framework-agnostic — it produces integer sequences regardless of whether the model is PyTorch or JAX
+2. **Model architecture translates 1:1:** PyTorch `nn.Module` → Flax NNX `nnx.Module` mapping is straightforward for standard Transformer components
+3. **Training loop changes:** PyTorch Trainer → custom Flax NNX training loop with Optax
+4. **Reference:** Google's MaxText (`github.com/google/maxtext`) provides production-grade JAX Transformer patterns we can follow
+**Estimated migration effort:** 2–4 weeks for a clean, well-separated codebase.
+---
+## 7. Implementation Roadmap
+### Phase 2A: Core Tokenizer Library (Weeks 1–3)
+#### Step 1: Domain Schema Definition
+Create a declarative schema format that describes the fields in a domain's event data:
+```python
+# src/tokenizers/schema.py
+from dataclasses import dataclass, field
+from enum import Enum
+from typing import List, Optional
+class FieldType(Enum):
+    NUMERICAL_CONTINUOUS = "numerical_continuous"    # prices, amounts → magnitude bins
+    NUMERICAL_DISCRETE = "numerical_discrete"        # quantities → small fixed vocab
+    CATEGORICAL_FIXED = "categorical_fixed"          # categories, days of week → direct mapping
+    CATEGORICAL_ENTITY = "categorical_entity"        # products, merchants → Semantic IDs (RQ-VAE)
+    TEMPORAL = "temporal"                            # timestamps → calendar decomposition
+    TEXT = "text"                                     # descriptions → BPE subwords
+    SIGN = "sign"                                    # credit/debit → 2 tokens
+@dataclass
+class FieldSpec:
+    name: str
+    field_type: FieldType
+    vocab_size: Optional[int] = None          # for fixed categorical
+    n_bins: int = 21                          # for numerical (Nubank uses 21)
+    calendar_fields: List[str] = field(       # for temporal
+        default_factory=lambda: ["month", "dow", "dom", "hour"]
+    )
+@dataclass
+class DomainSchema:
+    name: str                                 # e.g., "ecommerce", "finance"
+    fields: List[FieldSpec]                   # ordered list of fields per event
+    @property
+    def special_token_count(self) -> int:
+        """Total domain-specific special tokens needed."""
+        count = 0
+        for f in self.fields:
+            if f.field_type == FieldType.SIGN:
+                count += 2
+            elif f.field_type == FieldType.NUMERICAL_CONTINUOUS:
+                count += f.n_bins
+            elif f.field_type == FieldType.CATEGORICAL_FIXED:
+                count += f.vocab_size
+            elif f.field_type == FieldType.TEMPORAL:
+                count += sum({
+                    "month": 12, "dow": 7, "dom": 31, "hour": 24,
+                    "quarter": 4, "year": 10
+                }.get(cf, 0) for cf in f.calendar_fields)
+        return count
+# Example: Nubank-style financial schema
+FINANCE_SCHEMA = DomainSchema(
+    name="finance",
+    fields=[
+        FieldSpec("amount_sign", FieldType.SIGN),
+        FieldSpec("amount", FieldType.NUMERICAL_CONTINUOUS, n_bins=21),
+        FieldSpec("timestamp", FieldType.TEMPORAL,
+                  calendar_fields=["month", "dow", "dom", "hour"]),
+        FieldSpec("description", FieldType.TEXT),
+    ]
+)
+# Example: E-commerce schema
+ECOMMERCE_SCHEMA = DomainSchema(
+    name="ecommerce",
+    fields=[
+        FieldSpec("event_type", FieldType.CATEGORICAL_FIXED, vocab_size=5),
+        FieldSpec("price", FieldType.NUMERICAL_CONTINUOUS, n_bins=21),
+        FieldSpec("quantity", FieldType.NUMERICAL_DISCRETE, vocab_size=11),
+        FieldSpec("category_l1", FieldType.CATEGORICAL_FIXED, vocab_size=30),
+        FieldSpec("category_l2", FieldType.CATEGORICAL_FIXED, vocab_size=200),
+        FieldSpec("timestamp", FieldType.TEMPORAL,
+                  calendar_fields=["month", "dow", "dom", "hour"]),
+        FieldSpec("product_title", FieldType.TEXT),
+    ]
+)
+```
+#### Step 2: Per-Field Tokenizers
+Implement each field type tokenizer as a standalone module:
+```python
+# src/tokenizers/field_tokenizers.py
+import numpy as np
+from typing import List
+class SignTokenizer:
+    """Tokenizes sign of a numerical value (credit/debit, inflow/outflow)."""
+    def __init__(self, prefix: str = "SIGN"):
+        self.tokens = [f"[{prefix}_POS]", f"[{prefix}_NEG]"]
+    def __call__(self, value: float) -> str:
+        return self.tokens[0] if value >= 0 else self.tokens[1]
+    @property
+    def vocab(self) -> List[str]:
+        return self.tokens
+class MagnitudeBucketTokenizer:
+    """Quantizes continuous values into bins (Nubank-style).
+    Uses quantile-based binning on the training distribution.
+    Follows the Relative Magnitude Tokenization principle from TP-BERTa.
+    """
+    def __init__(self, n_bins: int = 21, prefix: str = "AMT"):
+        self.n_bins = n_bins
+        self.prefix = prefix
+        self.bin_edges = None  # fitted from data
+    def fit(self, values: np.ndarray):
+        """Compute bin edges from training data using quantiles."""
+        # Use absolute values for magnitude binning
+        abs_vals = np.abs(values[~np.isnan(values)])
+        quantiles = np.linspace(0, 100, self.n_bins + 1)
+        self.bin_edges = np.percentile(abs_vals, quantiles)
+        return self
+    def __call__(self, value: float) -> str:
+        if self.bin_edges is None:
+            raise ValueError("Tokenizer not fitted. Call .fit() first.")
+        bin_idx = np.searchsorted(self.bin_edges[1:-1], abs(value))
+        return f"[{self.prefix}_{bin_idx:02d}]"
+    @property
+    def vocab(self) -> List[str]:
+        return [f"[{self.prefix}_{i:02d}]" for i in range(self.n_bins)]
+class CalendarTokenizer:
+    """Decomposes timestamps into calendar components (Nubank-style)."""
+    FIELD_VOCABS = {
+        "month": ([f"[MON_{i:02d}]" for i in range(1, 13)], lambda dt: dt.month - 1),
+        "dow":   ([f"[DOW_{i}]" for i in range(7)], lambda dt: dt.weekday()),
+        "dom":   ([f"[DOM_{i:02d}]" for i in range(1, 32)], lambda dt: dt.day - 1),
+        "hour":  ([f"[HOUR_{i:02d}]" for i in range(24)], lambda dt: dt.hour),
+    }
+    def __init__(self, fields: List[str] = None):
+        self.fields = fields or ["month", "dow", "dom", "hour"]
+    def __call__(self, timestamp) -> List[str]:
+        tokens = []
+        for field_name in self.fields:
+            vocab, extractor = self.FIELD_VOCABS[field_name]
+            idx = extractor(timestamp)
+            tokens.append(vocab[idx])
+        return tokens
+    @property
+    def vocab(self) -> List[str]:
+        all_tokens = []
+        for field_name in self.fields:
+            all_tokens.extend(self.FIELD_VOCABS[field_name][0])
+        return all_tokens
+class CategoricalTokenizer:
+    """Maps categorical values to fixed vocabulary tokens."""
+    def __init__(self, categories: List[str], prefix: str = "CAT"):
+        self.prefix = prefix
+        self.token_map = {cat: f"[{prefix}_{i:03d}]" for i, cat in enumerate(categories)}
+        self.unk_token = f"[{prefix}_UNK]"
+    def __call__(self, value: str) -> str:
+        return self.token_map.get(value, self.unk_token)
+    @property
+    def vocab(self) -> List[str]:
+        return list(self.token_map.values()) + [self.unk_token]
+```
+#### Step 3: Composite Domain Tokenizer
+Assemble per-field tokenizers into a complete domain tokenizer, wrapped as `PreTrainedTokenizerFast`:
+```python
+# src/tokenizers/domain_tokenizer.py
+from tokenizers import Tokenizer, models, trainers, pre_tokenizers
+from transformers import PreTrainedTokenizerFast
+class DomainTokenizerBuilder:
+    """Builds a HuggingFace-compatible tokenizer from a DomainSchema."""
+    def __init__(self, schema: DomainSchema):
+        self.schema = schema
+        self.field_tokenizers = {}  # name → field tokenizer instance
+        self._build_field_tokenizers()
+    def _build_field_tokenizers(self):
+        for field_spec in self.schema.fields:
+            if field_spec.field_type == FieldType.SIGN:
+                self.field_tokenizers[field_spec.name] = SignTokenizer(field_spec.name.upper())
+            elif field_spec.field_type == FieldType.NUMERICAL_CONTINUOUS:
+                self.field_tokenizers[field_spec.name] = MagnitudeBucketTokenizer(
+                    n_bins=field_spec.n_bins, prefix=field_spec.name.upper()
+                )
+            elif field_spec.field_type == FieldType.TEMPORAL:
+                self.field_tokenizers[field_spec.name] = CalendarTokenizer(field_spec.calendar_fields)
+            # ... other types
+    def fit(self, data):
+        """Fit data-dependent tokenizers (magnitude bins, etc.)."""
+        for field_spec in self.schema.fields:
+            if field_spec.field_type == FieldType.NUMERICAL_CONTINUOUS:
+                values = [getattr(event, field_spec.name) for event in data]
+                self.field_tokenizers[field_spec.name].fit(np.array(values))
+        return self
+    def build_hf_tokenizer(self, text_corpus=None, bpe_vocab_size=8000) -> PreTrainedTokenizerFast:
+        """Build a complete HuggingFace tokenizer.
+        1. Collect all domain special tokens
+        2. Train BPE on text fields (if any)
+        3. Merge into a single PreTrainedTokenizerFast
+        """
+        # Collect all special tokens from field tokenizers
+        all_special_tokens = ["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]", "[BOS]", "[EOS]"]
+        for name, tok in self.field_tokenizers.items():
+            if hasattr(tok, 'vocab'):
+                all_special_tokens.extend(tok.vocab)
+        # Train BPE on text fields
+        bpe_tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))
+        bpe_tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
+        trainer = trainers.BpeTrainer(
+            vocab_size=bpe_vocab_size,
+            special_tokens=all_special_tokens,
+            min_frequency=2,
+        )
+        if text_corpus:
+            bpe_tokenizer.train_from_iterator(text_corpus, trainer=trainer)
+        # Wrap as HuggingFace tokenizer
+        hf_tokenizer = PreTrainedTokenizerFast(
+            tokenizer_object=bpe_tokenizer,
+            bos_token="[BOS]",
+            eos_token="[EOS]",
+            pad_token="[PAD]",
+            unk_token="[UNK]",
+            mask_token="[MASK]",
+        )
+        return hf_tokenizer
+    def tokenize_event(self, event) -> List[str]:
+        """Convert a single domain event into a list of token strings."""
+        tokens = []
+        for field_spec in self.schema.fields:
+            value = getattr(event, field_spec.name, None)
+            if value is None:
+                tokens.append("[UNK]")
+                continue
+            tok = self.field_tokenizers[field_spec.name]
+            result = tok(value)
+            if isinstance(result, list):
+                tokens.extend(result)
+            else:
+                tokens.append(result)
+        return tokens
+```
+### Phase 2B: Model Architecture (Weeks 3–5)
+#### Step 4: GPT-style Causal Transformer (NoPE)
+Implement as a HuggingFace-compatible `PreTrainedModel`:
+```python
+# src/models/configuration_domain_transformer.py
+from transformers import PretrainedConfig
+class DomainTransformerConfig(PretrainedConfig):
+    model_type = "domain_transformer"
+    def __init__(
+        self,
+        vocab_size: int = 32000,
+        hidden_size: int = 256,        # 256 = 24M params, 1024 = 330M (Nubank sizes)
+        num_hidden_layers: int = 24,    # Nubank uses 24 for both sizes
+        num_attention_heads: int = 16,  # Nubank uses 16 for both sizes
+        intermediate_size: int = None,  # defaults to 4 * hidden_size
+        max_position_embeddings: int = 2048,
+        dropout: float = 0.1,
+        use_positional_encoding: bool = False,  # NoPE by default!
+        **kwargs
+    ):
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size or 4 * hidden_size
+        self.max_position_embeddings = max_position_embeddings
+        self.dropout = dropout
+        self.use_positional_encoding = use_positional_encoding
+        super().__init__(**kwargs)
+```
+```python
+# src/models/modeling_domain_transformer.py
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel
+from transformers.modeling_outputs import CausalLMOutputWithPast
+class DomainTransformerBlock(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.ln1 = nn.LayerNorm(config.hidden_size)
+        self.attn = nn.MultiheadAttention(
+            config.hidden_size, config.num_attention_heads,
+            dropout=config.dropout, batch_first=True
+        )
+        self.ln2 = nn.LayerNorm(config.hidden_size)
+        self.mlp = nn.Sequential(
+            nn.Linear(config.hidden_size, config.intermediate_size),
+            nn.GELU(),
+            nn.Linear(config.intermediate_size, config.hidden_size),
+            nn.Dropout(config.dropout),
+        )
+    def forward(self, x, attn_mask=None):
+        # Pre-norm architecture
+        h = self.ln1(x)
+        h, _ = self.attn(h, h, h, attn_mask=attn_mask, is_causal=True)
+        x = x + h
+        x = x + self.mlp(self.ln2(x))
+        return x
+class DomainTransformerForCausalLM(PreTrainedModel):
+    config_class = DomainTransformerConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
+        # NoPE: no positional encoding by default (Kazemnejad et al. 2023)
+        if config.use_positional_encoding:
+            self.embed_positions = nn.Embedding(
+                config.max_position_embeddings, config.hidden_size
+            )
+        else:
+            self.embed_positions = None
+        self.drop = nn.Dropout(config.dropout)
+        self.blocks = nn.ModuleList([
+            DomainTransformerBlock(config)
+            for _ in range(config.num_hidden_layers)
+        ])
+        self.ln_f = nn.LayerNorm(config.hidden_size)
+        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
+        # Weight tying (standard for small models)
+        self.lm_head.weight = self.embed_tokens.weight
+        self.post_init()
+    def forward(self, input_ids, attention_mask=None, labels=None, **kwargs):
+        x = self.embed_tokens(input_ids)
+        if self.embed_positions is not None:
+            positions = torch.arange(input_ids.size(1), device=input_ids.device)
+            x = x + self.embed_positions(positions)
+        x = self.drop(x)
+        for block in self.blocks:
+            x = block(x, attn_mask=attention_mask)
+        x = self.ln_f(x)
+        logits = self.lm_head(x)
+        loss = None
+        if labels is not None:
+            shift_logits = logits[..., :-1, :].contiguous()
+            shift_labels = labels[..., 1:].contiguous()
+            loss = nn.functional.cross_entropy(
+                shift_logits.view(-1, shift_logits.size(-1)),
+                shift_labels.view(-1),
+                ignore_index=-100,
+            )
+        return CausalLMOutputWithPast(loss=loss, logits=logits)
+# Register with AutoClass for Hub compatibility
+DomainTransformerConfig.register_for_auto_class()
+DomainTransformerForCausalLM.register_for_auto_class("AutoModelForCausalLM")
+```
+#### Step 5: PLR Numerical Embeddings (for Joint Fusion)
+Port from Yandex's implementation:
+```python
+# src/models/plr_embeddings.py
+import torch
+import torch.nn as nn
+import math
+class PeriodicLinearReLU(nn.Module):
+    """PLR numerical embeddings (Gorishniy et al. 2022).
+    Maps scalar x → [sin(2π·w·x + b), cos(2π·w·x + b)] → Linear → ReLU
+    Frequencies w and phases b are LEARNED parameters.
+    """
+    def __init__(self, n_features: int, n_frequencies: int = 64, embedding_dim: int = 64):
+        super().__init__()
+        self.n_features = n_features
+        self.n_frequencies = n_frequencies
+        # Learnable frequencies and phases (per feature)
+        self.frequencies = nn.Parameter(
+            torch.randn(n_features, n_frequencies) * 0.01
+        )
+        self.phases = nn.Parameter(
+            torch.zeros(n_features, n_frequencies)
+        )
+        # Linear projection: 2*n_frequencies → embedding_dim
+        self.linear = nn.Linear(2 * n_frequencies, embedding_dim)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            x: (batch, n_features) — raw scalar feature values
+        Returns:
+            (batch, n_features, embedding_dim)
+        """
+        # x: (B, F) → (B, F, 1)
+        x = x.unsqueeze(-1)
+        # Periodic encoding: (B, F, n_freq)
+        angles = 2 * math.pi * self.frequencies.unsqueeze(0) * x + self.phases.unsqueeze(0)
+        periodic = torch.cat([torch.sin(angles), torch.cos(angles)], dim=-1)  # (B, F, 2*n_freq)
+        # Linear + ReLU: (B, F, embedding_dim)
+        return torch.relu(self.linear(periodic))
+```
+### Phase 2C: Pre-training (Weeks 5–7)
+#### Step 6: Data Pipeline
+```python
+# src/training/data_pipeline.py
+from torch.utils.data import Dataset
+from typing import List
+class DomainSequenceDataset(Dataset):
+    """Converts user event sequences into token sequences for CLM training."""
+    def __init__(self, user_sequences, tokenizer_builder, hf_tokenizer, max_length=2048):
+        self.user_sequences = user_sequences
+        self.tokenizer_builder = tokenizer_builder
+        self.hf_tokenizer = hf_tokenizer
+        self.max_length = max_length
+    def __len__(self):
+        return len(self.user_sequences)
+    def __getitem__(self, idx):
+        events = self.user_sequences[idx]
+        # Tokenize each event into token strings
+        token_strings = []
+        for event in events:
+            event_tokens = self.tokenizer_builder.tokenize_event(event)
+            token_strings.extend(event_tokens)
+        # Convert token strings to IDs via HF tokenizer
+        encoding = self.hf_tokenizer(
+            " ".join(token_strings),
+            max_length=self.max_length,
+            truncation=True,
+            padding="max_length",
+            return_tensors="pt",
+        )
+        input_ids = encoding["input_ids"].squeeze(0)
+        return {
+            "input_ids": input_ids,
+            "labels": input_ids.clone(),  # CLM: labels = input shifted by 1
+            "attention_mask": encoding["attention_mask"].squeeze(0),
+        }
+```
+#### Step 7: Pre-training with HuggingFace Trainer
+```python
+# src/training/pretrain.py
+from transformers import Trainer, TrainingArguments
+def pretrain_domain_model(
+    model,
+    train_dataset,
+    eval_dataset=None,
+    output_dir="./checkpoints",
+    hub_model_id="org/domain-model-24m",
+    num_epochs=3,
+    batch_size=64,
+    learning_rate=3e-4,
+    context_length=2048,
+):
+    training_args = TrainingArguments(
+        output_dir=output_dir,
+        num_train_epochs=num_epochs,
+        per_device_train_batch_size=batch_size,
+        gradient_accumulation_steps=4,
+        learning_rate=learning_rate,
+        lr_scheduler_type="cosine",
+        warmup_ratio=0.05,
+        weight_decay=0.01,
+        logging_strategy="steps",
+        logging_steps=100,
+        logging_first_step=True,
+        disable_tqdm=True,           # plain text logging for cloud
+        eval_strategy="steps" if eval_dataset else "no",
+        eval_steps=500,
+        save_strategy="steps",
+        save_steps=1000,
+        save_total_limit=3,
+        push_to_hub=True,
+        hub_model_id=hub_model_id,
+        bf16=True,
+        dataloader_num_workers=4,
+        report_to="trackio",
+    )
+    trainer = Trainer(
+        model=model,
+        args=training_args,
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+    )
+    trainer.train()
+    trainer.push_to_hub()
+```
+### Phase 2D: Joint Fusion Fine-tuning (Weeks 7–9)
+#### Step 8: nuFormer-style Joint Fusion
+```python
+# src/models/joint_fusion.py
+import torch
+import torch.nn as nn
+class DCNv2CrossLayer(nn.Module):
+    """Single cross layer from DCN V2 (Wang et al. 2021)."""
+    def __init__(self, dim):
+        super().__init__()
+        self.weight = nn.Linear(dim, dim, bias=True)
+    def forward(self, x0, x):
+        return x0 * self.weight(x) + x  # element-wise product with anchor
+class JointFusionModel(nn.Module):
+    """nuFormer-style: Transaction Transformer + DCNv2(PLR) → Joint Prediction.
+    Architecture:
+        Transaction Sequence → Pre-trained DomainTransformer → user_embedding
+        Tabular Features → PLR → DCNv2 → tab_embedding
+        Concatenate → MLP Head → prediction
+    """
+    def __init__(self, transformer_model, n_tabular_features, n_classes=1,
+                 plr_frequencies=64, dcn_layers=3, hidden_dim=256):
+        super().__init__()
+        self.transformer = transformer_model  # pre-trained, unfrozen for fine-tuning
+        transformer_dim = transformer_model.config.hidden_size
+        # Tabular branch: PLR → DCNv2
+        self.plr = PeriodicLinearReLU(n_tabular_features, plr_frequencies, hidden_dim)
+        tab_input_dim = n_tabular_features * hidden_dim
+        self.dcn_layers = nn.ModuleList([
+            DCNv2CrossLayer(tab_input_dim) for _ in range(dcn_layers)
+        ])
+        self.dcn_deep = nn.Sequential(
+            nn.Linear(tab_input_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Linear(hidden_dim, hidden_dim),
+            nn.ReLU(),
+        )
+        # Joint head
+        self.head = nn.Sequential(
+            nn.Linear(transformer_dim + hidden_dim, hidden_dim),
+            nn.ReLU(),
+            nn.Dropout(0.1),
+            nn.Linear(hidden_dim, n_classes),
+        )
+    def forward(self, input_ids, attention_mask, tabular_features, labels=None):
+        # Transaction branch: get last-token embedding
+        transformer_output = self.transformer(input_ids, attention_mask=attention_mask)
+        user_embedding = transformer_output.logits[:, -1, :]  # last token representation
+        # Tabular branch: PLR → flatten → DCNv2
+        tab_embedded = self.plr(tabular_features)  # (B, F, D)
+        tab_flat = tab_embedded.view(tab_embedded.size(0), -1)  # (B, F*D)
+        x0 = tab_flat
+        x = tab_flat
+        for cross_layer in self.dcn_layers:
+            x = cross_layer(x0, x)
+        tab_output = self.dcn_deep(x)  # (B, hidden_dim)
+        # Joint fusion
+        combined = torch.cat([user_embedding, tab_output], dim=-1)
+        logits = self.head(combined)
+        loss = None
+        if labels is not None:
+            loss = nn.functional.binary_cross_entropy_with_logits(logits.squeeze(-1), labels.float())
+        return {"loss": loss, "logits": logits}
+```
+### Phase 3: Domain Demos (Weeks 9–12)
+| Week | Deliverable | Hardware |
+|------|------------|----------|
+| 9–10 | **Finance demo:** Transaction tokenizer + 24M model pre-trained on synthetic/public financial data + fraud detection fine-tuning | a10g-large |
+| 10–11 | **E-commerce demo:** Event tokenizer + 24M model pre-trained on Amazon review sequences + next-purchase prediction | a10g-large |
+| 11–12 | **Evaluation & benchmarking:** Compare domain tokenizer vs. text serialization vs. LightGBM baselines on each domain | a10g-large |
+### Phase 4: Scale & Optimize (Weeks 12+)
+| Task | Details |
+|------|---------|
+| Scale to 330M params | Increase `hidden_size` to 1024, train on a100-large |
+| `torch.compile()` | Apply to attention and MLP blocks for 10–30% speedup |
+| ONNX export | `torch.onnx.export()` for production serving |
+| Context window experiments | Ablate 512/1024/2048/4096 context lengths |
+| Data source ablation | Test impact of different event types (Nubank found adding low-signal sources hurts) |
+| ActionPiece vocabulary | Implement BPE-like cross-field merging on top of per-field tokens |
+---
+## Appendix A: Framework Usage Across Reference Papers
+| Paper | ArXiv | Framework | Verbatim Evidence |
+|-------|-------|-----------|-------------------|
+| nuFormer (Nubank) | 2507.23267 | PyTorch + HF (inferred) | All dependencies are PyTorch-based |
+| TIGER (Google) | 2305.05065 | JAX + T5X | "We use the open-sourced T5X framework" |
+| ActionPiece (DeepMind) | 2502.13581 | PyTorch + HF | "HuggingFace Transformers and PyTorch" (Appendix H) |
+| RecFormer | 2305.13731 | PyTorch + HF Longformer | "Longformer implemented by Huggingface" (§3.1.4) |
+| Banking TF | 2410.08243 | PyTorch | "Pytorch backend is used" (Appendix B) |
+| PLR Embeddings (Yandex) | 2203.05556 | PyTorch | Repository: pure PyTorch + scikit-learn |
+| KL3M Tokenizers | 2503.17247 | HF `tokenizers` + PyTorch | "tokenizers" BPE for HF compatibility |
+---
+## Appendix B: Head-to-Head Comparison Matrix
+| Criterion | PyTorch + HF | JAX/Flax NNX | Keras 3 + JAX |
+|-----------|-------------|-------------|---------------|
+| Custom domain tokenizer | ✅ `PreTrainedTokenizerFast` | ❌ Build from scratch | ❌ Build from scratch |
+| HF Trainer integration | ✅ Native | ❌ Not compatible | ❌ Not compatible |
+| Hub push/pull | ✅ `push_to_hub()` | ⚠️ Works, fragmented | ⚠️ Limited |
+| PEFT/LoRA | ✅ Drop-in | ❌ Manual | ❌ Manual |
+| ONNX export | ✅ One-line | ❌ Multi-hop | ⚠️ TF backend required |
+| TGI/vLLM serving | ✅ First-class | ❌ Not supported | ❌ Not supported |
+| TPU training | ⚠️ PyTorch/XLA (overhead) | ✅ Native | ✅ Native |
+| JIT compilation | ✅ `torch.compile()` | ✅ `@nnx.jit` | ✅ XLA via JAX |
+| Dynamic shapes (NLP) | ✅ Native | ❌ Recompiles | ❌ Recompiles |
+| Debugging | ✅ Eager mode, std debugger | ⚠️ Challenging | ⚠️ Challenging |
+| Reference implementations | 5/6 papers | 1/6 papers | 0/6 papers |
+| Community/hiring pool | 🟢 Large | 🟡 Small | 🟡 Small |
+---
+## Appendix C: Key Code Patterns
+### Adding Domain Special Tokens to an Existing Tokenizer
+```python
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("gpt2")  # or any base tokenizer
+# Add all domain special tokens
+special_tokens = {
+    "additional_special_tokens": [
+        # Amount tokens (Nubank-style)
+        "[AMT_POS]", "[AMT_NEG]",
+        *[f"[AMT_{i:02d}]" for i in range(21)],
+        # Calendar tokens
+        *[f"[MON_{i:02d}]" for i in range(1, 13)],
+        *[f"[DOW_{i}]" for i in range(7)],
+        *[f"[DOM_{i:02d}]" for i in range(1, 32)],
+        *[f"[HOUR_{i:02d}]" for i in range(24)],
+    ]
+}
+num_added = tokenizer.add_special_tokens(special_tokens)
+print(f"Added {num_added} domain tokens. Vocab size: {len(tokenizer)}")
+# CRITICAL: resize model embeddings
+model.resize_token_embeddings(len(tokenizer))
+```
+### Registering a Custom Model for Hub Deployment
+```python
+# In your package's __init__.py or a registration script:
+from transformers import AutoConfig, AutoModelForCausalLM
+from .configuration_domain_transformer import DomainTransformerConfig
+from .modeling_domain_transformer import DomainTransformerForCausalLM
+# Register so AutoClass can find your model
+AutoConfig.register("domain_transformer", DomainTransformerConfig)
+AutoModelForCausalLM.register(DomainTransformerConfig, DomainTransformerForCausalLM)
+# Enable push_to_hub with custom code
+DomainTransformerConfig.register_for_auto_class()
+DomainTransformerForCausalLM.register_for_auto_class("AutoModelForCausalLM")
+# Push: uploads configuration.py, modeling.py, config.json, model.safetensors
+model.push_to_hub("org/domain-transformer-24m")
+# Load anywhere:
+model = AutoModelForCausalLM.from_pretrained("org/domain-transformer-24m", trust_remote_code=True)
+```
+---
+*This ADR is a living document and will be updated as implementation progresses.*