File size: 14,187 Bytes

6c4ad4d

# Phase 2A–2C Implementation Report

> **domainTokenizer v0.3.0** — Core library complete: tokenizers, models, pre-training pipeline
>
> **124 tests passing** (72 tokenizer + 33 model + 19 training)
>
> *April 2026*

---

## Overview

Phase 2 implements the core domainTokenizer library — everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a pre-trained Transformer foundation model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.

The library is organized as three layers, each built and tested independently before composing into the next:

```
Phase 2A: Tokenizers  →  Phase 2B: Models  →  Phase 2C: Training Pipeline
(schema → tokens)        (tokens → loss)      (data → Trainer → checkpoints)
```

---

## Phase 2A: Domain Tokenizer Library (Weeks 1–3)

### What Was Built

A declarative schema system and 5 per-field tokenizers that convert raw domain events into HuggingFace-compatible token sequences.

| Component | Purpose | Output |
|-----------|---------|--------|
| `DomainSchema` + `FieldSpec` | Declarative event definition — fields, types, bin counts | Schema object |
| `SignTokenizer` | Credit/debit, +/- | `79.99 → [AMT_SIGN_POS]` |
| `MagnitudeBucketTokenizer` | Quantile-based numerical bins (fits on data) | `79.99 → [AMT_15]` |
| `CalendarTokenizer` | Timestamp → month/dow/dom/hour decomposition | `Mar 15 2pm → 4 tokens` |
| `CategoricalTokenizer` | Fixed category mapping with UNK fallback | `"purchase" → [EVT_001]` |
| `DiscreteNumericalTokenizer` | Small integers with overflow | `3 → [QTY_03]`, `15 → [QTY_OVER]` |
| `DomainTokenizerBuilder` | Assembles per-field tokenizers → HF `PreTrainedTokenizerFast` | HF tokenizer |

Three predefined schemas ship out of the box:
- `FINANCE_SCHEMA` — 97 domain tokens (Nubank-compatible: sign + amount bins + calendar)
- `ECOMMERCE_SCHEMA` — event type + price + quantity + category + calendar + product title
- `HEALTHCARE_SCHEMA` — clinical event type + cost + severity + provider + calendar + description

### Key Technical Decisions

1. **Hybrid vocabulary: special tokens + BPE.** Following Nubank exactly, structured fields (amounts, dates, categories) become single special tokens while free-text fields (descriptions, product titles) use standard BPE. This compresses each event to ~14 tokens vs ~35-50 with pure text serialization, tripling the number of events that fit in a 2048-token context window.

2. **Quantile-based magnitude binning (not linear).** The `MagnitudeBucketTokenizer` uses quantile percentiles on absolute values, not uniform bins. Financial data is heavily skewed (many small transactions, few large ones). Quantile bins ensure each bin gets roughly equal representation in the training data, maximizing the model's ability to distinguish between common transaction sizes.

3. **Separate sign and magnitude tokenization.** Following Nubank's `ϕ_sign` + `ϕ_amt` pattern, the sign (credit/debit) is tokenized independently from the magnitude. This lets the model learn that "a $500 inflow" and "a $500 outflow" share magnitude semantics but differ in direction — without wasting bins on both positive and negative ranges.

4. **Schema-driven factory pattern.** Field tokenizers are created automatically from `FieldSpec` declarations via `create_field_tokenizer()`. Adding a new domain requires only defining a `DomainSchema` — no code changes to the tokenizer pipeline. This enables rapid domain iteration (finance → e-commerce → healthcare) without engineering overhead.

5. **Data-dependent tokenizers require explicit fitting.** `MagnitudeBucketTokenizer` must be `.fit()` on training data before use. Calling `.build()` on an unfitted schema raises `RuntimeError`. This prevents a subtle bug where bin edges are computed on test data, leaking information.

6. **HuggingFace-native output.** The `DomainTokenizerBuilder.build()` method produces a standard `PreTrainedTokenizerFast` — the same type returned by `AutoTokenizer.from_pretrained()`. This means zero adaptation for HF Trainer, `push_to_hub()`, `save_pretrained()`, ONNX export, etc.

### Test Results

**72 tests passing** covering: field spec validation, all 5 tokenizer types (including edge cases: NaN, None, overflow, unknown categories), predefined schemas (including Nubank 97-token compatibility check), builder fit/build/tokenize/encode pipeline, and full end-to-end sequence encoding.

---

## Phase 2B: Model Architecture (Weeks 3–5)

### What Was Built

A GPT-style causal decoder Transformer registered as a HuggingFace `PreTrainedModel`, plus numerical embeddings and joint fusion components.

| Component | Purpose | Based On |
|-----------|---------|----------|
| `DomainTransformerConfig` | HF-compatible config with presets (`"24m"`, `"85m"`, `"330m"`) | Nubank nuFormer sizes |
| `DomainTransformerForCausalLM` | Causal decoder: NoPE, pre-norm, SDPA attention, weight tying | NoPE (arXiv:2305.19466) + GPT-2 |
| `PeriodicLinearReLU` | Learned sin/cos embeddings for numerical features | Gorishniy et al. (arXiv:2203.05556) |
| `DCNv2` + `JointFusionModel` | Transformer + tabular feature fusion for fine-tuning | Nubank + DCN V2 (arXiv:2008.13535) |

### Key Technical Decisions

1. **NoPE (No Positional Encoding).** Following Kazemnejad et al. (NeurIPS 2023), the model uses zero positional encoding — no absolute, no RoPE, no ALiBi. NoPE outperforms all PE schemes on length generalization benchmarks. For domain sequences where users have vastly different history lengths (20 to 2000+ events), length generalization is critical. The model implicitly learns relative position from the causal attention mask pattern.

2. **`F.scaled_dot_product_attention` with `is_causal=True`, not `nn.MultiheadAttention`.** PyTorch's `nn.MultiheadAttention(is_causal=True)` has a known bug requiring an explicit `attn_mask` even when `is_causal=True` is set. We implement attention directly using `F.scaled_dot_product_attention`, which auto-dispatches to FlashAttention/cuDNN when available on CUDA, and uses an efficient C++ kernel on CPU.

3. **HF attention mask conversion.** HuggingFace Trainer sends attention masks as `(B, T)` long tensors (1=attend, 0=pad). PyTorch SDPA requires either `None` (use `is_causal`) or a float mask where masked positions are `-inf`. The attention module handles this conversion: when a mask is provided, it's expanded to `(B, 1, 1, T)`, converted to float, and inverted (`0 → -inf`, `1 → 0.0`). When no mask is provided, `is_causal=True` handles causality for free.

4. **Weight tying via HF v5.7+ dict format.** The `_tied_weights_keys` API changed from a list to a dict in transformers 5.7. We use `{"lm_head.weight": "model.embed_tokens.weight"}` with proper `get/set_input_embeddings` and `get/set_output_embeddings` implementations. `post_init()` handles the actual tying.

5. **Pre-norm architecture (LayerNorm before attention/FFN).** GPT-2 and most modern LLMs use pre-norm. This makes training more stable than post-norm, especially at the 24M–330M scale where we don't have the luxury of extensive hyperparameter tuning.

6. **`get_user_embedding()` method on the CausalLM class.** For downstream tasks (classification, joint fusion), we need a single vector representing the user's transaction history. This method extracts the hidden state at the last non-padding position — the standard approach for decoder-only models. It uses `attention_mask.sum(dim=1) - 1` to find the last real token position per sequence.

7. **PLR frequencies and phases are learned parameters.** Unlike fixed Fourier features, PLR initializes frequencies and phases as trainable `nn.Parameter` tensors. This lets the model discover the most informative frequency decomposition for each numerical feature during training — crucial for financial data where relevant scales span 4+ orders of magnitude.

### Test Results

**33 tests passing** covering: config presets and serialization, base model forward shapes, CausalLM with/without labels, loss differentiability, weight tying verification, user embedding extraction (with and without mask), parameter counts for tiny and 24M configs, gradient checkpointing, causal masking verification, PLR shapes and gradients, DCNv2 cross layers, JointFusion binary and multiclass, and full tokenizer→model→loss integration.

---

## Phase 2C: Pre-training Pipeline (Weeks 5–7)

### What Was Built

A data pipeline and training harness that connects the tokenizer and model layers into a complete CLM pre-training workflow.

| Component | Purpose |
|-----------|---------|
| `tokenize_user_sequences()` | Converts lists of user event sequences → variable-length token ID lists |
| `pack_sequences()` | Packs variable-length sequences into fixed-length blocks (run_clm.py pattern) |
| `prepare_clm_dataset()` | Convenience pipeline: user events → tokenize → pack → HFDataset |
| `pretrain_domain_model()` | Pre-trains via HF Trainer with DataCollatorForLanguageModeling, cosine schedule |

### Key Technical Decisions

1. **Sequence packing, not padding.** Following the official HF `run_clm.py` pattern, all tokenized user sequences are concatenated into one long stream and split into fixed-length blocks. This achieves 100% token utilization — every position in every training example is a real token contributing gradient signal. Padding wastes 30-70% of tokens for variable-length sequences, which is unacceptable when training data is finite (typical business scenario). The trade-off: cross-sequence boundaries exist within blocks. For domain events delimited by `[BOS]`/`[EOS]`/`[SEP_EVENT]` tokens, this is benign — the model learns to handle delimiters naturally.

2. **`DataCollatorForLanguageModeling(mlm=False)` handles label creation.** The HF Trainer does NOT auto-inject `labels`. The data collator does: it clones `input_ids`, sets `labels = input_ids`, and masks any padding positions (token_id == pad_token_id) with `-100` so they don't contribute to loss. Our packed sequences have no padding, so `labels == input_ids` exactly — every token is a training target.

3. **`processing_class` parameter (not `tokenizer`).** HuggingFace Trainer v5.7 renamed `tokenizer` to `processing_class` in `Trainer.__init__()`. The old name raises `TypeError`. This is a silent API break that only manifests at runtime — caught and fixed during testing.

4. **Cosine learning rate schedule with warmup.** Following Nubank and standard GPT pre-training practice. The cosine schedule decays smoothly from peak LR to near-zero, avoiding the abrupt drops of step schedules. Warmup prevents early training instability when loss gradients are large and noisy.

5. **`disable_tqdm=True` and `logging_strategy="steps"`.** For cloud/headless execution, tqdm progress bars are useless (they produce thousands of `\r` characters in log files). Plain text step-by-step logging (`loss=X.XXX, grad_norm=Y.YYY, lr=Z.ZZZ`) is greppable and parseable by monitoring tools.

6. **Dataset yields only `{"input_ids": [...]}`.** The collator adds `labels` and `attention_mask`. The Trainer's `remove_unused_columns=True` (default) auto-drops any extra columns not in the model's `forward()` signature. This means you can safely store metadata (user IDs, sequence lengths) in the dataset — they're dropped before batching.

### Smoke Test Results

24-step training on CPU with a tiny model (64-dim, 2 layers) confirmed the full pipeline:

```
Step  1: loss=5.419  grad_norm=7.227  lr=1.000e-03
Step 12: loss=4.510  grad_norm=3.668  lr=5.653e-04
Step 24: loss=4.322  grad_norm=3.636  lr=4.278e-06
```

Loss decreased monotonically from 5.42 to 4.32 with cosine decay — the tokenizer→packing→collator→model→loss→optimizer pipeline is end-to-end functional.

### Test Results

**19 tests passing** covering: tokenization of user sequences (variable lengths, BOS/EOS presence), packing (fixed blocks, concatenation, remainder dropping, error on insufficient data), full dataset preparation, DataCollator behavior (label creation, shapes, all-ones attention mask for packed data), integration forward pass with backward, Trainer smoke test (24 steps), and validation that missing pad_token raises correctly.

---

## Cumulative Test Summary

| Phase | Tests | Coverage |
|-------|-------|----------|
| 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
| 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizer→model integration |
| 2C: Training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
| **Total** | **124** | **All passing** |

---

## Library API Summary (v0.3.0)

```python
from domain_tokenizer import (
    # Schemas
    DomainSchema, FieldSpec, FieldType,
    # Tokenizers
    DomainTokenizerBuilder,
    # Models
    DomainTransformerConfig, DomainTransformerForCausalLM,
    PeriodicLinearReLU, JointFusionModel, DCNv2,
    # Training
    prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
```

### End-to-End Usage

```python
# 1. Build tokenizer from schema
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create model
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)

# 4. Pre-train
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",
    num_epochs=10, learning_rate=3e-4, bf16=True,
)
```