domainTokenizer / docs /phase2_implementation_report.md
rtferraz's picture
Add Phase 2A-2C implementation report — technical decisions, architecture summary, test results
6c4ad4d verified
|
raw
history blame
14.2 kB

Phase 2A–2C Implementation Report

domainTokenizer v0.3.0 — Core library complete: tokenizers, models, pre-training pipeline

124 tests passing (72 tokenizer + 33 model + 19 training)

April 2026


Overview

Phase 2 implements the core domainTokenizer library — everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a pre-trained Transformer foundation model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.

The library is organized as three layers, each built and tested independently before composing into the next:

Phase 2A: Tokenizers  →  Phase 2B: Models  →  Phase 2C: Training Pipeline
(schema → tokens)        (tokens → loss)      (data → Trainer → checkpoints)

Phase 2A: Domain Tokenizer Library (Weeks 1–3)

What Was Built

A declarative schema system and 5 per-field tokenizers that convert raw domain events into HuggingFace-compatible token sequences.

Component Purpose Output
DomainSchema + FieldSpec Declarative event definition — fields, types, bin counts Schema object
SignTokenizer Credit/debit, +/- 79.99 → [AMT_SIGN_POS]
MagnitudeBucketTokenizer Quantile-based numerical bins (fits on data) 79.99 → [AMT_15]
CalendarTokenizer Timestamp → month/dow/dom/hour decomposition Mar 15 2pm → 4 tokens
CategoricalTokenizer Fixed category mapping with UNK fallback "purchase" → [EVT_001]
DiscreteNumericalTokenizer Small integers with overflow 3 → [QTY_03], 15 → [QTY_OVER]
DomainTokenizerBuilder Assembles per-field tokenizers → HF PreTrainedTokenizerFast HF tokenizer

Three predefined schemas ship out of the box:

  • FINANCE_SCHEMA — 97 domain tokens (Nubank-compatible: sign + amount bins + calendar)
  • ECOMMERCE_SCHEMA — event type + price + quantity + category + calendar + product title
  • HEALTHCARE_SCHEMA — clinical event type + cost + severity + provider + calendar + description

Key Technical Decisions

  1. Hybrid vocabulary: special tokens + BPE. Following Nubank exactly, structured fields (amounts, dates, categories) become single special tokens while free-text fields (descriptions, product titles) use standard BPE. This compresses each event to ~14 tokens vs ~35-50 with pure text serialization, tripling the number of events that fit in a 2048-token context window.

  2. Quantile-based magnitude binning (not linear). The MagnitudeBucketTokenizer uses quantile percentiles on absolute values, not uniform bins. Financial data is heavily skewed (many small transactions, few large ones). Quantile bins ensure each bin gets roughly equal representation in the training data, maximizing the model's ability to distinguish between common transaction sizes.

  3. Separate sign and magnitude tokenization. Following Nubank's ϕ_sign + ϕ_amt pattern, the sign (credit/debit) is tokenized independently from the magnitude. This lets the model learn that "a $500 inflow" and "a $500 outflow" share magnitude semantics but differ in direction — without wasting bins on both positive and negative ranges.

  4. Schema-driven factory pattern. Field tokenizers are created automatically from FieldSpec declarations via create_field_tokenizer(). Adding a new domain requires only defining a DomainSchema — no code changes to the tokenizer pipeline. This enables rapid domain iteration (finance → e-commerce → healthcare) without engineering overhead.

  5. Data-dependent tokenizers require explicit fitting. MagnitudeBucketTokenizer must be .fit() on training data before use. Calling .build() on an unfitted schema raises RuntimeError. This prevents a subtle bug where bin edges are computed on test data, leaking information.

  6. HuggingFace-native output. The DomainTokenizerBuilder.build() method produces a standard PreTrainedTokenizerFast — the same type returned by AutoTokenizer.from_pretrained(). This means zero adaptation for HF Trainer, push_to_hub(), save_pretrained(), ONNX export, etc.

Test Results

72 tests passing covering: field spec validation, all 5 tokenizer types (including edge cases: NaN, None, overflow, unknown categories), predefined schemas (including Nubank 97-token compatibility check), builder fit/build/tokenize/encode pipeline, and full end-to-end sequence encoding.


Phase 2B: Model Architecture (Weeks 3–5)

What Was Built

A GPT-style causal decoder Transformer registered as a HuggingFace PreTrainedModel, plus numerical embeddings and joint fusion components.

Component Purpose Based On
DomainTransformerConfig HF-compatible config with presets ("24m", "85m", "330m") Nubank nuFormer sizes
DomainTransformerForCausalLM Causal decoder: NoPE, pre-norm, SDPA attention, weight tying NoPE (arXiv:2305.19466) + GPT-2
PeriodicLinearReLU Learned sin/cos embeddings for numerical features Gorishniy et al. (arXiv:2203.05556)
DCNv2 + JointFusionModel Transformer + tabular feature fusion for fine-tuning Nubank + DCN V2 (arXiv:2008.13535)

Key Technical Decisions

  1. NoPE (No Positional Encoding). Following Kazemnejad et al. (NeurIPS 2023), the model uses zero positional encoding — no absolute, no RoPE, no ALiBi. NoPE outperforms all PE schemes on length generalization benchmarks. For domain sequences where users have vastly different history lengths (20 to 2000+ events), length generalization is critical. The model implicitly learns relative position from the causal attention mask pattern.

  2. F.scaled_dot_product_attention with is_causal=True, not nn.MultiheadAttention. PyTorch's nn.MultiheadAttention(is_causal=True) has a known bug requiring an explicit attn_mask even when is_causal=True is set. We implement attention directly using F.scaled_dot_product_attention, which auto-dispatches to FlashAttention/cuDNN when available on CUDA, and uses an efficient C++ kernel on CPU.

  3. HF attention mask conversion. HuggingFace Trainer sends attention masks as (B, T) long tensors (1=attend, 0=pad). PyTorch SDPA requires either None (use is_causal) or a float mask where masked positions are -inf. The attention module handles this conversion: when a mask is provided, it's expanded to (B, 1, 1, T), converted to float, and inverted (0 → -inf, 1 → 0.0). When no mask is provided, is_causal=True handles causality for free.

  4. Weight tying via HF v5.7+ dict format. The _tied_weights_keys API changed from a list to a dict in transformers 5.7. We use {"lm_head.weight": "model.embed_tokens.weight"} with proper get/set_input_embeddings and get/set_output_embeddings implementations. post_init() handles the actual tying.

  5. Pre-norm architecture (LayerNorm before attention/FFN). GPT-2 and most modern LLMs use pre-norm. This makes training more stable than post-norm, especially at the 24M–330M scale where we don't have the luxury of extensive hyperparameter tuning.

  6. get_user_embedding() method on the CausalLM class. For downstream tasks (classification, joint fusion), we need a single vector representing the user's transaction history. This method extracts the hidden state at the last non-padding position — the standard approach for decoder-only models. It uses attention_mask.sum(dim=1) - 1 to find the last real token position per sequence.

  7. PLR frequencies and phases are learned parameters. Unlike fixed Fourier features, PLR initializes frequencies and phases as trainable nn.Parameter tensors. This lets the model discover the most informative frequency decomposition for each numerical feature during training — crucial for financial data where relevant scales span 4+ orders of magnitude.

Test Results

33 tests passing covering: config presets and serialization, base model forward shapes, CausalLM with/without labels, loss differentiability, weight tying verification, user embedding extraction (with and without mask), parameter counts for tiny and 24M configs, gradient checkpointing, causal masking verification, PLR shapes and gradients, DCNv2 cross layers, JointFusion binary and multiclass, and full tokenizer→model→loss integration.


Phase 2C: Pre-training Pipeline (Weeks 5–7)

What Was Built

A data pipeline and training harness that connects the tokenizer and model layers into a complete CLM pre-training workflow.

Component Purpose
tokenize_user_sequences() Converts lists of user event sequences → variable-length token ID lists
pack_sequences() Packs variable-length sequences into fixed-length blocks (run_clm.py pattern)
prepare_clm_dataset() Convenience pipeline: user events → tokenize → pack → HFDataset
pretrain_domain_model() Pre-trains via HF Trainer with DataCollatorForLanguageModeling, cosine schedule

Key Technical Decisions

  1. Sequence packing, not padding. Following the official HF run_clm.py pattern, all tokenized user sequences are concatenated into one long stream and split into fixed-length blocks. This achieves 100% token utilization — every position in every training example is a real token contributing gradient signal. Padding wastes 30-70% of tokens for variable-length sequences, which is unacceptable when training data is finite (typical business scenario). The trade-off: cross-sequence boundaries exist within blocks. For domain events delimited by [BOS]/[EOS]/[SEP_EVENT] tokens, this is benign — the model learns to handle delimiters naturally.

  2. DataCollatorForLanguageModeling(mlm=False) handles label creation. The HF Trainer does NOT auto-inject labels. The data collator does: it clones input_ids, sets labels = input_ids, and masks any padding positions (token_id == pad_token_id) with -100 so they don't contribute to loss. Our packed sequences have no padding, so labels == input_ids exactly — every token is a training target.

  3. processing_class parameter (not tokenizer). HuggingFace Trainer v5.7 renamed tokenizer to processing_class in Trainer.__init__(). The old name raises TypeError. This is a silent API break that only manifests at runtime — caught and fixed during testing.

  4. Cosine learning rate schedule with warmup. Following Nubank and standard GPT pre-training practice. The cosine schedule decays smoothly from peak LR to near-zero, avoiding the abrupt drops of step schedules. Warmup prevents early training instability when loss gradients are large and noisy.

  5. disable_tqdm=True and logging_strategy="steps". For cloud/headless execution, tqdm progress bars are useless (they produce thousands of \r characters in log files). Plain text step-by-step logging (loss=X.XXX, grad_norm=Y.YYY, lr=Z.ZZZ) is greppable and parseable by monitoring tools.

  6. Dataset yields only {"input_ids": [...]}. The collator adds labels and attention_mask. The Trainer's remove_unused_columns=True (default) auto-drops any extra columns not in the model's forward() signature. This means you can safely store metadata (user IDs, sequence lengths) in the dataset — they're dropped before batching.

Smoke Test Results

24-step training on CPU with a tiny model (64-dim, 2 layers) confirmed the full pipeline:

Step  1: loss=5.419  grad_norm=7.227  lr=1.000e-03
Step 12: loss=4.510  grad_norm=3.668  lr=5.653e-04
Step 24: loss=4.322  grad_norm=3.636  lr=4.278e-06

Loss decreased monotonically from 5.42 to 4.32 with cosine decay — the tokenizer→packing→collator→model→loss→optimizer pipeline is end-to-end functional.

Test Results

19 tests passing covering: tokenization of user sequences (variable lengths, BOS/EOS presence), packing (fixed blocks, concatenation, remainder dropping, error on insufficient data), full dataset preparation, DataCollator behavior (label creation, shapes, all-ones attention mask for packed data), integration forward pass with backward, Trainer smoke test (24 steps), and validation that missing pad_token raises correctly.


Cumulative Test Summary

Phase Tests Coverage
2A: Tokenizers 72 Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding
2B: Models 33 Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizer→model integration
2C: Training 19 Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling
Total 124 All passing

Library API Summary (v0.3.0)

from domain_tokenizer import (
    # Schemas
    DomainSchema, FieldSpec, FieldType,
    # Tokenizers
    DomainTokenizerBuilder,
    # Models
    DomainTransformerConfig, DomainTransformerForCausalLM,
    PeriodicLinearReLU, JointFusionModel, DCNv2,
    # Training
    prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA

End-to-End Usage

# 1. Build tokenizer from schema
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)
hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)

# 2. Prepare packed training data
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)

# 3. Create model
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)

# 4. Pre-train
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    hub_model_id="org/finance-24m",
    num_epochs=10, learning_rate=3e-4, bf16=True,
)