rtferraz
/

domainTokenizer

Model card Files Files and versions

xet

Community

rtferraz commited on 8 days ago

Commit

7aac458

verified ·

1 Parent(s): abab711

Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API

Browse files

Files changed (1) hide show

docs/phase2_implementation_report.md +78 -15

docs/phase2_implementation_report.md CHANGED Viewed

@@ -1,8 +1,8 @@
-# Phase 2A–2C Implementation Report
-> **domainTokenizer v0.3.0** — Core library complete: tokenizers, models, pre-training pipeline
 >
-> **124 tests passing** (72 tokenizer + 33 model + 19 training)
 >
 > *April 2026*
@@ -10,13 +10,13 @@
 ## Overview
-Phase 2 implements the core domainTokenizer library — everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a pre-trained Transformer foundation model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.
-The library is organized as three layers, each built and tested independently before composing into the next:
 ```
-Phase 2A: Tokenizers  →  Phase 2B: Models  →  Phase 2C: Training Pipeline
-(schema → tokens)        (tokens → loss)      (data → Trainer → checkpoints)
 ```
 ---
@@ -142,18 +142,62 @@ Loss decreased monotonically from 5.42 to 4.32 with cosine decay — the tokeniz
 ---
 ## Cumulative Test Summary
 | Phase | Tests | Coverage |
 |-------|-------|----------|
 | 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
 | 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizer→model integration |
-| 2C: Training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
-| **Total** | **124** | **All passing** |
 ---
-## Library API Summary (v0.3.0)
 ```python
 from domain_tokenizer import (
@@ -164,13 +208,15 @@ from domain_tokenizer import (
     # Models
     DomainTransformerConfig, DomainTransformerForCausalLM,
     PeriodicLinearReLU, JointFusionModel, DCNv2,
-    # Training
     prepare_clm_dataset, pretrain_domain_model,
 )
 from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
 ```
-### End-to-End Usage
 ```python
 # 1. Build tokenizer from schema
@@ -181,14 +227,31 @@ hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
 # 2. Prepare packed training data
 dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
-# 3. Create model
 config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
 model = DomainTransformerForCausalLM(config)
-# 4. Pre-train
 pretrain_domain_model(
     model, hf_tokenizer, dataset,
     hub_model_id="org/finance-24m",
     num_epochs=10, learning_rate=3e-4, bf16=True,
 )
 ```

+# Phase 2A–2D Implementation Report
+> **domainTokenizer v0.4.0** — Core library complete: tokenizers, models, pre-training, fine-tuning
 >
+> **139 tests passing** (72 tokenizer + 33 model + 19 pre-training + 15 fine-tuning)
 >
 > *April 2026*
 ## Overview
+Phase 2 implements the complete domainTokenizer library — everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a fine-tuned downstream prediction model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.
+The library is organized as four layers, each built and tested independently before composing into the next:
 ```
+Phase 2A: Tokenizers  →  Phase 2B: Models  →  Phase 2C: Pre-training  →  Phase 2D: Fine-tuning
+(schema → tokens)        (tokens → loss)      (CLM on sequences)         (joint fusion on labels)
 ```
 ---
 ---
+## Phase 2D: Fine-tuning Pipeline (Weeks 7–9)
+### What Was Built
+A supervised fine-tuning pipeline for the JointFusionModel — the nuFormer-style architecture that combines a pre-trained transaction Transformer with DCNv2(PLR) tabular features for downstream prediction tasks.
+| Component | Purpose |
+|-----------|---------|
+| `DomainFinetuneDataset` | Per-user torch Dataset yielding `{input_ids, attention_mask, tabular_features, labels}` |
+| `prepare_finetune_dataset()` | Convenience constructor with validation and logging |
+| `finetune_domain_model()` | Fine-tunes JointFusionModel via HF Trainer — zero subclassing needed |
+### Key Technical Decisions
+1. **HF Trainer Pattern A — zero custom code required.** The critical discovery: HuggingFace Trainer inspects `JointFusionModel.forward(self, input_ids, attention_mask, tabular_features, labels)` via `inspect.signature()`. Because `tabular_features` is a named parameter in the forward signature, the Trainer auto-keeps it from the dataset and passes it to the model. No `compute_loss` override, no `remove_unused_columns=False`, no Trainer subclass. This was verified empirically on transformers 5.7.0 — the Trainer's `_set_signature_columns_if_needed()` method builds the allowed column list directly from the model's `forward()` parameters, and this works identically for plain `nn.Module` and `PreTrainedModel`.
+2. **Per-user padding, not packing.** Unlike pre-training (which packs sequences for 100% token utilization), fine-tuning uses per-user padded sequences. The reason: each training sample needs its own label. In pre-training, the "label" is the next token — shared across the packed block. In fine-tuning, the label is a user-level outcome (e.g., "will this user activate a product?") — each user is a separate sample with its own label. Padding tokens are masked in the attention via `attention_mask`, so they don't affect the user embedding extracted by `get_user_embedding()`.
+3. **Dataset returns tensors directly, no custom collator.** The `DomainFinetuneDataset.__getitem__()` returns pre-tokenized, pre-padded torch tensors. The default PyTorch `DataLoader` collation (stack tensors into batches) is sufficient. No `DataCollatorForLanguageModeling` needed — that's pre-training only. This simplifies the pipeline and avoids double-padding issues.
+4. **`save_strategy` is configurable (not hardcoded).** During testing, we discovered that saving JointFusionModel checkpoints via safetensors fails because the wrapped DomainTransformerForCausalLM has tied weights (lm_head ↔ embed_tokens), and safetensors rejects shared tensor storage by default. The fix: `save_strategy` is exposed as a parameter so users can set `"no"` during experimentation or use custom saving logic for production. This is a known HF issue with wrapper models containing tied-weight sub-models.
+5. **Binary and multiclass via `n_classes` parameter.** The same `JointFusionModel` and `finetune_domain_model()` handle both binary classification (`n_classes=1`, BCE loss) and multiclass (`n_classes>1`, CE loss). The loss function switches automatically based on `n_classes`. Labels are `float` for binary and `long` for multiclass — the dataset returns `float32` by default, and the caller casts to `long` for multiclass.
+### Smoke Test Results
+5-step fine-tuning on CPU with a tiny model confirmed the full pipeline:
+```
+Step 1: loss=0.750  grad_norm=7.158  lr=1.000e-03
+Step 3: loss=0.996  grad_norm=3.771  lr=6.545e-04
+Step 5: loss=0.818  grad_norm=2.681  lr=9.549e-05
+Train loss: 0.752 (5 steps, 20 samples, batch=4)
+```
+Both the Transformer branch and PLR+DCNv2 tabular branch received gradients — end-to-end joint training is functional.
+### Test Results
+**15 tests passing** covering: dataset creation (length, keys, shapes, padding correctness, attention mask alignment, dtypes, length mismatch error, stats), DataLoader batching, forward pass on real dataset batches, backward gradient flow through both branches, multiclass classification, HF Trainer smoke test (5 steps), and the `prepare_finetune_dataset` convenience function.
+---
 ## Cumulative Test Summary
 | Phase | Tests | Coverage |
 |-------|-------|----------|
 | 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
 | 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizer→model integration |
+| 2C: Pre-training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
+| 2D: Fine-tuning | 15 | Dataset creation/validation, batching, forward/backward through JointFusion, 5-step Trainer smoke test, multiclass, convenience function |
+| **Total** | **139** | **All passing** |
 ---
+## Library API Summary (v0.4.0)
 ```python
 from domain_tokenizer import (
     # Models
     DomainTransformerConfig, DomainTransformerForCausalLM,
     PeriodicLinearReLU, JointFusionModel, DCNv2,
+    # Pre-training
     prepare_clm_dataset, pretrain_domain_model,
+    # Fine-tuning
+    DomainFinetuneDataset, prepare_finetune_dataset, finetune_domain_model,
 )
 from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
 ```
+### End-to-End Usage: Pre-training → Fine-tuning
 ```python
 # 1. Build tokenizer from schema
 # 2. Prepare packed training data
 dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
+# 3. Create and pre-train model
 config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
 model = DomainTransformerForCausalLM(config)
 pretrain_domain_model(
     model, hf_tokenizer, dataset,
     hub_model_id="org/finance-24m",
     num_epochs=10, learning_rate=3e-4, bf16=True,
 )
+# 4. Create joint fusion model for fine-tuning
+fusion = JointFusionModel(
+    transformer_model=model,        # pre-trained, unfrozen
+    n_tabular_features=291,         # hand-crafted tabular features
+    n_classes=1,                    # binary: will user activate product?
+)
+# 5. Prepare fine-tuning data
+ft_dataset = prepare_finetune_dataset(
+    user_sequences, tabular_features, labels,
+    builder, hf_tokenizer, max_length=512,
+)
+# 6. Fine-tune
+finetune_domain_model(
+    fusion, ft_dataset,
+    num_epochs=5, learning_rate=1e-4, bf16=True,
+)
 ```