Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API
Browse files
docs/phase2_implementation_report.md
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
-
# Phase 2Aβ
|
| 2 |
|
| 3 |
-
> **domainTokenizer v0.
|
| 4 |
>
|
| 5 |
-
> **
|
| 6 |
>
|
| 7 |
> *April 2026*
|
| 8 |
|
|
@@ -10,13 +10,13 @@
|
|
| 10 |
|
| 11 |
## Overview
|
| 12 |
|
| 13 |
-
Phase 2 implements the
|
| 14 |
|
| 15 |
-
The library is organized as
|
| 16 |
|
| 17 |
```
|
| 18 |
-
Phase 2A: Tokenizers β Phase 2B: Models β Phase 2C:
|
| 19 |
-
(schema β tokens) (tokens β loss) (
|
| 20 |
```
|
| 21 |
|
| 22 |
---
|
|
@@ -142,18 +142,62 @@ Loss decreased monotonically from 5.42 to 4.32 with cosine decay β the tokeniz
|
|
| 142 |
|
| 143 |
---
|
| 144 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
## Cumulative Test Summary
|
| 146 |
|
| 147 |
| Phase | Tests | Coverage |
|
| 148 |
|-------|-------|----------|
|
| 149 |
| 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
|
| 150 |
| 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizerβmodel integration |
|
| 151 |
-
| 2C:
|
| 152 |
-
|
|
|
|
|
| 153 |
|
| 154 |
---
|
| 155 |
|
| 156 |
-
## Library API Summary (v0.
|
| 157 |
|
| 158 |
```python
|
| 159 |
from domain_tokenizer import (
|
|
@@ -164,13 +208,15 @@ from domain_tokenizer import (
|
|
| 164 |
# Models
|
| 165 |
DomainTransformerConfig, DomainTransformerForCausalLM,
|
| 166 |
PeriodicLinearReLU, JointFusionModel, DCNv2,
|
| 167 |
-
#
|
| 168 |
prepare_clm_dataset, pretrain_domain_model,
|
|
|
|
|
|
|
| 169 |
)
|
| 170 |
from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
|
| 171 |
```
|
| 172 |
|
| 173 |
-
### End-to-End Usage
|
| 174 |
|
| 175 |
```python
|
| 176 |
# 1. Build tokenizer from schema
|
|
@@ -181,14 +227,31 @@ hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
|
|
| 181 |
# 2. Prepare packed training data
|
| 182 |
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
|
| 183 |
|
| 184 |
-
# 3. Create model
|
| 185 |
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
|
| 186 |
model = DomainTransformerForCausalLM(config)
|
| 187 |
-
|
| 188 |
-
# 4. Pre-train
|
| 189 |
pretrain_domain_model(
|
| 190 |
model, hf_tokenizer, dataset,
|
| 191 |
hub_model_id="org/finance-24m",
|
| 192 |
num_epochs=10, learning_rate=3e-4, bf16=True,
|
| 193 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 194 |
```
|
|
|
|
| 1 |
+
# Phase 2Aβ2D Implementation Report
|
| 2 |
|
| 3 |
+
> **domainTokenizer v0.4.0** β Core library complete: tokenizers, models, pre-training, fine-tuning
|
| 4 |
>
|
| 5 |
+
> **139 tests passing** (72 tokenizer + 33 model + 19 pre-training + 15 fine-tuning)
|
| 6 |
>
|
| 7 |
> *April 2026*
|
| 8 |
|
|
|
|
| 10 |
|
| 11 |
## Overview
|
| 12 |
|
| 13 |
+
Phase 2 implements the complete domainTokenizer library β everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a fine-tuned downstream prediction model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.
|
| 14 |
|
| 15 |
+
The library is organized as four layers, each built and tested independently before composing into the next:
|
| 16 |
|
| 17 |
```
|
| 18 |
+
Phase 2A: Tokenizers β Phase 2B: Models β Phase 2C: Pre-training β Phase 2D: Fine-tuning
|
| 19 |
+
(schema β tokens) (tokens β loss) (CLM on sequences) (joint fusion on labels)
|
| 20 |
```
|
| 21 |
|
| 22 |
---
|
|
|
|
| 142 |
|
| 143 |
---
|
| 144 |
|
| 145 |
+
## Phase 2D: Fine-tuning Pipeline (Weeks 7β9)
|
| 146 |
+
|
| 147 |
+
### What Was Built
|
| 148 |
+
|
| 149 |
+
A supervised fine-tuning pipeline for the JointFusionModel β the nuFormer-style architecture that combines a pre-trained transaction Transformer with DCNv2(PLR) tabular features for downstream prediction tasks.
|
| 150 |
+
|
| 151 |
+
| Component | Purpose |
|
| 152 |
+
|-----------|---------|
|
| 153 |
+
| `DomainFinetuneDataset` | Per-user torch Dataset yielding `{input_ids, attention_mask, tabular_features, labels}` |
|
| 154 |
+
| `prepare_finetune_dataset()` | Convenience constructor with validation and logging |
|
| 155 |
+
| `finetune_domain_model()` | Fine-tunes JointFusionModel via HF Trainer β zero subclassing needed |
|
| 156 |
+
|
| 157 |
+
### Key Technical Decisions
|
| 158 |
+
|
| 159 |
+
1. **HF Trainer Pattern A β zero custom code required.** The critical discovery: HuggingFace Trainer inspects `JointFusionModel.forward(self, input_ids, attention_mask, tabular_features, labels)` via `inspect.signature()`. Because `tabular_features` is a named parameter in the forward signature, the Trainer auto-keeps it from the dataset and passes it to the model. No `compute_loss` override, no `remove_unused_columns=False`, no Trainer subclass. This was verified empirically on transformers 5.7.0 β the Trainer's `_set_signature_columns_if_needed()` method builds the allowed column list directly from the model's `forward()` parameters, and this works identically for plain `nn.Module` and `PreTrainedModel`.
|
| 160 |
+
|
| 161 |
+
2. **Per-user padding, not packing.** Unlike pre-training (which packs sequences for 100% token utilization), fine-tuning uses per-user padded sequences. The reason: each training sample needs its own label. In pre-training, the "label" is the next token β shared across the packed block. In fine-tuning, the label is a user-level outcome (e.g., "will this user activate a product?") β each user is a separate sample with its own label. Padding tokens are masked in the attention via `attention_mask`, so they don't affect the user embedding extracted by `get_user_embedding()`.
|
| 162 |
+
|
| 163 |
+
3. **Dataset returns tensors directly, no custom collator.** The `DomainFinetuneDataset.__getitem__()` returns pre-tokenized, pre-padded torch tensors. The default PyTorch `DataLoader` collation (stack tensors into batches) is sufficient. No `DataCollatorForLanguageModeling` needed β that's pre-training only. This simplifies the pipeline and avoids double-padding issues.
|
| 164 |
+
|
| 165 |
+
4. **`save_strategy` is configurable (not hardcoded).** During testing, we discovered that saving JointFusionModel checkpoints via safetensors fails because the wrapped DomainTransformerForCausalLM has tied weights (lm_head β embed_tokens), and safetensors rejects shared tensor storage by default. The fix: `save_strategy` is exposed as a parameter so users can set `"no"` during experimentation or use custom saving logic for production. This is a known HF issue with wrapper models containing tied-weight sub-models.
|
| 166 |
+
|
| 167 |
+
5. **Binary and multiclass via `n_classes` parameter.** The same `JointFusionModel` and `finetune_domain_model()` handle both binary classification (`n_classes=1`, BCE loss) and multiclass (`n_classes>1`, CE loss). The loss function switches automatically based on `n_classes`. Labels are `float` for binary and `long` for multiclass β the dataset returns `float32` by default, and the caller casts to `long` for multiclass.
|
| 168 |
+
|
| 169 |
+
### Smoke Test Results
|
| 170 |
+
|
| 171 |
+
5-step fine-tuning on CPU with a tiny model confirmed the full pipeline:
|
| 172 |
+
|
| 173 |
+
```
|
| 174 |
+
Step 1: loss=0.750 grad_norm=7.158 lr=1.000e-03
|
| 175 |
+
Step 3: loss=0.996 grad_norm=3.771 lr=6.545e-04
|
| 176 |
+
Step 5: loss=0.818 grad_norm=2.681 lr=9.549e-05
|
| 177 |
+
Train loss: 0.752 (5 steps, 20 samples, batch=4)
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
Both the Transformer branch and PLR+DCNv2 tabular branch received gradients β end-to-end joint training is functional.
|
| 181 |
+
|
| 182 |
+
### Test Results
|
| 183 |
+
|
| 184 |
+
**15 tests passing** covering: dataset creation (length, keys, shapes, padding correctness, attention mask alignment, dtypes, length mismatch error, stats), DataLoader batching, forward pass on real dataset batches, backward gradient flow through both branches, multiclass classification, HF Trainer smoke test (5 steps), and the `prepare_finetune_dataset` convenience function.
|
| 185 |
+
|
| 186 |
+
---
|
| 187 |
+
|
| 188 |
## Cumulative Test Summary
|
| 189 |
|
| 190 |
| Phase | Tests | Coverage |
|
| 191 |
|-------|-------|----------|
|
| 192 |
| 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
|
| 193 |
| 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizerβmodel integration |
|
| 194 |
+
| 2C: Pre-training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
|
| 195 |
+
| 2D: Fine-tuning | 15 | Dataset creation/validation, batching, forward/backward through JointFusion, 5-step Trainer smoke test, multiclass, convenience function |
|
| 196 |
+
| **Total** | **139** | **All passing** |
|
| 197 |
|
| 198 |
---
|
| 199 |
|
| 200 |
+
## Library API Summary (v0.4.0)
|
| 201 |
|
| 202 |
```python
|
| 203 |
from domain_tokenizer import (
|
|
|
|
| 208 |
# Models
|
| 209 |
DomainTransformerConfig, DomainTransformerForCausalLM,
|
| 210 |
PeriodicLinearReLU, JointFusionModel, DCNv2,
|
| 211 |
+
# Pre-training
|
| 212 |
prepare_clm_dataset, pretrain_domain_model,
|
| 213 |
+
# Fine-tuning
|
| 214 |
+
DomainFinetuneDataset, prepare_finetune_dataset, finetune_domain_model,
|
| 215 |
)
|
| 216 |
from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
|
| 217 |
```
|
| 218 |
|
| 219 |
+
### End-to-End Usage: Pre-training β Fine-tuning
|
| 220 |
|
| 221 |
```python
|
| 222 |
# 1. Build tokenizer from schema
|
|
|
|
| 227 |
# 2. Prepare packed training data
|
| 228 |
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
|
| 229 |
|
| 230 |
+
# 3. Create and pre-train model
|
| 231 |
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
|
| 232 |
model = DomainTransformerForCausalLM(config)
|
|
|
|
|
|
|
| 233 |
pretrain_domain_model(
|
| 234 |
model, hf_tokenizer, dataset,
|
| 235 |
hub_model_id="org/finance-24m",
|
| 236 |
num_epochs=10, learning_rate=3e-4, bf16=True,
|
| 237 |
)
|
| 238 |
+
|
| 239 |
+
# 4. Create joint fusion model for fine-tuning
|
| 240 |
+
fusion = JointFusionModel(
|
| 241 |
+
transformer_model=model, # pre-trained, unfrozen
|
| 242 |
+
n_tabular_features=291, # hand-crafted tabular features
|
| 243 |
+
n_classes=1, # binary: will user activate product?
|
| 244 |
+
)
|
| 245 |
+
|
| 246 |
+
# 5. Prepare fine-tuning data
|
| 247 |
+
ft_dataset = prepare_finetune_dataset(
|
| 248 |
+
user_sequences, tabular_features, labels,
|
| 249 |
+
builder, hf_tokenizer, max_length=512,
|
| 250 |
+
)
|
| 251 |
+
|
| 252 |
+
# 6. Fine-tune
|
| 253 |
+
finetune_domain_model(
|
| 254 |
+
fusion, ft_dataset,
|
| 255 |
+
num_epochs=5, learning_rate=1e-4, bf16=True,
|
| 256 |
+
)
|
| 257 |
```
|