rtferraz commited on
Commit
7aac458
Β·
verified Β·
1 Parent(s): abab711

Update implementation report: add Phase 2D, update header to v0.4.0 / 139 tests, update cumulative summary and API

Browse files
Files changed (1) hide show
  1. docs/phase2_implementation_report.md +78 -15
docs/phase2_implementation_report.md CHANGED
@@ -1,8 +1,8 @@
1
- # Phase 2A–2C Implementation Report
2
 
3
- > **domainTokenizer v0.3.0** β€” Core library complete: tokenizers, models, pre-training pipeline
4
  >
5
- > **124 tests passing** (72 tokenizer + 33 model + 19 training)
6
  >
7
  > *April 2026*
8
 
@@ -10,13 +10,13 @@
10
 
11
  ## Overview
12
 
13
- Phase 2 implements the core domainTokenizer library β€” everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a pre-trained Transformer foundation model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.
14
 
15
- The library is organized as three layers, each built and tested independently before composing into the next:
16
 
17
  ```
18
- Phase 2A: Tokenizers β†’ Phase 2B: Models β†’ Phase 2C: Training Pipeline
19
- (schema β†’ tokens) (tokens β†’ loss) (data β†’ Trainer β†’ checkpoints)
20
  ```
21
 
22
  ---
@@ -142,18 +142,62 @@ Loss decreased monotonically from 5.42 to 4.32 with cosine decay β€” the tokeniz
142
 
143
  ---
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  ## Cumulative Test Summary
146
 
147
  | Phase | Tests | Coverage |
148
  |-------|-------|----------|
149
  | 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
150
  | 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizerβ†’model integration |
151
- | 2C: Training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
152
- | **Total** | **124** | **All passing** |
 
153
 
154
  ---
155
 
156
- ## Library API Summary (v0.3.0)
157
 
158
  ```python
159
  from domain_tokenizer import (
@@ -164,13 +208,15 @@ from domain_tokenizer import (
164
  # Models
165
  DomainTransformerConfig, DomainTransformerForCausalLM,
166
  PeriodicLinearReLU, JointFusionModel, DCNv2,
167
- # Training
168
  prepare_clm_dataset, pretrain_domain_model,
 
 
169
  )
170
  from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
171
  ```
172
 
173
- ### End-to-End Usage
174
 
175
  ```python
176
  # 1. Build tokenizer from schema
@@ -181,14 +227,31 @@ hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
181
  # 2. Prepare packed training data
182
  dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
183
 
184
- # 3. Create model
185
  config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
186
  model = DomainTransformerForCausalLM(config)
187
-
188
- # 4. Pre-train
189
  pretrain_domain_model(
190
  model, hf_tokenizer, dataset,
191
  hub_model_id="org/finance-24m",
192
  num_epochs=10, learning_rate=3e-4, bf16=True,
193
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
194
  ```
 
1
+ # Phase 2A–2D Implementation Report
2
 
3
+ > **domainTokenizer v0.4.0** β€” Core library complete: tokenizers, models, pre-training, fine-tuning
4
  >
5
+ > **139 tests passing** (72 tokenizer + 33 model + 19 pre-training + 15 fine-tuning)
6
  >
7
  > *April 2026*
8
 
 
10
 
11
  ## Overview
12
 
13
+ Phase 2 implements the complete domainTokenizer library β€” everything needed to go from raw domain events (financial transactions, e-commerce actions, clinical encounters) to a fine-tuned downstream prediction model. The implementation directly follows the validated patterns from Nubank's nuFormer (arXiv:2507.23267), with architecture decisions grounded in 6 audited reference papers.
14
 
15
+ The library is organized as four layers, each built and tested independently before composing into the next:
16
 
17
  ```
18
+ Phase 2A: Tokenizers β†’ Phase 2B: Models β†’ Phase 2C: Pre-training β†’ Phase 2D: Fine-tuning
19
+ (schema β†’ tokens) (tokens β†’ loss) (CLM on sequences) (joint fusion on labels)
20
  ```
21
 
22
  ---
 
142
 
143
  ---
144
 
145
+ ## Phase 2D: Fine-tuning Pipeline (Weeks 7–9)
146
+
147
+ ### What Was Built
148
+
149
+ A supervised fine-tuning pipeline for the JointFusionModel β€” the nuFormer-style architecture that combines a pre-trained transaction Transformer with DCNv2(PLR) tabular features for downstream prediction tasks.
150
+
151
+ | Component | Purpose |
152
+ |-----------|---------|
153
+ | `DomainFinetuneDataset` | Per-user torch Dataset yielding `{input_ids, attention_mask, tabular_features, labels}` |
154
+ | `prepare_finetune_dataset()` | Convenience constructor with validation and logging |
155
+ | `finetune_domain_model()` | Fine-tunes JointFusionModel via HF Trainer β€” zero subclassing needed |
156
+
157
+ ### Key Technical Decisions
158
+
159
+ 1. **HF Trainer Pattern A β€” zero custom code required.** The critical discovery: HuggingFace Trainer inspects `JointFusionModel.forward(self, input_ids, attention_mask, tabular_features, labels)` via `inspect.signature()`. Because `tabular_features` is a named parameter in the forward signature, the Trainer auto-keeps it from the dataset and passes it to the model. No `compute_loss` override, no `remove_unused_columns=False`, no Trainer subclass. This was verified empirically on transformers 5.7.0 β€” the Trainer's `_set_signature_columns_if_needed()` method builds the allowed column list directly from the model's `forward()` parameters, and this works identically for plain `nn.Module` and `PreTrainedModel`.
160
+
161
+ 2. **Per-user padding, not packing.** Unlike pre-training (which packs sequences for 100% token utilization), fine-tuning uses per-user padded sequences. The reason: each training sample needs its own label. In pre-training, the "label" is the next token β€” shared across the packed block. In fine-tuning, the label is a user-level outcome (e.g., "will this user activate a product?") β€” each user is a separate sample with its own label. Padding tokens are masked in the attention via `attention_mask`, so they don't affect the user embedding extracted by `get_user_embedding()`.
162
+
163
+ 3. **Dataset returns tensors directly, no custom collator.** The `DomainFinetuneDataset.__getitem__()` returns pre-tokenized, pre-padded torch tensors. The default PyTorch `DataLoader` collation (stack tensors into batches) is sufficient. No `DataCollatorForLanguageModeling` needed β€” that's pre-training only. This simplifies the pipeline and avoids double-padding issues.
164
+
165
+ 4. **`save_strategy` is configurable (not hardcoded).** During testing, we discovered that saving JointFusionModel checkpoints via safetensors fails because the wrapped DomainTransformerForCausalLM has tied weights (lm_head ↔ embed_tokens), and safetensors rejects shared tensor storage by default. The fix: `save_strategy` is exposed as a parameter so users can set `"no"` during experimentation or use custom saving logic for production. This is a known HF issue with wrapper models containing tied-weight sub-models.
166
+
167
+ 5. **Binary and multiclass via `n_classes` parameter.** The same `JointFusionModel` and `finetune_domain_model()` handle both binary classification (`n_classes=1`, BCE loss) and multiclass (`n_classes>1`, CE loss). The loss function switches automatically based on `n_classes`. Labels are `float` for binary and `long` for multiclass β€” the dataset returns `float32` by default, and the caller casts to `long` for multiclass.
168
+
169
+ ### Smoke Test Results
170
+
171
+ 5-step fine-tuning on CPU with a tiny model confirmed the full pipeline:
172
+
173
+ ```
174
+ Step 1: loss=0.750 grad_norm=7.158 lr=1.000e-03
175
+ Step 3: loss=0.996 grad_norm=3.771 lr=6.545e-04
176
+ Step 5: loss=0.818 grad_norm=2.681 lr=9.549e-05
177
+ Train loss: 0.752 (5 steps, 20 samples, batch=4)
178
+ ```
179
+
180
+ Both the Transformer branch and PLR+DCNv2 tabular branch received gradients β€” end-to-end joint training is functional.
181
+
182
+ ### Test Results
183
+
184
+ **15 tests passing** covering: dataset creation (length, keys, shapes, padding correctness, attention mask alignment, dtypes, length mismatch error, stats), DataLoader batching, forward pass on real dataset batches, backward gradient flow through both branches, multiclass classification, HF Trainer smoke test (5 steps), and the `prepare_finetune_dataset` convenience function.
185
+
186
+ ---
187
+
188
  ## Cumulative Test Summary
189
 
190
  | Phase | Tests | Coverage |
191
  |-------|-------|----------|
192
  | 2A: Tokenizers | 72 | Schema validation, 5 field tokenizers (edge cases: NaN, None, overflow, UNK), predefined schemas, builder pipeline, end-to-end encoding |
193
  | 2B: Models | 33 | Config presets, forward pass shapes, loss computation, weight tying, user embeddings, param counts, gradient checkpointing, causal masking, PLR, DCNv2, JointFusion, tokenizerβ†’model integration |
194
+ | 2C: Pre-training | 19 | Tokenization, packing, collation, DataCollator behavior, forward+backward integration, 24-step Trainer smoke test, error handling |
195
+ | 2D: Fine-tuning | 15 | Dataset creation/validation, batching, forward/backward through JointFusion, 5-step Trainer smoke test, multiclass, convenience function |
196
+ | **Total** | **139** | **All passing** |
197
 
198
  ---
199
 
200
+ ## Library API Summary (v0.4.0)
201
 
202
  ```python
203
  from domain_tokenizer import (
 
208
  # Models
209
  DomainTransformerConfig, DomainTransformerForCausalLM,
210
  PeriodicLinearReLU, JointFusionModel, DCNv2,
211
+ # Pre-training
212
  prepare_clm_dataset, pretrain_domain_model,
213
+ # Fine-tuning
214
+ DomainFinetuneDataset, prepare_finetune_dataset, finetune_domain_model,
215
  )
216
  from domain_tokenizer.schemas import FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
217
  ```
218
 
219
+ ### End-to-End Usage: Pre-training β†’ Fine-tuning
220
 
221
  ```python
222
  # 1. Build tokenizer from schema
 
227
  # 2. Prepare packed training data
228
  dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
229
 
230
+ # 3. Create and pre-train model
231
  config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
232
  model = DomainTransformerForCausalLM(config)
 
 
233
  pretrain_domain_model(
234
  model, hf_tokenizer, dataset,
235
  hub_model_id="org/finance-24m",
236
  num_epochs=10, learning_rate=3e-4, bf16=True,
237
  )
238
+
239
+ # 4. Create joint fusion model for fine-tuning
240
+ fusion = JointFusionModel(
241
+ transformer_model=model, # pre-trained, unfrozen
242
+ n_tabular_features=291, # hand-crafted tabular features
243
+ n_classes=1, # binary: will user activate product?
244
+ )
245
+
246
+ # 5. Prepare fine-tuning data
247
+ ft_dataset = prepare_finetune_dataset(
248
+ user_sequences, tabular_features, labels,
249
+ builder, hf_tokenizer, max_length=512,
250
+ )
251
+
252
+ # 6. Fine-tune
253
+ finetune_domain_model(
254
+ fusion, ft_dataset,
255
+ num_epochs=5, learning_rate=1e-4, bf16=True,
256
+ )
257
  ```