rtferraz commited on
Commit
756d197
·
verified ·
1 Parent(s): 7aac458

Add ADR-002: Dataset selection for Phase 3 demos — research findings, rationale, phased plan

Browse files
Files changed (1) hide show
  1. docs/adr/ADR-002-dataset-selection.md +322 -0
docs/adr/ADR-002-dataset-selection.md ADDED
@@ -0,0 +1,322 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ADR-002: Dataset Selection for Phase 3 Domain Demos
2
+
3
+ > **Status:** Accepted
4
+ > **Date:** April 30, 2026
5
+ > **Decision:** Start with `mindweave/bank-transactions-us` for pipeline validation, then scale to Sparkov (finance), REES46 (e-commerce), and Synthea (healthcare)
6
+
7
+ ---
8
+
9
+ ## 1. Context
10
+
11
+ Phase 2 delivered a complete library (v0.4.0, 139 tests) with tokenizers, models, and training pipelines — all validated on synthetic data generated in test fixtures. Phase 3 requires running the full pipeline on **real public datasets** to produce trained models and benchmark against baselines.
12
+
13
+ We need datasets for three domains matching our predefined schemas:
14
+
15
+ | Schema | Required Fields | Minimum Scale for Demo |
16
+ |--------|----------------|----------------------|
17
+ | `FINANCE_SCHEMA` | timestamp, signed amount, text description | 100+ users × 10+ events |
18
+ | `ECOMMERCE_SCHEMA` | timestamp, price, event type, category, product text | 1000+ users × 10+ events |
19
+ | `HEALTHCARE_SCHEMA` | timestamp, event type, severity, cost, clinical text | 1000+ patients × 10+ events |
20
+
21
+ The strategy: **start small to validate the pipeline end-to-end, then scale to production-sized datasets.**
22
+
23
+ ---
24
+
25
+ ## 2. Dataset Analysis
26
+
27
+ ### 2.1 Candidates Evaluated
28
+
29
+ We evaluated 8 datasets across 3 domains. Each was checked for: HuggingFace Hub availability, schema compatibility with our field types, scale (users × events), licensing, and accessibility (instant download vs. gated/external).
30
+
31
+ #### Finance Candidates
32
+
33
+ | Dataset | Source | Users | Events | Schema Fit | Access |
34
+ |---------|--------|-------|--------|-----------|--------|
35
+ | **mindweave/bank-transactions-us** | HF Hub | ~20 accounts | ~400 | ✅ Perfect | Instant |
36
+ | **Sparkov CC Fraud** (kartik2112) | Kaggle | ~1,000 | 1.3M | ✅ Excellent | Kaggle account |
37
+ | IBM AML Transactions | GitHub | Thousands | 550K–55M | ✅ Good | Direct download |
38
+
39
+ **mindweave/bank-transactions-us** — inspected in detail:
40
+ - Config `bank_transactions`: 11 columns, 0.4MB Parquet
41
+ - `transaction_date` (string, `"2024-01-04"`) → maps to `FINANCE_SCHEMA.timestamp` ✅
42
+ - `amount` (float64, signed: `-17584.14` for payroll, `+1413.94` for deposits) → maps to `amount` and `amount_sign` ✅
43
+ - `description` (string, `"Payroll - net wages"`, `"Customer payment received"`) → maps to `description` ✅
44
+ - `source_module` (string, `"payroll"`, `"sales"`, `"purchases"`) → bonus categorical field ✅
45
+ - `transaction_type` (string, `"withdrawal"`, `"deposit"`) → redundant with sign, but useful for validation
46
+ - `bank_account_id` (UUID) → user grouping key ✅
47
+ - Linked `bank_accounts` table has company/bank metadata for potential tabular features
48
+
49
+ **Scale limitation:** ~20 accounts × ~20 transactions each = ~400 total. This is too small for meaningful pre-training, but **schema-perfect for pipeline validation**: every field maps directly to `FINANCE_SCHEMA` without transformation. The model will overfit immediately, but that's exactly what confirms the pipeline works.
50
+
51
+ **Sparkov CC Fraud** — the scale-up target:
52
+ - ~1,000 cardholders × ~1,300 transactions each = 1.3M total events
53
+ - `trans_date_trans_time`, `amt`, `merchant`, `category`, `cc_num`, `is_fraud`
54
+ - CC0 license (public domain)
55
+ - `is_fraud` provides a natural fine-tuning label (binary classification)
56
+ - Requires Kaggle account for download (free, instant)
57
+
58
+ #### E-Commerce Candidates
59
+
60
+ | Dataset | Source | Users | Events | Schema Fit | Access |
61
+ |---------|--------|-------|--------|-----------|--------|
62
+ | **REES46 Behavioral** | HF Hub | Millions | 42M | ✅ Perfect | Instant |
63
+ | Amazon Reviews 2023 | HF Hub (gated) | 33M | 571M | ⚠️ No price in reviews | HF token |
64
+
65
+ **REES46** (`kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019`) — inspected:
66
+ - 2GB Parquet (10 files), fully accessible on HF Hub
67
+ - `event_time` (ISO 8601), `event_type` (`"view"`, `"cart"`, `"purchase"`), `product_id`, `category_code` (`"electronics.smartphone"`), `brand`, `price` (float64), `user_id`
68
+ - Every ECOMMERCE_SCHEMA field maps directly
69
+ - For pre-training: filter to `purchase` events for clean transaction sequences
70
+ - Scale: can subsample to 10K–100K users for demo, millions available for production
71
+
72
+ #### Healthcare Candidates
73
+
74
+ | Dataset | Source | Patients | Events | Schema Fit | Access |
75
+ |---------|--------|----------|--------|-----------|--------|
76
+ | **Synthea 575K** | HF Hub | 575K | Millions | ✅ Excellent | Instant |
77
+ | Synthea Direct | synthea.mitre.org | 100K–1M | Millions | ✅ Same | Direct download |
78
+ | MIMIC-IV | PhysioNet | 40K+ ICU | Millions | ✅ Gold standard | 1-2 day DUA |
79
+
80
+ **Synthea 575K** (`richardyoung/synthea-575k-patients`) — inspected:
81
+ - 136GB total across 18 Parquet files (allergies, conditions, encounters, medications, observations, procedures, etc.)
82
+ - Default config shows allergies table: `START` (date), `PATIENT` (UUID), `DESCRIPTION`, `TYPE`, `CATEGORY`, `SEVERITY1`
83
+ - For richer sequences: load `encounters.parquet` (5.1GB) with `Start`, `DESCRIPTION`, `Base_Cost`, `REASONDESCRIPTION`
84
+ - Fully synthetic — no IRB, no access restrictions, MIT/Apache 2.0 license
85
+
86
+ ### 2.2 Schema Mapping Verification
87
+
88
+ Direct field mapping from `mindweave/bank-transactions-us` to `FINANCE_SCHEMA`:
89
+
90
+ ```
91
+ Dataset Column → FINANCE_SCHEMA Field → Tokenizer
92
+ ─────────────────────────────────────────────────────────────────
93
+ amount (sign) → amount_sign → SignTokenizer (2 tokens)
94
+ amount (magnitude) → amount → MagnitudeBucketTokenizer (21 bins)
95
+ transaction_date → timestamp → CalendarTokenizer (month/dow/dom/hour)
96
+ description → description → BPE subword tokenizer
97
+ ─────────────────────────────────────────────────────────────────
98
+ bank_account_id → (user grouping key) → group-by for user sequences
99
+ source_module → (bonus: not in schema) → could extend schema
100
+ transaction_type → (redundant with sign) → validation check
101
+ ```
102
+
103
+ **Zero transformation needed.** The `amount` field is already signed (negative = withdrawal, positive = deposit). The `description` field contains natural text suitable for BPE. The `transaction_date` is a standard date string. This is the cleanest possible mapping to our schema.
104
+
105
+ ---
106
+
107
+ ## 3. Decision
108
+
109
+ ### Phased approach: validate small → scale up
110
+
111
+ | Phase | Dataset | Purpose | Scale |
112
+ |-------|---------|---------|-------|
113
+ | **3.0: Pipeline Validation** | `mindweave/bank-transactions-us` | Verify end-to-end: load → tokenize → pack → train → loss decreases | ~400 events, ~20 accounts |
114
+ | **3.1: Finance Demo** | Sparkov CC Fraud (Kaggle) | Train 24M model, fine-tune fraud detection, benchmark vs LightGBM | 1.3M events, 1K users |
115
+ | **3.2: E-Commerce Demo** | REES46 (HF Hub) | Train 24M model, next-purchase prediction | 42M events, subsample to 100K users |
116
+ | **3.3: Healthcare Demo** | Synthea 575K (HF Hub) | Train 24M model, condition prediction | 575K patients, subsample encounters |
117
+
118
+ ### Rationale
119
+
120
+ 1. **Start with mindweave because it's schema-perfect and instant.** No data cleaning, no field renaming, no Kaggle credentials needed. The pipeline either works or it doesn't — this dataset tells us in minutes.
121
+
122
+ 2. **The model will overfit on 400 events — that's the point.** If loss doesn't decrease on 400 events, the pipeline is broken. If it does, the pipeline works and we can scale with confidence.
123
+
124
+ 3. **Sparkov is the real finance demo.** 1,000 users × 1,300 events is the exact scale where a 24M-parameter model should learn meaningful patterns. The `is_fraud` label enables a direct comparison with LightGBM on the same data.
125
+
126
+ 4. **REES46 is the flagship demo.** Millions of events, real behavioral data, perfect schema fit, instant HF download. This is the dataset that demonstrates domainTokenizer's value proposition most compellingly.
127
+
128
+ 5. **Synthea is the healthcare proof point.** Fully synthetic (no access barriers), massive scale, multiple event types. Validates that the domain tokenizer approach generalizes beyond finance and e-commerce.
129
+
130
+ ---
131
+
132
+ ## 4. Implementation
133
+
134
+ ### 4.1 Phase 3.0: Pipeline Validation with mindweave
135
+
136
+ **Goal:** Run the complete pipeline end-to-end on real data, verify loss decreases, confirm no bugs.
137
+
138
+ **Step 1: Load and explore the data**
139
+
140
+ ```python
141
+ from datasets import load_dataset
142
+ import pandas as pd
143
+
144
+ # Load bank transactions
145
+ ds = load_dataset("mindweave/bank-transactions-us", "bank_transactions", split="train")
146
+ df = ds.to_pandas()
147
+
148
+ # Basic stats
149
+ print(f"Total transactions: {len(df)}")
150
+ print(f"Unique accounts: {df['bank_account_id'].nunique()}")
151
+ print(f"Date range: {df['transaction_date'].min()} to {df['transaction_date'].max()}")
152
+ print(f"Amount range: {df['amount'].min():.2f} to {df['amount'].max():.2f}")
153
+ print(f"Descriptions: {df['description'].nunique()} unique")
154
+ print(f"Source modules: {df['source_module'].value_counts().to_dict()}")
155
+ ```
156
+
157
+ **Step 2: Convert to domainTokenizer event format**
158
+
159
+ ```python
160
+ from datetime import datetime
161
+
162
+ def row_to_event(row):
163
+ """Convert a DataFrame row to a FINANCE_SCHEMA event dict."""
164
+ return {
165
+ "amount_sign": row["amount"], # SignTokenizer reads the sign
166
+ "amount": row["amount"], # MagnitudeBucketTokenizer reads abs value
167
+ "timestamp": datetime.strptime(row["transaction_date"], "%Y-%m-%d"),
168
+ "description": row["description"], # BPE tokenizer
169
+ }
170
+
171
+ # Group by account → list of event sequences
172
+ user_sequences = []
173
+ for account_id, group in df.sort_values("transaction_date").groupby("bank_account_id"):
174
+ events = [row_to_event(row) for _, row in group.iterrows()]
175
+ user_sequences.append(events)
176
+
177
+ print(f"Users: {len(user_sequences)}")
178
+ print(f"Events per user: {[len(s) for s in user_sequences]}")
179
+ ```
180
+
181
+ **Step 3: Build tokenizer, prepare data, train**
182
+
183
+ ```python
184
+ from domain_tokenizer import (
185
+ DomainTokenizerBuilder, DomainTransformerConfig,
186
+ DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
187
+ )
188
+ from domain_tokenizer.schemas import FINANCE_SCHEMA
189
+
190
+ # Build tokenizer
191
+ all_events = [e for seq in user_sequences for e in seq]
192
+ builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
193
+ builder.fit(all_events)
194
+ hf_tokenizer = builder.build(
195
+ text_corpus=[e["description"] for e in all_events],
196
+ bpe_vocab_size=500, # small vocab for small dataset
197
+ )
198
+
199
+ # Prepare packed dataset
200
+ dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=128)
201
+ print(f"Packed blocks: {len(dataset)} × 128 tokens")
202
+
203
+ # Create tiny model (for validation, not real training)
204
+ config = DomainTransformerConfig(
205
+ vocab_size=hf_tokenizer.vocab_size,
206
+ hidden_size=128, num_hidden_layers=4, num_attention_heads=4,
207
+ intermediate_size=512,
208
+ )
209
+ model = DomainTransformerForCausalLM(config)
210
+ print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")
211
+
212
+ # Train — expect loss to decrease rapidly (overfitting on small data = pipeline works)
213
+ pretrain_domain_model(
214
+ model, hf_tokenizer, dataset,
215
+ num_epochs=20,
216
+ per_device_batch_size=4,
217
+ gradient_accumulation_steps=1,
218
+ learning_rate=3e-4,
219
+ warmup_steps=10,
220
+ logging_steps=5,
221
+ save_strategy="no",
222
+ report_to="none",
223
+ )
224
+ ```
225
+
226
+ **Expected outcome:** Loss should drop from ~6.0 to <2.0 within 20 epochs on 400 events. If it does, the pipeline is validated. If it doesn't, there's a bug in tokenization, packing, or model architecture.
227
+
228
+ **Validation checks after training:**
229
+ - [ ] Loss decreased monotonically (overfitting expected and desired)
230
+ - [ ] No NaN/inf in loss or gradients
231
+ - [ ] Token distribution is reasonable (no >50% UNK tokens)
232
+ - [ ] `builder.tokenize_event()` produces expected token strings for sample events
233
+ - [ ] `hf_tokenizer.decode()` on model output produces recognizable token strings
234
+
235
+ ### 4.2 Phase 3.1: Finance Demo with Sparkov (After Validation)
236
+
237
+ ```bash
238
+ # Download from Kaggle
239
+ kaggle datasets download kartik2112/fraud-detection -p data/
240
+ unzip data/fraud-detection.zip -d data/sparkov/
241
+ ```
242
+
243
+ ```python
244
+ import pandas as pd
245
+
246
+ df = pd.read_csv("data/sparkov/fraudTrain.csv")
247
+
248
+ def sparkov_to_event(row):
249
+ return {
250
+ "amount_sign": row["amt"], # always positive in Sparkov; sign from context
251
+ "amount": row["amt"],
252
+ "timestamp": datetime.strptime(row["trans_date_trans_time"], "%Y-%m-%d %H:%M:%S"),
253
+ "description": f"{row['merchant']} {row['category']}",
254
+ }
255
+
256
+ # Group by cardholder
257
+ user_sequences = []
258
+ labels = [] # for fine-tuning: any fraud in user's history?
259
+ for cc_num, group in df.sort_values("trans_date_trans_time").groupby("cc_num"):
260
+ events = [sparkov_to_event(row) for _, row in group.iterrows()]
261
+ user_sequences.append(events)
262
+ labels.append(int(group["is_fraud"].any()))
263
+
264
+ # Pre-train 24M model on 1K users × 1.3K events
265
+ config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
266
+ model = DomainTransformerForCausalLM(config)
267
+ # ... pretrain_domain_model(model, ..., bf16=True) # requires GPU
268
+
269
+ # Fine-tune for fraud detection
270
+ # ... finetune_domain_model(fusion_model, ft_dataset, ...)
271
+ ```
272
+
273
+ **Hardware:** a10g-large (24GB VRAM), ~2-3 hours for 24M model on 1.3M events.
274
+
275
+ ### 4.3 Phase 3.2: E-Commerce Demo with REES46
276
+
277
+ ```python
278
+ from datasets import load_dataset
279
+
280
+ ds = load_dataset(
281
+ "kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019",
282
+ split="train",
283
+ )
284
+
285
+ # Filter to purchases and subsample users
286
+ purchases = ds.filter(lambda x: x["event_type"] == "purchase")
287
+ # Group by user_id, take top 100K users by event count
288
+ # ... build ECOMMERCE_SCHEMA tokenizer, train 24M model
289
+ ```
290
+
291
+ ### 4.4 Phase 3.3: Healthcare Demo with Synthea
292
+
293
+ ```python
294
+ from huggingface_hub import hf_hub_download
295
+ import pandas as pd
296
+
297
+ encounters = pd.read_parquet(hf_hub_download(
298
+ "richardyoung/synthea-575k-patients",
299
+ "data/encounters.parquet",
300
+ repo_type="dataset",
301
+ ))
302
+
303
+ # Group by PATIENT, sort by Start date
304
+ # Map: Start→timestamp, Base_Cost→amount, DESCRIPTION→description
305
+ # ... build HEALTHCARE_SCHEMA tokenizer, train 24M model
306
+ ```
307
+
308
+ ---
309
+
310
+ ## 5. Risks and Mitigations
311
+
312
+ | Risk | Impact | Mitigation |
313
+ |------|--------|-----------|
314
+ | mindweave too small to catch scale bugs | Bugs only surface at 1M+ events | Run Sparkov immediately after validation passes |
315
+ | Sparkov has no negative amounts | `SignTokenizer` always produces `[AMT_SIGN_POS]` | Concatenate merchant+category as description; test sign tokenizer separately on mindweave (which has signed amounts) |
316
+ | REES46 2GB download slow | Delays e-commerce demo | Stream via HF datasets `streaming=True` or subsample first |
317
+ | Synthea encounters lack numerical values | `MagnitudeBucketTokenizer` underutilized | Use `Base_Cost` for cost binning; join with `observations.parquet` for lab values |
318
+ | Model overfits on 400 events | Expected — not a bug | Overfitting on validation set = pipeline works. Move to Sparkov for real training. |
319
+
320
+ ---
321
+
322
+ *This ADR will be updated with results from each phase as demos are completed.*