File size: 15,326 Bytes
756d197 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 | # ADR-002: Dataset Selection for Phase 3 Domain Demos
> **Status:** Accepted
> **Date:** April 30, 2026
> **Decision:** Start with `mindweave/bank-transactions-us` for pipeline validation, then scale to Sparkov (finance), REES46 (e-commerce), and Synthea (healthcare)
---
## 1. Context
Phase 2 delivered a complete library (v0.4.0, 139 tests) with tokenizers, models, and training pipelines β all validated on synthetic data generated in test fixtures. Phase 3 requires running the full pipeline on **real public datasets** to produce trained models and benchmark against baselines.
We need datasets for three domains matching our predefined schemas:
| Schema | Required Fields | Minimum Scale for Demo |
|--------|----------------|----------------------|
| `FINANCE_SCHEMA` | timestamp, signed amount, text description | 100+ users Γ 10+ events |
| `ECOMMERCE_SCHEMA` | timestamp, price, event type, category, product text | 1000+ users Γ 10+ events |
| `HEALTHCARE_SCHEMA` | timestamp, event type, severity, cost, clinical text | 1000+ patients Γ 10+ events |
The strategy: **start small to validate the pipeline end-to-end, then scale to production-sized datasets.**
---
## 2. Dataset Analysis
### 2.1 Candidates Evaluated
We evaluated 8 datasets across 3 domains. Each was checked for: HuggingFace Hub availability, schema compatibility with our field types, scale (users Γ events), licensing, and accessibility (instant download vs. gated/external).
#### Finance Candidates
| Dataset | Source | Users | Events | Schema Fit | Access |
|---------|--------|-------|--------|-----------|--------|
| **mindweave/bank-transactions-us** | HF Hub | ~20 accounts | ~400 | β
Perfect | Instant |
| **Sparkov CC Fraud** (kartik2112) | Kaggle | ~1,000 | 1.3M | β
Excellent | Kaggle account |
| IBM AML Transactions | GitHub | Thousands | 550Kβ55M | β
Good | Direct download |
**mindweave/bank-transactions-us** β inspected in detail:
- Config `bank_transactions`: 11 columns, 0.4MB Parquet
- `transaction_date` (string, `"2024-01-04"`) β maps to `FINANCE_SCHEMA.timestamp` β
- `amount` (float64, signed: `-17584.14` for payroll, `+1413.94` for deposits) β maps to `amount` and `amount_sign` β
- `description` (string, `"Payroll - net wages"`, `"Customer payment received"`) β maps to `description` β
- `source_module` (string, `"payroll"`, `"sales"`, `"purchases"`) β bonus categorical field β
- `transaction_type` (string, `"withdrawal"`, `"deposit"`) β redundant with sign, but useful for validation
- `bank_account_id` (UUID) β user grouping key β
- Linked `bank_accounts` table has company/bank metadata for potential tabular features
**Scale limitation:** ~20 accounts Γ ~20 transactions each = ~400 total. This is too small for meaningful pre-training, but **schema-perfect for pipeline validation**: every field maps directly to `FINANCE_SCHEMA` without transformation. The model will overfit immediately, but that's exactly what confirms the pipeline works.
**Sparkov CC Fraud** β the scale-up target:
- ~1,000 cardholders Γ ~1,300 transactions each = 1.3M total events
- `trans_date_trans_time`, `amt`, `merchant`, `category`, `cc_num`, `is_fraud`
- CC0 license (public domain)
- `is_fraud` provides a natural fine-tuning label (binary classification)
- Requires Kaggle account for download (free, instant)
#### E-Commerce Candidates
| Dataset | Source | Users | Events | Schema Fit | Access |
|---------|--------|-------|--------|-----------|--------|
| **REES46 Behavioral** | HF Hub | Millions | 42M | β
Perfect | Instant |
| Amazon Reviews 2023 | HF Hub (gated) | 33M | 571M | β οΈ No price in reviews | HF token |
**REES46** (`kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019`) β inspected:
- 2GB Parquet (10 files), fully accessible on HF Hub
- `event_time` (ISO 8601), `event_type` (`"view"`, `"cart"`, `"purchase"`), `product_id`, `category_code` (`"electronics.smartphone"`), `brand`, `price` (float64), `user_id`
- Every ECOMMERCE_SCHEMA field maps directly
- For pre-training: filter to `purchase` events for clean transaction sequences
- Scale: can subsample to 10Kβ100K users for demo, millions available for production
#### Healthcare Candidates
| Dataset | Source | Patients | Events | Schema Fit | Access |
|---------|--------|----------|--------|-----------|--------|
| **Synthea 575K** | HF Hub | 575K | Millions | β
Excellent | Instant |
| Synthea Direct | synthea.mitre.org | 100Kβ1M | Millions | β
Same | Direct download |
| MIMIC-IV | PhysioNet | 40K+ ICU | Millions | β
Gold standard | 1-2 day DUA |
**Synthea 575K** (`richardyoung/synthea-575k-patients`) β inspected:
- 136GB total across 18 Parquet files (allergies, conditions, encounters, medications, observations, procedures, etc.)
- Default config shows allergies table: `START` (date), `PATIENT` (UUID), `DESCRIPTION`, `TYPE`, `CATEGORY`, `SEVERITY1`
- For richer sequences: load `encounters.parquet` (5.1GB) with `Start`, `DESCRIPTION`, `Base_Cost`, `REASONDESCRIPTION`
- Fully synthetic β no IRB, no access restrictions, MIT/Apache 2.0 license
### 2.2 Schema Mapping Verification
Direct field mapping from `mindweave/bank-transactions-us` to `FINANCE_SCHEMA`:
```
Dataset Column β FINANCE_SCHEMA Field β Tokenizer
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
amount (sign) β amount_sign β SignTokenizer (2 tokens)
amount (magnitude) β amount β MagnitudeBucketTokenizer (21 bins)
transaction_date β timestamp β CalendarTokenizer (month/dow/dom/hour)
description β description β BPE subword tokenizer
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
bank_account_id β (user grouping key) β group-by for user sequences
source_module β (bonus: not in schema) β could extend schema
transaction_type β (redundant with sign) β validation check
```
**Zero transformation needed.** The `amount` field is already signed (negative = withdrawal, positive = deposit). The `description` field contains natural text suitable for BPE. The `transaction_date` is a standard date string. This is the cleanest possible mapping to our schema.
---
## 3. Decision
### Phased approach: validate small β scale up
| Phase | Dataset | Purpose | Scale |
|-------|---------|---------|-------|
| **3.0: Pipeline Validation** | `mindweave/bank-transactions-us` | Verify end-to-end: load β tokenize β pack β train β loss decreases | ~400 events, ~20 accounts |
| **3.1: Finance Demo** | Sparkov CC Fraud (Kaggle) | Train 24M model, fine-tune fraud detection, benchmark vs LightGBM | 1.3M events, 1K users |
| **3.2: E-Commerce Demo** | REES46 (HF Hub) | Train 24M model, next-purchase prediction | 42M events, subsample to 100K users |
| **3.3: Healthcare Demo** | Synthea 575K (HF Hub) | Train 24M model, condition prediction | 575K patients, subsample encounters |
### Rationale
1. **Start with mindweave because it's schema-perfect and instant.** No data cleaning, no field renaming, no Kaggle credentials needed. The pipeline either works or it doesn't β this dataset tells us in minutes.
2. **The model will overfit on 400 events β that's the point.** If loss doesn't decrease on 400 events, the pipeline is broken. If it does, the pipeline works and we can scale with confidence.
3. **Sparkov is the real finance demo.** 1,000 users Γ 1,300 events is the exact scale where a 24M-parameter model should learn meaningful patterns. The `is_fraud` label enables a direct comparison with LightGBM on the same data.
4. **REES46 is the flagship demo.** Millions of events, real behavioral data, perfect schema fit, instant HF download. This is the dataset that demonstrates domainTokenizer's value proposition most compellingly.
5. **Synthea is the healthcare proof point.** Fully synthetic (no access barriers), massive scale, multiple event types. Validates that the domain tokenizer approach generalizes beyond finance and e-commerce.
---
## 4. Implementation
### 4.1 Phase 3.0: Pipeline Validation with mindweave
**Goal:** Run the complete pipeline end-to-end on real data, verify loss decreases, confirm no bugs.
**Step 1: Load and explore the data**
```python
from datasets import load_dataset
import pandas as pd
# Load bank transactions
ds = load_dataset("mindweave/bank-transactions-us", "bank_transactions", split="train")
df = ds.to_pandas()
# Basic stats
print(f"Total transactions: {len(df)}")
print(f"Unique accounts: {df['bank_account_id'].nunique()}")
print(f"Date range: {df['transaction_date'].min()} to {df['transaction_date'].max()}")
print(f"Amount range: {df['amount'].min():.2f} to {df['amount'].max():.2f}")
print(f"Descriptions: {df['description'].nunique()} unique")
print(f"Source modules: {df['source_module'].value_counts().to_dict()}")
```
**Step 2: Convert to domainTokenizer event format**
```python
from datetime import datetime
def row_to_event(row):
"""Convert a DataFrame row to a FINANCE_SCHEMA event dict."""
return {
"amount_sign": row["amount"], # SignTokenizer reads the sign
"amount": row["amount"], # MagnitudeBucketTokenizer reads abs value
"timestamp": datetime.strptime(row["transaction_date"], "%Y-%m-%d"),
"description": row["description"], # BPE tokenizer
}
# Group by account β list of event sequences
user_sequences = []
for account_id, group in df.sort_values("transaction_date").groupby("bank_account_id"):
events = [row_to_event(row) for _, row in group.iterrows()]
user_sequences.append(events)
print(f"Users: {len(user_sequences)}")
print(f"Events per user: {[len(s) for s in user_sequences]}")
```
**Step 3: Build tokenizer, prepare data, train**
```python
from domain_tokenizer import (
DomainTokenizerBuilder, DomainTransformerConfig,
DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA
# Build tokenizer
all_events = [e for seq in user_sequences for e in seq]
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)
hf_tokenizer = builder.build(
text_corpus=[e["description"] for e in all_events],
bpe_vocab_size=500, # small vocab for small dataset
)
# Prepare packed dataset
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=128)
print(f"Packed blocks: {len(dataset)} Γ 128 tokens")
# Create tiny model (for validation, not real training)
config = DomainTransformerConfig(
vocab_size=hf_tokenizer.vocab_size,
hidden_size=128, num_hidden_layers=4, num_attention_heads=4,
intermediate_size=512,
)
model = DomainTransformerForCausalLM(config)
print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")
# Train β expect loss to decrease rapidly (overfitting on small data = pipeline works)
pretrain_domain_model(
model, hf_tokenizer, dataset,
num_epochs=20,
per_device_batch_size=4,
gradient_accumulation_steps=1,
learning_rate=3e-4,
warmup_steps=10,
logging_steps=5,
save_strategy="no",
report_to="none",
)
```
**Expected outcome:** Loss should drop from ~6.0 to <2.0 within 20 epochs on 400 events. If it does, the pipeline is validated. If it doesn't, there's a bug in tokenization, packing, or model architecture.
**Validation checks after training:**
- [ ] Loss decreased monotonically (overfitting expected and desired)
- [ ] No NaN/inf in loss or gradients
- [ ] Token distribution is reasonable (no >50% UNK tokens)
- [ ] `builder.tokenize_event()` produces expected token strings for sample events
- [ ] `hf_tokenizer.decode()` on model output produces recognizable token strings
### 4.2 Phase 3.1: Finance Demo with Sparkov (After Validation)
```bash
# Download from Kaggle
kaggle datasets download kartik2112/fraud-detection -p data/
unzip data/fraud-detection.zip -d data/sparkov/
```
```python
import pandas as pd
df = pd.read_csv("data/sparkov/fraudTrain.csv")
def sparkov_to_event(row):
return {
"amount_sign": row["amt"], # always positive in Sparkov; sign from context
"amount": row["amt"],
"timestamp": datetime.strptime(row["trans_date_trans_time"], "%Y-%m-%d %H:%M:%S"),
"description": f"{row['merchant']} {row['category']}",
}
# Group by cardholder
user_sequences = []
labels = [] # for fine-tuning: any fraud in user's history?
for cc_num, group in df.sort_values("trans_date_trans_time").groupby("cc_num"):
events = [sparkov_to_event(row) for _, row in group.iterrows()]
user_sequences.append(events)
labels.append(int(group["is_fraud"].any()))
# Pre-train 24M model on 1K users Γ 1.3K events
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
# ... pretrain_domain_model(model, ..., bf16=True) # requires GPU
# Fine-tune for fraud detection
# ... finetune_domain_model(fusion_model, ft_dataset, ...)
```
**Hardware:** a10g-large (24GB VRAM), ~2-3 hours for 24M model on 1.3M events.
### 4.3 Phase 3.2: E-Commerce Demo with REES46
```python
from datasets import load_dataset
ds = load_dataset(
"kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019",
split="train",
)
# Filter to purchases and subsample users
purchases = ds.filter(lambda x: x["event_type"] == "purchase")
# Group by user_id, take top 100K users by event count
# ... build ECOMMERCE_SCHEMA tokenizer, train 24M model
```
### 4.4 Phase 3.3: Healthcare Demo with Synthea
```python
from huggingface_hub import hf_hub_download
import pandas as pd
encounters = pd.read_parquet(hf_hub_download(
"richardyoung/synthea-575k-patients",
"data/encounters.parquet",
repo_type="dataset",
))
# Group by PATIENT, sort by Start date
# Map: Startβtimestamp, Base_Costβamount, DESCRIPTIONβdescription
# ... build HEALTHCARE_SCHEMA tokenizer, train 24M model
```
---
## 5. Risks and Mitigations
| Risk | Impact | Mitigation |
|------|--------|-----------|
| mindweave too small to catch scale bugs | Bugs only surface at 1M+ events | Run Sparkov immediately after validation passes |
| Sparkov has no negative amounts | `SignTokenizer` always produces `[AMT_SIGN_POS]` | Concatenate merchant+category as description; test sign tokenizer separately on mindweave (which has signed amounts) |
| REES46 2GB download slow | Delays e-commerce demo | Stream via HF datasets `streaming=True` or subsample first |
| Synthea encounters lack numerical values | `MagnitudeBucketTokenizer` underutilized | Use `Base_Cost` for cost binning; join with `observations.parquet` for lab values |
| Model overfits on 400 events | Expected β not a bug | Overfitting on validation set = pipeline works. Move to Sparkov for real training. |
---
*This ADR will be updated with results from each phase as demos are completed.*
|