| # ADR-002: Dataset Selection for Phase 3 Domain Demos |
|
|
| > **Status:** Accepted |
| > **Date:** April 30, 2026 |
| > **Decision:** Start with `mindweave/bank-transactions-us` for pipeline validation, then scale to Sparkov (finance), REES46 (e-commerce), and Synthea (healthcare) |
|
|
| --- |
|
|
| ## 1. Context |
|
|
| Phase 2 delivered a complete library (v0.4.0, 139 tests) with tokenizers, models, and training pipelines β all validated on synthetic data generated in test fixtures. Phase 3 requires running the full pipeline on **real public datasets** to produce trained models and benchmark against baselines. |
|
|
| We need datasets for three domains matching our predefined schemas: |
|
|
| | Schema | Required Fields | Minimum Scale for Demo | |
| |--------|----------------|----------------------| |
| | `FINANCE_SCHEMA` | timestamp, signed amount, text description | 100+ users Γ 10+ events | |
| | `ECOMMERCE_SCHEMA` | timestamp, price, event type, category, product text | 1000+ users Γ 10+ events | |
| | `HEALTHCARE_SCHEMA` | timestamp, event type, severity, cost, clinical text | 1000+ patients Γ 10+ events | |
|
|
| The strategy: **start small to validate the pipeline end-to-end, then scale to production-sized datasets.** |
|
|
| --- |
|
|
| ## 2. Dataset Analysis |
|
|
| ### 2.1 Candidates Evaluated |
|
|
| We evaluated 8 datasets across 3 domains. Each was checked for: HuggingFace Hub availability, schema compatibility with our field types, scale (users Γ events), licensing, and accessibility (instant download vs. gated/external). |
|
|
| #### Finance Candidates |
|
|
| | Dataset | Source | Users | Events | Schema Fit | Access | |
| |---------|--------|-------|--------|-----------|--------| |
| | **mindweave/bank-transactions-us** | HF Hub | ~20 accounts | ~400 | β
Perfect | Instant | |
| | **Sparkov CC Fraud** (kartik2112) | Kaggle | ~1,000 | 1.3M | β
Excellent | Kaggle account | |
| | IBM AML Transactions | GitHub | Thousands | 550Kβ55M | β
Good | Direct download | |
|
|
| **mindweave/bank-transactions-us** β inspected in detail: |
| - Config `bank_transactions`: 11 columns, 0.4MB Parquet |
| - `transaction_date` (string, `"2024-01-04"`) β maps to `FINANCE_SCHEMA.timestamp` β
|
| - `amount` (float64, signed: `-17584.14` for payroll, `+1413.94` for deposits) β maps to `amount` and `amount_sign` β
|
| - `description` (string, `"Payroll - net wages"`, `"Customer payment received"`) β maps to `description` β
|
| - `source_module` (string, `"payroll"`, `"sales"`, `"purchases"`) β bonus categorical field β
|
| - `transaction_type` (string, `"withdrawal"`, `"deposit"`) β redundant with sign, but useful for validation |
| - `bank_account_id` (UUID) β user grouping key β
|
| - Linked `bank_accounts` table has company/bank metadata for potential tabular features |
|
|
| **Scale limitation:** ~20 accounts Γ ~20 transactions each = ~400 total. This is too small for meaningful pre-training, but **schema-perfect for pipeline validation**: every field maps directly to `FINANCE_SCHEMA` without transformation. The model will overfit immediately, but that's exactly what confirms the pipeline works. |
|
|
| **Sparkov CC Fraud** β the scale-up target: |
| - ~1,000 cardholders Γ ~1,300 transactions each = 1.3M total events |
| - `trans_date_trans_time`, `amt`, `merchant`, `category`, `cc_num`, `is_fraud` |
| - CC0 license (public domain) |
| - `is_fraud` provides a natural fine-tuning label (binary classification) |
| - Requires Kaggle account for download (free, instant) |
|
|
| #### E-Commerce Candidates |
|
|
| | Dataset | Source | Users | Events | Schema Fit | Access | |
| |---------|--------|-------|--------|-----------|--------| |
| | **REES46 Behavioral** | HF Hub | Millions | 42M | β
Perfect | Instant | |
| | Amazon Reviews 2023 | HF Hub (gated) | 33M | 571M | β οΈ No price in reviews | HF token | |
|
|
| **REES46** (`kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019`) β inspected: |
| - 2GB Parquet (10 files), fully accessible on HF Hub |
| - `event_time` (ISO 8601), `event_type` (`"view"`, `"cart"`, `"purchase"`), `product_id`, `category_code` (`"electronics.smartphone"`), `brand`, `price` (float64), `user_id` |
| - Every ECOMMERCE_SCHEMA field maps directly |
| - For pre-training: filter to `purchase` events for clean transaction sequences |
| - Scale: can subsample to 10Kβ100K users for demo, millions available for production |
| |
| #### Healthcare Candidates |
| |
| | Dataset | Source | Patients | Events | Schema Fit | Access | |
| |---------|--------|----------|--------|-----------|--------| |
| | **Synthea 575K** | HF Hub | 575K | Millions | β
Excellent | Instant | |
| | Synthea Direct | synthea.mitre.org | 100Kβ1M | Millions | β
Same | Direct download | |
| | MIMIC-IV | PhysioNet | 40K+ ICU | Millions | β
Gold standard | 1-2 day DUA | |
| |
| **Synthea 575K** (`richardyoung/synthea-575k-patients`) β inspected: |
| - 136GB total across 18 Parquet files (allergies, conditions, encounters, medications, observations, procedures, etc.) |
| - Default config shows allergies table: `START` (date), `PATIENT` (UUID), `DESCRIPTION`, `TYPE`, `CATEGORY`, `SEVERITY1` |
| - For richer sequences: load `encounters.parquet` (5.1GB) with `Start`, `DESCRIPTION`, `Base_Cost`, `REASONDESCRIPTION` |
| - Fully synthetic β no IRB, no access restrictions, MIT/Apache 2.0 license |
|
|
| ### 2.2 Schema Mapping Verification |
|
|
| Direct field mapping from `mindweave/bank-transactions-us` to `FINANCE_SCHEMA`: |
|
|
| ``` |
| Dataset Column β FINANCE_SCHEMA Field β Tokenizer |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| amount (sign) β amount_sign β SignTokenizer (2 tokens) |
| amount (magnitude) β amount β MagnitudeBucketTokenizer (21 bins) |
| transaction_date β timestamp β CalendarTokenizer (month/dow/dom/hour) |
| description β description β BPE subword tokenizer |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| bank_account_id β (user grouping key) β group-by for user sequences |
| source_module β (bonus: not in schema) β could extend schema |
| transaction_type β (redundant with sign) β validation check |
| ``` |
|
|
| **Zero transformation needed.** The `amount` field is already signed (negative = withdrawal, positive = deposit). The `description` field contains natural text suitable for BPE. The `transaction_date` is a standard date string. This is the cleanest possible mapping to our schema. |
|
|
| --- |
|
|
| ## 3. Decision |
|
|
| ### Phased approach: validate small β scale up |
|
|
| | Phase | Dataset | Purpose | Scale | |
| |-------|---------|---------|-------| |
| | **3.0: Pipeline Validation** | `mindweave/bank-transactions-us` | Verify end-to-end: load β tokenize β pack β train β loss decreases | ~400 events, ~20 accounts | |
| | **3.1: Finance Demo** | Sparkov CC Fraud (Kaggle) | Train 24M model, fine-tune fraud detection, benchmark vs LightGBM | 1.3M events, 1K users | |
| | **3.2: E-Commerce Demo** | REES46 (HF Hub) | Train 24M model, next-purchase prediction | 42M events, subsample to 100K users | |
| | **3.3: Healthcare Demo** | Synthea 575K (HF Hub) | Train 24M model, condition prediction | 575K patients, subsample encounters | |
|
|
| ### Rationale |
|
|
| 1. **Start with mindweave because it's schema-perfect and instant.** No data cleaning, no field renaming, no Kaggle credentials needed. The pipeline either works or it doesn't β this dataset tells us in minutes. |
|
|
| 2. **The model will overfit on 400 events β that's the point.** If loss doesn't decrease on 400 events, the pipeline is broken. If it does, the pipeline works and we can scale with confidence. |
|
|
| 3. **Sparkov is the real finance demo.** 1,000 users Γ 1,300 events is the exact scale where a 24M-parameter model should learn meaningful patterns. The `is_fraud` label enables a direct comparison with LightGBM on the same data. |
|
|
| 4. **REES46 is the flagship demo.** Millions of events, real behavioral data, perfect schema fit, instant HF download. This is the dataset that demonstrates domainTokenizer's value proposition most compellingly. |
|
|
| 5. **Synthea is the healthcare proof point.** Fully synthetic (no access barriers), massive scale, multiple event types. Validates that the domain tokenizer approach generalizes beyond finance and e-commerce. |
|
|
| --- |
|
|
| ## 4. Implementation |
|
|
| ### 4.1 Phase 3.0: Pipeline Validation with mindweave |
|
|
| **Goal:** Run the complete pipeline end-to-end on real data, verify loss decreases, confirm no bugs. |
|
|
| **Step 1: Load and explore the data** |
|
|
| ```python |
| from datasets import load_dataset |
| import pandas as pd |
| |
| # Load bank transactions |
| ds = load_dataset("mindweave/bank-transactions-us", "bank_transactions", split="train") |
| df = ds.to_pandas() |
| |
| # Basic stats |
| print(f"Total transactions: {len(df)}") |
| print(f"Unique accounts: {df['bank_account_id'].nunique()}") |
| print(f"Date range: {df['transaction_date'].min()} to {df['transaction_date'].max()}") |
| print(f"Amount range: {df['amount'].min():.2f} to {df['amount'].max():.2f}") |
| print(f"Descriptions: {df['description'].nunique()} unique") |
| print(f"Source modules: {df['source_module'].value_counts().to_dict()}") |
| ``` |
|
|
| **Step 2: Convert to domainTokenizer event format** |
|
|
| ```python |
| from datetime import datetime |
| |
| def row_to_event(row): |
| """Convert a DataFrame row to a FINANCE_SCHEMA event dict.""" |
| return { |
| "amount_sign": row["amount"], # SignTokenizer reads the sign |
| "amount": row["amount"], # MagnitudeBucketTokenizer reads abs value |
| "timestamp": datetime.strptime(row["transaction_date"], "%Y-%m-%d"), |
| "description": row["description"], # BPE tokenizer |
| } |
| |
| # Group by account β list of event sequences |
| user_sequences = [] |
| for account_id, group in df.sort_values("transaction_date").groupby("bank_account_id"): |
| events = [row_to_event(row) for _, row in group.iterrows()] |
| user_sequences.append(events) |
| |
| print(f"Users: {len(user_sequences)}") |
| print(f"Events per user: {[len(s) for s in user_sequences]}") |
| ``` |
|
|
| **Step 3: Build tokenizer, prepare data, train** |
|
|
| ```python |
| from domain_tokenizer import ( |
| DomainTokenizerBuilder, DomainTransformerConfig, |
| DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model, |
| ) |
| from domain_tokenizer.schemas import FINANCE_SCHEMA |
| |
| # Build tokenizer |
| all_events = [e for seq in user_sequences for e in seq] |
| builder = DomainTokenizerBuilder(FINANCE_SCHEMA) |
| builder.fit(all_events) |
| hf_tokenizer = builder.build( |
| text_corpus=[e["description"] for e in all_events], |
| bpe_vocab_size=500, # small vocab for small dataset |
| ) |
| |
| # Prepare packed dataset |
| dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=128) |
| print(f"Packed blocks: {len(dataset)} Γ 128 tokens") |
| |
| # Create tiny model (for validation, not real training) |
| config = DomainTransformerConfig( |
| vocab_size=hf_tokenizer.vocab_size, |
| hidden_size=128, num_hidden_layers=4, num_attention_heads=4, |
| intermediate_size=512, |
| ) |
| model = DomainTransformerForCausalLM(config) |
| print(f"Model params: {sum(p.numel() for p in model.parameters()):,}") |
| |
| # Train β expect loss to decrease rapidly (overfitting on small data = pipeline works) |
| pretrain_domain_model( |
| model, hf_tokenizer, dataset, |
| num_epochs=20, |
| per_device_batch_size=4, |
| gradient_accumulation_steps=1, |
| learning_rate=3e-4, |
| warmup_steps=10, |
| logging_steps=5, |
| save_strategy="no", |
| report_to="none", |
| ) |
| ``` |
|
|
| **Expected outcome:** Loss should drop from ~6.0 to <2.0 within 20 epochs on 400 events. If it does, the pipeline is validated. If it doesn't, there's a bug in tokenization, packing, or model architecture. |
|
|
| **Validation checks after training:** |
| - [ ] Loss decreased monotonically (overfitting expected and desired) |
| - [ ] No NaN/inf in loss or gradients |
| - [ ] Token distribution is reasonable (no >50% UNK tokens) |
| - [ ] `builder.tokenize_event()` produces expected token strings for sample events |
| - [ ] `hf_tokenizer.decode()` on model output produces recognizable token strings |
|
|
| ### 4.2 Phase 3.1: Finance Demo with Sparkov (After Validation) |
|
|
| ```bash |
| # Download from Kaggle |
| kaggle datasets download kartik2112/fraud-detection -p data/ |
| unzip data/fraud-detection.zip -d data/sparkov/ |
| ``` |
|
|
| ```python |
| import pandas as pd |
| |
| df = pd.read_csv("data/sparkov/fraudTrain.csv") |
| |
| def sparkov_to_event(row): |
| return { |
| "amount_sign": row["amt"], # always positive in Sparkov; sign from context |
| "amount": row["amt"], |
| "timestamp": datetime.strptime(row["trans_date_trans_time"], "%Y-%m-%d %H:%M:%S"), |
| "description": f"{row['merchant']} {row['category']}", |
| } |
| |
| # Group by cardholder |
| user_sequences = [] |
| labels = [] # for fine-tuning: any fraud in user's history? |
| for cc_num, group in df.sort_values("trans_date_trans_time").groupby("cc_num"): |
| events = [sparkov_to_event(row) for _, row in group.iterrows()] |
| user_sequences.append(events) |
| labels.append(int(group["is_fraud"].any())) |
| |
| # Pre-train 24M model on 1K users Γ 1.3K events |
| config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size) |
| model = DomainTransformerForCausalLM(config) |
| # ... pretrain_domain_model(model, ..., bf16=True) # requires GPU |
| |
| # Fine-tune for fraud detection |
| # ... finetune_domain_model(fusion_model, ft_dataset, ...) |
| ``` |
|
|
| **Hardware:** a10g-large (24GB VRAM), ~2-3 hours for 24M model on 1.3M events. |
|
|
| ### 4.3 Phase 3.2: E-Commerce Demo with REES46 |
|
|
| ```python |
| from datasets import load_dataset |
| |
| ds = load_dataset( |
| "kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019", |
| split="train", |
| ) |
| |
| # Filter to purchases and subsample users |
| purchases = ds.filter(lambda x: x["event_type"] == "purchase") |
| # Group by user_id, take top 100K users by event count |
| # ... build ECOMMERCE_SCHEMA tokenizer, train 24M model |
| ``` |
|
|
| ### 4.4 Phase 3.3: Healthcare Demo with Synthea |
|
|
| ```python |
| from huggingface_hub import hf_hub_download |
| import pandas as pd |
| |
| encounters = pd.read_parquet(hf_hub_download( |
| "richardyoung/synthea-575k-patients", |
| "data/encounters.parquet", |
| repo_type="dataset", |
| )) |
| |
| # Group by PATIENT, sort by Start date |
| # Map: Startβtimestamp, Base_Costβamount, DESCRIPTIONβdescription |
| # ... build HEALTHCARE_SCHEMA tokenizer, train 24M model |
| ``` |
|
|
| --- |
|
|
| ## 5. Risks and Mitigations |
|
|
| | Risk | Impact | Mitigation | |
| |------|--------|-----------| |
| | mindweave too small to catch scale bugs | Bugs only surface at 1M+ events | Run Sparkov immediately after validation passes | |
| | Sparkov has no negative amounts | `SignTokenizer` always produces `[AMT_SIGN_POS]` | Concatenate merchant+category as description; test sign tokenizer separately on mindweave (which has signed amounts) | |
| | REES46 2GB download slow | Delays e-commerce demo | Stream via HF datasets `streaming=True` or subsample first | |
| | Synthea encounters lack numerical values | `MagnitudeBucketTokenizer` underutilized | Use `Base_Cost` for cost binning; join with `observations.parquet` for lab values | |
| | Model overfits on 400 events | Expected β not a bug | Overfitting on validation set = pipeline works. Move to Sparkov for real training. | |
|
|
| --- |
|
|
| *This ADR will be updated with results from each phase as demos are completed.* |
|
|