Add 01_finance_pretrain.ipynb — Phase 3.1 notebook for pre-training on 5M Nigerian financial transactions

29 cells (18 code + 11 markdown):
- Loads electricsheepafrica/Nigerian-Financial-Transactions dataset from HF Hub
- Data profiling with matplotlib visualizations
- Converts to FINANCE_SCHEMA, groups by sender_account
- Builds hybrid domain tokenizer (97 special + BPE)
- Packs sequences, trains 24M DomainTransformer
- Loss curves, next-token predictions, t-SNE user embeddings
- Saves artifacts for fine-tuning notebook

Auto-detects GPU (L4/CPU) and adjusts batch size/epochs accordingly.

Files changed (1) hide show

notebooks/01_finance_pretrain.ipynb +581 -0

notebooks/01_finance_pretrain.ipynb ADDED Viewed

	@@ -0,0 +1,581 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 01 — Finance Pre-Training: Domain Tokenizer on Real Financial Transactions\n",
+    "\n",
+    "**Goal:** Pre-train a 24M-parameter DomainTransformer on 5M synthetic Nigerian financial transactions, demonstrating that the domainTokenizer pipeline works at scale on real-world data.\n",
+    "\n",
+    "**Dataset:** [electricsheepafrica/Nigerian-Financial-Transactions-and-Fraud-Detection-Dataset](https://huggingface.co/datasets/electricsheepafrica/Nigerian-Financial-Transactions-and-Fraud-Detection-Dataset) — 5M transactions, 45 features, fraud labels.\n",
+    "\n",
+    "**Pipeline:**\n",
+    "1. Load data from HuggingFace Hub\n",
+    "2. Explore and profile the dataset\n",
+    "3. Convert to FINANCE_SCHEMA events, group by user\n",
+    "4. Build domain tokenizer (special tokens + BPE)\n",
+    "5. Pack into CLM training dataset\n",
+    "6. Pre-train 24M DomainTransformer (NoPE, GPT-style)\n",
+    "7. Inspect learned representations\n",
+    "\n",
+    "**Hardware:** L4 GPU (24GB VRAM) — 24M model fits comfortably.\n",
+    "\n",
+    "**Reference:** Nubank nuFormer ([arXiv:2507.23267](https://arxiv.org/abs/2507.23267)) — same architecture pattern."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Uncomment and run once to install dependencies:\n",
+    "# !pip install datasets transformers torch accelerate tokenizers numpy pandas matplotlib scikit-learn"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import logging\n",
+    "import time\n",
+    "import pickle\n",
+    "from datetime import datetime\n",
+    "from collections import Counter\n",
+    "\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "import torch\n",
+    "from datasets import load_dataset\n",
+    "\n",
+    "# If running from cloned repo, add src/ to path\n",
+    "import sys, os\n",
+    "if os.path.exists('../src'):\n",
+    "    sys.path.insert(0, '../src')\n",
+    "elif os.path.exists('src'):\n",
+    "    sys.path.insert(0, 'src')\n",
+    "\n",
+    "from domain_tokenizer import (\n",
+    "    DomainTokenizerBuilder, DomainTransformerConfig,\n",
+    "    DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,\n",
+    ")\n",
+    "from domain_tokenizer.schemas import FINANCE_SCHEMA\n",
+    "\n",
+    "logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')\n",
+    "print(f'torch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f'GPU: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f}GB')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1 — Load Dataset from HuggingFace Hub\n",
+    "\n",
+    "5M synthetic Nigerian fintech transactions with 45 features including merchant categories, device info, risk scores, and fraud labels."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "ds = load_dataset(\n",
+    "    'electricsheepafrica/Nigerian-Financial-Transactions-and-Fraud-Detection-Dataset',\n",
+    "    split='train',\n",
+    ")\n",
+    "print(f'Loaded: {len(ds):,} transactions, {len(ds.column_names)} columns')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = ds.to_pandas()\n",
+    "print(f'Shape: {df.shape}')\n",
+    "df.head(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2 — Data Profiling\n",
+    "\n",
+    "Understanding what we're tokenizing: user counts, amount distributions, transaction types, merchant categories."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(f\"Unique senders (users): {df['sender_account'].nunique():,}\")\n",
+    "print(f\"Timestamp range: {df['timestamp'].min()} to {df['timestamp'].max()}\")\n",
+    "print(f\"Amount range: {df['amount_ngn'].min():,.2f} to {df['amount_ngn'].max():,.2f} NGN\")\n",
+    "print(f\"Amount mean: {df['amount_ngn'].mean():,.2f}, median: {df['amount_ngn'].median():,.2f}\")\n",
+    "print(f\"\\nTransaction types:\\n{df['transaction_type'].value_counts().to_string()}\")\n",
+    "print(f\"\\nMerchant categories (top 15):\\n{df['merchant_category'].value_counts().head(15).to_string()}\")\n",
+    "print(f\"\\nFraud rate: {df['is_fraud'].mean()*100:.2f}%\")\n",
+    "print(f\"\\nPayment channels:\\n{df['payment_channel'].value_counts().to_string()}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Events per user distribution\n",
+    "events_per_user = df.groupby('sender_account').size()\n",
+    "print(f\"Events per user: min={events_per_user.min()}, max={events_per_user.max()}, \"\n",
+    "      f\"mean={events_per_user.mean():.1f}, median={events_per_user.median():.1f}\")\n",
+    "print(f\"Users with 5+ events: {(events_per_user >= 5).sum():,}\")\n",
+    "print(f\"Users with 10+ events: {(events_per_user >= 10).sum():,}\")\n",
+    "\n",
+    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
+    "\n",
+    "axes[0].hist(np.log10(df['amount_ngn'].clip(lower=1)), bins=50, edgecolor='black', alpha=0.7)\n",
+    "axes[0].set_xlabel('log10(Amount NGN)')\n",
+    "axes[0].set_ylabel('Count')\n",
+    "axes[0].set_title('Amount Distribution (log scale)')\n",
+    "\n",
+    "axes[1].hist(events_per_user.clip(upper=50), bins=50, edgecolor='black', alpha=0.7)\n",
+    "axes[1].set_xlabel('Events per User')\n",
+    "axes[1].set_ylabel('Count')\n",
+    "axes[1].set_title('Events per User')\n",
+    "\n",
+    "df['transaction_type'].value_counts().head(10).plot(kind='barh', ax=axes[2])\n",
+    "axes[2].set_xlabel('Count')\n",
+    "axes[2].set_title('Transaction Types')\n",
+    "\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3 — Convert to FINANCE_SCHEMA Events\n",
+    "\n",
+    "Mapping:\n",
+    "- `timestamp` → CalendarTokenizer (month, day-of-week, day-of-month, hour)\n",
+    "- `amount_ngn` → SignTokenizer (credit/debit) + MagnitudeBucketTokenizer (21 quantile bins)\n",
+    "- `merchant_category` + `transaction_type` → BPE text description\n",
+    "- `sender_account` → user grouping key"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def row_to_event(row):\n",
+    "    \"\"\"Convert a DataFrame row to a FINANCE_SCHEMA event dict.\"\"\"\n",
+    "    dt = datetime.strptime(row['timestamp'][:19], '%Y-%m-%d %H:%M:%S')\n",
+    "    desc = f\"{row['merchant_category']} {row['transaction_type']}\"\n",
+    "    amt = row['amount_ngn']\n",
+    "    if row['transaction_type'] == 'withdrawal':\n",
+    "        amt = -abs(amt)\n",
+    "    return {\n",
+    "        'amount_sign': amt,\n",
+    "        'amount': amt,\n",
+    "        'timestamp': dt,\n",
+    "        'description': desc,\n",
+    "    }\n",
+    "\n",
+    "sample = row_to_event(df.iloc[0])\n",
+    "print(f'Sample event: {sample}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "MIN_EVENTS = 5\n",
+    "MAX_EVENTS = 500  # cap to prevent very long sequences from dominating\n",
+    "\n",
+    "user_sequences = []\n",
+    "user_ids = []\n",
+    "user_fraud_labels = []\n",
+    "\n",
+    "for sender, group in df.sort_values('timestamp').groupby('sender_account'):\n",
+    "    if len(group) < MIN_EVENTS:\n",
+    "        continue\n",
+    "    events = [row_to_event(row) for _, row in group.head(MAX_EVENTS).iterrows()]\n",
+    "    user_sequences.append(events)\n",
+    "    user_ids.append(sender)\n",
+    "    user_fraud_labels.append(int(group['is_fraud'].any()))\n",
+    "\n",
+    "print(f'Users with {MIN_EVENTS}+ events: {len(user_sequences):,}')\n",
+    "print(f'Total events: {sum(len(s) for s in user_sequences):,}')\n",
+    "print(f'Events per user: min={min(len(s) for s in user_sequences)}, '\n",
+    "      f'max={max(len(s) for s in user_sequences)}, '\n",
+    "      f'mean={np.mean([len(s) for s in user_sequences]):.1f}')\n",
+    "print(f'Fraud rate (user-level): {np.mean(user_fraud_labels)*100:.2f}%')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4 — Build Domain Tokenizer\n",
+    "\n",
+    "Hybrid vocabulary: 97 special tokens (sign + amount bins + calendar) + BPE for descriptions.\n",
+    "Following Nubank nuFormer's tokenization approach."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all_events = [e for seq in user_sequences for e in seq]\n",
+    "print(f'Total events for fitting: {len(all_events):,}')\n",
+    "\n",
+    "builder = DomainTokenizerBuilder(FINANCE_SCHEMA)\n",
+    "builder.fit(all_events)\n",
+    "\n",
+    "text_corpus = [e['description'] for e in all_events]\n",
+    "unique_descs = sorted(set(text_corpus))\n",
+    "print(f'Unique descriptions: {len(unique_descs)}')\n",
+    "for d in unique_descs[:10]:\n",
+    "    print(f\"  '{d}'\")\n",
+    "if len(unique_descs) > 10:\n",
+    "    print(f'  ... and {len(unique_descs) - 10} more')\n",
+    "\n",
+    "hf_tokenizer = builder.build(\n",
+    "    text_corpus=text_corpus,\n",
+    "    bpe_vocab_size=2000,\n",
+    ")\n",
+    "\n",
+    "print(f'\\nVocab size: {hf_tokenizer.vocab_size}')\n",
+    "print(f'Stats: {builder.get_stats()}')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Inspect tokenized output\n",
+    "print('--- Sample event tokenized ---')\n",
+    "sample_tokens = builder.tokenize_event(user_sequences[0][0])\n",
+    "for i, t in enumerate(sample_tokens):\n",
+    "    print(f'  [{i}] {t}')\n",
+    "\n",
+    "print(f'\\n--- First user, first 3 events ---')\n",
+    "seq_tokens = builder.tokenize_sequence(user_sequences[0][:3])\n",
+    "for i, t in enumerate(seq_tokens):\n",
+    "    print(f'  [{i:3d}] {t}')\n",
+    "\n",
+    "seq_ids = hf_tokenizer(' '.join(seq_tokens), add_special_tokens=False)['input_ids']\n",
+    "unk_id = hf_tokenizer.unk_token_id\n",
+    "unk_count = sum(1 for i in seq_ids if i == unk_id)\n",
+    "print(f'\\nUNK rate: {unk_count}/{len(seq_ids)} ({unk_count/max(len(seq_ids),1)*100:.1f}%)')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5 — Pack into CLM Training Dataset\n",
+    "\n",
+    "Sequence packing (run_clm.py pattern): concatenate all user sequences, split into fixed-length blocks.\n",
+    "100% token utilization, zero padding waste."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "BLOCK_SIZE = 512  # Nubank uses 2048; 512 for faster iteration\n",
+    "\n",
+    "dataset = prepare_clm_dataset(\n",
+    "    user_sequences, builder, hf_tokenizer,\n",
+    "    block_size=BLOCK_SIZE,\n",
+    ")\n",
+    "\n",
+    "print(f'Packed: {len(dataset):,} blocks x {BLOCK_SIZE} = {len(dataset)*BLOCK_SIZE:,} training tokens')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Decode a sample block to verify it looks right\n",
+    "sample_block = dataset[0]['input_ids']\n",
+    "print(f'Sample block decoded (first 60 tokens):')\n",
+    "print(hf_tokenizer.decode(sample_block[:60]))\n",
+    "\n",
+    "# Token frequency analysis\n",
+    "all_ids = [i for row in dataset for i in row['input_ids']]\n",
+    "counts = Counter(all_ids)\n",
+    "unk_pct = counts.get(unk_id, 0) / len(all_ids) * 100\n",
+    "\n",
+    "print(f'\\nTotal tokens: {len(all_ids):,}')\n",
+    "print(f'Unique token IDs used: {len(counts)}/{hf_tokenizer.vocab_size}')\n",
+    "print(f'UNK tokens: {counts.get(unk_id, 0):,} ({unk_pct:.2f}%)')\n",
+    "\n",
+    "print(f'\\nTop 20 tokens:')\n",
+    "for tid, count in counts.most_common(20):\n",
+    "    tok_str = hf_tokenizer.decode([tid]).strip() or '(space/control)'\n",
+    "    pct = count / len(all_ids) * 100\n",
+    "    print(f'  {tid:5d}  {count:8,}  ({pct:5.1f}%)  {tok_str}')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 6 — Pre-Train 24M DomainTransformer\n",
+    "\n",
+    "Architecture (Nubank nuFormer):\n",
+    "- GPT-style causal decoder, NoPE (no positional encoding)\n",
+    "- 24M preset: d=512, 6 layers, 8 heads, FFN=2048\n",
+    "- Cosine LR schedule with warmup, AdamW optimizer\n",
+    "- CLM objective (next token prediction on transaction sequences)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "config = DomainTransformerConfig.from_preset('24m', vocab_size=hf_tokenizer.vocab_size)\n",
+    "model = DomainTransformerForCausalLM(config)\n",
+    "\n",
+    "n_params = sum(p.numel() for p in model.parameters())\n",
+    "print(f'Model: {n_params:,} parameters')\n",
+    "print(f'Config: d={config.hidden_size}, L={config.num_hidden_layers}, H={config.num_attention_heads}')\n",
+    "print(f'VRAM estimate: ~{n_params * 2 / 1e9:.1f}GB (bf16 training with optimizer states ~3x)')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%time\n",
+    "USE_GPU = torch.cuda.is_available()\n",
+    "\n",
+    "trainer = pretrain_domain_model(\n",
+    "    model=model,\n",
+    "    tokenizer=hf_tokenizer,\n",
+    "    train_dataset=dataset,\n",
+    "    output_dir='./finance_pretrain_checkpoints',\n",
+    "    hub_model_id=None,  # set to 'your-username/finance-domain-24m' to auto-push\n",
+    "    num_epochs=3 if USE_GPU else 1,\n",
+    "    per_device_batch_size=32 if USE_GPU else 4,\n",
+    "    gradient_accumulation_steps=4 if USE_GPU else 1,\n",
+    "    learning_rate=3e-4,\n",
+    "    warmup_steps=200 if USE_GPU else 10,\n",
+    "    logging_steps=50 if USE_GPU else 10,\n",
+    "    save_steps=1000 if USE_GPU else 999999,\n",
+    "    bf16=USE_GPU,\n",
+    "    report_to='none',\n",
+    "    seed=42,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 7 — Inspect Training Results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Loss curve\n",
+    "losses = [h['loss'] for h in trainer.state.log_history if 'loss' in h]\n",
+    "\n",
+    "print(f'Steps: {trainer.state.global_step:,}')\n",
+    "print(f'Loss: {losses[0]:.4f} -> {losses[-1]:.4f} ({(1-losses[-1]/losses[0])*100:.1f}% reduction)')\n",
+    "print(f'Min loss: {min(losses):.4f}')\n",
+    "\n",
+    "fig, ax = plt.subplots(figsize=(10, 5))\n",
+    "ax.plot(losses, linewidth=0.5, alpha=0.5, label='Per-step')\n",
+    "window = max(len(losses) // 50, 1)\n",
+    "if len(losses) > window:\n",
+    "    smoothed = pd.Series(losses).rolling(window=window, min_periods=1).mean()\n",
+    "    ax.plot(smoothed, linewidth=2, color='red', label=f'Smoothed (w={window})')\n",
+    "ax.set_xlabel('Step')\n",
+    "ax.set_ylabel('Loss')\n",
+    "ax.set_title('Pre-Training Loss Curve')\n",
+    "ax.legend()\n",
+    "ax.grid(True, alpha=0.3)\n",
+    "plt.tight_layout()\n",
+    "plt.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Next-token prediction test\n",
+    "model.eval()\n",
+    "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
+    "model = model.to(device)\n",
+    "\n",
+    "test_tokens = builder.tokenize_sequence(user_sequences[0][:3])\n",
+    "test_ids = hf_tokenizer(' '.join(test_tokens), return_tensors='pt', add_special_tokens=False)['input_ids'].to(device)\n",
+    "\n",
+    "with torch.no_grad():\n",
+    "    logits = model(input_ids=test_ids).logits\n",
+    "    top5 = torch.topk(logits[0, -1, :], 5)\n",
+    "\n",
+    "print('Last 5 input tokens:')\n",
+    "for tid in test_ids[0, -5:]:\n",
+    "    print(f\"  {tid.item():5d} -> '{hf_tokenizer.decode([tid.item()])}'\")\n",
+    "\n",
+    "print('\\nTop-5 next token predictions:')\n",
+    "for score, tid in zip(top5.values, top5.indices):\n",
+    "    print(f\"  {tid.item():5d} -> '{hf_tokenizer.decode([tid.item()])}' (score={score.item():.3f})\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# User embedding visualization (t-SNE)\n",
+    "n_sample = min(200, len(user_sequences))\n",
+    "embeddings = []\n",
+    "labels_sample = []\n",
+    "\n",
+    "for i in range(n_sample):\n",
+    "    tokens = builder.tokenize_sequence(user_sequences[i][:50])\n",
+    "    enc = hf_tokenizer(' '.join(tokens), return_tensors='pt', add_special_tokens=False,\n",
+    "                       max_length=256, truncation=True, padding='max_length')\n",
+    "    with torch.no_grad():\n",
+    "        emb = model.get_user_embedding(enc['input_ids'].to(device), enc['attention_mask'].to(device))\n",
+    "        embeddings.append(emb.cpu().numpy().flatten())\n",
+    "    labels_sample.append(user_fraud_labels[i])\n",
+    "\n",
+    "embeddings = np.array(embeddings)\n",
+    "labels_sample = np.array(labels_sample)\n",
+    "print(f'Embeddings: {embeddings.shape}, Fraud: {labels_sample.sum()}/{len(labels_sample)}')\n",
+    "\n",
+    "if len(embeddings) >= 20:\n",
+    "    from sklearn.manifold import TSNE\n",
+    "    coords = TSNE(n_components=2, random_state=42, perplexity=min(30, len(embeddings)-1)).fit_transform(embeddings)\n",
+    "    \n",
+    "    fig, ax = plt.subplots(figsize=(8, 6))\n",
+    "    for label, color, name in [(0, 'tab:green', 'Normal'), (1, 'tab:red', 'Fraud')]:\n",
+    "        mask = labels_sample == label\n",
+    "        ax.scatter(coords[mask, 0], coords[mask, 1], c=color, label=name, alpha=0.6, edgecolors='black', linewidth=0.3, s=30)\n",
+    "    ax.set_title('User Embeddings (t-SNE) — Pre-trained DomainTransformer')\n",
+    "    ax.legend()\n",
+    "    plt.tight_layout()\n",
+    "    plt.show()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Save Artifacts for Fine-Tuning Notebook\n",
+    "\n",
+    "Saves the pre-trained model, tokenizer, and user data so `02_finance_finetune.ipynb` can pick up where we left off."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save tokenizer\n",
+    "hf_tokenizer.save_pretrained('./finance_tokenizer')\n",
+    "builder.save('./finance_tokenizer')\n",
+    "\n",
+    "# Save model\n",
+    "model.save_pretrained('./finance_pretrain_checkpoints/final')\n",
+    "\n",
+    "# Save user data\n",
+    "artifacts = {\n",
+    "    'user_sequences': user_sequences,\n",
+    "    'user_ids': user_ids,\n",
+    "    'user_fraud_labels': user_fraud_labels,\n",
+    "}\n",
+    "with open('./finance_artifacts.pkl', 'wb') as f:\n",
+    "    pickle.dump(artifacts, f)\n",
+    "\n",
+    "print('Saved: tokenizer, model, user data')\n",
+    "print(f'  ./finance_tokenizer/')\n",
+    "print(f'  ./finance_pretrain_checkpoints/final/')\n",
+    "print(f'  ./finance_artifacts.pkl')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Summary\n",
+    "\n",
+    "| Metric | Value |\n",
+    "|--------|-------|\n",
+    "| Dataset | Nigerian Financial Transactions (5M) |\n",
+    "| Users (5+ events) | *see output above* |\n",
+    "| Training tokens | *see output above* |\n",
+    "| Model | DomainTransformer 24M (NoPE, GPT-style) |\n",
+    "| Final loss | *see output above* |\n",
+    "| UNK rate | *see output above* |\n",
+    "\n",
+    "**Next:** `02_finance_finetune.ipynb` — Fine-tune for fraud detection with JointFusionModel, compare vs LightGBM."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.12.0"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}