File size: 17,866 Bytes

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 03 — E-Commerce Fine-Tuning: Future Purchase Prediction\n",
    "\n",
    "**Goal:** Fine-tune the pre-trained DomainTransformer for predicting whether a user will purchase in the future, using only their past browsing history.\n",
    "\n",
    "**Task:** Binary classification — given the first 70% of a user's events, predict if they purchase in the remaining 30%.\n",
    "\n",
    "**Why temporal split:** Avoids label leakage. The previous version used `n_purchases` as a feature to predict `has_purchase` → trivial AUC 1.0. This version simulates the real production scenario: predict future behavior from past behavior.\n",
    "\n",
    "**Pre-trained model:** [rtferraz/ecommerce-domain-24m](https://huggingface.co/rtferraz/ecommerce-domain-24m)\n",
    "\n",
    "**Architecture:** JointFusionModel (pre-trained Transformer + DCNv2 with PLR tabular embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# !pip install datasets transformers torch accelerate tokenizers numpy pandas matplotlib scikit-learn wandb huggingface_hub lightgbm safetensors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging, pickle, os, sys, gc\n",
    "from datetime import datetime\n",
    "from collections import Counter\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import torch\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import roc_auc_score\n",
    "\n",
    "if os.path.exists('../src'): sys.path.insert(0, '../src')\n",
    "elif os.path.exists('src'): sys.path.insert(0, 'src')\n",
    "\n",
    "from domain_tokenizer import (\n",
    "    DomainTokenizerBuilder, DomainTransformerConfig,\n",
    "    DomainTransformerForCausalLM, JointFusionModel,\n",
    "    DomainFinetuneDataset, finetune_domain_model,\n",
    ")\n",
    "from domain_tokenizer.schema import DomainSchema, FieldSpec, FieldType\n",
    "\n",
    "logging.basicConfig(level=logging.INFO, format='%(asctime)s %(message)s')\n",
    "print(f'torch: {torch.__version__}, CUDA: {torch.cuda.is_available()}')\n",
    "if torch.cuda.is_available():\n",
    "    print(f'GPU: {torch.cuda.get_device_name(0)}, VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f}GB')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from huggingface_hub import login\n",
    "login()\n",
    "\n",
    "import wandb\n",
    "wandb.login()\n",
    "os.environ['WANDB_PROJECT'] = 'domainTokenizer'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1 — Load Pre-trained Artifacts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./ecommerce_artifacts.pkl', 'rb') as f:\n",
    "    artifacts = pickle.load(f)\n",
    "user_sequences = artifacts['user_sequences']\n",
    "user_ids = artifacts['user_ids']\n",
    "print(f'Loaded {len(user_sequences):,} users')\n",
    "\n",
    "from transformers import PreTrainedTokenizerFast\n",
    "hf_tokenizer = PreTrainedTokenizerFast.from_pretrained('./ecommerce_tokenizer')\n",
    "print(f'Tokenizer vocab: {hf_tokenizer.vocab_size}')\n",
    "\n",
    "ECOMMERCE_REES46_SCHEMA = DomainSchema(\n",
    "    name='ecommerce_rees46',\n",
    "    fields=[\n",
    "        FieldSpec(name='event_type', field_type=FieldType.CATEGORICAL_FIXED, prefix='EVT',\n",
    "                  categories=['view', 'cart', 'remove_from_cart', 'purchase']),\n",
    "        FieldSpec(name='price', field_type=FieldType.NUMERICAL_CONTINUOUS, prefix='PRICE', n_bins=21),\n",
    "        FieldSpec(name='category', field_type=FieldType.TEXT, prefix='CAT'),\n",
    "        FieldSpec(name='timestamp', field_type=FieldType.TEMPORAL, calendar_fields=['dow', 'hour']),\n",
    "    ],\n",
    ")\n",
    "builder = DomainTokenizerBuilder(ECOMMERCE_REES46_SCHEMA)\n",
    "all_events_flat = [e for seq in user_sequences for e in seq]\n",
    "builder.fit(all_events_flat)\n",
    "del all_events_flat; gc.collect()\n",
    "print('Builder fitted')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = DomainTransformerForCausalLM.from_pretrained('./ecommerce_pretrain_checkpoints/final/')\n",
    "print(f'Pre-trained model loaded: {sum(p.numel() for p in model.parameters()):,} params')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2 — Temporal Split: Labels and Features\n",
    "\n",
    "**The key design (avoids leakage):**\n",
    "- Split each user's events at the 70% mark temporally\n",
    "- **Input to model:** first 70% of events (history)\n",
    "- **Label:** did the user purchase in the last 30%? (future)\n",
    "- **Tabular features:** computed only from the first 70% (no future info)\n",
    "\n",
    "This matches Nubank's setup: predict future behavior from past history."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "SPLIT_RATIO = 0.7  # 70% history, 30% future\n",
    "MIN_HISTORY = 5    # need at least 5 events in history\n",
    "MIN_FUTURE = 3     # need at least 3 events in future\n",
    "\n",
    "history_sequences = []  # input to model\n",
    "future_labels = []      # target: purchased in future?\n",
    "valid_user_ids = []\n",
    "\n",
    "for i, events in enumerate(user_sequences):\n",
    "    split_idx = int(len(events) * SPLIT_RATIO)\n",
    "    history = events[:split_idx]\n",
    "    future = events[split_idx:]\n",
    "    \n",
    "    if len(history) < MIN_HISTORY or len(future) < MIN_FUTURE:\n",
    "        continue\n",
    "    \n",
    "    # Label: did user purchase in the future window?\n",
    "    has_future_purchase = any(e['event_type'] == 'purchase' for e in future)\n",
    "    \n",
    "    history_sequences.append(history)\n",
    "    future_labels.append(1.0 if has_future_purchase else 0.0)\n",
    "    valid_user_ids.append(user_ids[i])\n",
    "\n",
    "future_labels = np.array(future_labels)\n",
    "print(f'Valid users (enough history + future): {len(history_sequences):,}')\n",
    "print(f'Future purchasers: {future_labels.sum():.0f} / {len(future_labels)} ({future_labels.mean()*100:.1f}%)')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def compute_history_features(events):\n",
    "    \"\"\"Features from HISTORY ONLY — no future information leaks.\"\"\"\n",
    "    n_events = len(events)\n",
    "    n_views = sum(1 for e in events if e['event_type'] == 'view')\n",
    "    n_carts = sum(1 for e in events if e['event_type'] == 'cart')\n",
    "    n_removes = sum(1 for e in events if e['event_type'] == 'remove_from_cart')\n",
    "    # NOTE: n_purchases in HISTORY is allowed — it's past behavior, not future\n",
    "    n_hist_purchases = sum(1 for e in events if e['event_type'] == 'purchase')\n",
    "    \n",
    "    prices = [e['price'] for e in events if e['price'] > 0]\n",
    "    avg_price = np.mean(prices) if prices else 0\n",
    "    max_price = max(prices) if prices else 0\n",
    "    std_price = np.std(prices) if len(prices) > 1 else 0\n",
    "    \n",
    "    n_unique_categories = len(set(e['category'] for e in events))\n",
    "    avg_hour = np.mean([e['timestamp'].hour for e in events])\n",
    "    \n",
    "    # Funnel ratios from history\n",
    "    cart_rate = n_carts / max(n_views, 1)\n",
    "    remove_rate = n_removes / max(n_carts, 1) if n_carts > 0 else 0\n",
    "    hist_purchase_rate = n_hist_purchases / max(n_events, 1)\n",
    "    \n",
    "    # Session intensity (events per day approximation)\n",
    "    if len(events) >= 2:\n",
    "        time_span = (events[-1]['timestamp'] - events[0]['timestamp']).total_seconds() / 86400  # days\n",
    "        events_per_day = n_events / max(time_span, 1)\n",
    "    else:\n",
    "        events_per_day = 0\n",
    "    \n",
    "    return [\n",
    "        n_events, n_views, n_carts, n_removes, n_hist_purchases,\n",
    "        avg_price, max_price, std_price,\n",
    "        n_unique_categories, avg_hour,\n",
    "        cart_rate, remove_rate, hist_purchase_rate, events_per_day,\n",
    "    ]\n",
    "\n",
    "FEATURE_NAMES = [\n",
    "    'n_events', 'n_views', 'n_carts', 'n_removes', 'n_hist_purchases',\n",
    "    'avg_price', 'max_price', 'std_price',\n",
    "    'n_unique_categories', 'avg_hour',\n",
    "    'cart_rate', 'remove_rate', 'hist_purchase_rate', 'events_per_day',\n",
    "]\n",
    "\n",
    "print(f'Computing features from history only...')\n",
    "tabular_features = np.array([compute_history_features(seq) for seq in history_sequences], dtype=np.float32)\n",
    "print(f'Features: {tabular_features.shape}, {len(FEATURE_NAMES)} features')\n",
    "print(f'Feature names: {FEATURE_NAMES}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train/test split (80/20, stratified)\n",
    "train_idx, test_idx = train_test_split(\n",
    "    range(len(history_sequences)), test_size=0.2, random_state=42, stratify=future_labels\n",
    ")\n",
    "\n",
    "train_seqs = [history_sequences[i] for i in train_idx]\n",
    "test_seqs = [history_sequences[i] for i in test_idx]\n",
    "train_features = tabular_features[train_idx]\n",
    "test_features = tabular_features[test_idx]\n",
    "train_labels = future_labels[train_idx]\n",
    "test_labels = future_labels[test_idx]\n",
    "\n",
    "print(f'Train: {len(train_seqs):,} ({train_labels.mean()*100:.1f}% will purchase in future)')\n",
    "print(f'Test: {len(test_seqs):,} ({test_labels.mean()*100:.1f}% will purchase in future)')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3 — LightGBM Baseline (history features only)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import lightgbm as lgb\n",
    "\n",
    "lgb_model = lgb.LGBMClassifier(n_estimators=200, learning_rate=0.05, max_depth=6, random_state=42, verbose=-1)\n",
    "lgb_model.fit(train_features, train_labels)\n",
    "\n",
    "lgb_test_probs = lgb_model.predict_proba(test_features)[:, 1]\n",
    "lgb_test_auc = roc_auc_score(test_labels, lgb_test_probs)\n",
    "\n",
    "print(f'LightGBM Baseline (history features only):')\n",
    "print(f'  Test AUC: {lgb_test_auc:.4f}')\n",
    "\n",
    "importance = pd.Series(lgb_model.feature_importances_, index=FEATURE_NAMES).sort_values(ascending=False)\n",
    "print(f'\\nTop features:')\n",
    "for feat, imp in importance.head(7).items(): print(f'  {feat}: {imp}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4 — JointFusionModel Fine-Tuning\n",
    "\n",
    "The transformer sees the **raw event sequence** (history only). The DCNv2 branch sees the **hand-crafted features** (also history only). The question: does the raw sequence add signal beyond what the features capture?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "MAX_LENGTH = 256\n",
    "\n",
    "train_dataset = DomainFinetuneDataset(\n",
    "    train_seqs, train_features, train_labels, builder, hf_tokenizer, max_length=MAX_LENGTH)\n",
    "test_dataset = DomainFinetuneDataset(\n",
    "    test_seqs, test_features, test_labels, builder, hf_tokenizer, max_length=MAX_LENGTH)\n",
    "\n",
    "print(f'Train: {len(train_dataset)}, Test: {len(test_dataset)}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fusion_model = JointFusionModel(\n",
    "    transformer_model=model,\n",
    "    n_tabular_features=len(FEATURE_NAMES),\n",
    "    n_classes=1,\n",
    "    plr_frequencies=32, plr_embedding_dim=32,\n",
    "    dcn_cross_layers=3, dcn_deep_layers=2, dcn_deep_dim=128,\n",
    "    head_hidden_dim=128, dropout=0.1,\n",
    ")\n",
    "print(f'JointFusion: {sum(p.numel() for p in fusion_model.parameters()):,} params')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "USE_GPU = torch.cuda.is_available()\n",
    "GPU_NAME = torch.cuda.get_device_name(0) if USE_GPU else ''\n",
    "USE_BF16 = USE_GPU and 'T4' not in GPU_NAME\n",
    "USE_FP16 = USE_GPU and not USE_BF16\n",
    "\n",
    "trainer = finetune_domain_model(\n",
    "    model=fusion_model,\n",
    "    train_dataset=train_dataset,\n",
    "    eval_dataset=test_dataset,\n",
    "    output_dir='./ecommerce_finetune_checkpoints',\n",
    "    num_epochs=5 if USE_GPU else 2,\n",
    "    per_device_batch_size=32 if USE_GPU else 8,\n",
    "    gradient_accumulation_steps=1,\n",
    "    learning_rate=1e-4,\n",
    "    warmup_steps=50,\n",
    "    logging_steps=20,\n",
    "    eval_steps=200 if USE_GPU else 50,\n",
    "    save_strategy='no',\n",
    "    bf16=USE_BF16, fp16=USE_FP16,\n",
    "    report_to='wandb',\n",
    "    run_name='ecommerce-finetune-temporal-5ep',\n",
    "    seed=42,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5 — Evaluate and Compare"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fusion_model.eval()\n",
    "device = 'cuda' if torch.cuda.is_available() else 'cpu'\n",
    "fusion_model = fusion_model.to(device)\n",
    "\n",
    "all_probs, all_labels_eval = [], []\n",
    "loader = torch.utils.data.DataLoader(test_dataset, batch_size=64, shuffle=False)\n",
    "\n",
    "with torch.no_grad():\n",
    "    for batch in loader:\n",
    "        batch = {k: v.to(device) for k, v in batch.items()}\n",
    "        labels_batch = batch.pop('labels')\n",
    "        out = fusion_model(**batch)\n",
    "        probs = torch.sigmoid(out['logits'].squeeze(-1))\n",
    "        all_probs.extend(probs.cpu().numpy())\n",
    "        all_labels_eval.extend(labels_batch.cpu().numpy())\n",
    "\n",
    "fusion_test_auc = roc_auc_score(np.array(all_labels_eval), np.array(all_probs))\n",
    "print(f'JointFusion Test AUC: {fusion_test_auc:.4f}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('=' * 60)\n",
    "print('MODEL COMPARISON — Future Purchase Prediction (AUC)')\n",
    "print('=' * 60)\n",
    "print(f'  LightGBM (history features only):     {lgb_test_auc:.4f}')\n",
    "print(f'  JointFusion (Transformer + features): {fusion_test_auc:.4f}')\n",
    "print(f'  Difference:                           {fusion_test_auc - lgb_test_auc:+.4f}')\n",
    "print('=' * 60)\n",
    "\n",
    "if fusion_test_auc > lgb_test_auc:\n",
    "    print(f'\\n✅ JointFusion beats LightGBM by {(fusion_test_auc - lgb_test_auc)*100:.2f} pp')\n",
    "    print(f'   The sequential patterns from domain tokens add value beyond tabular features.')\n",
    "elif abs(fusion_test_auc - lgb_test_auc) < 0.005:\n",
    "    print(f'\\n≈ Roughly tied. The transformer embeddings match LightGBM.')\n",
    "    print(f'   More pre-training epochs would likely push JointFusion ahead.')\n",
    "else:\n",
    "    print(f'\\n⚠️ LightGBM leads by {(lgb_test_auc - fusion_test_auc)*100:.2f} pp')\n",
    "    print(f'   More pre-training (10+ epochs) and longer context (1024+) needed.')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "losses = [h['loss'] for h in trainer.state.log_history if 'loss' in h]\n",
    "eval_losses = [h['eval_loss'] for h in trainer.state.log_history if 'eval_loss' in h]\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 5))\n",
    "ax.plot(losses, label='Train Loss', alpha=0.7)\n",
    "if eval_losses:\n",
    "    eval_x = np.linspace(0, len(losses), len(eval_losses))\n",
    "    ax.plot(eval_x, eval_losses, 'ro-', label='Eval Loss', markersize=4)\n",
    "ax.set_xlabel('Step'); ax.set_ylabel('Loss'); ax.set_title('Fine-Tuning Loss (Temporal Split)')\n",
    "ax.legend(); ax.grid(True, alpha=0.3); plt.tight_layout(); plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "wandb.finish()\n",
    "print('Done!')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "| Model | Test AUC | Input |\n",
    "|-------|----------|-------|\n",
    "| LightGBM | *see above* | 14 history-only features |\n",
    "| JointFusion | *see above* | Pre-trained domain token sequence + same 14 features |\n",
    "\n",
    "**Task:** Predict future purchase from past browsing history (temporal split, no leakage).\n",
    "\n",
    "The pre-trained DomainTransformer captures sequential patterns (browsing funnels, category stickiness, temporal habits) that may add predictive signal beyond aggregate features."
   ]
  }
 ],
 "metadata": {
  "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" },
  "language_info": { "name": "python", "version": "3.12.0" }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}