--- tags: - xgboost - catboost - conversion-prediction - view-through-attribution - advertising - classification - imbalanced-data - ml-intern license: apache-2.0 library_name: catboost --- # View-Through Conversion (VTC) Predictor v2 — Inventory Signals Only **Task**: Predict conversion probability from ad-impression inventory signals, with conversions happening *independent of clicks* (post-impression / view-through). **Key constraint**: False Negatives are expensive — we optimize for **Recall / F2**. --- ## Dataset - **Source**: [`criteo/criteo-attribution-dataset`](https://huggingface.co/datasets/criteo/criteo-attribution-dataset) - **Size**: 300K sampled rows (6M+ available in full dataset) - **Target**: `conversion` (0/1) — independent of click - **Features**: 28 engineered features from **inventory signals only** (no click data) ### Inventory Features Used | Feature | Description | |---|---| | `timestamp`, `uid`, `campaign` | Basic impression metadata | | `cost`, `cpo` | Bid / pricing signals | | `cat1`–`cat9` | Categorical inventory features | | `time_since_last_click` | Recency (set to median if no prior click) | | `uid_freq`, `campaign_freq` | Frequency encoding (how common is this user/campaign) | | `campaign_conv_rate` | Target-encoded campaign conversion rate | | `cpo_log`, `cost_log` | Log-transformed prices | | `tsc_bucket` | Binned time-since-last-click | | `cost_cpo_ratio`, `cost_cpo_diff` | Bid competitiveness signals | | `campaign_cat4`, `cat1_cat2`, `cat3_cat5` | Cross-categorical interactions | | `campaign_cost_mean`, `campaign_cpo_mean` | Campaign-level price statistics | | `uid_impression_num` | User impression sequence number | | `cat_hash` | Hashed combination of 4 categories | **Excluded click signals**: `click`, `click_pos`, `click_nb`, `attribution`, `conversion_timestamp`, `conversion_id` --- ## Model - **Architecture**: CatBoost Classifier (gradient-boosted trees with native categorical handling) - **Objective**: `binary:logistic` - **Class imbalance**: `scale_pos_weight = 18` (derived from class ratio) - **Key hyperparameters**: iterations=1500, depth=8, lr=0.05, l2_leaf_reg=3 - **Early stopping**: 100 rounds on validation PR-AUC --- ## Metrics (Test Set, 45,000 impressions) ### Threshold Scan | Threshold | Recall | Precision | F1 | F2 | FN | FP | FPR | |---|---|---|---|---|---|---|---| | 0.01 | **0.9996** | 0.051 | 0.097 | 0.211 | **1** | 42,256 | 0.989 | | 0.05 | **0.993** | 0.058 | 0.109 | 0.235 | 15 | 36,537 | 0.855 | | 0.10 | **0.979** | 0.068 | 0.127 | 0.266 | 48 | 30,263 | 0.708 | | 0.15 | **0.955** | 0.078 | 0.144 | 0.294 | 101 | 25,527 | 0.597 | | **0.20** | **0.932** | 0.088 | 0.161 | **0.319** | **154** | 21,799 | 0.510 | | 0.25 | **0.913** | 0.100 | 0.180 | **0.347** | 197 | 18,613 | 0.436 | | 0.30 | 0.881 | 0.111 | 0.198 | 0.370 | 268 | 15,862 | 0.371 | | 0.40 | 0.810 | 0.140 | 0.239 | 0.415 | 429 | 11,194 | 0.262 | | 0.50 | 0.736 | 0.173 | 0.280 | **0.446** | 597 | 7,943 | 0.186 | ### Ranking Quality | Metric | Value | |---|---| | **AUC** | **0.8581** | | **PR-AUC** | **0.3430** | ### Feature Importance (Top 10) | Rank | Feature | Importance | |---|---|---| | 1 | `campaign_conv_rate` | 9.94 | | 2 | `cost_cpo_ratio` | 8.87 | | 3 | `cat1_cat2` | 5.27 | | 4 | `cost` | 5.17 | | 5 | `cost_log` | 4.99 | | 6 | `cat1` | 4.98 | | 7 | `campaign_cost_mean` | 4.09 | | 8 | `campaign_cpo_mean` | 3.57 | | 9 | `uid` | 3.43 | | 10 | `cat_hash` | 3.43 | --- ## Choosing Your Threshold The **cost of a False Negative** (missed conversion) vs **cost of a False Positive** (wasted impression) determines the right threshold: | If FN is … times more expensive than FP | Recommended Threshold | Recall | FN Rate | |---|---|---|---| | 1× (equal cost) | 0.50 | 73.6% | 26.4% | | 5× | 0.20 | 93.2% | 6.8% | | 10× | 0.10 | 97.9% | 2.1% | | 50×+ | 0.01 | 99.96% | 0.04% | --- ## Improvements Over v1 (Baseline XGBoost) | Metric | v1 Baseline | v2 CatBoost + Engineering | Δ | |---|---|---|---| | AUC | 0.847 | **0.858** | **+1.1%** | | PR-AUC | 0.318 | **0.343** | **+2.5%** | | Recall@0.50 | 0.673 | 0.736 | +6.3% | | F2@0.50 | 0.447 | **0.446** | ~same | | Recall@0.20 | — | **0.932** | New! | | FN@0.20 | — | **154** (of 2,257) | New! | **Key improvements**: 1. **Campaign conversion rate** (target encoding) is the #1 feature 2. **Cross-categorical interactions** (`cat1_cat2`, `campaign_cat4`) capture context 3. **Bid competitiveness signals** (`cost_cpo_ratio`, `cost_log`) separate high-intent inventory 4. **CatBoost native categorical handling** avoids one-hot explosion for high-cardinality IDs --- ## Inference Example ```python import pandas as pd from catboost import CatBoostClassifier # Load model model = CatBoostClassifier() model.load_model("vtc_best.cbm") # Features needed (inventory signals only) features = [ "timestamp","uid","campaign","cost","cpo","time_since_last_click", "cat1","cat2","cat3","cat4","cat5","cat6","cat7","cat8","cat9", "uid_freq","campaign_freq","campaign_conv_rate", "cpo_log","cost_log","tsc_bucket","cost_cpo_ratio","cost_cpo_diff", "campaign_cost_mean","campaign_cost_std","campaign_cpo_mean","campaign_cpo_std", "uid_impression_num","cat_hash", "campaign_cat4","cat1_cat2","cat3_cat5" ] # Build a single row (you must compute engineered features from raw data) row = pd.DataFrame({...}) # Predict proba = model.predict_proba(row[features])[0][1] print(f"Conversion probability: {proba:.4f}") # Apply threshold based on business cost of FN vs FP # thr=0.20 → ~93% recall (catches most converters) # thr=0.50 → ~74% recall, lower FP rate ``` --- ## Training Details - Framework: CatBoost 1.2.7 - Data: 300K rows (210K train / 45K val / 45K test) - Conversion rate: 5.02% - Class weight: `scale_pos_weight = 18` - Hardware: CPU-only - Time: ~8 minutes --- ## Citation Original dataset: [Criteo Attribution Dataset](https://huggingface.co/datasets/criteo/criteo-attribution-dataset) --- ## Files - `vtc_best.cbm` — CatBoost model - `final_importance.csv` — Per-feature importance ## Generated by ML Intern This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. - Try ML Intern: https://smolagents-ml-intern.hf.space - Source code: https://github.com/huggingface/ml-intern ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = 'nithish277/vtc-inventory-signals' tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id) ``` For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.