| --- |
| tags: |
| - xgboost |
| - catboost |
| - conversion-prediction |
| - view-through-attribution |
| - advertising |
| - classification |
| - imbalanced-data |
| - ml-intern |
| license: apache-2.0 |
| library_name: catboost |
| --- |
| |
| # View-Through Conversion (VTC) Predictor v2 — Inventory Signals Only |
|
|
| **Task**: Predict conversion probability from ad-impression inventory signals, |
| with conversions happening *independent of clicks* (post-impression / view-through). |
|
|
| **Key constraint**: False Negatives are expensive — we optimize for **Recall / F2**. |
|
|
| --- |
|
|
| ## Dataset |
|
|
| - **Source**: [`criteo/criteo-attribution-dataset`](https://huggingface.co/datasets/criteo/criteo-attribution-dataset) |
| - **Size**: 300K sampled rows (6M+ available in full dataset) |
| - **Target**: `conversion` (0/1) — independent of click |
| - **Features**: 28 engineered features from **inventory signals only** (no click data) |
|
|
| ### Inventory Features Used |
|
|
| | Feature | Description | |
| |---|---| |
| | `timestamp`, `uid`, `campaign` | Basic impression metadata | |
| | `cost`, `cpo` | Bid / pricing signals | |
| | `cat1`–`cat9` | Categorical inventory features | |
| | `time_since_last_click` | Recency (set to median if no prior click) | |
| | `uid_freq`, `campaign_freq` | Frequency encoding (how common is this user/campaign) | |
| | `campaign_conv_rate` | Target-encoded campaign conversion rate | |
| | `cpo_log`, `cost_log` | Log-transformed prices | |
| | `tsc_bucket` | Binned time-since-last-click | |
| | `cost_cpo_ratio`, `cost_cpo_diff` | Bid competitiveness signals | |
| | `campaign_cat4`, `cat1_cat2`, `cat3_cat5` | Cross-categorical interactions | |
| | `campaign_cost_mean`, `campaign_cpo_mean` | Campaign-level price statistics | |
| | `uid_impression_num` | User impression sequence number | |
| | `cat_hash` | Hashed combination of 4 categories | |
|
|
| **Excluded click signals**: `click`, `click_pos`, `click_nb`, `attribution`, `conversion_timestamp`, `conversion_id` |
|
|
| --- |
|
|
| ## Model |
|
|
| - **Architecture**: CatBoost Classifier (gradient-boosted trees with native categorical handling) |
| - **Objective**: `binary:logistic` |
| - **Class imbalance**: `scale_pos_weight = 18` (derived from class ratio) |
| - **Key hyperparameters**: iterations=1500, depth=8, lr=0.05, l2_leaf_reg=3 |
| - **Early stopping**: 100 rounds on validation PR-AUC |
|
|
| --- |
|
|
| ## Metrics (Test Set, 45,000 impressions) |
|
|
| ### Threshold Scan |
|
|
| | Threshold | Recall | Precision | F1 | F2 | FN | FP | FPR | |
| |---|---|---|---|---|---|---|---| |
| | 0.01 | **0.9996** | 0.051 | 0.097 | 0.211 | **1** | 42,256 | 0.989 | |
| | 0.05 | **0.993** | 0.058 | 0.109 | 0.235 | 15 | 36,537 | 0.855 | |
| | 0.10 | **0.979** | 0.068 | 0.127 | 0.266 | 48 | 30,263 | 0.708 | |
| | 0.15 | **0.955** | 0.078 | 0.144 | 0.294 | 101 | 25,527 | 0.597 | |
| | **0.20** | **0.932** | 0.088 | 0.161 | **0.319** | **154** | 21,799 | 0.510 | |
| | 0.25 | **0.913** | 0.100 | 0.180 | **0.347** | 197 | 18,613 | 0.436 | |
| | 0.30 | 0.881 | 0.111 | 0.198 | 0.370 | 268 | 15,862 | 0.371 | |
| | 0.40 | 0.810 | 0.140 | 0.239 | 0.415 | 429 | 11,194 | 0.262 | |
| | 0.50 | 0.736 | 0.173 | 0.280 | **0.446** | 597 | 7,943 | 0.186 | |
|
|
| ### Ranking Quality |
|
|
| | Metric | Value | |
| |---|---| |
| | **AUC** | **0.8581** | |
| | **PR-AUC** | **0.3430** | |
|
|
| ### Feature Importance (Top 10) |
|
|
| | Rank | Feature | Importance | |
| |---|---|---| |
| | 1 | `campaign_conv_rate` | 9.94 | |
| | 2 | `cost_cpo_ratio` | 8.87 | |
| | 3 | `cat1_cat2` | 5.27 | |
| | 4 | `cost` | 5.17 | |
| | 5 | `cost_log` | 4.99 | |
| | 6 | `cat1` | 4.98 | |
| | 7 | `campaign_cost_mean` | 4.09 | |
| | 8 | `campaign_cpo_mean` | 3.57 | |
| | 9 | `uid` | 3.43 | |
| | 10 | `cat_hash` | 3.43 | |
|
|
| --- |
|
|
| ## Choosing Your Threshold |
|
|
| The **cost of a False Negative** (missed conversion) vs **cost of a False Positive** (wasted impression) determines the right threshold: |
|
|
| | If FN is … times more expensive than FP | Recommended Threshold | Recall | FN Rate | |
| |---|---|---|---| |
| | 1× (equal cost) | 0.50 | 73.6% | 26.4% | |
| | 5× | 0.20 | 93.2% | 6.8% | |
| | 10× | 0.10 | 97.9% | 2.1% | |
| | 50×+ | 0.01 | 99.96% | 0.04% | |
|
|
| --- |
|
|
| ## Improvements Over v1 (Baseline XGBoost) |
|
|
| | Metric | v1 Baseline | v2 CatBoost + Engineering | Δ | |
| |---|---|---|---| |
| | AUC | 0.847 | **0.858** | **+1.1%** | |
| | PR-AUC | 0.318 | **0.343** | **+2.5%** | |
| | Recall@0.50 | 0.673 | 0.736 | +6.3% | |
| | F2@0.50 | 0.447 | **0.446** | ~same | |
| | Recall@0.20 | — | **0.932** | New! | |
| | FN@0.20 | — | **154** (of 2,257) | New! | |
|
|
| **Key improvements**: |
| 1. **Campaign conversion rate** (target encoding) is the #1 feature |
| 2. **Cross-categorical interactions** (`cat1_cat2`, `campaign_cat4`) capture context |
| 3. **Bid competitiveness signals** (`cost_cpo_ratio`, `cost_log`) separate high-intent inventory |
| 4. **CatBoost native categorical handling** avoids one-hot explosion for high-cardinality IDs |
|
|
| --- |
|
|
| ## Inference Example |
|
|
| ```python |
| import pandas as pd |
| from catboost import CatBoostClassifier |
| |
| # Load model |
| model = CatBoostClassifier() |
| model.load_model("vtc_best.cbm") |
| |
| # Features needed (inventory signals only) |
| features = [ |
| "timestamp","uid","campaign","cost","cpo","time_since_last_click", |
| "cat1","cat2","cat3","cat4","cat5","cat6","cat7","cat8","cat9", |
| "uid_freq","campaign_freq","campaign_conv_rate", |
| "cpo_log","cost_log","tsc_bucket","cost_cpo_ratio","cost_cpo_diff", |
| "campaign_cost_mean","campaign_cost_std","campaign_cpo_mean","campaign_cpo_std", |
| "uid_impression_num","cat_hash", |
| "campaign_cat4","cat1_cat2","cat3_cat5" |
| ] |
| |
| # Build a single row (you must compute engineered features from raw data) |
| row = pd.DataFrame({...}) |
| |
| # Predict |
| proba = model.predict_proba(row[features])[0][1] |
| print(f"Conversion probability: {proba:.4f}") |
| |
| # Apply threshold based on business cost of FN vs FP |
| # thr=0.20 → ~93% recall (catches most converters) |
| # thr=0.50 → ~74% recall, lower FP rate |
| ``` |
|
|
| --- |
|
|
| ## Training Details |
|
|
| - Framework: CatBoost 1.2.7 |
| - Data: 300K rows (210K train / 45K val / 45K test) |
| - Conversion rate: 5.02% |
| - Class weight: `scale_pos_weight = 18` |
| - Hardware: CPU-only |
| - Time: ~8 minutes |
|
|
| --- |
|
|
| ## Citation |
|
|
| Original dataset: [Criteo Attribution Dataset](https://huggingface.co/datasets/criteo/criteo-attribution-dataset) |
|
|
| --- |
|
|
| ## Files |
|
|
| - `vtc_best.cbm` — CatBoost model |
| - `final_importance.csv` — Per-feature importance |
|
|
| <!-- ml-intern-provenance --> |
| ## Generated by ML Intern |
|
|
| This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub. |
|
|
| - Try ML Intern: https://smolagents-ml-intern.hf.space |
| - Source code: https://github.com/huggingface/ml-intern |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_id = 'nithish277/vtc-inventory-signals' |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForCausalLM.from_pretrained(model_id) |
| ``` |
|
|
| For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class. |
|
|