nithish277's picture
Update ML Intern artifact metadata
691f31e verified
---
tags:
- xgboost
- catboost
- conversion-prediction
- view-through-attribution
- advertising
- classification
- imbalanced-data
- ml-intern
license: apache-2.0
library_name: catboost
---
# View-Through Conversion (VTC) Predictor v2 — Inventory Signals Only
**Task**: Predict conversion probability from ad-impression inventory signals,
with conversions happening *independent of clicks* (post-impression / view-through).
**Key constraint**: False Negatives are expensive — we optimize for **Recall / F2**.
---
## Dataset
- **Source**: [`criteo/criteo-attribution-dataset`](https://huggingface.co/datasets/criteo/criteo-attribution-dataset)
- **Size**: 300K sampled rows (6M+ available in full dataset)
- **Target**: `conversion` (0/1) — independent of click
- **Features**: 28 engineered features from **inventory signals only** (no click data)
### Inventory Features Used
| Feature | Description |
|---|---|
| `timestamp`, `uid`, `campaign` | Basic impression metadata |
| `cost`, `cpo` | Bid / pricing signals |
| `cat1``cat9` | Categorical inventory features |
| `time_since_last_click` | Recency (set to median if no prior click) |
| `uid_freq`, `campaign_freq` | Frequency encoding (how common is this user/campaign) |
| `campaign_conv_rate` | Target-encoded campaign conversion rate |
| `cpo_log`, `cost_log` | Log-transformed prices |
| `tsc_bucket` | Binned time-since-last-click |
| `cost_cpo_ratio`, `cost_cpo_diff` | Bid competitiveness signals |
| `campaign_cat4`, `cat1_cat2`, `cat3_cat5` | Cross-categorical interactions |
| `campaign_cost_mean`, `campaign_cpo_mean` | Campaign-level price statistics |
| `uid_impression_num` | User impression sequence number |
| `cat_hash` | Hashed combination of 4 categories |
**Excluded click signals**: `click`, `click_pos`, `click_nb`, `attribution`, `conversion_timestamp`, `conversion_id`
---
## Model
- **Architecture**: CatBoost Classifier (gradient-boosted trees with native categorical handling)
- **Objective**: `binary:logistic`
- **Class imbalance**: `scale_pos_weight = 18` (derived from class ratio)
- **Key hyperparameters**: iterations=1500, depth=8, lr=0.05, l2_leaf_reg=3
- **Early stopping**: 100 rounds on validation PR-AUC
---
## Metrics (Test Set, 45,000 impressions)
### Threshold Scan
| Threshold | Recall | Precision | F1 | F2 | FN | FP | FPR |
|---|---|---|---|---|---|---|---|
| 0.01 | **0.9996** | 0.051 | 0.097 | 0.211 | **1** | 42,256 | 0.989 |
| 0.05 | **0.993** | 0.058 | 0.109 | 0.235 | 15 | 36,537 | 0.855 |
| 0.10 | **0.979** | 0.068 | 0.127 | 0.266 | 48 | 30,263 | 0.708 |
| 0.15 | **0.955** | 0.078 | 0.144 | 0.294 | 101 | 25,527 | 0.597 |
| **0.20** | **0.932** | 0.088 | 0.161 | **0.319** | **154** | 21,799 | 0.510 |
| 0.25 | **0.913** | 0.100 | 0.180 | **0.347** | 197 | 18,613 | 0.436 |
| 0.30 | 0.881 | 0.111 | 0.198 | 0.370 | 268 | 15,862 | 0.371 |
| 0.40 | 0.810 | 0.140 | 0.239 | 0.415 | 429 | 11,194 | 0.262 |
| 0.50 | 0.736 | 0.173 | 0.280 | **0.446** | 597 | 7,943 | 0.186 |
### Ranking Quality
| Metric | Value |
|---|---|
| **AUC** | **0.8581** |
| **PR-AUC** | **0.3430** |
### Feature Importance (Top 10)
| Rank | Feature | Importance |
|---|---|---|
| 1 | `campaign_conv_rate` | 9.94 |
| 2 | `cost_cpo_ratio` | 8.87 |
| 3 | `cat1_cat2` | 5.27 |
| 4 | `cost` | 5.17 |
| 5 | `cost_log` | 4.99 |
| 6 | `cat1` | 4.98 |
| 7 | `campaign_cost_mean` | 4.09 |
| 8 | `campaign_cpo_mean` | 3.57 |
| 9 | `uid` | 3.43 |
| 10 | `cat_hash` | 3.43 |
---
## Choosing Your Threshold
The **cost of a False Negative** (missed conversion) vs **cost of a False Positive** (wasted impression) determines the right threshold:
| If FN is … times more expensive than FP | Recommended Threshold | Recall | FN Rate |
|---|---|---|---|
| 1× (equal cost) | 0.50 | 73.6% | 26.4% |
| 5× | 0.20 | 93.2% | 6.8% |
| 10× | 0.10 | 97.9% | 2.1% |
| 50×+ | 0.01 | 99.96% | 0.04% |
---
## Improvements Over v1 (Baseline XGBoost)
| Metric | v1 Baseline | v2 CatBoost + Engineering | Δ |
|---|---|---|---|
| AUC | 0.847 | **0.858** | **+1.1%** |
| PR-AUC | 0.318 | **0.343** | **+2.5%** |
| Recall@0.50 | 0.673 | 0.736 | +6.3% |
| F2@0.50 | 0.447 | **0.446** | ~same |
| Recall@0.20 | — | **0.932** | New! |
| FN@0.20 | — | **154** (of 2,257) | New! |
**Key improvements**:
1. **Campaign conversion rate** (target encoding) is the #1 feature
2. **Cross-categorical interactions** (`cat1_cat2`, `campaign_cat4`) capture context
3. **Bid competitiveness signals** (`cost_cpo_ratio`, `cost_log`) separate high-intent inventory
4. **CatBoost native categorical handling** avoids one-hot explosion for high-cardinality IDs
---
## Inference Example
```python
import pandas as pd
from catboost import CatBoostClassifier
# Load model
model = CatBoostClassifier()
model.load_model("vtc_best.cbm")
# Features needed (inventory signals only)
features = [
"timestamp","uid","campaign","cost","cpo","time_since_last_click",
"cat1","cat2","cat3","cat4","cat5","cat6","cat7","cat8","cat9",
"uid_freq","campaign_freq","campaign_conv_rate",
"cpo_log","cost_log","tsc_bucket","cost_cpo_ratio","cost_cpo_diff",
"campaign_cost_mean","campaign_cost_std","campaign_cpo_mean","campaign_cpo_std",
"uid_impression_num","cat_hash",
"campaign_cat4","cat1_cat2","cat3_cat5"
]
# Build a single row (you must compute engineered features from raw data)
row = pd.DataFrame({...})
# Predict
proba = model.predict_proba(row[features])[0][1]
print(f"Conversion probability: {proba:.4f}")
# Apply threshold based on business cost of FN vs FP
# thr=0.20 → ~93% recall (catches most converters)
# thr=0.50 → ~74% recall, lower FP rate
```
---
## Training Details
- Framework: CatBoost 1.2.7
- Data: 300K rows (210K train / 45K val / 45K test)
- Conversion rate: 5.02%
- Class weight: `scale_pos_weight = 18`
- Hardware: CPU-only
- Time: ~8 minutes
---
## Citation
Original dataset: [Criteo Attribution Dataset](https://huggingface.co/datasets/criteo/criteo-attribution-dataset)
---
## Files
- `vtc_best.cbm` — CatBoost model
- `final_importance.csv` — Per-feature importance
<!-- ml-intern-provenance -->
## Generated by ML Intern
This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'nithish277/vtc-inventory-signals'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
```
For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.