Update ML Intern artifact metadata

691f31e verified 18 days ago

6.79 kB

	---
	tags:
	- xgboost
	- catboost
	- conversion-prediction
	- view-through-attribution
	- advertising
	- classification
	- imbalanced-data
	- ml-intern
	license: apache-2.0
	library_name: catboost
	---

	# View-Through Conversion (VTC) Predictor v2 — Inventory Signals Only

	Task: Predict conversion probability from ad-impression inventory signals,
	with conversions happening independent of clicks (post-impression / view-through).

	Key constraint: False Negatives are expensive — we optimize for Recall / F2.

	---

	## Dataset

	- Source: [`criteo/criteo-attribution-dataset`](https://huggingface.co/datasets/criteo/criteo-attribution-dataset)
	- Size: 300K sampled rows (6M+ available in full dataset)
	- Target: `conversion` (0/1) — independent of click
	- Features: 28 engineered features from inventory signals only (no click data)

	### Inventory Features Used

	\| Feature \| Description \|
	\|---\|---\|
	\| `timestamp`, `uid`, `campaign` \| Basic impression metadata \|
	\| `cost`, `cpo` \| Bid / pricing signals \|
	\| `cat1`–`cat9` \| Categorical inventory features \|
	\| `time_since_last_click` \| Recency (set to median if no prior click) \|
	\| `uid_freq`, `campaign_freq` \| Frequency encoding (how common is this user/campaign) \|
	\| `campaign_conv_rate` \| Target-encoded campaign conversion rate \|
	\| `cpo_log`, `cost_log` \| Log-transformed prices \|
	\| `tsc_bucket` \| Binned time-since-last-click \|
	\| `cost_cpo_ratio`, `cost_cpo_diff` \| Bid competitiveness signals \|
	\| `campaign_cat4`, `cat1_cat2`, `cat3_cat5` \| Cross-categorical interactions \|
	\| `campaign_cost_mean`, `campaign_cpo_mean` \| Campaign-level price statistics \|
	\| `uid_impression_num` \| User impression sequence number \|
	\| `cat_hash` \| Hashed combination of 4 categories \|

	Excluded click signals: `click`, `click_pos`, `click_nb`, `attribution`, `conversion_timestamp`, `conversion_id`

	---

	## Model

	- Architecture: CatBoost Classifier (gradient-boosted trees with native categorical handling)
	- Objective: `binary:logistic`
	- Class imbalance: `scale_pos_weight = 18` (derived from class ratio)
	- Key hyperparameters: iterations=1500, depth=8, lr=0.05, l2_leaf_reg=3
	- Early stopping: 100 rounds on validation PR-AUC

	---

	## Metrics (Test Set, 45,000 impressions)

	### Threshold Scan

	\| Threshold \| Recall \| Precision \| F1 \| F2 \| FN \| FP \| FPR \|
	\|---\|---\|---\|---\|---\|---\|---\|---\|
	\| 0.01 \| 0.9996 \| 0.051 \| 0.097 \| 0.211 \| 1 \| 42,256 \| 0.989 \|
	\| 0.05 \| 0.993 \| 0.058 \| 0.109 \| 0.235 \| 15 \| 36,537 \| 0.855 \|
	\| 0.10 \| 0.979 \| 0.068 \| 0.127 \| 0.266 \| 48 \| 30,263 \| 0.708 \|
	\| 0.15 \| 0.955 \| 0.078 \| 0.144 \| 0.294 \| 101 \| 25,527 \| 0.597 \|
	\| 0.20 \| 0.932 \| 0.088 \| 0.161 \| 0.319 \| 154 \| 21,799 \| 0.510 \|
	\| 0.25 \| 0.913 \| 0.100 \| 0.180 \| 0.347 \| 197 \| 18,613 \| 0.436 \|
	\| 0.30 \| 0.881 \| 0.111 \| 0.198 \| 0.370 \| 268 \| 15,862 \| 0.371 \|
	\| 0.40 \| 0.810 \| 0.140 \| 0.239 \| 0.415 \| 429 \| 11,194 \| 0.262 \|
	\| 0.50 \| 0.736 \| 0.173 \| 0.280 \| 0.446 \| 597 \| 7,943 \| 0.186 \|

	### Ranking Quality

	\| Metric \| Value \|
	\|---\|---\|
	\| AUC \| 0.8581 \|
	\| PR-AUC \| 0.3430 \|

	### Feature Importance (Top 10)

	\| Rank \| Feature \| Importance \|
	\|---\|---\|---\|
	\| 1 \| `campaign_conv_rate` \| 9.94 \|
	\| 2 \| `cost_cpo_ratio` \| 8.87 \|
	\| 3 \| `cat1_cat2` \| 5.27 \|
	\| 4 \| `cost` \| 5.17 \|
	\| 5 \| `cost_log` \| 4.99 \|
	\| 6 \| `cat1` \| 4.98 \|
	\| 7 \| `campaign_cost_mean` \| 4.09 \|
	\| 8 \| `campaign_cpo_mean` \| 3.57 \|
	\| 9 \| `uid` \| 3.43 \|
	\| 10 \| `cat_hash` \| 3.43 \|

	---

	## Choosing Your Threshold

	The cost of a False Negative (missed conversion) vs cost of a False Positive (wasted impression) determines the right threshold:

	\| If FN is … times more expensive than FP \| Recommended Threshold \| Recall \| FN Rate \|
	\|---\|---\|---\|---\|
	\| 1× (equal cost) \| 0.50 \| 73.6% \| 26.4% \|
	\| 5× \| 0.20 \| 93.2% \| 6.8% \|
	\| 10× \| 0.10 \| 97.9% \| 2.1% \|
	\| 50×+ \| 0.01 \| 99.96% \| 0.04% \|

	---

	## Improvements Over v1 (Baseline XGBoost)

	\| Metric \| v1 Baseline \| v2 CatBoost + Engineering \| Δ \|
	\|---\|---\|---\|---\|
	\| AUC \| 0.847 \| 0.858 \| +1.1% \|
	\| PR-AUC \| 0.318 \| 0.343 \| +2.5% \|
	\| Recall@0.50 \| 0.673 \| 0.736 \| +6.3% \|
	\| F2@0.50 \| 0.447 \| 0.446 \| ~same \|
	\| Recall@0.20 \| — \| 0.932 \| New! \|
	\| FN@0.20 \| — \| 154 (of 2,257) \| New! \|

	Key improvements:
	1. Campaign conversion rate (target encoding) is the #1 feature
	2. Cross-categorical interactions (`cat1_cat2`, `campaign_cat4`) capture context
	3. Bid competitiveness signals (`cost_cpo_ratio`, `cost_log`) separate high-intent inventory
	4. CatBoost native categorical handling avoids one-hot explosion for high-cardinality IDs

	---

	## Inference Example

	```python
	import pandas as pd
	from catboost import CatBoostClassifier

	# Load model
	model = CatBoostClassifier()
	model.load_model("vtc_best.cbm")

	# Features needed (inventory signals only)
	features = [
	"timestamp","uid","campaign","cost","cpo","time_since_last_click",
	"cat1","cat2","cat3","cat4","cat5","cat6","cat7","cat8","cat9",
	"uid_freq","campaign_freq","campaign_conv_rate",
	"cpo_log","cost_log","tsc_bucket","cost_cpo_ratio","cost_cpo_diff",
	"campaign_cost_mean","campaign_cost_std","campaign_cpo_mean","campaign_cpo_std",
	"uid_impression_num","cat_hash",
	"campaign_cat4","cat1_cat2","cat3_cat5"
	]

	# Build a single row (you must compute engineered features from raw data)
	row = pd.DataFrame({...})

	# Predict
	proba = model.predict_proba(row[features])[0][1]
	print(f"Conversion probability: {proba:.4f}")

	# Apply threshold based on business cost of FN vs FP
	# thr=0.20 → ~93% recall (catches most converters)
	# thr=0.50 → ~74% recall, lower FP rate
	```

	---

	## Training Details

	- Framework: CatBoost 1.2.7
	- Data: 300K rows (210K train / 45K val / 45K test)
	- Conversion rate: 5.02%
	- Class weight: `scale_pos_weight = 18`
	- Hardware: CPU-only
	- Time: ~8 minutes

	---

	## Citation

	Original dataset: [Criteo Attribution Dataset](https://huggingface.co/datasets/criteo/criteo-attribution-dataset)

	---

	## Files

	- `vtc_best.cbm` — CatBoost model
	- `final_importance.csv` — Per-feature importance

	<!-- ml-intern-provenance -->
	## Generated by ML Intern

	This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.

	- Try ML Intern: https://smolagents-ml-intern.hf.space
	- Source code: https://github.com/huggingface/ml-intern

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_id = 'nithish277/vtc-inventory-signals'
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)
	```

	For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.