YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Trelegy Ellipta β Copay Fraud Detection with Isolation Forest
Product: Trelegy Ellipta (GSK) β Fluticasone furoate / Umeclidinium / Vilanterol
Method: Hybrid Rules + Isolation Forest + SHAP Explainability
Use Case: Detecting copay card fraud in pharmaceutical patient assistance programs
ποΈ Project Overview
This project implements an end-to-end copay fraud detection system for Trelegy Ellipta, GSK's blockbuster triple-combination inhaler (~$8B+ annual revenue). The system uses a hybrid approach combining hard business rules with unsupervised anomaly detection (Isolation Forest) and SHAP-based explainability.
Why Trelegy?
- High-value drug (~$600-700/month retail) β high copay card value per claim
- Chronic daily use β monthly refills create recurring fraud opportunities
- Fixed 30-day supply (exactly 30 blisters per inhaler) β easy to detect refill anomalies
- Multiple assistance programs β card stacking opportunities
- Even 2% fraud leakage on Trelegy = ~$160M/year in losses
π Fraud Types Detected
| # | Fraud Type | Description | Detection Method |
|---|---|---|---|
| 1 | Early Refill Abuse | Refilling before 30-day supply runs out (< 23 days) | Rules + IF |
| 2 | Pharmacy Hopping | Filling at multiple pharmacies to stack copay cards | IF (behavioral) |
| 3 | Ghost Fills | Pharmacy bills but doesn't dispense the inhaler | IF (pharmacy patterns) |
| 4 | Prescriber Collusion | Single prescriber generating abnormal volume; wrong specialty | Rules + IF |
| 5 | Strength-Switch | Alternating between NDCs (100/62.5/25 β 200/62.5/25) without clinical reason | Rules + IF |
| 6 | Government Insurance Abuse | Medicare/Medicaid patient using copay card (violates program terms) | Rules |
ποΈ Architecture
Raw Copay Claims
β
βΌ
βββββββββββββββββββ
β Phase 1: Rules β β 10 Trelegy-specific hard rules
β (Hard Flags) β (early refill, impossible qty, govt insurance, etc.)
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β Phase 2: IF β β Isolation Forest trained on rule-clean data
β (Anomaly Detect)β (catches unknown/novel fraud patterns)
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β Phase 3: Score β β Combined priority = 50% IF + 30% rule severity + 20% rule flag
β (Priority Rank) β Risk tiers: Low / Medium / High / Critical
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β Phase 4: SHAP β β TreeExplainer β "Why was this claim flagged?"
β (Explainability)β Top features driving each anomaly score
ββββββββββ¬βββββββββ
βΌ
βββββββββββββββββββ
β Phase 5: Eval β β AUPRC, AUROC, F1, Precision@K, detection by fraud type
β (Reporting) β + 5 publication-quality visualizations
βββββββββββββββββββ
π Project Structure
trelegy_copay_fraud/
βββ README.md # This file
βββ run_all.py # π Master script β run this
βββ generate_synthetic_data.py # Synthetic Trelegy copay claims generator
βββ feature_engineering.py # Feature engineering pipeline (48 features)
βββ fraud_detection_pipeline.py # Full detection pipeline (rules + IF + SHAP + eval)
βββ requirements.txt # Python dependencies
βββ data/ # Generated synthetic data
β βββ trelegy_copay_claims.csv # Main claims dataset
β βββ patient_master.csv # Patient demographics
β βββ pharmacy_master.csv # Pharmacy reference
β βββ prescriber_master.csv # Prescriber reference
βββ results/ # Pipeline outputs
βββ investigation_queue_top500.csv # Top 500 flagged claims for review
βββ scored_claims_full.csv # All claims with scores & risk tiers
βββ feature_importance_shap.csv # SHAP feature importance ranking
βββ metrics.json # Evaluation metrics
βββ 01_evaluation_metrics.png # ROC, PR curve, confusion matrix
βββ 02_shap_summary.png # SHAP beeswarm plot
βββ 03_shap_bar_importance.png # SHAP bar chart
βββ 04_detection_by_fraud_type.png # Detection rates by fraud type
βββ 05_risk_tier_distribution.png # Risk tier analysis
βββ model/
βββ isolation_forest_model.pkl # Trained IF model
βββ scaler.pkl # StandardScaler
βββ encoder.pkl # OrdinalEncoder
βββ feature_names.pkl # Feature name list
π Quick Start
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run the full pipeline
python run_all.py
This will:
- Generate ~50,000 synthetic Trelegy copay claims (with ~3% injected fraud)
- Engineer 48 features (temporal, behavioral, rolling windows, Trelegy-specific)
- Apply 10 hard business rules
- Train Isolation Forest on rule-clean data
- Compute SHAP explanations for top features
- Generate evaluation metrics and 5 visualizations
- Output an investigation queue (top 500 highest-priority claims)
π Features (48 Total)
Temporal Features
days_between_fillsβ Days since last fill (normal: 28-33 for Trelegy)early_refill_flagβ Binary: fill before day 23days_since_first_fillβ Patient tenure in programclaim_month,claim_dowβ Seasonality
Rolling Window Aggregates (7d/30d/90d)
patient_fill_count_{7,30,90}dβ Fill velocity per patientpatient_copay_spend_{7,30,90}dβ Copay card spend per patientpatient_total_claim_{7,30,90}dβ Total claim amount per patientpharmacy_claim_count_{30,90}dβ Volume per pharmacyprescriber_claim_count_{30,90}dβ Volume per prescriber
Patient Behavioral
unique_pharmacies_overallβ Number of distinct pharmacies usedunique_programs_per_patientβ Card stacking indicatortotal_fills_per_patientβ Lifetime fill countavg_days_between_fillsβ Average refill gapstd_days_between_fillsβ Refill pattern consistencymax_fills_any_30dβ Peak fill velocity
Pharmacy & Prescriber
pharmacy_claims_per_patient_ratioβ Ghost fill indicatorprescriber_specialty_validβ Specialty match flagprescriber_total_claimsβ Volume anomaly
Trelegy-Specific
ndc_switch_flagβ Strength switching (0173-0893 β 0173-0887)govt_insurance_flagβ Medicare/Medicaid with copay cardcross_state_fillβ Patient state β pharmacy statenew_patient_burstβ >1 fill in first 7 days of enrollment
π Evaluation Metrics
Per literature best practices (arXiv:2312.13896, arXiv:2208.11904):
- AUPRC β Primary metric (best for imbalanced fraud data)
- Precision@K β Operational metric (investigator queue efficiency)
- F1-Score β Balance of precision and recall
- AUROC β Supplementary (can be misleading at <1% fraud rate)
π§ Trelegy Product Details
| Attribute | Value |
|---|---|
| Brand | Trelegy Ellipta |
| Manufacturer | GSK (GlaxoSmithKline) |
| Active Ingredients | Fluticasone furoate / Umeclidinium / Vilanterol |
| Indications | COPD (maintenance), Asthma (adults β₯18) |
| NDC (COPD/Asthma) | 0173-0893-14 (100/62.5/25 mcg) |
| NDC (Asthma) | 0173-0887-14 (200/62.5/25 mcg) |
| Days Supply | 30 days (30 blisters, 1 inhalation/day) |
| Retail Price | ~$600-700/month |
π References
- Liu, Ting, Zhou β "Isolation Forest" (IEEE ICDM 2008) β Original algorithm
- Hariri et al. β "Extended Isolation Forest" (2019) β High-dimensional improvement
- Thimonier et al. β arXiv:2312.13896 β IF vs LightGBM fraud benchmark
- Amazon FDB β arXiv:2208.14417 β Feature engineering best practices
- ZS Associates β Pharma Copay Fraud Case Study
- KPMG β Fighting Copay Fraud (PDF)
βοΈ Disclaimer
This project uses synthetic data only. No real patient, pharmacy, or prescriber data is included. The synthetic data is designed to be realistic for development and testing purposes but does not represent actual GSK copay program transactions.
π License
MIT License