YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Trelegy Ellipta β€” Copay Fraud Detection with Isolation Forest

Product: Trelegy Ellipta (GSK) β€” Fluticasone furoate / Umeclidinium / Vilanterol
Method: Hybrid Rules + Isolation Forest + SHAP Explainability
Use Case: Detecting copay card fraud in pharmaceutical patient assistance programs

πŸ—οΈ Project Overview

This project implements an end-to-end copay fraud detection system for Trelegy Ellipta, GSK's blockbuster triple-combination inhaler (~$8B+ annual revenue). The system uses a hybrid approach combining hard business rules with unsupervised anomaly detection (Isolation Forest) and SHAP-based explainability.

Why Trelegy?

  • High-value drug (~$600-700/month retail) β†’ high copay card value per claim
  • Chronic daily use β†’ monthly refills create recurring fraud opportunities
  • Fixed 30-day supply (exactly 30 blisters per inhaler) β†’ easy to detect refill anomalies
  • Multiple assistance programs β†’ card stacking opportunities
  • Even 2% fraud leakage on Trelegy = ~$160M/year in losses

πŸ“Š Fraud Types Detected

# Fraud Type Description Detection Method
1 Early Refill Abuse Refilling before 30-day supply runs out (< 23 days) Rules + IF
2 Pharmacy Hopping Filling at multiple pharmacies to stack copay cards IF (behavioral)
3 Ghost Fills Pharmacy bills but doesn't dispense the inhaler IF (pharmacy patterns)
4 Prescriber Collusion Single prescriber generating abnormal volume; wrong specialty Rules + IF
5 Strength-Switch Alternating between NDCs (100/62.5/25 ↔ 200/62.5/25) without clinical reason Rules + IF
6 Government Insurance Abuse Medicare/Medicaid patient using copay card (violates program terms) Rules

πŸ›οΈ Architecture

Raw Copay Claims
      β”‚
      β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 1: Rules  β”‚ ← 10 Trelegy-specific hard rules
β”‚ (Hard Flags)    β”‚   (early refill, impossible qty, govt insurance, etc.)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 2: IF     β”‚ ← Isolation Forest trained on rule-clean data
β”‚ (Anomaly Detect)β”‚   (catches unknown/novel fraud patterns)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 3: Score  β”‚ ← Combined priority = 50% IF + 30% rule severity + 20% rule flag
β”‚ (Priority Rank) β”‚   Risk tiers: Low / Medium / High / Critical
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 4: SHAP   β”‚ ← TreeExplainer β†’ "Why was this claim flagged?"
β”‚ (Explainability)β”‚   Top features driving each anomaly score
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Phase 5: Eval   β”‚ ← AUPRC, AUROC, F1, Precision@K, detection by fraud type
β”‚ (Reporting)     β”‚   + 5 publication-quality visualizations
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

trelegy_copay_fraud/
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ run_all.py                         # πŸš€ Master script β€” run this
β”œβ”€β”€ generate_synthetic_data.py         # Synthetic Trelegy copay claims generator
β”œβ”€β”€ feature_engineering.py             # Feature engineering pipeline (48 features)
β”œβ”€β”€ fraud_detection_pipeline.py        # Full detection pipeline (rules + IF + SHAP + eval)
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ data/                              # Generated synthetic data
β”‚   β”œβ”€β”€ trelegy_copay_claims.csv       # Main claims dataset
β”‚   β”œβ”€β”€ patient_master.csv             # Patient demographics
β”‚   β”œβ”€β”€ pharmacy_master.csv            # Pharmacy reference
β”‚   └── prescriber_master.csv          # Prescriber reference
└── results/                           # Pipeline outputs
    β”œβ”€β”€ investigation_queue_top500.csv # Top 500 flagged claims for review
    β”œβ”€β”€ scored_claims_full.csv         # All claims with scores & risk tiers
    β”œβ”€β”€ feature_importance_shap.csv    # SHAP feature importance ranking
    β”œβ”€β”€ metrics.json                   # Evaluation metrics
    β”œβ”€β”€ 01_evaluation_metrics.png      # ROC, PR curve, confusion matrix
    β”œβ”€β”€ 02_shap_summary.png            # SHAP beeswarm plot
    β”œβ”€β”€ 03_shap_bar_importance.png     # SHAP bar chart
    β”œβ”€β”€ 04_detection_by_fraud_type.png # Detection rates by fraud type
    β”œβ”€β”€ 05_risk_tier_distribution.png  # Risk tier analysis
    └── model/
        β”œβ”€β”€ isolation_forest_model.pkl # Trained IF model
        β”œβ”€β”€ scaler.pkl                 # StandardScaler
        β”œβ”€β”€ encoder.pkl                # OrdinalEncoder
        └── feature_names.pkl          # Feature name list

πŸš€ Quick Start

# 1. Install dependencies
pip install -r requirements.txt

# 2. Run the full pipeline
python run_all.py

This will:

  1. Generate ~50,000 synthetic Trelegy copay claims (with ~3% injected fraud)
  2. Engineer 48 features (temporal, behavioral, rolling windows, Trelegy-specific)
  3. Apply 10 hard business rules
  4. Train Isolation Forest on rule-clean data
  5. Compute SHAP explanations for top features
  6. Generate evaluation metrics and 5 visualizations
  7. Output an investigation queue (top 500 highest-priority claims)

πŸ“ˆ Features (48 Total)

Temporal Features

  • days_between_fills β€” Days since last fill (normal: 28-33 for Trelegy)
  • early_refill_flag β€” Binary: fill before day 23
  • days_since_first_fill β€” Patient tenure in program
  • claim_month, claim_dow β€” Seasonality

Rolling Window Aggregates (7d/30d/90d)

  • patient_fill_count_{7,30,90}d β€” Fill velocity per patient
  • patient_copay_spend_{7,30,90}d β€” Copay card spend per patient
  • patient_total_claim_{7,30,90}d β€” Total claim amount per patient
  • pharmacy_claim_count_{30,90}d β€” Volume per pharmacy
  • prescriber_claim_count_{30,90}d β€” Volume per prescriber

Patient Behavioral

  • unique_pharmacies_overall β€” Number of distinct pharmacies used
  • unique_programs_per_patient β€” Card stacking indicator
  • total_fills_per_patient β€” Lifetime fill count
  • avg_days_between_fills β€” Average refill gap
  • std_days_between_fills β€” Refill pattern consistency
  • max_fills_any_30d β€” Peak fill velocity

Pharmacy & Prescriber

  • pharmacy_claims_per_patient_ratio β€” Ghost fill indicator
  • prescriber_specialty_valid β€” Specialty match flag
  • prescriber_total_claims β€” Volume anomaly

Trelegy-Specific

  • ndc_switch_flag β€” Strength switching (0173-0893 ↔ 0173-0887)
  • govt_insurance_flag β€” Medicare/Medicaid with copay card
  • cross_state_fill β€” Patient state β‰  pharmacy state
  • new_patient_burst β€” >1 fill in first 7 days of enrollment

πŸ“ Evaluation Metrics

Per literature best practices (arXiv:2312.13896, arXiv:2208.11904):

  • AUPRC β€” Primary metric (best for imbalanced fraud data)
  • Precision@K β€” Operational metric (investigator queue efficiency)
  • F1-Score β€” Balance of precision and recall
  • AUROC β€” Supplementary (can be misleading at <1% fraud rate)

πŸ”§ Trelegy Product Details

Attribute Value
Brand Trelegy Ellipta
Manufacturer GSK (GlaxoSmithKline)
Active Ingredients Fluticasone furoate / Umeclidinium / Vilanterol
Indications COPD (maintenance), Asthma (adults β‰₯18)
NDC (COPD/Asthma) 0173-0893-14 (100/62.5/25 mcg)
NDC (Asthma) 0173-0887-14 (200/62.5/25 mcg)
Days Supply 30 days (30 blisters, 1 inhalation/day)
Retail Price ~$600-700/month

πŸ“š References

βš–οΈ Disclaimer

This project uses synthetic data only. No real patient, pharmacy, or prescriber data is included. The synthetic data is designed to be realistic for development and testing purposes but does not represent actual GSK copay program transactions.

πŸ“„ License

MIT License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for Harsh2396/trelegy-copay-fraud-detection-v2