GSK Copay Card Fraud Detection System β v4 Group-Aware
Product Focus: Trelegy Ellipta (configurable for Nucala / any GSK product)
Methodology: Hybrid Rules + Isolation Forest + SHAP Explainability + Hierarchical Summaries + Group-Aware Benefit Validation
Architecture: Drug-agnostic, ground-truth-optional, vendor-format-agnostic, production-ready
Analytical Levels: Transaction β Patient β HCP β Pharmacy
Group-Aware Validation: Group 8141 (Legacy) vs Group 8200 / 2025 benefit designs
Overview
This system detects fraudulent and suspicious copay card claims in GSK pharmaceutical transaction data using a 4-level hierarchical analytical framework:
| Level | What It Detects | Key Features |
|---|---|---|
| Transaction | Per-claim anomalies | Gap between fills, quantity, days supply, benefit amount, OOP cost, NDC switch |
| Patient | Behavioral patterns | One-and-done patients, active duration, avg gap between fills, short/long gap % |
| HCP | Prescriber-driven fraud | Suspicious specialty, one-and-done %, patient concentration, avg benefit per patient |
| Pharmacy | Pharmacy-centric rings | Active/closed flag, HCP concentration, one-and-done %, avg benefit, fraud risk score |
Under the hood, the system combines:
- 23 hard-coded business rules (15 original + 8 v3 hierarchical rules)
- Isolation Forest unsupervised anomaly detection trained on rule-clean data
- SHAP explainability for every flagged claim
- Hierarchical summary exports for investigative lens views
The pipeline supports any vendor format β ELAAD, APLD, IQVIA, CMS DMR, generic CSV. It auto-discovers column names via synonym mapping, handles missing columns gracefully, and produces a schema report showing what it found and what it missed.
Hierarchical Features (v3)
Transaction Level
days_between_fillsβ gap from last fillearly_refill_flagβ refill before 75% of expected days supply consumedquantity_anomalyβ quantity != 1 (Trelegy is 1 inhaler per fill)days_supply_anomalyβ days_supply != 30govt_insurance_flagβ Medicare/Medicaid/Tricare/VA (program violation)ndc_switch_flagβ patient has filled multiple NDCs (strength switching)cross_state_fillβ pharmacy state != inferred patient statebenefit_ratioβ benefit_amount / usual_customarytransaction_benefit_scoreβ z-score of benefit relative to populationtransaction_oop_scoreβ z-score of OOP cost relative to population
Patient Level
patient_one_and_doneβ patient has exactly 1 fill ever (hit-and-run pattern)patient_active_durationβ days from first to last fill (short = burst)patient_avg_gapβ average gap between fills per patientpatient_short_gap_pctβ % of fills with gap < 15 days (excessive frequency)patient_long_gap_pctβ % of fills with gap > 60 days (irregular adherence)unique_pharmacies_overallβ number of different pharmacies usedunique_prescribers_per_patientβ number of different HCPs seentotal_fills_per_patientβ total fill countpatient_fill_count_7d/30d/90dβ rolling fill counts
HCP Level
hcp_suspicious_specialtyβ 1 if specialty not in valid list (Dermatology, Orthopedics, etc.)hcp_one_and_done_pctβ % of patients with only 1 fill from this HCPhcp_patient_concentrationβ % of claims from top patient (high = patient farming)hcp_avg_benefit_per_patientβ average benefit per unique patienthcp_max_benefit_per_patientβ highest benefit concentrated on single patienthcp_std_benefitβ variance in benefit amounts (consistency = suspicious)hcp_unique_pharmaciesβ number of pharmacies this HCP writes forhcp_total_claims/hcp_unique_patients/hcp_patient_share
Pharmacy Level
pharmacy_active_flagβ 0 if inactive/closed (suspicious if still claiming)pharmacy_fraud_risk_scoreβ composite: 0.25Γreject_rate + 0.20Γpaper_rate + 0.30Γhcp_conc + 0.25Γone_donepharmacy_hcp_concentrationβ % claims from top HCP (high = HCP-driven ring)pharmacy_one_and_done_pctβ % patients with only 1 fill (churn-and-burn)pharmacy_avg_benefit/pharmacy_total_benefit/pharmacy_max_benefit_per_patientpharmacy_reject_rate/pharmacy_paper_submission_ratepharmacy_unique_hcpsβ number of distinct prescribers at this pharmacy
Project Structure
gsk_copay_fraud/
βββ config.py # Product config + column synonym mappings + FEATURE_DEPENDENCIES
βββ data_ingestion.py # Schema discovery + vendor-agnostic ingestion
βββ feature_engineering_v2.py # 60+ features with graceful degradation
βββ fraud_detection_pipeline_v3.py # Full pipeline (23 rules + IF + SHAP + hierarchical summaries)
βββ run_all_v3.py # CLI runner
βββ generate_elaad_test_data.py # Synthetic ELAAD-style test data generator with embedded fraud
βββ requirements.txt # Dependencies
βββ README.md # This file
βββ data/ # Place raw GSK data here
β βββ elaad_test_trelegy.csv
βββ results/ # All outputs generated here
βββ investigation_queue_top500.csv
βββ scored_claims_full.csv
βββ transaction_level_summary.csv β NEW: per-claim analytical view
βββ hcp_level_summary.csv β NEW: per-HCP investigative lens
βββ pharmacy_level_summary.csv β NEW: per-pharmacy investigative lens
βββ patient_level_summary.csv β NEW: per-patient behavioral view
βββ feature_importance_shap.csv
βββ metrics.json
βββ schema_report.json
βββ 01_evaluation_metrics.png
βββ 02_shap_summary.png
βββ 03_shap_bar_importance.png
βββ 04_rule_breakdown.png
βββ 05_risk_tier_distribution.png
βββ 06_hcp_summary.png β NEW
βββ 07_pharmacy_summary.png β NEW
βββ 08_patient_summary.png β NEW
Quick Start
1. Install Dependencies
pip install -r requirements.txt
2. Run on Real GSK Data
# Auto-detect everything (recommended)
python run_all_v3.py \
--data-path data/vendor_file.txt.gz
# CSV file
python run_all_v3.py \
--data-path data/gsk_copay_transactions.csv \
--file-type csv
# Gzipped TXT (common GSK format)
python run_all_v3.py \
--data-path data/GSK_COPAY_TRANSACTION_DAILY_20250725.TXT.GZ \
--file-type txt.gz
# Force a specific vendor format (skips auto-discovery)
python run_all_v3.py \
--data-path data/vendor_file.csv \
--vendor-format gsk_iqvia \
--contamination 0.03
# Adjust anomaly rate for unusual datasets
python run_all_v3.py \
--data-path data/vendor_file.csv \
--contamination 0.05
3. Generate & Test on Synthetic ELAAD Data
# Generate test data
python generate_elaad_test_data.py
# Creates data/elaad_test_trelegy.csv with embedded fraud patterns
# Run pipeline on synthetic data (high contamination needed because ~40% fraud)
python run_all_v3.py \
--data-path data/elaad_test_trelegy.csv \
--file-type csv \
--contamination 0.40
Vendor Format Handling
The Problem
Vendor files rarely match the idealised spec. A column named IQVIA_PATIENT_ID in the spec might appear as:
PATIENT_ID(ELAAD format)MEMBER_ID(APLD format)PATIENTID(no underscore)PAT ID(space instead of underscore)Patient ID(mixed case)PATIENT_KEY(different suffix)
The Solution: Schema Discovery
The pipeline uses COLUMN_SYNONYMS in config.py β a dictionary where each internal column name maps to a list of possible raw names (20+ synonyms per column). When a file is loaded:
- Scan header β collect all raw column names
- Normalize β uppercase, strip whitespace, replace underscores/spaces with single space
- Match β for each internal column, try synonyms in order of preference
- Report β log what was mapped and what was missing
- Continue β pipeline runs with whatever columns are available
Example Schema Report
2025-04-05 12:00:00 [INFO] Vendor format: detected=generic_csv, requested=auto
2025-04-05 12:00:00 [INFO] Schema discovery: mapped 38 / 50 internal columns
2025-04-05 12:00:00 [INFO] [pharmacy] 6/7 present. Missing: ['pharmacy_subcategory']
2025-04-05 12:00:00 [INFO] [reject] 1/3 present. Missing: ['reject_description', 'reject_type']
2025-04-05 12:00:00 [WARNING] Missing columns (12): ['hcp_id', 'record_type', 'other_coverage', ...]
2025-04-05 12:00:00 [INFO] Feature engineering on 85,432 claims...
2025-04-05 12:00:00 [WARNING] Skipping 'pharmacy_mail_order_pct' β missing mail_order
2025-04-05 12:00:00 [WARNING] Skipping 'prescriber_specialty_valid' β missing prescriber_specialty
Vendor Format Profiles
| Profile | Description |
|---|---|
auto |
Scan file header, detect best match (default) |
gsk_iqvia |
Expect IQVIA-style column names |
generic_csv |
Expect generic lower-case names |
cms_dmr |
CMS Drug Monitoring Report format |
elaad_apld |
ELAAD/APLD format with MEMBER_ID, HCP_ID, etc. |
unknown |
No preconceptions, rely fully on synonym matching |
Adding a New Vendor Synonym
No code changes needed. Edit config.py β COLUMN_SYNONYMS:
"patient_id": [
"IQVIA_PATIENT_ID", "PATIENT_ID", "PAT_ID", "MEMBER_ID",
"NEW_VENDOR_PATIENT_ID", # β add your vendor's name here
],
Input Data Schema
The pipeline accepts any of the following formats and auto-discovers columns:
| Format | Extension | Auto-detect? |
|---|---|---|
| CSV | .csv |
Yes |
| Tab-separated | .txt, .tsv |
Yes (scans first line for \t) |
| Gzipped tab | .txt.gz, .csv.gz |
Yes |
| ZIP archive | .zip |
Yes (extracts first CSV/TXT inside) |
| Excel | .xlsx, .xls |
Yes (requires openpyxl/xlrd) |
Key column groups the pipeline looks for:
- Patient:
IQVIA_PATIENT_ID/PATIENT_ID/MEMBER_IDβpatient_id - HCP:
IQVIA_PRESCRIBER_ID/HCP_ID/PRESCRIBER_ID/DOCTOR_IDβprescriber_npi - Claim:
CLAIM_NUMBER/CLAIM_NUM/CLAIMIDβclaim_number - Drug:
NDC/DRUG_NDC/NATIONAL_DRUG_CODEβdrug_ndc - Financial:
COPAY_AFTER_BENEFIT/COPAY_AFTER/OOP_COSTβcopay_after - Pharmacy:
PHARMACY_NABP_NUMBER/PHARMACY_ID/STORE_IDβpharmacy_nabp - Insurance:
PRIMARY_PAYER_BIN/PAYER_BIN/BINβprimary_payer_bin - Reject:
REJECT_CODE/REJECTION_CODE/DENIAL_CDβreject_code
See config.py::COLUMN_SYNONYMS for the full list of 100+ synonyms.
23 Business Rules
Original 15 Rules
| # | Rule | Condition | Fraud Signal |
|---|---|---|---|
| 1 | Early Refill | days_between_fills < 23 |
Early Refill Abuse |
| 2 | Impossible Qty | quantity != 1 |
Data Error / Fraud |
| 3 | Wrong Days Supply | days_supply != 30 |
Data Error / Fraud |
| 4 | Govt Insurance | insurance_type == Government |
Program Violation |
| 5 | Underage | patient_age < 18 |
Program Violation |
| 6 | Duplicate | Same patient + date + pharmacy | Duplicate Billing |
| 7 | NDC Switch | patient_ndc_count > 1 |
Strength Switching |
| 8 | Suspicious Specialty | Prescriber not in valid list | Prescriber Collusion |
| 9 | Multi-Program | unique_programs_per_patient > 1 |
Card Stacking |
| 10 | Excessive Fills (90d) | patient_fill_count_90d > 4 |
Stockpiling |
| 11 | High-Risk Reject | reject_code in {76, 88, 79} |
Maximizer / DUR |
| 12 | Maximizer Cap | maximizer_reject == 1 |
Benefit Exhaustion |
| 13 | Paper Submission | paper_submission == 1 |
Submission Fraud |
| 14 | Plan Switch | plan_switch_flag == 1 |
Plan Switching |
| 15 | Linked Claim | has_linked_claim == 1 |
Reversal / Adjustment |
v3 Hierarchical Rules
| # | Rule | Condition | Fraud Signal |
|---|---|---|---|
| 16 | HCP High Benefit | hcp_avg_benefit_per_patient > 500 |
Prescriber-driven extraction |
| 17 | HCP One-Done Concentration | hcp_one_and_done_pct > 0.6 |
HCP with hit-and-run patients |
| 18 | Pharmacy Fraud Risk | pharmacy_fraud_risk_score > 0.6 |
Composite pharmacy ring score |
| 19 | Pharmacy HCP Concentration | pharmacy_hcp_concentration > 0.5 |
Single HCP dominates pharmacy |
| 20 | Pharmacy One-Done | pharmacy_one_and_done_pct > 0.5 |
Pharmacy with churn-and-burn |
| 21 | Short Active Burst | patient_active_duration <= 14 AND total_fills > 1 |
Quick-fire multi-fill scheme |
| 22 | Cross-State | patient_state != pharmacy_state |
Out-of-state fraud |
| 23 | New Patient Burst | days_since_first <= 7 AND total_fills > 1 |
Same-week multiple fills |
Rules whose source columns are missing are silently skipped (count = 0).
Model: Isolation Forest
IsolationForest(
n_estimators=200,
contamination=0.03, # Adjustable via CLI (use 0.40 for synthetic test data)
max_samples="auto",
max_features=1.0,
bootstrap=False,
random_state=42,
n_jobs=-1,
)
Training strategy: Train ONLY on claims where rule_flag == 0 (rule-clean). Score ALL claims.
Degraded mode: If zero rule-clean claims exist (e.g., synthetic data with 40% fraud), the model trains on ALL claims with contamination=min(origΓ3, 0.5). This is a safety fallback β real production data will always have rule-clean claims.
Priority score:
priority_score = 0.50 * if_anomaly_score + 0.30 * rule_severity + 0.20 * rule_flag
Risk Tiers:
| Tier | Score | Action |
|---|---|---|
| Low | 0.0β0.3 | No action |
| Medium | 0.3β0.6 | Monitor |
| High | 0.6β0.8 | Investigate (investigation queue) |
| Critical | 0.8β1.0 | Immediate investigation + audit trail |
Outputs
Core Outputs
| File | Type | Description |
|---|---|---|
investigation_queue_top500.csv |
CSV | Top 500 highest-priority claims for manual review |
scored_claims_full.csv |
CSV | All claims with scores + risk tiers |
feature_importance_shap.csv |
CSV | SHAP ranking of features |
metrics.json |
JSON | All evaluation metrics + hierarchical summary counts |
schema_report.json |
JSON | Column mapping audit trail |
NEW: Hierarchical Summaries (v3)
| File | Type | Description |
|---|---|---|
transaction_level_summary.csv |
CSV | Per-claim analytical view (27 columns) |
hcp_level_summary.csv |
CSV | Per-HCP investigative lens (avg scores, specialty, concentration) |
pharmacy_level_summary.csv |
CSV | Per-pharmacy investigative lens (fraud risk score, HCP conc, one-done %) |
patient_level_summary.csv |
CSV | Per-patient behavioral view (one-and-done, active duration, gap stats) |
Visualizations
| File | Type | Description |
|---|---|---|
01_evaluation_metrics.png |
PNG | Score distribution, ROC, PR, tier counts |
02_shap_summary.png |
PNG | SHAP beeswarm (top 20 features) |
03_shap_bar_importance.png |
PNG | SHAP bar chart |
04_rule_breakdown.png |
PNG | Flagged claims by each rule |
05_risk_tier_distribution.png |
PNG | Tier counts + fraud rates |
06_hcp_summary.png |
PNG | HCP risk score distribution + benefit vs risk scatter |
07_pharmacy_summary.png |
PNG | Pharmacy risk score distribution + fraud risk vs priority |
08_patient_summary.png |
PNG | Patient risk score distribution + one-and-done rate |
Model Artifacts
| File | Type | Description |
|---|---|---|
model/isolation_forest_model.pkl |
PKL | Trained model |
model/scaler.pkl |
PKL | StandardScaler |
model/encoder.pkl |
PKL | OrdinalEncoder |
model/feature_names.pkl |
PKL | Feature name list |
Configuration: Drug-Agnostic
Edit config.py to configure for any GSK product:
PRODUCT_CONFIG = {
"product_name": "Trelegy Ellipta",
"days_supply_expected": 30,
"quantity_expected": 1,
"ndc_list": {
"00173089314": {"strength": "100/62.5/25", "indication": "COPD/Asthma"},
"00173088714": {"strength": "200/62.5/25", "indication": "Asthma"},
},
"valid_prescriber_specialties": ["Pulmonology", "Allergy/Immunology", ...],
"suspicious_prescriber_specialties": ["Dermatology", "Orthopedics", ...],
"early_refill_threshold_days": 23,
"max_fills_90d": 4,
# HCP thresholds
"hcp_high_benefit_threshold": 500.0,
"hcp_patient_concentration_threshold": 0.6,
# Pharmacy thresholds
"pharmacy_high_benefit_threshold": 450.0,
"pharmacy_hcp_concentration_threshold": 0.5,
"pharmacy_one_done_threshold": 0.5,
# Patient thresholds
"patient_gap_short_threshold": 15,
"patient_gap_long_threshold": 60,
"patient_max_active_duration_days": 180,
...
}
Tech Stack
- Python 3.9+
- pandas β₯ 2.0.0, numpy β₯ 1.24.0, scikit-learn β₯ 1.3.0
- shap β₯ 0.42.0, matplotlib β₯ 3.7.0, seaborn β₯ 0.12.0, joblib β₯ 1.3.0, pyarrow β₯ 12.0.0
License
Proprietary β GSK internal use.
Generated by ML Intern
This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.
- Try ML Intern: https://smolagents-ml-intern.hf.space
- Source code: https://github.com/huggingface/ml-intern
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = 'Harsh2396/gsk-copay-fraud-detection'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.