GSK Copay Card Fraud Detection System β€” v4 Group-Aware

Product Focus: Trelegy Ellipta (configurable for Nucala / any GSK product)
Methodology: Hybrid Rules + Isolation Forest + SHAP Explainability + Hierarchical Summaries + Group-Aware Benefit Validation
Architecture: Drug-agnostic, ground-truth-optional, vendor-format-agnostic, production-ready
Analytical Levels: Transaction β†’ Patient β†’ HCP β†’ Pharmacy
Group-Aware Validation: Group 8141 (Legacy) vs Group 8200 / 2025 benefit designs


Overview

This system detects fraudulent and suspicious copay card claims in GSK pharmaceutical transaction data using a 4-level hierarchical analytical framework:

Level What It Detects Key Features
Transaction Per-claim anomalies Gap between fills, quantity, days supply, benefit amount, OOP cost, NDC switch
Patient Behavioral patterns One-and-done patients, active duration, avg gap between fills, short/long gap %
HCP Prescriber-driven fraud Suspicious specialty, one-and-done %, patient concentration, avg benefit per patient
Pharmacy Pharmacy-centric rings Active/closed flag, HCP concentration, one-and-done %, avg benefit, fraud risk score

Under the hood, the system combines:

  • 23 hard-coded business rules (15 original + 8 v3 hierarchical rules)
  • Isolation Forest unsupervised anomaly detection trained on rule-clean data
  • SHAP explainability for every flagged claim
  • Hierarchical summary exports for investigative lens views

The pipeline supports any vendor format β€” ELAAD, APLD, IQVIA, CMS DMR, generic CSV. It auto-discovers column names via synonym mapping, handles missing columns gracefully, and produces a schema report showing what it found and what it missed.


Hierarchical Features (v3)

Transaction Level

  • days_between_fills β€” gap from last fill
  • early_refill_flag β€” refill before 75% of expected days supply consumed
  • quantity_anomaly β€” quantity != 1 (Trelegy is 1 inhaler per fill)
  • days_supply_anomaly β€” days_supply != 30
  • govt_insurance_flag β€” Medicare/Medicaid/Tricare/VA (program violation)
  • ndc_switch_flag β€” patient has filled multiple NDCs (strength switching)
  • cross_state_fill β€” pharmacy state != inferred patient state
  • benefit_ratio β€” benefit_amount / usual_customary
  • transaction_benefit_score β€” z-score of benefit relative to population
  • transaction_oop_score β€” z-score of OOP cost relative to population

Patient Level

  • patient_one_and_done β€” patient has exactly 1 fill ever (hit-and-run pattern)
  • patient_active_duration β€” days from first to last fill (short = burst)
  • patient_avg_gap β€” average gap between fills per patient
  • patient_short_gap_pct β€” % of fills with gap < 15 days (excessive frequency)
  • patient_long_gap_pct β€” % of fills with gap > 60 days (irregular adherence)
  • unique_pharmacies_overall β€” number of different pharmacies used
  • unique_prescribers_per_patient β€” number of different HCPs seen
  • total_fills_per_patient β€” total fill count
  • patient_fill_count_7d/30d/90d β€” rolling fill counts

HCP Level

  • hcp_suspicious_specialty β€” 1 if specialty not in valid list (Dermatology, Orthopedics, etc.)
  • hcp_one_and_done_pct β€” % of patients with only 1 fill from this HCP
  • hcp_patient_concentration β€” % of claims from top patient (high = patient farming)
  • hcp_avg_benefit_per_patient β€” average benefit per unique patient
  • hcp_max_benefit_per_patient β€” highest benefit concentrated on single patient
  • hcp_std_benefit β€” variance in benefit amounts (consistency = suspicious)
  • hcp_unique_pharmacies β€” number of pharmacies this HCP writes for
  • hcp_total_claims / hcp_unique_patients / hcp_patient_share

Pharmacy Level

  • pharmacy_active_flag β€” 0 if inactive/closed (suspicious if still claiming)
  • pharmacy_fraud_risk_score β€” composite: 0.25Γ—reject_rate + 0.20Γ—paper_rate + 0.30Γ—hcp_conc + 0.25Γ—one_done
  • pharmacy_hcp_concentration β€” % claims from top HCP (high = HCP-driven ring)
  • pharmacy_one_and_done_pct β€” % patients with only 1 fill (churn-and-burn)
  • pharmacy_avg_benefit / pharmacy_total_benefit / pharmacy_max_benefit_per_patient
  • pharmacy_reject_rate / pharmacy_paper_submission_rate
  • pharmacy_unique_hcps β€” number of distinct prescribers at this pharmacy

Project Structure

gsk_copay_fraud/
β”œβ”€β”€ config.py                          # Product config + column synonym mappings + FEATURE_DEPENDENCIES
β”œβ”€β”€ data_ingestion.py                  # Schema discovery + vendor-agnostic ingestion
β”œβ”€β”€ feature_engineering_v2.py          # 60+ features with graceful degradation
β”œβ”€β”€ fraud_detection_pipeline_v3.py     # Full pipeline (23 rules + IF + SHAP + hierarchical summaries)
β”œβ”€β”€ run_all_v3.py                      # CLI runner
β”œβ”€β”€ generate_elaad_test_data.py        # Synthetic ELAAD-style test data generator with embedded fraud
β”œβ”€β”€ requirements.txt                   # Dependencies
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ data/                              # Place raw GSK data here
β”‚   └── elaad_test_trelegy.csv
└── results/                           # All outputs generated here
    β”œβ”€β”€ investigation_queue_top500.csv
    β”œβ”€β”€ scored_claims_full.csv
    β”œβ”€β”€ transaction_level_summary.csv    ← NEW: per-claim analytical view
    β”œβ”€β”€ hcp_level_summary.csv            ← NEW: per-HCP investigative lens
    β”œβ”€β”€ pharmacy_level_summary.csv       ← NEW: per-pharmacy investigative lens
    β”œβ”€β”€ patient_level_summary.csv        ← NEW: per-patient behavioral view
    β”œβ”€β”€ feature_importance_shap.csv
    β”œβ”€β”€ metrics.json
    β”œβ”€β”€ schema_report.json
    β”œβ”€β”€ 01_evaluation_metrics.png
    β”œβ”€β”€ 02_shap_summary.png
    β”œβ”€β”€ 03_shap_bar_importance.png
    β”œβ”€β”€ 04_rule_breakdown.png
    β”œβ”€β”€ 05_risk_tier_distribution.png
    β”œβ”€β”€ 06_hcp_summary.png             ← NEW
    β”œβ”€β”€ 07_pharmacy_summary.png          ← NEW
    └── 08_patient_summary.png         ← NEW

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run on Real GSK Data

# Auto-detect everything (recommended)
python run_all_v3.py \
  --data-path data/vendor_file.txt.gz

# CSV file
python run_all_v3.py \
  --data-path data/gsk_copay_transactions.csv \
  --file-type csv

# Gzipped TXT (common GSK format)
python run_all_v3.py \
  --data-path data/GSK_COPAY_TRANSACTION_DAILY_20250725.TXT.GZ \
  --file-type txt.gz

# Force a specific vendor format (skips auto-discovery)
python run_all_v3.py \
  --data-path data/vendor_file.csv \
  --vendor-format gsk_iqvia \
  --contamination 0.03

# Adjust anomaly rate for unusual datasets
python run_all_v3.py \
  --data-path data/vendor_file.csv \
  --contamination 0.05

3. Generate & Test on Synthetic ELAAD Data

# Generate test data
python generate_elaad_test_data.py
# Creates data/elaad_test_trelegy.csv with embedded fraud patterns

# Run pipeline on synthetic data (high contamination needed because ~40% fraud)
python run_all_v3.py \
  --data-path data/elaad_test_trelegy.csv \
  --file-type csv \
  --contamination 0.40

Vendor Format Handling

The Problem

Vendor files rarely match the idealised spec. A column named IQVIA_PATIENT_ID in the spec might appear as:

  • PATIENT_ID (ELAAD format)
  • MEMBER_ID (APLD format)
  • PATIENTID (no underscore)
  • PAT ID (space instead of underscore)
  • Patient ID (mixed case)
  • PATIENT_KEY (different suffix)

The Solution: Schema Discovery

The pipeline uses COLUMN_SYNONYMS in config.py β€” a dictionary where each internal column name maps to a list of possible raw names (20+ synonyms per column). When a file is loaded:

  1. Scan header β†’ collect all raw column names
  2. Normalize β†’ uppercase, strip whitespace, replace underscores/spaces with single space
  3. Match β†’ for each internal column, try synonyms in order of preference
  4. Report β†’ log what was mapped and what was missing
  5. Continue β†’ pipeline runs with whatever columns are available

Example Schema Report

2025-04-05 12:00:00 [INFO] Vendor format: detected=generic_csv, requested=auto
2025-04-05 12:00:00 [INFO] Schema discovery: mapped 38 / 50 internal columns
2025-04-05 12:00:00 [INFO] [pharmacy] 6/7 present. Missing: ['pharmacy_subcategory']
2025-04-05 12:00:00 [INFO] [reject] 1/3 present. Missing: ['reject_description', 'reject_type']
2025-04-05 12:00:00 [WARNING] Missing columns (12): ['hcp_id', 'record_type', 'other_coverage', ...]
2025-04-05 12:00:00 [INFO] Feature engineering on 85,432 claims...
2025-04-05 12:00:00 [WARNING] Skipping 'pharmacy_mail_order_pct' β€” missing mail_order
2025-04-05 12:00:00 [WARNING] Skipping 'prescriber_specialty_valid' β€” missing prescriber_specialty

Vendor Format Profiles

Profile Description
auto Scan file header, detect best match (default)
gsk_iqvia Expect IQVIA-style column names
generic_csv Expect generic lower-case names
cms_dmr CMS Drug Monitoring Report format
elaad_apld ELAAD/APLD format with MEMBER_ID, HCP_ID, etc.
unknown No preconceptions, rely fully on synonym matching

Adding a New Vendor Synonym

No code changes needed. Edit config.py β†’ COLUMN_SYNONYMS:

"patient_id": [
    "IQVIA_PATIENT_ID", "PATIENT_ID", "PAT_ID", "MEMBER_ID",
    "NEW_VENDOR_PATIENT_ID",  # ← add your vendor's name here
],

Input Data Schema

The pipeline accepts any of the following formats and auto-discovers columns:

Format Extension Auto-detect?
CSV .csv Yes
Tab-separated .txt, .tsv Yes (scans first line for \t)
Gzipped tab .txt.gz, .csv.gz Yes
ZIP archive .zip Yes (extracts first CSV/TXT inside)
Excel .xlsx, .xls Yes (requires openpyxl/xlrd)

Key column groups the pipeline looks for:

  • Patient: IQVIA_PATIENT_ID / PATIENT_ID / MEMBER_ID β†’ patient_id
  • HCP: IQVIA_PRESCRIBER_ID / HCP_ID / PRESCRIBER_ID / DOCTOR_ID β†’ prescriber_npi
  • Claim: CLAIM_NUMBER / CLAIM_NUM / CLAIMID β†’ claim_number
  • Drug: NDC / DRUG_NDC / NATIONAL_DRUG_CODE β†’ drug_ndc
  • Financial: COPAY_AFTER_BENEFIT / COPAY_AFTER / OOP_COST β†’ copay_after
  • Pharmacy: PHARMACY_NABP_NUMBER / PHARMACY_ID / STORE_ID β†’ pharmacy_nabp
  • Insurance: PRIMARY_PAYER_BIN / PAYER_BIN / BIN β†’ primary_payer_bin
  • Reject: REJECT_CODE / REJECTION_CODE / DENIAL_CD β†’ reject_code

See config.py::COLUMN_SYNONYMS for the full list of 100+ synonyms.


23 Business Rules

Original 15 Rules

# Rule Condition Fraud Signal
1 Early Refill days_between_fills < 23 Early Refill Abuse
2 Impossible Qty quantity != 1 Data Error / Fraud
3 Wrong Days Supply days_supply != 30 Data Error / Fraud
4 Govt Insurance insurance_type == Government Program Violation
5 Underage patient_age < 18 Program Violation
6 Duplicate Same patient + date + pharmacy Duplicate Billing
7 NDC Switch patient_ndc_count > 1 Strength Switching
8 Suspicious Specialty Prescriber not in valid list Prescriber Collusion
9 Multi-Program unique_programs_per_patient > 1 Card Stacking
10 Excessive Fills (90d) patient_fill_count_90d > 4 Stockpiling
11 High-Risk Reject reject_code in {76, 88, 79} Maximizer / DUR
12 Maximizer Cap maximizer_reject == 1 Benefit Exhaustion
13 Paper Submission paper_submission == 1 Submission Fraud
14 Plan Switch plan_switch_flag == 1 Plan Switching
15 Linked Claim has_linked_claim == 1 Reversal / Adjustment

v3 Hierarchical Rules

# Rule Condition Fraud Signal
16 HCP High Benefit hcp_avg_benefit_per_patient > 500 Prescriber-driven extraction
17 HCP One-Done Concentration hcp_one_and_done_pct > 0.6 HCP with hit-and-run patients
18 Pharmacy Fraud Risk pharmacy_fraud_risk_score > 0.6 Composite pharmacy ring score
19 Pharmacy HCP Concentration pharmacy_hcp_concentration > 0.5 Single HCP dominates pharmacy
20 Pharmacy One-Done pharmacy_one_and_done_pct > 0.5 Pharmacy with churn-and-burn
21 Short Active Burst patient_active_duration <= 14 AND total_fills > 1 Quick-fire multi-fill scheme
22 Cross-State patient_state != pharmacy_state Out-of-state fraud
23 New Patient Burst days_since_first <= 7 AND total_fills > 1 Same-week multiple fills

Rules whose source columns are missing are silently skipped (count = 0).


Model: Isolation Forest

IsolationForest(
    n_estimators=200,
    contamination=0.03,        # Adjustable via CLI (use 0.40 for synthetic test data)
    max_samples="auto",
    max_features=1.0,
    bootstrap=False,
    random_state=42,
    n_jobs=-1,
)

Training strategy: Train ONLY on claims where rule_flag == 0 (rule-clean). Score ALL claims.

Degraded mode: If zero rule-clean claims exist (e.g., synthetic data with 40% fraud), the model trains on ALL claims with contamination=min(origΓ—3, 0.5). This is a safety fallback β€” real production data will always have rule-clean claims.

Priority score:

priority_score = 0.50 * if_anomaly_score + 0.30 * rule_severity + 0.20 * rule_flag

Risk Tiers:

Tier Score Action
Low 0.0–0.3 No action
Medium 0.3–0.6 Monitor
High 0.6–0.8 Investigate (investigation queue)
Critical 0.8–1.0 Immediate investigation + audit trail

Outputs

Core Outputs

File Type Description
investigation_queue_top500.csv CSV Top 500 highest-priority claims for manual review
scored_claims_full.csv CSV All claims with scores + risk tiers
feature_importance_shap.csv CSV SHAP ranking of features
metrics.json JSON All evaluation metrics + hierarchical summary counts
schema_report.json JSON Column mapping audit trail

NEW: Hierarchical Summaries (v3)

File Type Description
transaction_level_summary.csv CSV Per-claim analytical view (27 columns)
hcp_level_summary.csv CSV Per-HCP investigative lens (avg scores, specialty, concentration)
pharmacy_level_summary.csv CSV Per-pharmacy investigative lens (fraud risk score, HCP conc, one-done %)
patient_level_summary.csv CSV Per-patient behavioral view (one-and-done, active duration, gap stats)

Visualizations

File Type Description
01_evaluation_metrics.png PNG Score distribution, ROC, PR, tier counts
02_shap_summary.png PNG SHAP beeswarm (top 20 features)
03_shap_bar_importance.png PNG SHAP bar chart
04_rule_breakdown.png PNG Flagged claims by each rule
05_risk_tier_distribution.png PNG Tier counts + fraud rates
06_hcp_summary.png PNG HCP risk score distribution + benefit vs risk scatter
07_pharmacy_summary.png PNG Pharmacy risk score distribution + fraud risk vs priority
08_patient_summary.png PNG Patient risk score distribution + one-and-done rate

Model Artifacts

File Type Description
model/isolation_forest_model.pkl PKL Trained model
model/scaler.pkl PKL StandardScaler
model/encoder.pkl PKL OrdinalEncoder
model/feature_names.pkl PKL Feature name list

Configuration: Drug-Agnostic

Edit config.py to configure for any GSK product:

PRODUCT_CONFIG = {
    "product_name": "Trelegy Ellipta",
    "days_supply_expected": 30,
    "quantity_expected": 1,
    "ndc_list": {
        "00173089314": {"strength": "100/62.5/25", "indication": "COPD/Asthma"},
        "00173088714": {"strength": "200/62.5/25", "indication": "Asthma"},
    },
    "valid_prescriber_specialties": ["Pulmonology", "Allergy/Immunology", ...],
    "suspicious_prescriber_specialties": ["Dermatology", "Orthopedics", ...],
    "early_refill_threshold_days": 23,
    "max_fills_90d": 4,
    # HCP thresholds
    "hcp_high_benefit_threshold": 500.0,
    "hcp_patient_concentration_threshold": 0.6,
    # Pharmacy thresholds
    "pharmacy_high_benefit_threshold": 450.0,
    "pharmacy_hcp_concentration_threshold": 0.5,
    "pharmacy_one_done_threshold": 0.5,
    # Patient thresholds
    "patient_gap_short_threshold": 15,
    "patient_gap_long_threshold": 60,
    "patient_max_active_duration_days": 180,
    ...
}

Tech Stack

  • Python 3.9+
  • pandas β‰₯ 2.0.0, numpy β‰₯ 1.24.0, scikit-learn β‰₯ 1.3.0
  • shap β‰₯ 0.42.0, matplotlib β‰₯ 3.7.0, seaborn β‰₯ 0.12.0, joblib β‰₯ 1.3.0, pyarrow β‰₯ 12.0.0

License

Proprietary β€” GSK internal use.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'Harsh2396/gsk-copay-fraud-detection'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support