GSK Copay Card Fraud Detection System — v4 Group-Aware

Product Focus: Trelegy Ellipta (configurable for Nucala / any GSK product)
Methodology: Hybrid Rules + Isolation Forest + SHAP Explainability + Hierarchical Summaries + Group-Aware Benefit Validation
Architecture: Drug-agnostic, ground-truth-optional, vendor-format-agnostic, production-ready
Analytical Levels: Transaction → Patient → HCP → Pharmacy
Group-Aware Validation: Group 8141 (Legacy) vs Group 8200 / 2025 benefit designs

Overview

This system detects fraudulent and suspicious copay card claims in GSK pharmaceutical transaction data using a 4-level hierarchical analytical framework:

Level	What It Detects	Key Features
Transaction	Per-claim anomalies	Gap between fills, quantity, days supply, benefit amount, OOP cost, NDC switch
Patient	Behavioral patterns	One-and-done patients, active duration, avg gap between fills, short/long gap %
HCP	Prescriber-driven fraud	Suspicious specialty, one-and-done %, patient concentration, avg benefit per patient
Pharmacy	Pharmacy-centric rings	Active/closed flag, HCP concentration, one-and-done %, avg benefit, fraud risk score

Under the hood, the system combines:

23 hard-coded business rules (15 original + 8 v3 hierarchical rules)
Isolation Forest unsupervised anomaly detection trained on rule-clean data
SHAP explainability for every flagged claim
Hierarchical summary exports for investigative lens views

The pipeline supports any vendor format — ELAAD, APLD, IQVIA, CMS DMR, generic CSV. It auto-discovers column names via synonym mapping, handles missing columns gracefully, and produces a schema report showing what it found and what it missed.

Hierarchical Features (v3)

Transaction Level

days_between_fills — gap from last fill
early_refill_flag — refill before 75% of expected days supply consumed
quantity_anomaly — quantity != 1 (Trelegy is 1 inhaler per fill)
days_supply_anomaly — days_supply != 30
govt_insurance_flag — Medicare/Medicaid/Tricare/VA (program violation)
ndc_switch_flag — patient has filled multiple NDCs (strength switching)
cross_state_fill — pharmacy state != inferred patient state
benefit_ratio — benefit_amount / usual_customary
transaction_benefit_score — z-score of benefit relative to population
transaction_oop_score — z-score of OOP cost relative to population

Patient Level

patient_one_and_done — patient has exactly 1 fill ever (hit-and-run pattern)
patient_active_duration — days from first to last fill (short = burst)
patient_avg_gap — average gap between fills per patient
patient_short_gap_pct — % of fills with gap < 15 days (excessive frequency)
patient_long_gap_pct — % of fills with gap > 60 days (irregular adherence)
unique_pharmacies_overall — number of different pharmacies used
unique_prescribers_per_patient — number of different HCPs seen
total_fills_per_patient — total fill count
patient_fill_count_7d/30d/90d — rolling fill counts

HCP Level

hcp_suspicious_specialty — 1 if specialty not in valid list (Dermatology, Orthopedics, etc.)
hcp_one_and_done_pct — % of patients with only 1 fill from this HCP
hcp_patient_concentration — % of claims from top patient (high = patient farming)
hcp_avg_benefit_per_patient — average benefit per unique patient
hcp_max_benefit_per_patient — highest benefit concentrated on single patient
hcp_std_benefit — variance in benefit amounts (consistency = suspicious)
hcp_unique_pharmacies — number of pharmacies this HCP writes for
hcp_total_claims / hcp_unique_patients / hcp_patient_share

Pharmacy Level

pharmacy_active_flag — 0 if inactive/closed (suspicious if still claiming)
pharmacy_fraud_risk_score — composite: 0.25×reject_rate + 0.20×paper_rate + 0.30×hcp_conc + 0.25×one_done
pharmacy_hcp_concentration — % claims from top HCP (high = HCP-driven ring)
pharmacy_one_and_done_pct — % patients with only 1 fill (churn-and-burn)
pharmacy_avg_benefit / pharmacy_total_benefit / pharmacy_max_benefit_per_patient
pharmacy_reject_rate / pharmacy_paper_submission_rate
pharmacy_unique_hcps — number of distinct prescribers at this pharmacy

Project Structure

gsk_copay_fraud/
├── config.py                          # Product config + column synonym mappings + FEATURE_DEPENDENCIES
├── data_ingestion.py                  # Schema discovery + vendor-agnostic ingestion
├── feature_engineering_v2.py          # 60+ features with graceful degradation
├── fraud_detection_pipeline_v3.py     # Full pipeline (23 rules + IF + SHAP + hierarchical summaries)
├── run_all_v3.py                      # CLI runner
├── generate_elaad_test_data.py        # Synthetic ELAAD-style test data generator with embedded fraud
├── requirements.txt                   # Dependencies
├── README.md                          # This file
├── data/                              # Place raw GSK data here
│   └── elaad_test_trelegy.csv
└── results/                           # All outputs generated here
    ├── investigation_queue_top500.csv
    ├── scored_claims_full.csv
    ├── transaction_level_summary.csv    ← NEW: per-claim analytical view
    ├── hcp_level_summary.csv            ← NEW: per-HCP investigative lens
    ├── pharmacy_level_summary.csv       ← NEW: per-pharmacy investigative lens
    ├── patient_level_summary.csv        ← NEW: per-patient behavioral view
    ├── feature_importance_shap.csv
    ├── metrics.json
    ├── schema_report.json
    ├── 01_evaluation_metrics.png
    ├── 02_shap_summary.png
    ├── 03_shap_bar_importance.png
    ├── 04_rule_breakdown.png
    ├── 05_risk_tier_distribution.png
    ├── 06_hcp_summary.png             ← NEW
    ├── 07_pharmacy_summary.png          ← NEW
    └── 08_patient_summary.png         ← NEW

Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Run on Real GSK Data

# Auto-detect everything (recommended)
python run_all_v3.py \
  --data-path data/vendor_file.txt.gz

# CSV file
python run_all_v3.py \
  --data-path data/gsk_copay_transactions.csv \
  --file-type csv

# Gzipped TXT (common GSK format)
python run_all_v3.py \
  --data-path data/GSK_COPAY_TRANSACTION_DAILY_20250725.TXT.GZ \
  --file-type txt.gz

# Force a specific vendor format (skips auto-discovery)
python run_all_v3.py \
  --data-path data/vendor_file.csv \
  --vendor-format gsk_iqvia \
  --contamination 0.03

# Adjust anomaly rate for unusual datasets
python run_all_v3.py \
  --data-path data/vendor_file.csv \
  --contamination 0.05

3. Generate & Test on Synthetic ELAAD Data

# Generate test data
python generate_elaad_test_data.py
# Creates data/elaad_test_trelegy.csv with embedded fraud patterns

# Run pipeline on synthetic data (high contamination needed because ~40% fraud)
python run_all_v3.py \
  --data-path data/elaad_test_trelegy.csv \
  --file-type csv \
  --contamination 0.40

Vendor Format Handling

The Problem

Vendor files rarely match the idealised spec. A column named IQVIA_PATIENT_ID in the spec might appear as:

PATIENT_ID (ELAAD format)
MEMBER_ID (APLD format)
PATIENTID (no underscore)
PAT ID (space instead of underscore)
Patient ID (mixed case)
PATIENT_KEY (different suffix)

The Solution: Schema Discovery

The pipeline uses COLUMN_SYNONYMS in config.py — a dictionary where each internal column name maps to a list of possible raw names (20+ synonyms per column). When a file is loaded:

Scan header → collect all raw column names
Normalize → uppercase, strip whitespace, replace underscores/spaces with single space
Match → for each internal column, try synonyms in order of preference
Report → log what was mapped and what was missing
Continue → pipeline runs with whatever columns are available

Example Schema Report

2025-04-05 12:00:00 [INFO] Vendor format: detected=generic_csv, requested=auto
2025-04-05 12:00:00 [INFO] Schema discovery: mapped 38 / 50 internal columns
2025-04-05 12:00:00 [INFO] [pharmacy] 6/7 present. Missing: ['pharmacy_subcategory']
2025-04-05 12:00:00 [INFO] [reject] 1/3 present. Missing: ['reject_description', 'reject_type']
2025-04-05 12:00:00 [WARNING] Missing columns (12): ['hcp_id', 'record_type', 'other_coverage', ...]
2025-04-05 12:00:00 [INFO] Feature engineering on 85,432 claims...
2025-04-05 12:00:00 [WARNING] Skipping 'pharmacy_mail_order_pct' — missing mail_order
2025-04-05 12:00:00 [WARNING] Skipping 'prescriber_specialty_valid' — missing prescriber_specialty

Vendor Format Profiles

Profile	Description
`auto`	Scan file header, detect best match (default)
`gsk_iqvia`	Expect IQVIA-style column names
`generic_csv`	Expect generic lower-case names
`cms_dmr`	CMS Drug Monitoring Report format
`elaad_apld`	ELAAD/APLD format with MEMBER_ID, HCP_ID, etc.
`unknown`	No preconceptions, rely fully on synonym matching

Adding a New Vendor Synonym

No code changes needed. Edit config.py → COLUMN_SYNONYMS:

"patient_id": [
    "IQVIA_PATIENT_ID", "PATIENT_ID", "PAT_ID", "MEMBER_ID",
    "NEW_VENDOR_PATIENT_ID",  # ← add your vendor's name here
],

Input Data Schema

The pipeline accepts any of the following formats and auto-discovers columns:

Format	Extension	Auto-detect?
CSV	`.csv`	Yes
Tab-separated	`.txt`, `.tsv`	Yes (scans first line for `\t`)
Gzipped tab	`.txt.gz`, `.csv.gz`	Yes
ZIP archive	`.zip`	Yes (extracts first CSV/TXT inside)
Excel	`.xlsx`, `.xls`	Yes (requires openpyxl/xlrd)

Key column groups the pipeline looks for:

Patient: IQVIA_PATIENT_ID / PATIENT_ID / MEMBER_ID → patient_id
HCP: IQVIA_PRESCRIBER_ID / HCP_ID / PRESCRIBER_ID / DOCTOR_ID → prescriber_npi
Claim: CLAIM_NUMBER / CLAIM_NUM / CLAIMID → claim_number
Drug: NDC / DRUG_NDC / NATIONAL_DRUG_CODE → drug_ndc
Financial: COPAY_AFTER_BENEFIT / COPAY_AFTER / OOP_COST → copay_after
Pharmacy: PHARMACY_NABP_NUMBER / PHARMACY_ID / STORE_ID → pharmacy_nabp
Insurance: PRIMARY_PAYER_BIN / PAYER_BIN / BIN → primary_payer_bin
Reject: REJECT_CODE / REJECTION_CODE / DENIAL_CD → reject_code

See config.py::COLUMN_SYNONYMS for the full list of 100+ synonyms.

23 Business Rules

Original 15 Rules

#	Rule	Condition	Fraud Signal
1	Early Refill	`days_between_fills < 23`	Early Refill Abuse
2	Impossible Qty	`quantity != 1`	Data Error / Fraud
3	Wrong Days Supply	`days_supply != 30`	Data Error / Fraud
4	Govt Insurance	`insurance_type == Government`	Program Violation
5	Underage	`patient_age < 18`	Program Violation
6	Duplicate	Same patient + date + pharmacy	Duplicate Billing
7	NDC Switch	`patient_ndc_count > 1`	Strength Switching
8	Suspicious Specialty	Prescriber not in valid list	Prescriber Collusion
9	Multi-Program	`unique_programs_per_patient > 1`	Card Stacking
10	Excessive Fills (90d)	`patient_fill_count_90d > 4`	Stockpiling
11	High-Risk Reject	`reject_code in {76, 88, 79}`	Maximizer / DUR
12	Maximizer Cap	`maximizer_reject == 1`	Benefit Exhaustion
13	Paper Submission	`paper_submission == 1`	Submission Fraud
14	Plan Switch	`plan_switch_flag == 1`	Plan Switching
15	Linked Claim	`has_linked_claim == 1`	Reversal / Adjustment

v3 Hierarchical Rules

#	Rule	Condition	Fraud Signal
16	HCP High Benefit	`hcp_avg_benefit_per_patient > 500`	Prescriber-driven extraction
17	HCP One-Done Concentration	`hcp_one_and_done_pct > 0.6`	HCP with hit-and-run patients
18	Pharmacy Fraud Risk	`pharmacy_fraud_risk_score > 0.6`	Composite pharmacy ring score
19	Pharmacy HCP Concentration	`pharmacy_hcp_concentration > 0.5`	Single HCP dominates pharmacy
20	Pharmacy One-Done	`pharmacy_one_and_done_pct > 0.5`	Pharmacy with churn-and-burn
21	Short Active Burst	`patient_active_duration <= 14` AND `total_fills > 1`	Quick-fire multi-fill scheme
22	Cross-State	`patient_state != pharmacy_state`	Out-of-state fraud
23	New Patient Burst	`days_since_first <= 7` AND `total_fills > 1`	Same-week multiple fills

Rules whose source columns are missing are silently skipped (count = 0).

Model: Isolation Forest

IsolationForest(
    n_estimators=200,
    contamination=0.03,        # Adjustable via CLI (use 0.40 for synthetic test data)
    max_samples="auto",
    max_features=1.0,
    bootstrap=False,
    random_state=42,
    n_jobs=-1,
)

Training strategy: Train ONLY on claims where rule_flag == 0 (rule-clean). Score ALL claims.

Degraded mode: If zero rule-clean claims exist (e.g., synthetic data with 40% fraud), the model trains on ALL claims with contamination=min(orig×3, 0.5). This is a safety fallback — real production data will always have rule-clean claims.

Priority score:

priority_score = 0.50 * if_anomaly_score + 0.30 * rule_severity + 0.20 * rule_flag

Risk Tiers:

Tier	Score	Action
Low	0.0–0.3	No action
Medium	0.3–0.6	Monitor
High	0.6–0.8	Investigate (investigation queue)
Critical	0.8–1.0	Immediate investigation + audit trail

Outputs

Core Outputs

File	Type	Description
`investigation_queue_top500.csv`	CSV	Top 500 highest-priority claims for manual review
`scored_claims_full.csv`	CSV	All claims with scores + risk tiers
`feature_importance_shap.csv`	CSV	SHAP ranking of features
`metrics.json`	JSON	All evaluation metrics + hierarchical summary counts
`schema_report.json`	JSON	Column mapping audit trail

NEW: Hierarchical Summaries (v3)

File	Type	Description
`transaction_level_summary.csv`	CSV	Per-claim analytical view (27 columns)
`hcp_level_summary.csv`	CSV	Per-HCP investigative lens (avg scores, specialty, concentration)
`pharmacy_level_summary.csv`	CSV	Per-pharmacy investigative lens (fraud risk score, HCP conc, one-done %)
`patient_level_summary.csv`	CSV	Per-patient behavioral view (one-and-done, active duration, gap stats)

Visualizations

File	Type	Description
`01_evaluation_metrics.png`	PNG	Score distribution, ROC, PR, tier counts
`02_shap_summary.png`	PNG	SHAP beeswarm (top 20 features)
`03_shap_bar_importance.png`	PNG	SHAP bar chart
`04_rule_breakdown.png`	PNG	Flagged claims by each rule
`05_risk_tier_distribution.png`	PNG	Tier counts + fraud rates
`06_hcp_summary.png`	PNG	HCP risk score distribution + benefit vs risk scatter
`07_pharmacy_summary.png`	PNG	Pharmacy risk score distribution + fraud risk vs priority
`08_patient_summary.png`	PNG	Patient risk score distribution + one-and-done rate

Model Artifacts

File	Type	Description
`model/isolation_forest_model.pkl`	PKL	Trained model
`model/scaler.pkl`	PKL	StandardScaler
`model/encoder.pkl`	PKL	OrdinalEncoder
`model/feature_names.pkl`	PKL	Feature name list

Configuration: Drug-Agnostic

Edit config.py to configure for any GSK product:

PRODUCT_CONFIG = {
    "product_name": "Trelegy Ellipta",
    "days_supply_expected": 30,
    "quantity_expected": 1,
    "ndc_list": {
        "00173089314": {"strength": "100/62.5/25", "indication": "COPD/Asthma"},
        "00173088714": {"strength": "200/62.5/25", "indication": "Asthma"},
    },
    "valid_prescriber_specialties": ["Pulmonology", "Allergy/Immunology", ...],
    "suspicious_prescriber_specialties": ["Dermatology", "Orthopedics", ...],
    "early_refill_threshold_days": 23,
    "max_fills_90d": 4,
    # HCP thresholds
    "hcp_high_benefit_threshold": 500.0,
    "hcp_patient_concentration_threshold": 0.6,
    # Pharmacy thresholds
    "pharmacy_high_benefit_threshold": 450.0,
    "pharmacy_hcp_concentration_threshold": 0.5,
    "pharmacy_one_done_threshold": 0.5,
    # Patient thresholds
    "patient_gap_short_threshold": 15,
    "patient_gap_long_threshold": 60,
    "patient_max_active_duration_days": 180,
    ...
}

Tech Stack

Python 3.9+
pandas ≥ 2.0.0, numpy ≥ 1.24.0, scikit-learn ≥ 1.3.0
shap ≥ 0.42.0, matplotlib ≥ 3.7.0, seaborn ≥ 0.12.0, joblib ≥ 1.3.0, pyarrow ≥ 12.0.0

License

Proprietary — GSK internal use.

Generated by ML Intern

This model repository was generated by ML Intern, an agent for machine learning research and development on the Hugging Face Hub.

Try ML Intern: https://smolagents-ml-intern.hf.space
Source code: https://github.com/huggingface/ml-intern

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = 'Harsh2396/gsk-copay-fraud-detection'
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

For non-causal architectures, replace AutoModelForCausalLM with the appropriate AutoModel class.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support