Healthcare Fraud Detection with GRPO

⚠️ PRELIMINARY MODEL - 25 Training Steps Only

This model is a proof-of-concept for healthcare claims fraud detection using Group Relative Policy Optimization (GRPO). It has only been trained for 25 steps and requires additional training (200-500+ steps recommended) for production use.

Model Details

Base Model: Qwen/Qwen2.5-0.5B-Instruct (CORRECTED from Llama 3.2 1B)
Training Method: GRPO (Group Relative Policy Optimization)
Training Steps: 25 (preliminary)
Reward Improvement: +38% (from -40.74 to -25.35)
Initial Entropy: 2.27 (high uncertainty)
Hardware: NVIDIA RTX 4060 8GB
Quantization: 4-bit with LoRA adapters

Environment: Healthcare Claims Fraud Detection

OpenEnv-compatible RL environment with:

LLM-Native Actions: Text generation (Decision + Rationale + Evidence + Recommendation)
Multi-Component Reward: Decision (40%) + Rationale (30%) + Evidence (20%) + Efficiency (10%)
Hybrid Data: 60% Synthea synthetic + 40% CMS SynPUF real Medicare claims
8 Fraud Patterns: Upcoding, unbundling, phantom billing, etc.
A2A Protocol: Compatible with Green Agents / AgentBeats

Usage

from transformers import pipeline

# Load model
generator = pipeline(
    "text-generation", 
    model="shylane/healthcare-fraud-detection-grpo", 
    device=0
)

# Generate fraud assessment
prompt = '''Assess this healthcare claim for fraud:
Provider: Dr. Smith
Service: MRI Scan
Amount: $2,500
...'''

result = generator(prompt, max_length=300, temperature=0.7)
print(result[0]["generated_text"])

Training Results

Preliminary (25 steps):

Shows +38% improvement over random baseline
Demonstrates environment provides meaningful learning signal
Entropy decreasing (model becoming more confident)

For Production:

Requires 200-500+ additional training steps
Consider larger model (1B-3B parameters) for complex reasoning
Multi-GPU training recommended for faster convergence

Limitations

⚠️ This is NOT a production-ready model:

Only 25 training steps completed
High initial entropy (2.27)
Still operating at loss (negative rewards)
Small model size (0.5B parameters)
Limited to synthetic/hybrid data

Challenge Entry

This model is an entry for the OpenEnv Student Challenge 2026 sponsored by Meta PyTorch, Hugging Face, and Unsloth AI.

Innovation: Moving from discrete action spaces to LLM-native text generation with structured reasoning for healthcare fraud detection.

Code & Documentation

Repository: https://github.com/shylane/healthcare-openenv-challenge
Blog Post: See docs/BLOG_DRAFT.md in repository

Framework Versions

PEFT: 0.18.1
TRL: 0.27.1
Transformers: 5.0.0
PyTorch: 2.6.0+cu124

Citation

@misc{healthcare-fraud-grpo,
  title={Healthcare Fraud Detection with GRPO},
  author={OpenEnv Challenge Entry},
  year={2026},
  howpublished={\url{https://huggingface.co/shylane/healthcare-fraud-detection-grpo}}
}

Downloads last month: 1

Model tree for shylane/healthcare-fraud-detection-grpo

Base model

Qwen/Qwen2.5-0.5B

Finetuned

Qwen/Qwen2.5-0.5B-Instruct

Adapter

(503)

this model