Healthcare Fraud Detection with GRPO
⚠️ PRELIMINARY MODEL - 25 Training Steps Only
This model is a proof-of-concept for healthcare claims fraud detection using Group Relative Policy Optimization (GRPO). It has only been trained for 25 steps and requires additional training (200-500+ steps recommended) for production use.
Model Details
- Base Model: Qwen/Qwen2.5-0.5B-Instruct (CORRECTED from Llama 3.2 1B)
- Training Method: GRPO (Group Relative Policy Optimization)
- Training Steps: 25 (preliminary)
- Reward Improvement: +38% (from -40.74 to -25.35)
- Initial Entropy: 2.27 (high uncertainty)
- Hardware: NVIDIA RTX 4060 8GB
- Quantization: 4-bit with LoRA adapters
Environment: Healthcare Claims Fraud Detection
OpenEnv-compatible RL environment with:
- LLM-Native Actions: Text generation (Decision + Rationale + Evidence + Recommendation)
- Multi-Component Reward: Decision (40%) + Rationale (30%) + Evidence (20%) + Efficiency (10%)
- Hybrid Data: 60% Synthea synthetic + 40% CMS SynPUF real Medicare claims
- 8 Fraud Patterns: Upcoding, unbundling, phantom billing, etc.
- A2A Protocol: Compatible with Green Agents / AgentBeats
Usage
from transformers import pipeline
# Load model
generator = pipeline(
"text-generation",
model="shylane/healthcare-fraud-detection-grpo",
device=0
)
# Generate fraud assessment
prompt = '''Assess this healthcare claim for fraud:
Provider: Dr. Smith
Service: MRI Scan
Amount: $2,500
...'''
result = generator(prompt, max_length=300, temperature=0.7)
print(result[0]["generated_text"])
Training Results
Preliminary (25 steps):
- Shows +38% improvement over random baseline
- Demonstrates environment provides meaningful learning signal
- Entropy decreasing (model becoming more confident)
For Production:
- Requires 200-500+ additional training steps
- Consider larger model (1B-3B parameters) for complex reasoning
- Multi-GPU training recommended for faster convergence
Limitations
⚠️ This is NOT a production-ready model:
- Only 25 training steps completed
- High initial entropy (2.27)
- Still operating at loss (negative rewards)
- Small model size (0.5B parameters)
- Limited to synthetic/hybrid data
Challenge Entry
This model is an entry for the OpenEnv Student Challenge 2026 sponsored by Meta PyTorch, Hugging Face, and Unsloth AI.
Innovation: Moving from discrete action spaces to LLM-native text generation with structured reasoning for healthcare fraud detection.
Code & Documentation
- Repository: https://github.com/shylane/healthcare-openenv-challenge
- Blog Post: See
docs/BLOG_DRAFT.mdin repository
Framework Versions
- PEFT: 0.18.1
- TRL: 0.27.1
- Transformers: 5.0.0
- PyTorch: 2.6.0+cu124
Citation
@misc{healthcare-fraud-grpo,
title={Healthcare Fraud Detection with GRPO},
author={OpenEnv Challenge Entry},
year={2026},
howpublished={\url{https://huggingface.co/shylane/healthcare-fraud-detection-grpo}}
}
- Downloads last month
- 1