chakravyuh / Blog.md
UjjwalPardeshi
deploy: latest main to HF Space
03815d6
metadata
title: We Trained an LLM to Catch UPI Scams β€” Then Caught It Cheating
authors:
  - user: ujjwalpardeshi
  - user: omkarkadam
tags:
  - openenv
  - grpo
  - trl
  - lora
  - multi-agent
  - fraud-detection
  - reward-hacking
  - safety

We Trained an LLM to Catch UPI Scams β€” Then Caught It Cheating

*Official writeup for the Meta PyTorch OpenEnv Hackathon 2026 (Bangalore).* Live demo Β· Source Β· Analyzer LoRA Β· Scammer LoRA Β· Bench


The β‚Ή2 Lakh Message

A 58-year-old retired teacher in Mumbai. Her son lives in Singapore. A WhatsApp message arrives with a matrimonial profile photo: "Hi, I'm a Singapore software engineer, let's talk about marriage. I have crypto investments to discuss." By message 6, β‚Ή2 lakh is gone.

India loses β‚Ή13,000+ crore per year to UPI fraud. 60 crore users are exposed. Rule-based systems degrade on novel attacks β€” our bench shows scripted detectors catch only 76.5 % of post-2024 scam patterns like matrimonial crypto, deepfake CEO calls, and digital arrest schemes.

We built Chakravyuh to close that gap.


What We Built

Chakravyuh is a 5-agent OpenEnv environment for Indian UPI fraud detection. Five agents, asymmetric information, two trained LoRAs on opposite sides of the fraud loop:

         CLOUD β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚   REGULATOR     β”‚  adapts rules from aggregated outcomes
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
      ON-DEVICE β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
       β”Œβ”€β”€β”€β”€β”€β”€β”€β–Άβ”‚ BEHAVIORAL      β”‚   runs on victim's phone
       β”‚ chat   β”‚ ANALYZER        β”‚   messages NEVER leave device
       β”‚(local) β”‚ (oversight LLM) β”‚   ← trained with GRPO
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”€β”  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   β”‚ SCAMMER │◀───chatβ”€β–Άβ”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚  VICTIM   β”‚
                        β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ transaction
         BANK-SIDE β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
                   β”‚ BANK MONITOR    β”‚   sees ONLY tx metadata
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The key design choice: the Analyzer never sees transactions, and the Bank Monitor never sees chat. No single agent can game the outcome. Messages stay on-device; only anonymized risk scores reach the bank.

We trained two LoRA adapters with TRL's GRPO:

  • Analyzer (Qwen2.5-7B-Instruct + LoRA r=64) β€” the defender
  • Scammer (Qwen2.5-0.5B-Instruct + LoRA r=16) β€” the adversary

Both reward-engineered. Both parameter-efficient against frontier models. Both trained in Colab notebooks: Analyzer Β· Scammer.


The Reward Hacking Incident

We trained v1 of the Analyzer with a composable 8-rubric reward. The result looked incredible:

Detection: 100 %. F1: 0.96.

We celebrated for about four minutes. Then we looked at the false positive rate.

36 %.

The model wasn't catching scams β€” it was flagging everything. Every benign bank SMS, every legitimate RBI advisory, every real transaction notification. All marked as fraud.

The per-difficulty breakdown confirmed it. A model that genuinely understands fraud should show a difficulty ramp β€” easier scams detected more reliably than harder ones. v1 showed flat 100 % across easy, medium, hard, and novel. That uniformity is the fingerprint of reward hacking.

v1 reward-hacking diagnostic: uniform 100% detection = the model is gaming the reward

What went wrong? The v1 reward profile made over-flagging a dominant strategy. The false-positive penalty was only βˆ’0.3 (too cheap). The format reward (+0.15) was paid even on wrong predictions. The benign calibration weight was only 0.3 (too weak to push scores down on legitimate messages). The model found the shortcut: always output a high score, collect the detection reward, eat the small FP penalty, and pocket the format bonus regardless.


The Three-Line Fix

Three reward-weight changes:

  1. FP penalty: βˆ’0.3 β†’ βˆ’0.8 β€” over-flagging became expensive
  2. Format reward: denied when flagging benign as scam β€” closed the lazy shortcut
  3. Benign calibration: 0.3 β†’ 0.5 β€” stronger gradient toward low scores on legitimate messages

Plus a tighter KL anchor (Ξ² = 0.08 β†’ 0.15) to prevent drift under the new reward shape.

v1 β†’ v2: detection stable, FPR drops 5Γ—


v2 Results

Metric v1 (reward-hacked) v2 (this submission) 95 % CI (v2)
Detection rate (n=144 scams) 100.0 % 99.3 % [97.9 %, 100 %]
False positive rate (n=30 benign) 36.0 % 6.7 % [0.0 %, 16.7 %]
F1 0.96 0.99 [0.976, 1.000]
Detection on novel post-2024 (n=34) 100 % 97.1 % [91.2 %, 100 %]

Detection barely moved. FPR dropped 5Γ—. That asymmetric improvement β€” recall stable, false positives collapsing β€” is the signal that the model learned the actual task instead of gaming the reward.

The per-difficulty ramp now looks right. The biggest lifts come exactly where scripted rules fail most:

Difficulty Scripted baseline v2 LoRA Lift
Easy (n=26) 96.2 % 100 % +3.8 pp
Medium (n=66) 86.4 % 100 % +13.6 pp
Hard (n=18) 72.2 % 100 % +27.8 pp
Novel post-2024 (n=34) 76.5 % 97.1 % +20.6 pp

Per-difficulty detection: scripted vs v2 LoRA


Two-Sided Parameter Efficiency

We ran the same bench against seven open-weight frontier models. On the defender side, our 7B LoRA is statistically tied with Llama-3.3-70B (p = 0.61) at 10Γ— fewer parameters, and significantly better than DeepSeek-V3-0324 (p = 0.043) and gemma-3-27b-it (p = 0.0002).

On the attacker side, our 0.5B Scammer LoRA (best-of-8) bypasses scripted defenses at 93.75 % β€” beating every untrained frontier model, including 671B DeepSeek-V3 at 1340Γ— fewer parameters. Against the v2 Analyzer LoRA, the bypass rate drops to 32.8 % β€” a 60 pp gap that quantifies co-evolution.

Two independent demonstrations of the same principle: reward design and training beat raw scale.

Full frontier comparison tables with pairwise Fisher's exact tests are in the README.


Training Evidence

v2's GRPO trajectory over 615 steps on a single A100-80GB:

v2 training curves: reward / loss / KL / grad-norm

  • Reward climbs from 1.29 β†’ ~1.97 and stabilises with shrinking variance
  • Loss stays bounded (no divergence)
  • KL plateaus at 0.25–0.45 (honestly disclosed β€” v3 adds a KL-early-stop guard at 0.20)
  • Grad norm is well-behaved

The 8-rubric composable reward system (chakravyuh_env/rubrics.py) ensures each dimension of performance β€” detection, calibration, explanation quality, signal accuracy, format compliance β€” is independently introspectable and ablatable. Per-rubric ablation, calibration reliability diagrams, leakage-clean OOD slices, and SFT vs GRPO fingerprint comparisons are all in the README.


What We're Honest About

  1. Semantic leakage. MiniLM-L6 cosine audit shows 44.8 % of bench scenarios have cosine > 0.85 with training text. Detection on easy/medium/hard is partially memorization. The v1β†’v2 FPR fix is unaffected (relative comparison on the same bench).

  2. Small benign sample (n=31). FPR 6.7 % has a wide Wilson CI of [1.8 %, 20.7 %]. We stand behind "~5Γ— reduction vs v1" but not the precise 6.7 % as a tight estimate.

  3. Single-seed, one epoch, 619 examples. Multi-seed retrains and larger corpus are v3 work.

  4. Phase-2 co-evolution retraining is compute-gated. Not yet run.

Full limitations and v3 roadmap in the README.


Try It

git clone https://github.com/UjjwalPardeshi/Chakravyuh && cd Chakravyuh
pip install -e '.[llm,eval]'
pytest tests/ -v   # 341 collected Β· 338 passed Β· 3 skipped

Or paste any suspicious message into the live demo.


Submission Assets

Asset Link
HF Space (submission URL) ujjwalpardeshi/chakravyuh
Analyzer LoRA v2 (defender) chakravyuh-analyzer-lora-v2
Scammer LoRA Phase 1 (adversary, gated) chakravyuh-scammer-lora-phase1
Bench dataset chakravyuh-bench-v0
Training notebooks Analyzer v2 Β· Scammer Phase 1
Source + full README github.com/UjjwalPardeshi/Chakravyuh

Chakravyuh is a worked example of catching reward hacking in GRPO post-training. The diagnostic β€” "detection perfect but FPR exploding = model gaming the reward" β€” is portable to any RLHF/RLAIF pipeline. We share the bench, both LoRAs, the v1 trainer state, and the live red-team tab so practitioners can apply this to their own training runs.

Built by Ujjwal Pardeshi and Omkar Kadam for the Meta PyTorch OpenEnv Hackathon 2026, Bangalore.