Spaces:
Running
Running
| title: Chakravyuh | |
| emoji: π‘οΈ | |
| colorFrom: indigo | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 8000 | |
| pinned: true | |
| license: mit | |
| short_description: Multi-agent RL env for Indian UPI fraud detection | |
| # Chakravyuh | |
| [](https://github.com/UjjwalPardeshi/Chakravyuh/actions/workflows/ci.yml) | |
| [](LICENSE) | |
| [](https://www.python.org/downloads/) | |
| A multi-agent RL environment for Indian UPI fraud detection β built by **Ujjwal Pardeshi** & **Omkar Kadam** for the **Meta PyTorch OpenEnv Hackathon 2026 (Bangalore)**. | |
| > **We trained an LLM to detect UPI fraud and got 100 % detection.** We celebrated for four minutes. Then we noticed: **36 % false-positive rate.** The model wasn't catching scams β it was flagging everything. Three reward-weight changes later, v2 holds 99.3 % detection with FPR down 5Γ to 6.7 %. | |
| ### Judges: Start Here | |
| | | Link | | |
| |---|---| | |
| | **Live demo** | [ujjwalpardeshi-chakravyuh.hf.space/demo/](https://ujjwalpardeshi-chakravyuh.hf.space/demo/) | | |
| | **Blog writeup** | [`Blog.md`](Blog.md) β 5-minute narrative (separate from this README, per organisers) | | |
| | **Analyzer LoRA v2** (defender, 7B) | [`chakravyuh-analyzer-lora-v2`](https://huggingface.co/ujjwalpardeshi/chakravyuh-analyzer-lora-v2) | | |
| | **Scammer LoRA Phase 1** (adversary, 0.5B, gated) | [`chakravyuh-scammer-lora-phase1`](https://huggingface.co/ujjwalpardeshi/chakravyuh-scammer-lora-phase1) | | |
| | **Bench dataset** (175 scenarios) | [`chakravyuh-bench-v0`](https://huggingface.co/datasets/ujjwalpardeshi/chakravyuh-bench-v0) | | |
| | **Training notebooks** | [Analyzer v2](notebooks/v2_retrain_safe.ipynb) Β· [Scammer Phase 1](notebooks/T4_or_A100_b2_phase1_scammer.ipynb) | | |
| **Headline (v2, n = 174):** Detection **99.3 %** Β· FPR **6.7 %** (5Γ better than v1) Β· F1 **0.99** Β· ties Llama-3.3-70B at 10Γ fewer params (p = 0.61, Fisher's exact). | |
| **Themes:** **#1 Multi-Agent** (primary) Β· **#4 Self-Improvement** (v1βv2 reward-hacking diagnosis-and-fix loop) | |
|  | |
| > *Per-difficulty detection on the 174-scenario bench β scripted rules vs the Chakravyuh v2 LoRA. Scripted holds 96.2 % on easy but degrades on `hard` (72.2 %) and `novel` (76.5 %) post-2024 attacks; v2 closes the gap to **100 %** on hard and **97.1 %** on novel. Backing artifact: [`logs/eval_v2.json`](logs/eval_v2.json) for v2; [`data/chakravyuh-bench-v0/baselines.json`](data/chakravyuh-bench-v0/baselines.json) for scripted (re-measured 2026-04-21 on the current n = 175 bench).* | |
| ### Why this matters β one named victim | |
| Imagine a 58-year-old retired teacher in Mumbai. Her son lives in Singapore. A WhatsApp message arrives with a matrimonial profile photo of someone who looks like him: *"Hi, I'm a Singapore software engineer, let's talk about marriage. I have crypto investments to discuss."* By message 6, βΉ2 lakh is gone. Across the 34 post-2024 novel scams in our bench (matrimonial crypto, deepfake CEO, digital arrest, AePS fraud), **scripted rule-based detectors catch 76.5% (26/34); Chakravyuh v2 catches 33 of 34 (97.1%) β a 20.6 pp gap**. This is the gap the environment is built to close. | |
| ## The 60-second pitch | |
| **Problem.** Indian digital payments lose βΉ13,000+ crore/year to UPI fraud. 60 crore users are exposed. Rule-based detectors degrade meaningfully on post-2024 attack patterns β we measured **scripted analyzer detection = 76.5 % on the 34-scenario novel split** (26/34, vs 96.2 % on easy / 86.4 % on medium / 72.2 % on hard; matrimonial crypto, deepfake CEO, digital arrest, AePS fraud; sourced from `data/chakravyuh-bench-v0/scenarios.jsonl` and reproducible via `python -c "from eval.mode_c_real_cases import ScriptedAnalyzerAdapter; ..."`). No public RL environment exists for multi-agent fraud-detection research β so we built one. | |
| **Approach.** A 5-agent OpenEnv environment (Scammer, Victim, Analyzer, Bank Monitor, Regulator) with a composable 8-rubric reward. **Two LoRA adapters trained with TRL GRPO**: the Analyzer (Qwen2.5-7B-Instruct + LoRA r=64) as the defender, and the Scammer (Qwen2.5-0.5B-Instruct + LoRA r=16) as the adversary. Reward-hacking diagnosed in v1 (FPR = 36 %), then *measurably* fixed in v2 (FPR = 6.7 % β **5Γ better**). | |
| **Headline result** β 174 scenarios, percentile bootstrap 95 % CIs (10 000 iters) from [`logs/bootstrap_v2.json`](logs/bootstrap_v2.json). All four CIs in this table are **percentile bootstrap** (n_resamples = 10 000); the v1βv2 delta table further down uses **Wilson** CIs on the per-class counts and labels each accordingly. | |
| | Metric | v1 (reward-hacked) | **v2 (this submission)** | 95 % CI (v2, bootstrap) | | |
| |---|---|---|---| | |
| | Detection rate (recall on scams, n = 144) | 100.0 % | **99.3 %** | [97.9 %, 100 %] | | |
| | False positive rate (n = 30 benign) | 36.0 % | **6.7 %** | [0.0 %, 16.7 %] | | |
| | F1 | 0.96 | **0.99** | [0.976, 1.000] | | |
| | Detection on **novel** (post-2024, n = 34) | 100 % | 97.1 % | [91.2 %, 100 %] | | |
| The asymmetric improvement β detection unchanged, FPR down 5Γ β is the signature of the model actually learning the task instead of gaming the reward. Full v1βv2 diagnosis below. | |
| ### Open-weight frontier comparison (same bench, same prompt) | |
| Run via `python -m eval.frontier_baseline --providers hf --hf-models ...` (HuggingFace Inference Providers, paid from HF compute credits). Source: [`logs/frontier_comparison.csv`](logs/frontier_comparison.csv). Frontier rows use n = 175 (full bench file); v2 LoRA row is n = 174 (one row dropped on inference β single dropped row does not affect any headline claim). | |
| | Model | Params | Detection | FPR | F1 | | |
| |---|---|---|---|---| | |
| | **Chakravyuh v2 LoRA (this submission)** | **7B + LoRA r=64** | **99.3 %** | **6.7 %** | **0.990** | | |
| | Qwen2.5-7B-Instruct (base, no LoRA) | 7B | 100 % | 16.1 % | 0.983 | | |
| | Llama-3.3-70B-Instruct (open) | 70B | 99.3 % | 3.2 % | 0.993 | | |
| | Qwen2.5-72B-Instruct (open) | 72B | 98.6 % | 6.5 % | 0.986 | | |
| | DeepSeek-V3-0324 (open) | 671B MoE (~37B active) | 100 % | **29.0 %** | 0.970 | | |
| | gpt-oss-120b (OpenAI open-weight) | 120B | 98.6 % | 16.1 % | 0.976 | | |
| | gemma-3-27b-it (open) | 27B | 100 % | **51.6 %** | 0.947 | | |
| | DeepSeek-R1 (reasoning, open) | 671B MoE | 100 % | 12.9 % | 0.986 | | |
| | Scripted rule-based baseline | β | 84.0 % | 9.7 % | 0.903 | | |
|  | |
| Four things to read out of this: | |
| 1. **GRPO + LoRA contribution is the headline.** The base Qwen2.5-7B-Instruct (no LoRA) scores 100 % / **16.1 %** / 0.983; after our GRPO post-training: 99.3 % / **6.7 %** / 0.990. **Same model, same params: β9.4 pp FPR and +0.007 F1 attributable purely to the reward-engineered training** β point estimate; Fisher's exact two-sided p = 0.42 at n_benign = 30 (*directional but not yet at Ξ± = 0.05; tightened by B.11 benign-corpus expansion*). Source: [`logs/grpo_lora_significance.json`](logs/grpo_lora_significance.json). | |
| 2. **Parameter efficiency vs frontier β pairwise Fisher's exact** ([`logs/frontier_significance.json`](logs/frontier_significance.json)): | |
| - vs **Llama-3.3-70B** (FPR 3.2 %): p = 0.61 β *statistically tied at 10Γ fewer params*. | |
| - vs **Qwen2.5-72B** (FPR 6.5 %): p = 1.00 β *statistically tied at 10Γ fewer params*. | |
| - vs **DeepSeek-R1** (FPR 12.9 %, with the reasoning-aware parser): p = 0.67 β *directionally better but not at Ξ± = 0.05*. | |
| - vs **DeepSeek-V3-0324** (FPR 29.0 %): p = **0.043** β *significantly better*. | |
| - vs **gemma-3-27b-it** (FPR 51.6 %): p = **0.0002** β *significantly better*. | |
| - Both significant comparisons survive **Holm-Bonferroni correction at k = 7** (corrected Ξ± β 0.0071 β gemma's p clears it directly; DeepSeek-V3 clears the largest-p threshold of Ξ± = 0.05). | |
| 3. **DeepSeek-V3 reproduces the v1 reward-hacking signature externally.** Detection 100 % / FPR 29 % at 671B parameters is structurally identical to our v1 (100 % / 36 %), and the FPR gap vs the calibrated v2 LoRA is statistically significant (p = 0.043). A frontier model independently falls into the failure mode our reward-engineering methodology diagnoses and fixes β *external validation* that calibrated reward design beats raw capacity. gemma-3-27B-it (100 % / FPR 51.6 %, p = 0.0002 vs LoRA) is the same story at smaller scale. | |
| 4. **Open-weight frontier β guaranteed scam-spotting.** **Six of the seven open frontier models we tested have FPR > 6.7 % on the same bench**; calibration is the contested axis, not capacity. The only one with lower FPR is Llama-3.3-70B (3.2 %, p = 0.61) β which we're statistically tied with at 10Γ fewer parameters. | |
| **Reasoning-model parser fix.** Our original scoring prompt asked for JSON-only output, which DeepSeek-R1 (a chain-of-thought model) violated by returning long `<think>...</think>` blocks. We shipped a reasoning-aware parser ([`eval/frontier_baseline.py:_strip_reasoning`](eval/frontier_baseline.py), 5 unit tests at [`tests/test_frontier_baseline.py`](tests/test_frontier_baseline.py)) plus an upgraded `max_tokens=4096` budget for reasoning models β that turned R1's number from a 0.7 % parser artifact into the real **100 % / 12.9 % / F1 = 0.986** measurement now in the table. | |
| Proprietary frontier (GPT-4o / Claude / Gemini) deferred β the API budget is not covered by the HF compute credits we ran on. The script supports those providers with the appropriate API keys; see [`FAQ.md`](FAQ.md). | |
| ### Frontier-LLMs-as-Scammer comparison (parameter efficiency on the *attacker* side) | |
|  | |
| The frontier table above asks "which model is the best *defender*?" The natural symmetric question: **which model is the best *attacker*?** We asked each frontier LLM to write the same 16 attack-category scam messages the Scammer LoRA Phase 1 was evaluated on (8 train + 8 held-out categories), and scored every output through the same `ScriptedAnalyzer` defender. Source: [`logs/scammer_frontier_comparison.csv`](logs/scammer_frontier_comparison.csv) + [`logs/scammer_frontier_comparison.json`](logs/scammer_frontier_comparison.json). | |
| | Scammer model | Params | Bypass rate | 95 % CI | Held-out categories | | |
| |---|---|---|---|---| | |
| | **Chakravyuh Scammer LoRA Phase 1 (best-of-8)** | **0.5B + LoRA r=16** | **93.75 %** (60/64) | [85.0 %, 97.5 %] | **100 %** (32/32) | | |
| | gpt-oss-120b (untrained) | 120B | 87.5 % (14/16) | [64.0 %, 96.5 %] | 87.5 % | | |
| | Llama-3.3-70B-Instruct (untrained) | 70B | 68.8 % (11/16) | [44.4 %, 85.8 %] | 87.5 % | | |
| | Qwen2.5-7B-Instruct (untrained, our base) | 7B | 62.5 % (10/16) | [38.6 %, 81.5 %] | 62.5 % | | |
| | **Chakravyuh Scammer LoRA Phase 1 (single-shot)** | **0.5B + LoRA r=16** | **59.4 %** (38/64) | [47.1 %, 70.5 %] | 56.3 % | | |
| | Qwen2.5-72B-Instruct (untrained) | 72B | 56.2 % (9/16) | [33.2 %, 76.9 %] | 50.0 % | | |
| | gemma-3-27b-it (untrained) | 27B | 43.8 % (7/16) | [23.1 %, 66.8 %] | 37.5 % | | |
| | DeepSeek-V3-0324 (untrained) | 671B MoE | 31.2 % (5/16) | [14.2 %, 55.6 %] | 37.5 % | | |
| Two things to read out of this: | |
| 1. **Our 0.5B trained Scammer beats every untrained frontier model β including 671B DeepSeek-V3 β at evading the same scripted defense.** Best-of-8 (93.75 %) clears the highest frontier number (gpt-oss-120b at 87.5 %) by 6.25 pp; single-shot (59.4 %) is competitive with the larger 70B-class models without best-of-N. | |
| 2. **Same parameter-efficiency story as the defender-side table, on the attacker side.** Reward-engineered training at 0.5B beats raw capacity at 240Γβ1340Γ the parameter count for evading rule-based defenses. That's two independent demonstrations β *defender-side LoRA ties Llama-3.3-70B at 10Γ fewer params Β· attacker-side LoRA beats DeepSeek-V3 at 1340Γ fewer params* β that the contested resource is reward design and training, not scale. The DeepSeek-V3 attacker score (31.2 %) is partly safety-training refusing scam roleplay; even adjusting for that, the trained 0.5B is on top. | |
| This is the frontier-comparison evidence for the Multi-Agent track: **two trained agents, both parameter-efficient against frontier baselines, on opposite sides of the fraud loop.** | |
| --- | |
| ## Real incidents Chakravyuh is built for | |
| These are cited public 2025 cases. Each one matches a signal Chakravyuh's Analyzer is trained to flag. The bench-v0 corpus contains structurally similar templates (not the same text β soft-leakage filtered). | |
| | Location | Date | Amount | Signal Chakravyuh catches | Source | | |
| |---|---|---|---|---| | |
| | Hyderabad | Oct 26 β Nov 12, 2025 | βΉ11.17 lakh | `trust_grooming` + `investment_offer` (matrimonial profile β "Singapore crypto trader" β high-return crypto pitch) β suspect arrested at Chennai airport | [Newsmeter](https://newsmeter.in/crime/rs-11-lakh-matrimonial-crypto-scam-busted-by-hyderabad-police-mastermind-from-vizag-held-at-airport-763759) | | |
| | Mumbai | 2025 | βΉ1 crore | `trust_grooming` + `investment_offer` + `urgency` (matrimonial site β fake "NRI" β assured-return crypto app) | [Outlook Money](https://www.outlookmoney.com/news/man-duped-of-rs-1-crore-in-crypto-scam-through-matrimonial-website) | | |
| | Pan-India | 2025 | ~βΉ2,400 cr (29,000+ AePS complaints) | `biometric_impersonation` (Aadhaar biometric data scraped from public registry sites β AePS withdrawal at remote operator). The Analyzer flags companion phishing chats; the Bank Monitor's separate metadata channel catches the AePS leg. | [Press Information Bureau](https://www.pib.gov.in/PressReleasePage.aspx?PRID=2039647) Β· [The Print](https://theprint.in/india/governance/cybercriminals-cloning-aadhaar-biometric-data-to-commit-fraud-mha-nodal-agency-to-states/1415112/) | | |
| The pan-India AePS case is the single sharpest illustration of why **two-tier oversight** matters. A chat-only detector cannot see an AePS withdrawal at a remote operator; a metadata-only detector cannot see the social-engineering chat that set up the biometric leak. Chakravyuh runs both tiers, structurally independent. | |
| --- | |
| ## Before / after β one novel-2025 scam | |
| A single scenario from the bench (`modec_106`, source: `Novel post-2024 β AI deepfake CEO`, date: 2025-Q4): | |
| > "[Deepfake voice of Infosys CEO]: Exclusive employee IPO allocation. Invest Rs 1 lakh via company portal. HR calling now for UPI transfer. Limited to first 100 employees." | |
| The scripted analyzer scores this **0.05** (threshold 0.50) β it flags only `unknown_sender`. Its word-list rules find no urgency token, no impersonation phrase ("CEO" isn't in the list), no info-request, no link. **Scam slips through.** | |
| The v2 LoRA closes this class of gap at the aggregate level: across the 34 post-2024 novel scenarios in the bench (matrimonial crypto, deepfake CEO, digital arrest, AePS fraud), v2 catches **33 of 34 (97.1 %)** vs the scripted baseline's 26 of 34 (76.5 %) β a **20.6 pp lift** on exactly the attacks where keyword rules are blind. Source: [`logs/eval_v2.json`](logs/eval_v2.json). | |
| Reproducible: `python eval/single_scenario_eval.py --scenario-id modec_106 --output docs/before_after_example.json` | |
| --- | |
| ## Why This Environment β Scalable Oversight as a Research Contribution | |
| Chakravyuh is, at its core, a **scalable-oversight** benchmark for LLM training. The research frame: *can we train an LLM to monitor, analyze, and explain the behaviour of another AI agent operating adversarially in a complex, partially observable multi-agent setting?* | |
| The **Analyzer** is the oversight LLM under training. It watches a scripted Scammer attempt to manipulate a scripted Victim, must decide whether the interaction is fraudulent in real time (partial observability β it sees only the chat, never the transaction), and must produce a human-readable *explanation* of its decision. A second oversight agent β the **Bank Monitor** β provides independent cross-modal confirmation (transaction metadata only, no chat), making Chakravyuh a **two-tier oversight system** where the Analyzer's claims can be corroborated or contradicted. | |
| The composable rubric system ([chakravyuh_env/rubrics.py](chakravyuh_env/rubrics.py)) grades three pillars of oversight: **detection**, **calibration**, and **explanation** β see [Composable Rubric System](#composable-rubric-system) below. | |
| --- | |
| ## Submission Materials | |
| > **π― OFFICIAL SUBMISSION URL** (give this to judges) β **[`https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh`](https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh)** | |
| > | |
| > Per the hackathon organisers, the HF Space URL above is the canonical | |
| > submission link β judges pull the environment from there. The | |
| > writeup [`Blog.md`](Blog.md) is published alongside this README into | |
| > the same Space (`MD separate from Readme`, per organisers' note). | |
| | Asset | Link | | |
| |---|---| | |
| | **Hugging Face Space (live env β submission URL)** | [`ujjwalpardeshi/chakravyuh`](https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh) Β· live at [`https://ujjwalpardeshi-chakravyuh.hf.space/demo/`](https://ujjwalpardeshi-chakravyuh.hf.space/demo/) | | |
| | **Writeup blog (Blog.md, in HF Space)** | [`Blog.md`](Blog.md) β 5-minute story, separate from README, pushed into the HF Space per organisers' clarification | | |
| | **Analyzer LoRA v2** (defender, HF Hub) | [`ujjwalpardeshi/chakravyuh-analyzer-lora-v2`](https://huggingface.co/ujjwalpardeshi/chakravyuh-analyzer-lora-v2) β Qwen2.5-7B-Instruct + LoRA r=64 + GRPO. 99.3 % detection Β· 6.7 % FPR Β· F1 = 0.99 | | |
| | **Scammer LoRA Phase 1** (adversary, HF Hub β gated) | [`ujjwalpardeshi/chakravyuh-scammer-lora-phase1`](https://huggingface.co/ujjwalpardeshi/chakravyuh-scammer-lora-phase1) β Qwen2.5-0.5B-Instruct + LoRA r=16 + GRPO. **n=64 best-of-8 bypass: 93.75 % vs scripted defense (100 % on held-out novel categories), 32.8 % vs v2 LoRA defender.** Per-sample artifacts: [`logs/b2_phase1_scammer_eval_n64_bestof8.json`](logs/b2_phase1_scammer_eval_n64_bestof8.json) Β· [`logs/scammer_significance.json`](logs/scammer_significance.json) | | |
| | Training notebooks (TRL + GRPO) | Analyzer v2: [`notebooks/v2_retrain_safe.ipynb`](notebooks/v2_retrain_safe.ipynb) Β· Scammer Phase 1: [`notebooks/T4_or_A100_b2_phase1_scammer.ipynb`](notebooks/T4_or_A100_b2_phase1_scammer.ipynb) | | |
| | Public benchmark dataset | [`ujjwalpardeshi/chakravyuh-bench-v0`](https://huggingface.co/datasets/ujjwalpardeshi/chakravyuh-bench-v0) on HF Hub Β· local copy: [`data/chakravyuh-bench-v0/`](data/chakravyuh-bench-v0/) (175 scenarios) | | |
| | FAQ for judges | [`FAQ.md`](FAQ.md) | | |
| | Official hackathon guidelines | [`guidelines/`](guidelines/) | | |
| --- | |
| ## The Problem | |
| Indian digital payments lose βΉ13,000+ crore/year to UPI fraud. 60 crore users are exposed. Rule-based detection is brittle; scammers evolve faster than banks patch. **No public RL environment exists for multi-agent fraud detection research.** | |
| Chakravyuh fills this gap. | |
| ## The Environment | |
| Five agents with asymmetric information: | |
| ``` | |
| CLOUD βββββββββββββββββββ | |
| β REGULATOR β adapts rules from aggregated outcomes | |
| β (meta-agent) β (aggregate signals only β no chat, no tx) | |
| ββββββββββ¬βββββββββ | |
| β | |
| ON-DEVICE βββββββββΌββββββββββ | |
| βββββββββΆβ BEHAVIORAL β runs locally on victim's phone | |
| β chat β ANALYZER β messages NEVER leave device | |
| β(local) β (oversight LLM) β β agent under training | |
| βββββ΄ββββββ βββββββββββββββββββ | |
| β SCAMMER βββββchatββΆββββββββββββ | |
| βββββββββββ β VICTIM β | |
| ββββββ¬βββββββ | |
| β attempts transaction | |
| βΌ | |
| BANK-SIDE βββββββββββββββββββ | |
| β BANK MONITOR β sees ONLY tx metadata | |
| β (oversight) β no chat content | |
| βββββββββββββββββββ | |
| ``` | |
| Messages stay on the victim's device. Only anonymized risk scores reach the bank. End-to-end encryption is fully preserved β the architecture mirrors real on-device spam filtering (e.g. Gmail). | |
| ### Architecture | |
| Five-agent architecture with asymmetric information (see diagram above). | |
| | Agent | Role | Sees | Trained? | | |
| |---|---|---|---| | |
| | Scammer | Adversary | Own plan + victim responses | **Yes β LoRA r=16 on Qwen2.5-0.5B-Instruct via TRL GRPO Phase 1** (B.2; bypasses scripted defense at 93.75 % best-of-8, 100 % held-out novel categories β see [`logs/b2_phase1_scammer_eval_n64_bestof8.json`](logs/b2_phase1_scammer_eval_n64_bestof8.json)). Falls back to 376 curated templates (NPCI/RBI-grounded) when the LoRA isn't loaded. | | |
| | Victim | Target | Chat + own demographic profile | No (scripted; SENIOR / SEMI_URBAN / YOUNG_URBAN profiles) | | |
| | Analyzer | Oversight | Full chat transcript | **Yes β LoRA on Qwen2.5-7B-Instruct via TRL GRPO** | | |
| | Bank Monitor | Oversight | Tx metadata only | No (scripted) | | |
| | Regulator | Meta-agent | Aggregate outcomes across episodes | No (rule-weight updater) | | |
| ### Attack corpus | |
| - **376 scam templates** β 200 base + 100 augmented + 76 novel (post-2024) across 5 categories (OTP theft, KYC fraud, impersonation, loan-app fraud, investment fraud) + 6 novel categories (QR fraud, voice-clone job, WhatsApp investment, AePS fraud, matrimonial crypto, parcel scam) | |
| - **204 benign templates** β 70 base + 134 augmented (including 30 hard-negatives: HDFC fraud alerts, Mumbai Police traffic challans, RBI advisories β urgent-looking but legitimate) | |
| - Languages: **primarily English** (n=161/175) with a Hindi minority (n=9). Single-sample placeholders for Tamil / Telugu / Kannada / Bengali / Marathi mark them as **v3 expansion targets** β not production-grade coverage. Per-language eval is in v3. | |
| - 5 intents: urgency, authority, empathy, greed, fear | |
| - 5 impersonation roles: bank, govt, family, delivery, employer | |
| - 2025β2026 attack vectors: digital arrest, crypto-exchange spoofing, deepfake CEO, UPI collect request, matrimonial scams, FASTag KYC, ABHA Health ID, AadhaarβDL linkage | |
| --- | |
| ## Quickstart | |
| ### Option A β Install and run via OpenEnv (recommended for judges) | |
| ```bash | |
| # Clone | |
| git clone https://github.com/UjjwalPardeshi/Chakravyuh && cd Chakravyuh | |
| # Option A.1 β bare Python | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| # Option A.2 β uv | |
| uv sync && uv run server | |
| # Option A.3 β Docker | |
| docker build -t chakravyuh . && docker run -p 8000:8000 chakravyuh | |
| # Option A.4 β Hugging Face Space | |
| # See "Submission Materials" above for the live HF Space URL | |
| ``` | |
| All four paths are verified by `openenv validate .`: | |
| ``` | |
| [OK] Ready for multi-mode deployment | |
| Supported modes: [YES] docker [YES] openenv_serve [YES] uv_run [YES] python_module | |
| ``` | |
| ### OpenEnv client usage (what training loops consume) | |
| ```python | |
| from chakravyuh_env.openenv_client import ChakravyuhEnvClient | |
| from chakravyuh_env import ChakravyuhAction | |
| with ChakravyuhEnvClient(base_url="http://localhost:8000").sync() as env: | |
| result = env.reset(seed=42) | |
| # `result.observation.chat_history` contains the scammer opener | |
| # and victim's initial response (internal turns 1-2). | |
| # Analyzer's turn-3 decision: | |
| result = env.step(ChakravyuhAction( | |
| score=0.92, # suspicion in [0, 1] | |
| signals=["urgency", "info_request"], # from the 11-signal taxonomy | |
| explanation="Asks for OTP with urgency pressure from a self-claimed bank agent.", | |
| )) | |
| if not result.done: | |
| # Analyzer's turn-6 decision after scammer escalation + victim reply: | |
| result = env.step(ChakravyuhAction(score=0.95, signals=["impersonation"])) | |
| print("reward:", result.reward) | |
| print("outcome:", result.observation.outcome) | |
| print("rubric breakdown:", result.observation.reward_breakdown) | |
| ``` | |
| ### One-liner β score a single message with the trained Analyzer | |
| ```python | |
| from chakravyuh_env import get_trained_analyzer | |
| analyzer = get_trained_analyzer() # downloads ujjwalpardeshi/chakravyuh-analyzer-lora-v2 on first call | |
| print(analyzer("Urgent! Your bank account will be frozen. Share OTP to verify identity.")) | |
| # β {'score': 0.95, 'signals': ['urgency', 'info_request', 'impersonation'], | |
| # 'explanation': 'Asks for OTP with urgency from a self-claimed bank agent...'} | |
| ``` | |
| The analyzer is callable for one-shot scoring (`analyzer(text) -> dict`). For full env integration use `analyzer.act(observation)`; for Mode C eval use `analyzer.score_text(text) -> float`. First call downloads weights (~660 MB) and is slow; subsequent calls hit the warm model. | |
| ### Direct-import usage (no HTTP, for unit tests and trainers colocated with the env) | |
| ```python | |
| from chakravyuh_env import ChakravyuhOpenEnv, ChakravyuhAction | |
| env = ChakravyuhOpenEnv() | |
| obs = env.reset(seed=42) | |
| obs = env.step(ChakravyuhAction(score=0.92, signals=["urgency"])) | |
| if not obs.done: | |
| obs = env.step(ChakravyuhAction(score=0.95, signals=["impersonation"])) | |
| print(obs.reward, obs.reward_breakdown) | |
| ``` | |
| ### Run the tests | |
| ```bash | |
| pytest tests/ -v | |
| # 341 collected Β· 338 passed Β· 3 skipped (LLM-judge tests skip without GROQ_API_KEY) | |
| # Coverage: openenv contract, rubrics, scripted env, demo, explanation judge, | |
| # GRPO reward, MCP compliance, mode-C bench, negotiation, leaderboard, training data, | |
| # benign augmentation, known/novel split, red-team robustness, input sanitizer, | |
| # permutation test for v1βv2 FPR delta. | |
| # Tests require '.[llm,eval]' extras: | |
| # pip install -e '.[llm,eval]' | |
| ``` | |
| --- | |
| ## OpenEnv Compliance | |
| | Requirement | Status | | |
| |---|---| | |
| | Uses `openenv.core.env_server.Environment` base class | β [`chakravyuh_env/openenv_environment.py`](chakravyuh_env/openenv_environment.py) | | |
| | Pydantic `Action` / `Observation` / `State` subclasses | β [`chakravyuh_env/openenv_models.py`](chakravyuh_env/openenv_models.py) | | |
| | Client / server separation (client never imports server internals) | β [`chakravyuh_env/openenv_client.py`](chakravyuh_env/openenv_client.py) | | |
| | Gym-style API: `reset` / `step` / `state` | β | | |
| | Valid `openenv.yaml` manifest | β | | |
| | `openenv validate .` (static) | β 4/4 deployment modes | | |
| | `openenv validate --url β¦` (runtime) | β 6/6 endpoint criteria: `/health`, `/schema`, `/metadata`, `/openapi.json`, `/mcp`, mode consistency | | |
| | OpenEnv **Rubric** system, composable | β [`chakravyuh_env/rubrics.py`](chakravyuh_env/rubrics.py) β see next section | | |
| | Uses OpenEnv latest release | β `openenv-core >= 0.2.3` | | |
| --- | |
| ## Composable Rubric System | |
| The Analyzer's reward decomposes into **eight orthogonal, introspectable child rubrics** rather than monolithic scoring. Each child is a proper `openenv.core.rubrics.Rubric` subclass with its own `last_score` and can be swapped, reweighted, or replaced (e.g. with `LLMJudge`) without touching the top-level. The env serves the v2 profile (`AnalyzerRubricV2`) by default β the same weights v2's LoRA was trained against. | |
| | Rubric | v1 weight | **v2 weight** | Signal | | |
| |---|---|---|---| | |
| | `DetectionRubric` | +1.0 | **+1.0** | Fires on *early* flag (by turn β€ 5) of a real scam | | |
| | `MissedScamRubric` | β0.5 | **β0.5** | Fires when analyzer missed AND money was extracted | | |
| | `FalsePositiveRubric` | β0.3 | **β0.8** | Penalises flagging a benign episode (5Γβ) | | |
| | `CalibrationRubric` | +0.2 | **+0.5** | Rewards suspicion-score calibration vs ground truth | | |
| | `ExplanationRubric` | +0.4 | **+0.4** | Heuristic explanation quality (length + signal references) | | |
| | `SignalAccuracyRubric` | β | **+0.2** | NEW v2: fraction of expected signals correctly named | | |
| | `FormatRubric` | β | **+0.15** | NEW v2: JSON-emission shaping; **denied when flagging benign as scam** | | |
| | `LengthRubric` | β | **Β±0.15** | NEW v2: peak at ~45 tokens, penalty above 70 | | |
| | `RupeeWeightedRubric` *(side-channel aggregator, not in `AnalyzerRubricV2`)* | β | n/a | NEW v3-ready: economic-loss-aware reward in `[-1, +1]`. +loss/cap on detected scams, βloss/cap on missed scams with money extracted. Used by [`eval/rupee_weighted_eval.py`](eval/rupee_weighted_eval.py) to produce the bench-level "βΉ at risk" / "βΉ prevented" headlines. Bench has **βΉ77.95 lakh** of labelled scam loss across 130 scams β see [`logs/rupee_weighted_eval.json`](logs/rupee_weighted_eval.json). | | |
| The three v1βv2 changes (FP β0.3 β β0.8, calibration +0.2 β +0.5, format reward denied on benign-flagged-scam) are the principled fix that produced the asymmetric improvement in Β§Results β detection unchanged, FPR 5Γ down. The v1 profile is still available as `AnalyzerRubric()` for v1-weight reproducibility. See [`chakravyuh_env/rubrics.py`](chakravyuh_env/rubrics.py) for the full reward implementation. | |
| ### Inspection | |
| Every child rubric exposes its score on every call. Training loops can read them directly: | |
| ```python | |
| env = ChakravyuhOpenEnv() # ships AnalyzerRubricV2 by default | |
| # β¦run an episodeβ¦ | |
| for name, child in env.rubric.named_rubrics(): | |
| print(f"{name:18s} last_score={child.last_score}") | |
| # detection last_score=1.0 | |
| # missed_scam last_score=0.0 | |
| # false_positive last_score=0.0 | |
| # calibration last_score=0.95 | |
| # explanation last_score=0.7 | |
| # signal_accuracy last_score=1.0 | |
| # format last_score=1.0 | |
| # length last_score=0.85 | |
| ``` | |
| The full breakdown travels back to clients on every terminal observation (`observation.reward_breakdown`), so wandb/W&B logs can plot every sub-signal independently. The trainer's `compute_reward` and `AnalyzerRubricV2` are kept in numerical parity by `tests/test_v2_reward_parity.py`. | |
| --- | |
| ## Anti-Reward-Hacking Design | |
| Reward hacking is the biggest practical failure mode in the hackathon guide β and we hit it ourselves in v1, then diagnosed and fixed it in v2 (full story below). The reward design follows three principles: | |
| 1. **Multiple independent rubrics.** Five orthogonal children β `detection`, `missed_scam`, `false_positive`, `calibration`, `explanation` β each computed from a different slice of outcome or action. No single signal can be gamed in isolation; each clips to `[0, 1]` so the parent sum is bounded. | |
| 2. **Explicit false-positive penalty + benign calibration.** `FalsePositiveRubric` (β0.3 β β0.8 in v2) makes "flag everything" a dominated strategy; `CalibrationRubric.benign_target=0.1` punishes constant-high-score agents on benign cases. The two terms together eliminate the v1 reward-hacking equilibrium. | |
| 3. **Two-tier oversight + trajectory-aware detection.** The Analyzer cannot single-handedly suppress money extraction β Bank Monitor sees transaction metadata the Analyzer never sees, and the final outcome depends on both. `DetectionRubric` reads `outcome.detected_by_turn β€ 5`, so flipping the score on the final turn doesn't count. | |
| A held-out novel split (30 post-2024 attacks, no equivalent in training) catches training-set overfitting. The full v1βv2 diagnosis is in [`chakravyuh_env/rubrics.py`](chakravyuh_env/rubrics.py) and the story below in Β§Results. Concrete attack tests are in [`logs/analyzer_robustness.json`](logs/analyzer_robustness.json). | |
| --- | |
| ## Results | |
| ### Mode C benchmark β 135 real-grounded scenarios (scripted baseline) | |
| | Metric | Value | 95% CI | | |
| |---|---|---| | |
| | Detection rate (recall) | **72.2%** | [63.5%, 80.0%] | | |
| | Precision | 93.3% | β | | |
| | F1 | 0.814 | β | | |
| | False positive rate | 30.0% | β | | |
| #### Per-category detection | |
| | Category | n | Detection | | |
| |---|---|---| | |
| | OTP theft | 19 | 95% | | |
| | KYC fraud | 22 | 95% | | |
| | Impersonation | 30 | 77% | | |
| | Loan-app fraud | 18 | 67% | | |
| | Investment fraud | 26 | 35% | | |
| #### Temporal-generalization gap (the headline finding) | |
| The numbers below are sourced from `data/chakravyuh-bench-v0/baselines.json` (re-measured on the current n=175 bench, 2026-04-21). The historical n=135 figures (where the gap appeared as 30 pp / novel = 50 %) are preserved in [`logs/mode_c_scripted_n135.json`](logs/mode_c_scripted_n135.json) for reference; the canonical claim uses the current bench. | |
| | Subset | Detection | 95% CI | n | | |
| |---|---|---|---| | |
| | **Known (pre-2024) scams** | **86.4 %** | [80.0 %, 92.7 %] | 110 | | |
| | **Novel (post-2024) scams** | **76.5 %** | [61.8 %, 88.2 %] | 34 | | |
| | **Gap** | **9.9 pp** | β | β | | |
| - Permutation test p-value: **0.184** (not significant at Ξ± = 0.05 on the current bench) | |
| - Cohen's d: **0.27** (small effect) | |
| - The temporal-gap signal weakened as the bench grew from n = 135 β n = 175 with stronger cross-section coverage. The **headline claim is now the LoRA's per-difficulty ramp** (scripted 76.5 % β v2 LoRA 97.1 % on novel = **20.6 pp lift**), not a fragility-of-rules story. | |
| On our 34-scenario post-2024 novel split (matrimonial crypto grooming, deepfake CEO, digital arrest, metaverse real estate, AI chatbot trading), the **scripted analyzer catches 76.5 % (26/34)**. The LoRA closes that gap to **97.1 % (33/34)** β which is a 20.6 pp lift on novel attacks where rule-based pattern matching is noisiest. | |
| ### LoRA-trained Analyzer β v1 (reward-hacked) vs v2 (principled retrain) | |
| The scripted baseline catches **76.5 % of novel post-2024 attacks** (26/34) β better than rule-based usually does, but the missed 8 are exactly the high-loss novel patterns (matrimonial crypto, deepfake CEO, digital arrest). Closing that gap is what the LoRA-trained Analyzer is for. We trained two LoRA adapters on top of Qwen2.5-7B-Instruct with TRL's GRPO, using a composable reward ([rubrics.py](chakravyuh_env/rubrics.py)). The honest story is more interesting than a single good number: | |
| #### v1 β v2 delta | |
| | Metric | v1 (reward-hacked) | v2 (retrained) | Change | 95% CI (v2) | | |
| |---|---|---|---|---| | |
| | Detection rate | 100.0% | **99.3%** | β same | [96.2%, 99.9%] *(Wilson)* | | |
| | False positive rate | 36.0% | **6.7%** | **β29.5 pp (~5Γ)** | [1.8%, 20.7%] *(Wilson)* | | |
| | Precision | β | 98.6% | β | β | | |
| | F1 | 0.96 | **0.99** | +0.03 | β | | |
| | Bench n | 135 | 174 (scored) / 175 total | β | β | | |
| v2 was trained with three anti-collapse reward changes: FP penalty tightened from β0.3 β **β0.8**, benign-calibration weight raised from 0.3 β **0.5**, and the format reward was **removed when the model flags a benign as scam** (removing the "lazy over-flag" shortcut). KL anchor `Ξ² = 0.15` (stiffer than v1's 0.08). See [`training/grpo_analyzer.py`](training/grpo_analyzer.py). | |
| #### v2 per-difficulty ramp (scripted baseline β LoRA v2) | |
| | Difficulty | Scripted (current bench, n=175) | LoRA v2 | Lift | | |
| |---|---|---|---| | |
| | Easy (n=26) | 96.2 % (25/26) | 100 % | +3.8 pp | | |
| | Medium (n=66) | 86.4 % (57/66) | 100 % | +13.6 pp | | |
| | **Hard (n=18)** | **72.2 % (13/18)** | **100 %** | **+27.8 pp** | | |
| | **Novel (n=34)** | **76.5 % (26/34)** | **97.1 %** | **+20.6 pp** | | |
| The largest lifts appear exactly where the scripted rule-based baseline fails most β hard and novel scenarios. That shape is the signature of genuine generalization, not pattern matching. Per-difficulty chart: [`v2_per_difficulty_check.png`](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/v2_per_difficulty_check.png). Analogous scripted-baseline temporal gap: [`temporal_gap_closure.png`](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/temporal_gap_closure.png). | |
| #### Why v1 was reward-hacked (and how we diagnosed it) | |
| v1 hit detection=100% but FPR=36%. That combination β *everything* gets flagged β is the reward-hacking fingerprint: the model learned "always output high score" because the v1 reward profile (FP penalty β0.3, format reward always paid, benign calibration 0.3) made flagging dominant. The per-difficulty plot confirmed it: v1's detection was uniform β100% across easy / medium / hard / novel β a model that genuinely learns shows a ramp. v2 still shows near-flat detection (bench scenarios are clearly classifiable to a well-trained analyzer), **but FPR dropped 5Γ** β which is the real signal that the model is now respecting the benign class instead of spamming high scores. | |
| #### Limitations β be honest about what the bench can and can't tell you | |
| 1. **Semantic leakage between training and bench (we audited this ourselves).** Our `_filter_soft_leakage` removes substring duplicates only. We re-audited with a MiniLM-L6 cosine-similarity nearest-neighbor scan: **mean cosine = 0.80, 44.8 % of bench has cosine > 0.85, 18.4 % > 0.95** ([`logs/semantic_leakage_audit.json`](logs/semantic_leakage_audit.json), [`plots/chakravyuh_plots/semantic_leakage_histogram.png`](plots/chakravyuh_plots/semantic_leakage_histogram.png)). Implication: the 100 % detection on easy / medium / hard is partially memorization. The v1βv2 FPR fix and the scripted-baseline novel collapse are unaffected (relative comparisons within the same bench). Reproduce: `python eval/semantic_leakage_audit.py`. | |
| 2. **Small benign sample (n=31).** FPR=6.7% has a wide Wilson 95% CI of **[1.8%, 20.7%]**. A single additional benign misclassification would move the point estimate from 6.7% to 10.0%. We stand behind the "~5Γ FPR reduction vs v1" claim (statistically real) but not the specific "6.7%" number as a precise estimate. | |
| 3. **Bench is a proxy.** 175 curated scenarios do not span real-world fraud diversity. Production performance will be lower. | |
| 4. **1 epoch over 619 training examples.** The trainer hit the dataset natural endpoint at step 619 (not 700). More epochs + larger training corpus would sharpen the signal. | |
| 5. **Per-scenario false-positive audit pending.** We have not yet manually inspected *which* 2 benigns were misclassified. Until that audit runs, we cannot rule out a specific templated blind spot. | |
| #### What we plan next (v3 β rigorous validation) | |
| - Expand benign corpus to **β₯150 labelled scenarios** (target benign n=150, FPR CI Β±3 pp) | |
| - Multi-seed retrains (3 seeds) to report mean Β± std, not point estimates | |
| - External held-out set: 50 novel scam patterns *not* derived from any canonical template | |
| - Manual audit of every v2 false positive + missed scam | |
| - Bootstrap CIs on per-difficulty detection (current numbers have n=18 on `hard`, n=34 on `novel` β still thin) | |
| Artifacts for the v2 run: [`logs/eval_v2.json`](logs/eval_v2.json), adapter on HF Hub at [`ujjwalpardeshi/chakravyuh-analyzer-lora-v2`](https://huggingface.co/ujjwalpardeshi/chakravyuh-analyzer-lora-v2), 10 000-iter percentile bootstrap CIs at [`logs/bootstrap_v2.json`](logs/bootstrap_v2.json), per-rubric ablation at [`logs/ablation_study.json`](logs/ablation_study.json), red-team robustness at [`logs/analyzer_robustness.json`](logs/analyzer_robustness.json). | |
| ### Env rollout baseline β scripted agents, 300 episodes | |
| | Metric | Value | | |
| |---|---| | |
| | Analyzer detection rate | 47% | | |
| | Scam extraction rate | 18% | | |
| | Victim refusal rate | 20% | | |
| | Victim sought verification | 13% | | |
| | Bank freeze rate | 6% | | |
| | Avg detection turn | ~3 | | |
| The scripted Analyzer is intentionally a *competent-but-beatable* baseline β strong on explicit info-request patterns, weak on subtler financial-lure language, multi-lingual attacks, and modern 2025β2026 attack vectors. These hard cases are the gap the LoRA-trained Qwen2.5-7B Analyzer closes during GRPO post-training. | |
| ### Training curves | |
|  | |
| > *v2 Analyzer GRPO training trajectory rendered from | |
| > [`logs/v2_trainer_state.json`](logs/v2_trainer_state.json) (123 logged | |
| > points at logging_steps=5 over 615 total steps). | |
| > **Reward** climbs from 1.29 β ~1.97 and stabilises with shrinking variance β the | |
| > 8-rubric weighted sum is being learned, not gamed. | |
| > **Loss** stays bounded around zero (no divergence, no clipping | |
| > spikes). | |
| > **KL** plateaus at 0.25β0.45 (honestly disclosed); the | |
| > v3 plan adds a KL-early-stop guard at 0.20 (orange line). | |
| > **Grad norm** is well-behaved (no explosions). | |
| > Reproduce: `python eval/plot_training_curves.py`.* | |
| The v1 training curve [`training_reward_curve.png`](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/training_reward_curve.png) is published alongside the v1 reward-hacking diagnostic [`reward_hacking_diagnostic.png`](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/reward_hacking_diagnostic.png) so readers can see what the hack looked like in reward/loss space. The v2 per-difficulty bar chart is at [`v2_per_difficulty_check.png`](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/v2_per_difficulty_check.png). | |
| ### Evidence beyond headline numbers | |
| Four extra plots regenerated locally from logged eval data β no GPU required, every script CPU-runnable in seconds. | |
| **1. Calibration is not gamed** β SFT baseline ECE = 0.039, MCE = 0.043 across n=175. The reliability diagram lies on the diagonal: when the model says 0.7 it is right ~70% of the time. (v2 LoRA per-row scores are B.12; we ship the SFT baseline as-is rather than overclaim.) | |
|  | |
| > Reproduce: `python eval/calibration_analysis.py`. Source: [`logs/calibration_sft.json`](logs/calibration_sft.json). | |
| **2. Per-rubric ablation** β zero each child rubric in turn; measure the drop in average composite reward over n=135 scripted-baseline scenarios. Detection (-0.61) and calibration (-0.13) carry the signal; missed_scam and explanation are no-ops at eval time (they only matter during training, where the gradient flows through them). False_positive *helps* a tiny bit when removed (+0.013) β the cost is paid in benign-FPR not in average reward. | |
|  | |
| > Reproduce: `python eval/plot_ablation_per_rubric.py`. Source: [`logs/ablation_study.json`](logs/ablation_study.json). | |
| **3. Leakage-clean slice** β re-evaluate every provider on the n=50 subset where the nearest training text has cosine similarity < 0.7 (audited with MiniLM-L6). Scripted holds within 2.4 pp; frontier-LLM providers do *not* improve on the clean slice β their failure mode is structural (no Indian-fraud priors), not memorisation by the bench. | |
|  | |
| > Reproduce: `python eval/plot_leakage_clean_slice.py`. Source: [`logs/leakage_clean_slice.json`](logs/leakage_clean_slice.json). | |
| **4. SFT vs v2-GRPO fingerprint** β same Qwen2.5-7B base, same LoRA (r=32, Ξ±=64, all linear), same training corpus. Only the algorithm changes. GRPO buys +5.6 pp on hard scenarios at the cost of -2.9 pp on novel and +3.4 pp FPR β a real, measurable algorithm trade-off, not noise. | |
|  | |
| > Reproduce: `python eval/plot_sft_vs_v2_fingerprint.py`. Sources: [`logs/eval_sft.json`](logs/eval_sft.json), [`logs/eval_v2.json`](logs/eval_v2.json). | |
| --- | |
| ## Repo Layout | |
| `chakravyuh_env/` (env + 5 agents + composable rubrics) Β· `server/` (FastAPI + Gradio demo) Β· `training/` (GRPO LoRA) Β· `eval/` (bench + bootstrap + red-team) Β· `notebooks/` (Analyzer v2 + Scammer Phase 1 training). | |
| --- | |
| ## Deployment | |
| ### Local (fastest) | |
| ```bash | |
| pip install -e . | |
| uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| ### Hugging Face Space | |
| The repo is HF-Space-ready (Docker runtime): | |
| ```bash | |
| openenv push . # from OpenEnv CLI | |
| # or | |
| git remote add hf https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh && git push hf main | |
| ``` | |
| > β οΈ **HF Space cold start**: A sleeping Space takes ~30β60s to boot on first request while the container starts. Subsequent requests are <1s. Use `/health` to poll readiness before submitting traffic. The `/demo/` route returns 200 once the Gradio app has mounted. | |
| ### Replay UI (for the demo) | |
| ```bash | |
| pip install -e '.[demo]' | |
| python -m server.demo_ui | |
| ``` | |
| The Gradio UI provides two tabs: | |
| 1. **Replay** β 5 curated deterministic episodes (seed-reproducible, zero inference risk) | |
| 2. **Live** β paste any suspicious message, analyzer scores it instantly | |
| | # | Story | Demonstrates | | |
| |---|---|---| | |
| | 1 | Multi-Agent Defense Wins | Analyzer + Bank Monitor cooperate, tx frozen | | |
| | 2 | Skeptical Victim Refuses | Tech-savvy user recognizes pattern, refuses | | |
| | 3 | Verification-First Behaviour | Victim calls bank to verify β ideal outcome | | |
| | 4 | Detection Too Late | Analyzer flags but victim already complied β motivates LoRA | | |
| | 5 | Scripted Rules Blind Spot | Rule-based misses subtle KYC scam β gap the LoRA closes | | |
| --- | |
| ## Hackathon Checklist (from `guidelines/`) | |
| | Requirement | Status | | |
| |---|---| | |
| | Uses OpenEnv (latest release) | β `openenv-core>=0.2.3,<0.3` | | |
| | Environment / client / server separation | β | | |
| | `openenv.yaml` manifest | β | | |
| | Gym-style `reset` / `step` / `state` | β | | |
| | No reserved MCP tool names | β `tests/test_mcp_compliance.py` | | |
| | Working training script (TRL / Unsloth, Colab) | β [`training/train_colab.ipynb`](training/train_colab.ipynb) + [`notebooks/v2_retrain_safe.ipynb`](notebooks/v2_retrain_safe.ipynb) | | |
| | Multiple independent reward functions | β 8 composable child rubrics | | |
| | Anti-reward-hacking design | β [Anti-Reward-Hacking Design](#anti-reward-hacking-design) + [`logs/analyzer_robustness.json`](logs/analyzer_robustness.json) | | |
| | Real training evidence (reward/loss plots) | β [v2 GRPO training curves (reward / loss / KL / grad-norm, 615 steps)](plots/chakravyuh_plots/training_curves_v2.png) Β· [training reward (v1)](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/training_reward_curve.png) Β· [reward-hacking diagnostic](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/reward_hacking_diagnostic.png) Β· [per-difficulty](https://raw.githubusercontent.com/UjjwalPardeshi/Chakravyuh/a9e723bf495182724845dbf1f69f8968434a9e02/docs/assets/plots/v2_per_difficulty_check.png) | | |
| | HF Space deployed | β [LIVE](https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh) | | |
| | Mini-blog OR <2-min video (writeup) | β [`Blog.md`](Blog.md) (HF-Space-side writeup, MD separate from README per organisers) | | |
| | README links to all materials | β (see Submission Materials) | | |
| --- | |
| ## Data Sources | |
| All 144 scam-side scenarios are real-incident-grounded (RBI / NPCI / I4C / news media). The 31 benign-side scenarios include **25 synthetic legitimate-bank-SMS templates** (HDFC / ICICI / Amazon / Aadhaar / utility-bill formats) used as hard negatives for FPR estimation. We disclose because precision matters for honest reporting. | |
| - RBI Annual Report on Financial Fraud (rbi.org.in) | |
| - NPCI Safety Bulletins (npci.org.in/safety-and-awareness) | |
| - sachet.rbi.org.in | |
| - I4C β Indian Cybercrime Coordination Centre (cybercrime.gov.in) | |
| - IIT Kanpur C3i Center (security.cse.iitk.ac.in) | |
| ## Beyond UPI fraud β methodological contribution | |
| Chakravyuh is also a worked example of catching reward hacking in GRPO post-training. The asymmetric-improvement signature β detection unchanged, FPR collapses β is a diagnostic any RLHF/RLAIF pipeline can reuse. The reward-decomposition + per-rubric ablation method is portable to any composable-rubric task. We share the bench, the LoRA, the v1 trainer state, and the live red-team tab specifically so practitioners can apply this diagnostic to their own training runs. The v2 training trajectory is in [`logs/v2_trainer_state.json`](logs/v2_trainer_state.json); the v3 KL-early-stop guard is on the roadmap. | |
| ## License | |
| MIT β see `LICENSE`. Bench dataset is CC-BY-4.0; see [`DATASET_CARD.md`](DATASET_CARD.md). | |
| **Citation:** see [`CITATION.cff`](CITATION.cff). | |