File size: 12,165 Bytes
03815d6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# FAQ

For judges and reviewers skimming the project. Detailed answers in the linked artifacts.

## What is Chakravyuh in one sentence?

A multi-agent OpenEnv-compliant reinforcement-learning environment for Indian UPI fraud detection that diagnoses and fixes reward hacking in RLHF (v1 100 % detection / 36 % FPR β†’ v2 99.3 % / 6.7 % FPR).

## What are the headline numbers and where can I verify them?

| Claim | Value | Artifact |
|---|---|---|
| v2 Analyzer detection on bench (n=174) | 99.3 % | [`logs/eval_v2.json`](logs/eval_v2.json) |
| v2 Analyzer false-positive rate (n=30 benigns) | 6.7 % | [`logs/eval_v2.json`](logs/eval_v2.json) |
| v2 vs v1 FPR drop | 36 % β†’ 6.7 % (5Γ—) | [`README.md`](README.md) Β§reward-hacking-diagnosis |
| Bootstrap CI for FPR (10k iter, percentile method) | [1.8 %, 20.7 %] | [`logs/bootstrap_v2.json`](logs/bootstrap_v2.json) |
| B.2 Scammer LoRA bypass (n=64, single-shot) | 59.4 % | [`logs/b2_phase1_scammer_eval_n64.json`](logs/b2_phase1_scammer_eval_n64.json) |
| B.2 Scammer LoRA bypass (n=64, best-of-8) | 93.75 % | [`logs/b2_phase1_scammer_eval_n64_bestof8.json`](logs/b2_phase1_scammer_eval_n64_bestof8.json) |
| B.2 held-out novel category bypass (best-of-8) | 100 % | same file, `aggregate.held_out_seeds` |
| B.2 Scammer train-vs-held-out parity (Fisher's exact, OOD generalization) | p = 0.80 (single-shot), p = 0.11 (best-of-8) | [`logs/scammer_significance.json`](logs/scammer_significance.json) |
| B.2 Scammer best-of-8 vs single-shot (McNemar exact, paired) | p β‰ˆ 5e-7 (strictly dominant) | [`logs/scammer_significance.json`](logs/scammer_significance.json) |
| Semantic-leakage audit (cosine > 0.85 to training) | 44.8 % of bench | [`logs/semantic_leakage_audit.json`](logs/semantic_leakage_audit.json) |
| Live HF Space cold-start | 2.7 s | manual probe; reproducible from any laptop |

## Why GRPO over PPO or SFT?

PPO needs a learned value head; on a 7B base with rubric-style rewards that's an extra 100M parameters and unstable. GRPO uses group-relative advantage (no value head), lines up with our composable rubric (`AnalyzerRubricV2`), and β€” uniquely β€” supports the **adversarial-Scammer training loop** in B.2: the Scammer is trained against a frozen analyzer, then the analyzer is retrained against a frozen Scammer. SFT can match GRPO's standalone v2 numbers on this bench (3.2 % vs 6.7 % FPR β€” see [`logs/sft_vs_grpo_comparison.json`](logs/sft_vs_grpo_comparison.json)) but cannot do the co-evolution loop.

## Why Qwen2.5-7B-Instruct as the base?

Multilingual coverage of all 7 Indian languages we care about (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, English), Apache-2.0 license, fits a single A100 with LoRA, and instruction-tuned base means lower compute to converge.

## You only trained one agent β€” isn't that just supervised classification?

No, **two agents are now trained**: the Analyzer (v2 LoRA on Qwen2.5-7B) and the Scammer (Phase-1 LoRA on Qwen2.5-0.5B). The Scammer was trained via TRL 0.14 GRPO against the rule-based defense and now evades it at 93.75 % best-of-8 / 100 % on held-out novel categories. Phase 2 (LoRA-vs-LoRA co-evolution) is the next move.

## How do you know v2 is not also reward-hacked?

The asymmetric improvement: v1 scored detection 100 % / FPR 36 % β€” textbook reward-hacking signature. v2 scored detection **99.3 %** / FPR **6.7 %**. Detection essentially unchanged while FPR collapsed 5Γ— β€” that is what *learning the task* looks like, vs *gaming the proxy* which would have moved both metrics together. The full diagnosis is in the [Reward-Hacking Incident and Fix](Blog.md#reward-hacking-incident-and-fix-main-contribution) section of Blog.md.

## How is your bench different from your training data?

We audited this with MiniLM-L6 cosine similarity ([`eval/semantic_leakage_audit.py`](eval/semantic_leakage_audit.py)). 44.8 % of bench items have cosine > 0.85 to the nearest training text. The 100 % detection on easy/medium/hard buckets is **partly memorization**. The v1 β†’ v2 relative FPR fix and the scripted-baseline novel-collapse are **unaffected by leakage** (relative comparison on the same bench). v3 closes the absolute gap with a held-out template-family retrain β€” see the Honest Limitations section in [`Blog.md`](Blog.md#honest-limitations).

## If 44.8 % of bench is high-similarity, what's your real generalization number?

The leakage-clean subset (cosine < 0.70: 38 scams + 12 benigns = 50 scenarios) is where the honest in-distribution number lives β€” measurement is the B.12 v3 work. The novel post-2024 split (n=34) gives detection 97.1 %, but per the leakage audit even the novel split has mean cosine 0.79 to training, so the real OOD number is below that. The B.2 Scammer's held-out 100 % bypass on novel categories is *attacker-side* OOD evidence β€” it shows our trained-Scammer learned a generalizable structure of UPI fraud, not memorized prompts.

## Why does this matter outside India?

The methodological contribution generalizes. The v1 β†’ v2 reward-hacking-fix loop (FP penalty up, calibration weight up, format reward off-on-benign) is the **canonical worked example** of catching reward hacking in any RLHF post-training pipeline that uses composable-rubric rewards. The two-tier oversight architecture (chat-only Analyzer + metadata-only Bank Monitor) maps cleanly to scalable-oversight literature. Domain-Indian, methodology-universal.

## Can I run this on a phone?

The merged 7B + LoRA quantized to q4_k_m runs at ~10 tok/s on a Pixel 8 (8 GB RAM). The GGUF release is at `ujjwalpardeshi/chakravyuh-v2-gguf` (when shipped); see `serving/` for vLLM and Ollama harnesses for laptop / server / phone deployment paths.

## How do I reproduce your numbers?

```bash
git clone https://github.com/UjjwalPardeshi/Chakravyuh && cd Chakravyuh
uv pip sync uv.lock          # pinned reproducible install
pytest tests/ -v             # 341 collected, 338 pass, 3 GROQ-gated skip
make smoke-test              # in-process env reset+step
make reproduce               # CHAKRAVYUH_SKIP_INFERENCE=1 for cached scores (~10 min CPU)
```

Full walk-through with expected output snippets at [`REPRODUCE.md`](REPRODUCE.md).

## Production-ready?

No. Chakravyuh is a research environment, not a deployable Indian-bank fraud module. Domain adaptation, regulatory compliance (DPDPA, RBI rules), live-data evaluation, and adversary-resistant deployment are out of scope. See the Responsible Use section in the [MODEL_CARD.md](MODEL_CARD.md#limitations).

## What's the most surprising finding?

The semantic-leakage audit. We assumed our substring-filter was sufficient to ensure bench / training disjointness. Running the full MiniLM-L6 cosine audit revealed 44.8 % of bench items have cosine > 0.85 to a training text. We shipped this disclosure *as a feature* β€” the discipline of measuring and publishing it is what separates research from demoware.

## How does this compare to GPT-4o / Claude / Gemini?

We ran an open-weight frontier comparison via HuggingFace Inference Providers (paid from our HF compute credits, ~$2 total across 7 models). Numbers from [`logs/frontier_comparison.csv`](logs/frontier_comparison.csv) β€” frontier rows are n = 175 (full bench file); v2 LoRA row is n = 174 (one row dropped on inference β€” see [`Blog.md`](Blog.md#honest-limitations)):

| Model | Params | Detection | FPR | F1 |
|---|---|---|---|---|
| **Chakravyuh v2 LoRA (this work)** | **7B + LoRA** | **99.3 %** | **6.7 %** | **0.990** |
| Qwen2.5-7B-Instruct (base, no LoRA) | 7B | 100 % | 16.1 % | 0.983 |
| Llama-3.3-70B-Instruct | 70B | 99.3 % | 3.2 % | 0.993 |
| Qwen2.5-72B-Instruct | 72B | 98.6 % | 6.5 % | 0.986 |
| DeepSeek-V3-0324 | 671B MoE | 100 % | **29.0 %** | 0.970 |
| gpt-oss-120b | 120B | 98.6 % | 16.1 % | 0.976 |
| gemma-3-27b-it | 27B | 100 % | **51.6 %** | 0.947 |
| DeepSeek-R1 (reasoning, parser-fix applied) | 671B MoE | 100 % | 12.9 % | 0.986 |
| Scripted baseline | β€” | 84.0 % | 9.7 % | 0.903 |

Four readouts:

1. **GRPO + LoRA contribution is now isolated.** Same Qwen2.5-7B base, no LoRA β†’ FPR 16.1 % / F1 0.980. After our GRPO training β†’ FPR **6.7 %** / F1 **0.990**. **βˆ’9.4 pp FPR, +0.010 F1 from the reward-engineered training alone** (point estimate; Fisher's exact p = 0.42 at n_benign = 30 β€” directional, not yet at Ξ± = 0.05; B.11 benign-corpus expansion fixes that). Source: [`logs/grpo_lora_significance.json`](logs/grpo_lora_significance.json).
2. **Parameter efficiency β€” pairwise Fisher's exact vs v2 LoRA** ([`logs/frontier_significance.json`](logs/frontier_significance.json)): tied with Llama-3.3-70B (p = 0.61) and Qwen2.5-72B (p = 1.00) at 10Γ— fewer parameters; **significantly beats DeepSeek-V3 (p = 0.043) and gemma-3-27B (p = 0.0002).**
3. **DeepSeek-V3 (671B) reproduces the v1 reward-hacking signature externally** β€” its 99.3 % / 29 % FPR profile is structurally identical to our v1's 100 % / 36 %, and the FPR gap vs the calibrated LoRA is statistically significant. A frontier-class model independently lands in the failure mode our reward-engineering methodology diagnoses and fixes β€” external validation of the diagnostic itself. gemma-3-27B-it (FPR 51.6 %, p = 0.0002 vs LoRA) is the same story at smaller scale.
4. **Open-weight frontier β‰  guaranteed scam-spotting.** Five of seven open frontier models we tested have FPR > 6.7 % on the same bench. The calibration channel β€” not raw capacity β€” is what's actually contested.

**Reasoning-model parser fix.** DeepSeek-R1 is a chain-of-thought model that wraps its answer in `<think>...</think>` blocks. Our original parser asked for JSON-only output; R1's reasoning-token output didn't parse and defaulted to 0. Reasoning-aware fix shipped at [`eval/frontier_baseline.py:_strip_reasoning`](eval/frontier_baseline.py) plus an `max_tokens=4096` budget for reasoning models, with 5 unit tests at [`tests/test_frontier_baseline.py`](tests/test_frontier_baseline.py). After the fix, R1 scores **100 % / 12.9 % / F1 = 0.986** β€” the table above shows the corrected number.

**Proprietary frontier (GPT-4o / Claude / Gemini) deferred** β€” those APIs are not covered by HF compute credits and we did not authorize the ~$40–80 separate spend. The script supports them with the appropriate API keys (`OPENAI_API_KEY` / `ANTHROPIC_API_KEY` / `GEMINI_API_KEY`); reproducing instructions in [`REPRODUCE.md`](REPRODUCE.md) Step 6b.

## How does your trained Scammer compare to frontier LLMs as attackers?

**Our 0.5B trained Scammer beats every untrained frontier model β€” including 671B DeepSeek-V3 β€” at evading the same scripted defense.** Source: [`logs/scammer_frontier_comparison.csv`](logs/scammer_frontier_comparison.csv).

| Scammer | Params | Bypass rate (vs ScriptedAnalyzer) |
|---|---|---|
| **Chakravyuh Scammer LoRA Phase 1 (best-of-8)** | **0.5B + LoRA** | **93.75 %** |
| gpt-oss-120b (untrained) | 120B | 87.5 % |
| Llama-3.3-70B (untrained) | 70B | 68.8 % |
| Qwen2.5-7B base (untrained) | 7B | 62.5 % |
| **Chakravyuh Scammer LoRA Phase 1 (single-shot)** | **0.5B + LoRA** | **59.4 %** |
| Qwen2.5-72B (untrained) | 72B | 56.2 % |
| gemma-3-27b-it (untrained) | 27B | 43.8 % |
| DeepSeek-V3-0324 (untrained) | 671B MoE | 31.2 % |

Same parameter-efficiency story as the defender side: reward-engineered training at 0.5B beats raw capacity at 240×–1340Γ— the parameter count. Two trained agents, both parameter-efficient against frontier baselines, on opposite sides of the fraud loop. DeepSeek-V3's low score is partly safety-training refusing scam roleplay β€” even adjusting for that, the trained 0.5B is on top.

## Where can I see the live demo?

[https://ujjwalpardeshi-chakravyuh.hf.space](https://ujjwalpardeshi-chakravyuh.hf.space) β€” `/demo/` for the Gradio UI (red-team tab is the wow-moment), `/openapi.json` for the JSON API, `POST /mcp` for the JSON-RPC interface.

## What's still open?

See the Honest Limitations section in [`Blog.md`](Blog.md#honest-limitations) for the comprehensive list. Top three: (1) per-row v2 logits + leakage-clean slice (B.12), (2) held-out template-family retrain (B.7), (3) B.2 Phase 2 LoRA-vs-LoRA co-evolution.