chakravyuh / FAQ.md
UjjwalPardeshi
deploy: latest main to HF Space
03815d6

FAQ

For judges and reviewers skimming the project. Detailed answers in the linked artifacts.

What is Chakravyuh in one sentence?

A multi-agent OpenEnv-compliant reinforcement-learning environment for Indian UPI fraud detection that diagnoses and fixes reward hacking in RLHF (v1 100 % detection / 36 % FPR β†’ v2 99.3 % / 6.7 % FPR).

What are the headline numbers and where can I verify them?

Claim Value Artifact
v2 Analyzer detection on bench (n=174) 99.3 % logs/eval_v2.json
v2 Analyzer false-positive rate (n=30 benigns) 6.7 % logs/eval_v2.json
v2 vs v1 FPR drop 36 % β†’ 6.7 % (5Γ—) README.md Β§reward-hacking-diagnosis
Bootstrap CI for FPR (10k iter, percentile method) [1.8 %, 20.7 %] logs/bootstrap_v2.json
B.2 Scammer LoRA bypass (n=64, single-shot) 59.4 % logs/b2_phase1_scammer_eval_n64.json
B.2 Scammer LoRA bypass (n=64, best-of-8) 93.75 % logs/b2_phase1_scammer_eval_n64_bestof8.json
B.2 held-out novel category bypass (best-of-8) 100 % same file, aggregate.held_out_seeds
B.2 Scammer train-vs-held-out parity (Fisher's exact, OOD generalization) p = 0.80 (single-shot), p = 0.11 (best-of-8) logs/scammer_significance.json
B.2 Scammer best-of-8 vs single-shot (McNemar exact, paired) p β‰ˆ 5e-7 (strictly dominant) logs/scammer_significance.json
Semantic-leakage audit (cosine > 0.85 to training) 44.8 % of bench logs/semantic_leakage_audit.json
Live HF Space cold-start 2.7 s manual probe; reproducible from any laptop

Why GRPO over PPO or SFT?

PPO needs a learned value head; on a 7B base with rubric-style rewards that's an extra 100M parameters and unstable. GRPO uses group-relative advantage (no value head), lines up with our composable rubric (AnalyzerRubricV2), and β€” uniquely β€” supports the adversarial-Scammer training loop in B.2: the Scammer is trained against a frozen analyzer, then the analyzer is retrained against a frozen Scammer. SFT can match GRPO's standalone v2 numbers on this bench (3.2 % vs 6.7 % FPR β€” see logs/sft_vs_grpo_comparison.json) but cannot do the co-evolution loop.

Why Qwen2.5-7B-Instruct as the base?

Multilingual coverage of all 7 Indian languages we care about (Hindi, Tamil, Telugu, Kannada, Bengali, Marathi, English), Apache-2.0 license, fits a single A100 with LoRA, and instruction-tuned base means lower compute to converge.

You only trained one agent β€” isn't that just supervised classification?

No, two agents are now trained: the Analyzer (v2 LoRA on Qwen2.5-7B) and the Scammer (Phase-1 LoRA on Qwen2.5-0.5B). The Scammer was trained via TRL 0.14 GRPO against the rule-based defense and now evades it at 93.75 % best-of-8 / 100 % on held-out novel categories. Phase 2 (LoRA-vs-LoRA co-evolution) is the next move.

How do you know v2 is not also reward-hacked?

The asymmetric improvement: v1 scored detection 100 % / FPR 36 % β€” textbook reward-hacking signature. v2 scored detection 99.3 % / FPR 6.7 %. Detection essentially unchanged while FPR collapsed 5Γ— β€” that is what learning the task looks like, vs gaming the proxy which would have moved both metrics together. The full diagnosis is in the Reward-Hacking Incident and Fix section of Blog.md.

How is your bench different from your training data?

We audited this with MiniLM-L6 cosine similarity (eval/semantic_leakage_audit.py). 44.8 % of bench items have cosine > 0.85 to the nearest training text. The 100 % detection on easy/medium/hard buckets is partly memorization. The v1 β†’ v2 relative FPR fix and the scripted-baseline novel-collapse are unaffected by leakage (relative comparison on the same bench). v3 closes the absolute gap with a held-out template-family retrain β€” see the Honest Limitations section in Blog.md.

If 44.8 % of bench is high-similarity, what's your real generalization number?

The leakage-clean subset (cosine < 0.70: 38 scams + 12 benigns = 50 scenarios) is where the honest in-distribution number lives β€” measurement is the B.12 v3 work. The novel post-2024 split (n=34) gives detection 97.1 %, but per the leakage audit even the novel split has mean cosine 0.79 to training, so the real OOD number is below that. The B.2 Scammer's held-out 100 % bypass on novel categories is attacker-side OOD evidence β€” it shows our trained-Scammer learned a generalizable structure of UPI fraud, not memorized prompts.

Why does this matter outside India?

The methodological contribution generalizes. The v1 β†’ v2 reward-hacking-fix loop (FP penalty up, calibration weight up, format reward off-on-benign) is the canonical worked example of catching reward hacking in any RLHF post-training pipeline that uses composable-rubric rewards. The two-tier oversight architecture (chat-only Analyzer + metadata-only Bank Monitor) maps cleanly to scalable-oversight literature. Domain-Indian, methodology-universal.

Can I run this on a phone?

The merged 7B + LoRA quantized to q4_k_m runs at ~10 tok/s on a Pixel 8 (8 GB RAM). The GGUF release is at ujjwalpardeshi/chakravyuh-v2-gguf (when shipped); see serving/ for vLLM and Ollama harnesses for laptop / server / phone deployment paths.

How do I reproduce your numbers?

git clone https://github.com/UjjwalPardeshi/Chakravyuh && cd Chakravyuh
uv pip sync uv.lock          # pinned reproducible install
pytest tests/ -v             # 341 collected, 338 pass, 3 GROQ-gated skip
make smoke-test              # in-process env reset+step
make reproduce               # CHAKRAVYUH_SKIP_INFERENCE=1 for cached scores (~10 min CPU)

Full walk-through with expected output snippets at REPRODUCE.md.

Production-ready?

No. Chakravyuh is a research environment, not a deployable Indian-bank fraud module. Domain adaptation, regulatory compliance (DPDPA, RBI rules), live-data evaluation, and adversary-resistant deployment are out of scope. See the Responsible Use section in the MODEL_CARD.md.

What's the most surprising finding?

The semantic-leakage audit. We assumed our substring-filter was sufficient to ensure bench / training disjointness. Running the full MiniLM-L6 cosine audit revealed 44.8 % of bench items have cosine > 0.85 to a training text. We shipped this disclosure as a feature β€” the discipline of measuring and publishing it is what separates research from demoware.

How does this compare to GPT-4o / Claude / Gemini?

We ran an open-weight frontier comparison via HuggingFace Inference Providers (paid from our HF compute credits, ~$2 total across 7 models). Numbers from logs/frontier_comparison.csv β€” frontier rows are n = 175 (full bench file); v2 LoRA row is n = 174 (one row dropped on inference β€” see Blog.md):

Model Params Detection FPR F1
Chakravyuh v2 LoRA (this work) 7B + LoRA 99.3 % 6.7 % 0.990
Qwen2.5-7B-Instruct (base, no LoRA) 7B 100 % 16.1 % 0.983
Llama-3.3-70B-Instruct 70B 99.3 % 3.2 % 0.993
Qwen2.5-72B-Instruct 72B 98.6 % 6.5 % 0.986
DeepSeek-V3-0324 671B MoE 100 % 29.0 % 0.970
gpt-oss-120b 120B 98.6 % 16.1 % 0.976
gemma-3-27b-it 27B 100 % 51.6 % 0.947
DeepSeek-R1 (reasoning, parser-fix applied) 671B MoE 100 % 12.9 % 0.986
Scripted baseline β€” 84.0 % 9.7 % 0.903

Four readouts:

  1. GRPO + LoRA contribution is now isolated. Same Qwen2.5-7B base, no LoRA β†’ FPR 16.1 % / F1 0.980. After our GRPO training β†’ FPR 6.7 % / F1 0.990. βˆ’9.4 pp FPR, +0.010 F1 from the reward-engineered training alone (point estimate; Fisher's exact p = 0.42 at n_benign = 30 β€” directional, not yet at Ξ± = 0.05; B.11 benign-corpus expansion fixes that). Source: logs/grpo_lora_significance.json.
  2. Parameter efficiency β€” pairwise Fisher's exact vs v2 LoRA (logs/frontier_significance.json): tied with Llama-3.3-70B (p = 0.61) and Qwen2.5-72B (p = 1.00) at 10Γ— fewer parameters; significantly beats DeepSeek-V3 (p = 0.043) and gemma-3-27B (p = 0.0002).
  3. DeepSeek-V3 (671B) reproduces the v1 reward-hacking signature externally β€” its 99.3 % / 29 % FPR profile is structurally identical to our v1's 100 % / 36 %, and the FPR gap vs the calibrated LoRA is statistically significant. A frontier-class model independently lands in the failure mode our reward-engineering methodology diagnoses and fixes β€” external validation of the diagnostic itself. gemma-3-27B-it (FPR 51.6 %, p = 0.0002 vs LoRA) is the same story at smaller scale.
  4. Open-weight frontier β‰  guaranteed scam-spotting. Five of seven open frontier models we tested have FPR > 6.7 % on the same bench. The calibration channel β€” not raw capacity β€” is what's actually contested.

Reasoning-model parser fix. DeepSeek-R1 is a chain-of-thought model that wraps its answer in <think>...</think> blocks. Our original parser asked for JSON-only output; R1's reasoning-token output didn't parse and defaulted to 0. Reasoning-aware fix shipped at eval/frontier_baseline.py:_strip_reasoning plus an max_tokens=4096 budget for reasoning models, with 5 unit tests at tests/test_frontier_baseline.py. After the fix, R1 scores 100 % / 12.9 % / F1 = 0.986 β€” the table above shows the corrected number.

Proprietary frontier (GPT-4o / Claude / Gemini) deferred β€” those APIs are not covered by HF compute credits and we did not authorize the ~$40–80 separate spend. The script supports them with the appropriate API keys (OPENAI_API_KEY / ANTHROPIC_API_KEY / GEMINI_API_KEY); reproducing instructions in REPRODUCE.md Step 6b.

How does your trained Scammer compare to frontier LLMs as attackers?

Our 0.5B trained Scammer beats every untrained frontier model β€” including 671B DeepSeek-V3 β€” at evading the same scripted defense. Source: logs/scammer_frontier_comparison.csv.

Scammer Params Bypass rate (vs ScriptedAnalyzer)
Chakravyuh Scammer LoRA Phase 1 (best-of-8) 0.5B + LoRA 93.75 %
gpt-oss-120b (untrained) 120B 87.5 %
Llama-3.3-70B (untrained) 70B 68.8 %
Qwen2.5-7B base (untrained) 7B 62.5 %
Chakravyuh Scammer LoRA Phase 1 (single-shot) 0.5B + LoRA 59.4 %
Qwen2.5-72B (untrained) 72B 56.2 %
gemma-3-27b-it (untrained) 27B 43.8 %
DeepSeek-V3-0324 (untrained) 671B MoE 31.2 %

Same parameter-efficiency story as the defender side: reward-engineered training at 0.5B beats raw capacity at 240×–1340Γ— the parameter count. Two trained agents, both parameter-efficient against frontier baselines, on opposite sides of the fraud loop. DeepSeek-V3's low score is partly safety-training refusing scam roleplay β€” even adjusting for that, the trained 0.5B is on top.

Where can I see the live demo?

https://ujjwalpardeshi-chakravyuh.hf.space β€” /demo/ for the Gradio UI (red-team tab is the wow-moment), /openapi.json for the JSON API, POST /mcp for the JSON-RPC interface.

What's still open?

See the Honest Limitations section in Blog.md for the comprehensive list. Top three: (1) per-row v2 logits + leakage-clean slice (B.12), (2) held-out template-family retrain (B.7), (3) B.2 Phase 2 LoRA-vs-LoRA co-evolution.