chakravyuh / DATASET_CARD.md
UjjwalPardeshi
deploy: latest main to HF Space
03815d6
metadata
name: chakravyuh-bench-v0
license: cc-by-4.0
language:
  - en
  - hi
  - ta
  - te
  - kn
  - bn
  - mr
task_categories:
  - text-classification
tags:
  - fraud-detection
  - upi
  - india
  - safety
  - benchmark
  - scam-detection
  - multilingual
  - multi-turn
size_categories:
  - n<1K
pretty_name: Chakravyuh-Bench-v0 β€” Indian UPI Fraud Detection

Chakravyuh-Bench-v0 β€” Dataset Card

Canonical card (with HF Datasets front-matter) lives at data/chakravyuh-bench-v0/README.md. This file is the repo-level mirror so it shows up alongside MODEL_CARD.md and README.md.

Summary

chakravyuh-bench-v0 is the public evaluation dataset for the Chakravyuh multi-agent Indian UPI fraud-detection environment. 175 hand-curated scenarios (174 scored after one was skipped for empty text) β€” 144 scams across 5 fraud categories, 31 non-scam (15 benign + 16 borderline for false-positive discipline).

Composition

n
Total scenarios 175
Scored in eval (one skipped on empty text) 174
Scams 144
Non-scams 31
Multi-turn (2–4 turns) 15
Adversarial paraphrases 10
Regional-language probes (hi/ta/te/kn/bn/mr) 5
Difficulty n
Easy 30
Medium 78
Hard 33
Novel (post-2024) 34
Category n
Impersonation 37
Investment fraud 32
KYC fraud 28
OTP theft 24
Loan-app fraud 23
Borderline (FP discipline) 16
Benign 15

What this dataset is for

A fixed, citable test set for any fraud detector β€” rule-based, LLM zero-shot, or trained LoRA β€” to be evaluated on the same scenarios. Every scenario carries split: "test" to make this explicit. Train detectors elsewhere (e.g., on Chakravyuh env rollouts or scammer-template datasets) and evaluate here.

Sources (all public)

  • RBI Annual Report on Financial Fraud (Trend and Progress of Banking, FY22–24)
  • NPCI Safety Bulletins β€” npci.org.in/safety-and-awareness
  • I4C (Indian Cybercrime Coordination Centre) β€” cybercrime.gov.in
  • sachet.rbi.org.in β€” reported entity database
  • News: The Hindu, Times of India, Hindustan Times, MoneyControl, Inc42 β€” fraud coverage 2022–2026

Every scenario is a realistic reconstruction of publicly documented fraud patterns. They are not verbatim transcripts β€” those are not publicly releasable for privacy reasons. Reconstruction methodology is in the bench README.

Published baselines (on the bench, threshold = 0.55)

Method Detection FPR F1 Novel subset
ScriptedAnalyzer (rule-based) 70.1 % 29.0 % 0.795 50.0 %
Chakravyuh-Qwen2.5-LoRA v1 (reward-hacked) 100 % 36.0 % 0.96 100 %
Chakravyuh-Qwen2.5-LoRA v2 99.3 % 6.7 % 0.99 97.1 %

Bootstrap 95 % CIs from logs/bootstrap_v2.json β€” detection [97.9 %, 100 %], FPR [0 %, 16.7 %], F1 [0.976, 1.000], novel detection [91.2 %, 100 %].

Frontier zero-shot baselines β€” open-weight tier shipped 2026-04-26 via HuggingFace Inference Providers (paid from HF compute credits). Full results at logs/frontier_comparison.csv: Llama-3.3-70B-Instruct (99.3 % / 3.2 % FPR / F1 0.993), Qwen2.5-72B-Instruct (98.6 % / 6.5 % / 0.986), DeepSeek-V3-0324 (100 % / 29.0 % / 0.970), Qwen2.5-7B-Instruct base (100 % / 16.1 % / 0.983), gpt-oss-120b (98.6 % / 16.1 % / 0.976), DeepSeek-R1 (100 % / 12.9 % / 0.986, with reasoning-aware parser), gemma-3-27b-it (100 % / 51.6 % / 0.947). Pairwise Fisher's exact significance vs the v2 LoRA at logs/frontier_significance.json. Proprietary frontier (GPT-4o / Claude / Gemini) deferred β€” those APIs are not covered by HF compute credits.

Adversarial Scammer evaluation (B.2 Phase 1, n=64 best-of-8)

The bench is also re-used as the held-out evaluation set for an adversarial Scammer LoRA (Qwen2.5-0.5B + GRPO, paired adapter to the Analyzer LoRA above). The Scammer's adversarial bypass rate is measured per category against two defenders β€” the rule-based ScriptedAnalyzer and the v2 Analyzer LoRA from this dataset card:

Defender Best-of-8 bypass (n=64) Held-out novel (n=32) Backing artefact
Rule-based ScriptedAnalyzer 93.75 % 100 % logs/b2_phase1_scammer_eval_n64_bestof8.json
v2 Analyzer LoRA (this card) 32.8 % 37.5 % logs/b2_phase1_scammer_vs_v2_lora.json

The 60.9-pp gap is the bench-derived measurement of the v2 LoRA's defensive lift over rule-based detection in an adversarial setting. The Scammer adapter (ujjwalpardeshi/chakravyuh-scammer-lora-phase1) is published behind an HF Hub gated-access flag with the misuse statement on its HF Hub model card.

Economic-loss aggregator

Each scenario carries metadata.loss_amount_inr (a publicly-sourced victim-loss estimate in INR; null for benigns and borderline cases). Aggregating across the 130 scams with positive loss values gives a total β‚Ή77.95 lakh at risk in the bench. The eval/rupee_weighted_eval.py script joins per-row eval logs with these amounts to produce the "β‚Ή prevented" headline number. Backing artefact: logs/rupee_weighted_eval.json.

Label quality

  • Rule-vs-expert Cohen's ΞΊ = 0.277 β€” fair agreement (Landis-Koch). The benchmark is not a keyword-matching exercise; a trained model should meaningfully exceed ΞΊ > 0.70 to claim real detection capability.
  • True human-IRR is deferred to v0.3 β€” single-annotator ground truth in v0.2.

Reproduce: python -m eval.agreement.

Limitations (be honest about what the bench can and can't tell you)

  • n=175 is a starting benchmark; v1.0 target is 500+.
  • Test-only. No train/val split shipped.
  • English-dominant (β‰ˆ 92 %). Regional-language coverage is a 5-scenario capability probe.
  • Class imbalance (82 % scam / 18 % non-scam). Does NOT reflect real SMS base rates (~1 % scam). Precision is structurally inflated vs production; interpret F1 / recall as comparative, not absolute.
  • Single annotator. Full human IRR deferred to v0.3.
  • Synthetic reconstruction, not victim transcripts. Frontier LLMs may have seen original source reports during pretraining; the 34 novel-post-2024 scenarios partially mitigate this but not fully.
  • No voice audio β€” voice-channel scenarios are text reconstructions.
  • No cross-market validation β€” Indian UPI only, not Pix / Zelle / GoPay.
  • Multi-turn is shallow β€” capped at 4 turns; real fraud can extend over weeks.

Licence and citation

CC-BY-4.0. Use freely with attribution.

@misc{chakravyuh-bench-v0,
  title  = {Chakravyuh-Bench-v0: A benchmark for Indian UPI fraud detection},
  author = {Pardeshi, Ujjwal and Kadam, Omkar},
  year   = {2026},
  howpublished = {Meta PyTorch OpenEnv Hackathon, Bangalore, April 2026},
  url    = {https://huggingface.co/datasets/ujjwalpardeshi/chakravyuh-bench-v0},
  version = {0.2.0}
}

Loading

from datasets import load_dataset

ds = load_dataset("ujjwalpardeshi/chakravyuh-bench-v0", split="test")
for scenario in ds:
    print(scenario["id"], scenario["ground_truth"]["category"])

Or directly from the repo:

import json
with open("data/chakravyuh-bench-v0/scenarios.jsonl") as f:
    rows = [json.loads(l) for l in f]
print(f"{len(rows)} scenarios")

Versions

  • v0.2.0 (2026-04-21) β€” 175 scenarios, paraphrases + multi-turn + regional probes, baselines.json, rule-vs-expert ΞΊ.
  • v0.1.0 (2026-04-21) β€” 135 scenarios, single-turn only.
  • Planned v0.3.0 β€” Human-IRR on 30-scenario sample, real voice transcripts, train/val split.

Contact

Issues, PRs, and dataset extension proposals: https://github.com/UjjwalPardeshi/Chakravyuh/issues.