docs: replace auto-generated model card with Chakravyuh-specific one

c4e8f35 verified 16 days ago

7.51 kB

license: mit
language:
  - en
  - hi
  - ta
  - te
  - kn
  - bn
  - mr
base_model: Qwen/Qwen2.5-7B-Instruct
library_name: peft
pipeline_tag: text-generation
tags:
  - lora
  - peft
  - grpo
  - trl
  - unsloth
  - fraud-detection
  - upi
  - india
  - multi-agent
  - openenv
  - scalable-oversight
datasets:
  - ujjwalpardeshi/chakravyuh-bench-v0
metrics:
  - f1
  - precision
  - recall
model-index:
  - name: chakravyuh-analyzer-lora-v2
    results:
      - task:
          type: text-classification
          name: Indian UPI Fraud Detection (Chakravyuh bench-v0)
        dataset:
          name: chakravyuh-bench-v0
          type: custom
        metrics:
          - name: Detection (recall)
            type: recall
            value: 0.993
          - name: False Positive Rate
            type: fpr
            value: 0.067
          - name: Precision
            type: precision
            value: 0.986
          - name: F1
            type: f1
            value: 0.99

Chakravyuh Analyzer — LoRA v2

LoRA adapter for Qwen/Qwen2.5-7B-Instruct, post-trained with TRL's GRPO on the Chakravyuh multi-agent Indian UPI fraud-detection environment.

The Analyzer's job: read a multi-turn dialogue between a (scripted) Scammer and Victim and output a calibrated suspicion score plus a justified explanation, in real time, on the victim's device. This adapter is the v2 of two Chakravyuh trained adapters and is the honest one — see "v1 → v2 story" below.

Quick numbers (full results in `logs/eval_v2.json` of the GitHub repo)

Metric	v1 (reward-hacked)	v2 (this adapter)
Detection rate	100.0%	99.3%
False positive rate	36.0%	6.7% (5× better)
F1	0.96	0.99
Bench size	135	174 evaluated (175 total, 1 skipped)

Per-difficulty detection (scams only, n=144)

Difficulty	n	Detection
Easy	26	100%
Medium	66	100%
Hard	18	100%
Novel	34	97%

The dip on novel (post-2024 attack patterns) is the small honest crack that confirms the model is not collapsing to "always flag."

v1 → v2 story (the reason this adapter exists)

v1 hit detection=100% / FPR=36% — a textbook reward-hacking fingerprint. The model had learned to flag everything and then defend the over-flagging with plausible-sounding reasoning. The reward components were:

Detection (+1 correct / -0.5 wrong)
False-positive penalty (−0.3) ← too light
Format reward (+0.15) ← paid even when prediction was wrong
Calibration (×0.3 for benign) ← too weak on benign side
Explanation (×0.4)

After diagnosing the hack, three principled changes were applied for v2:

FP penalty −0.3 → −0.8 — over-flagging is now expensive
Format reward DENIED on benign-flagged-as-scam — closes the "lazy over-flag" shortcut
Benign calibration weight 0.3 → 0.5 — stronger gradient toward score ≤ 0.2 on benign

Plus the KL anchor was tightened (β = 0.08 → 0.15) so the model can't drift far from the base distribution under the new reward shape.

Result: detection barely moved (100% → 99%), but FPR dropped 5× (36% → 6.7%). That asymmetric improvement is the signature of the model actually learning the task instead of gaming the reward.

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model_id = "Qwen/Qwen2.5-7B-Instruct"
adapter_id = "ujjwalpardeshi/chakravyuh-analyzer-lora-v2"

tokenizer = AutoTokenizer.from_pretrained(base_model_id)
base = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, adapter_id)
model.eval()

system_prompt = (
    "You are Chakravyuh's Behavioral Analyzer, a fraud detection AI deployed "
    "on-device to monitor real-time chat for Indian UPI scam patterns. "
    "Output strict JSON with `score` in [0,1], `signals` from the taxonomy, "
    "and `explanation`."
)

user_prompt = (
    "Scammer messages:\n"
    "Urgent! Your bank account will be frozen. Share OTP to verify identity.\n\n"
    "Analyze these messages. Return strict JSON only."
)

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=160,
        do_sample=False,
        temperature=0.0,
        pad_token_id=tokenizer.eos_token_id,
    )
response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response)

Expected output (JSON):

{
  "score": 0.95,
  "signals": ["urgency", "info_request", "impersonation"],
  "explanation": "Asks for OTP with urgency pressure from a self-claimed bank agent; matches OTP-theft scam pattern."
}

Training details

Base model: Qwen/Qwen2.5-7B-Instruct (4-bit Unsloth quantization for training, bf16 inference)
LoRA rank: 64
LoRA alpha: 128
KL anchor (β): 0.15
Training corpus: 619 examples (456 scam + 204 benign templates, soft-leakage filtered against the test set; see training/grpo_analyzer.py:_filter_soft_leakage)
Algorithm: GRPO via TRL
Steps: 619 (1 full epoch over the corpus)
Reward function: Composable 5-rubric system (detection, FP penalty, missed-scam penalty, calibration, explanation quality)
Hardware: Single A100-80GB (Colab Pro+)

trainer_state.json (full training trajectory) is at logs/v2_trainer_state.json in the source repo.

Limitations

Small benign sample (n=30 evaluated, 1 of 31 in bench skipped due to empty text). Wilson 95% CI on FPR is approximately [1.9%, 21.3%]. We stand behind the "5× FPR reduction vs v1" claim (statistically real) but not the precise "6.7%" figure as a tight estimate.
Single-seed training. Multi-seed retrains are deferred to v3.
Bench is a proxy. 175 curated scenarios do not span real-world Indian fraud diversity. Production performance will be lower.
One epoch over 619 templates. More data + more epochs are deferred to v3.
English-dominant training. Multi-language detection numbers (Tamil, Telugu, etc.) require per-language eval — not yet measured at the time of writing.

See docs/RESPONSIBLE_USE.md for intended use and dual-use considerations.

Citation

@software{pardeshi2026chakravyuh,
  title  = {Chakravyuh: A Multi-Agent RL Environment for Indian UPI Fraud Detection},
  author = {Pardeshi, Ujjwal},
  year   = {2026},
  url    = {https://github.com/UjjwalPardeshi/Chakravyuh}
}

License

MIT — see LICENSE in the source repo.

ujjwalpardeshi
/

chakravyuh-analyzer-lora-v2

Chakravyuh Analyzer — LoRA v2

Quick numbers (full results in `logs/eval_v2.json` of the GitHub repo)

Per-difficulty detection (scams only, n=144)

v1 → v2 story (the reason this adapter exists)

Quick start

Training details

Limitations

Links

Citation

License

Chakravyuh Analyzer — LoRA v2

Quick numbers (full results in logs/eval_v2.json of the GitHub repo)

Per-difficulty detection (scams only, n=144)

v1 → v2 story (the reason this adapter exists)

Quick start

Training details

Limitations

Links

Citation

License

Quick numbers (full results in `logs/eval_v2.json` of the GitHub repo)