| --- |
| license: mit |
| language: |
| - en |
| - hi |
| - ta |
| - te |
| - kn |
| - bn |
| - mr |
| base_model: Qwen/Qwen2.5-7B-Instruct |
| library_name: peft |
| pipeline_tag: text-generation |
| tags: |
| - lora |
| - peft |
| - grpo |
| - trl |
| - unsloth |
| - fraud-detection |
| - upi |
| - india |
| - multi-agent |
| - openenv |
| - scalable-oversight |
| datasets: |
| - ujjwalpardeshi/chakravyuh-bench-v0 |
| metrics: |
| - f1 |
| - precision |
| - recall |
| model-index: |
| - name: chakravyuh-analyzer-lora-v2 |
| results: |
| - task: |
| type: text-classification |
| name: Indian UPI Fraud Detection (Chakravyuh bench-v0) |
| dataset: |
| name: chakravyuh-bench-v0 |
| type: custom |
| metrics: |
| - name: Detection (recall) |
| type: recall |
| value: 0.993 |
| - name: False Positive Rate |
| type: fpr |
| value: 0.067 |
| - name: Precision |
| type: precision |
| value: 0.986 |
| - name: F1 |
| type: f1 |
| value: 0.99 |
| --- |
| |
| # Chakravyuh Analyzer β LoRA v2 |
|
|
| LoRA adapter for **Qwen/Qwen2.5-7B-Instruct**, post-trained with TRL's GRPO on the [Chakravyuh](https://github.com/UjjwalPardeshi/Chakravyuh) multi-agent Indian UPI fraud-detection environment. |
|
|
| The Analyzer's job: read a multi-turn dialogue between a (scripted) Scammer and Victim and output a calibrated suspicion score plus a justified explanation, in real time, on the victim's device. This adapter is the **v2 of two** Chakravyuh trained adapters and is the **honest one** β see "v1 β v2 story" below. |
|
|
| ## Quick numbers (full results in `logs/eval_v2.json` of the GitHub repo) |
| |
| | Metric | v1 (reward-hacked) | **v2 (this adapter)** | |
| |---|---|---| |
| | Detection rate | 100.0% | **99.3%** | |
| | False positive rate | 36.0% | **6.7%** (5Γ better) | |
| | F1 | 0.96 | **0.99** | |
| | Bench size | 135 | 174 evaluated (175 total, 1 skipped) | |
| |
| ### Per-difficulty detection (scams only, n=144) |
| |
| | Difficulty | n | Detection | |
| |---|---|---| |
| | Easy | 26 | 100% | |
| | Medium | 66 | 100% | |
| | Hard | 18 | 100% | |
| | Novel | 34 | 97% | |
| |
| The dip on `novel` (post-2024 attack patterns) is the small honest crack that confirms the model is not collapsing to "always flag." |
| |
| ## v1 β v2 story (the reason this adapter exists) |
| |
| v1 hit `detection=100% / FPR=36%` β a textbook **reward-hacking fingerprint**. The model had learned to flag *everything* and then defend the over-flagging with plausible-sounding reasoning. The reward components were: |
| |
| - Detection (+1 correct / -0.5 wrong) |
| - False-positive penalty (β0.3) β too light |
| - Format reward (+0.15) β paid even when prediction was wrong |
| - Calibration (Γ0.3 for benign) β too weak on benign side |
| - Explanation (Γ0.4) |
| |
| After diagnosing the hack, three principled changes were applied for v2: |
| |
| 1. **FP penalty β0.3 β β0.8** β over-flagging is now expensive |
| 2. **Format reward DENIED on benign-flagged-as-scam** β closes the "lazy over-flag" shortcut |
| 3. **Benign calibration weight 0.3 β 0.5** β stronger gradient toward score β€ 0.2 on benign |
| |
| Plus the KL anchor was tightened (`Ξ² = 0.08 β 0.15`) so the model can't drift far from the base distribution under the new reward shape. |
| |
| Result: detection barely moved (100% β 99%), but FPR dropped 5Γ (36% β 6.7%). That asymmetric improvement is the signature of the model actually learning the task instead of gaming the reward. |
| |
| ## Quick start |
| |
| ```python |
| from transformers import AutoTokenizer, AutoModelForCausalLM |
| from peft import PeftModel |
| import torch |
| |
| base_model_id = "Qwen/Qwen2.5-7B-Instruct" |
| adapter_id = "ujjwalpardeshi/chakravyuh-analyzer-lora-v2" |
|
|
| tokenizer = AutoTokenizer.from_pretrained(base_model_id) |
| base = AutoModelForCausalLM.from_pretrained( |
| base_model_id, |
| torch_dtype=torch.bfloat16, |
| device_map="auto", |
| ) |
| model = PeftModel.from_pretrained(base, adapter_id) |
| model.eval() |
| |
| system_prompt = ( |
| "You are Chakravyuh's Behavioral Analyzer, a fraud detection AI deployed " |
| "on-device to monitor real-time chat for Indian UPI scam patterns. " |
| "Output strict JSON with `score` in [0,1], `signals` from the taxonomy, " |
| "and `explanation`." |
| ) |
| |
| user_prompt = ( |
| "Scammer messages:\n" |
| "Urgent! Your bank account will be frozen. Share OTP to verify identity.\n\n" |
| "Analyze these messages. Return strict JSON only." |
| ) |
| |
| messages = [ |
| {"role": "system", "content": system_prompt}, |
| {"role": "user", "content": user_prompt}, |
| ] |
| prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(prompt, return_tensors="pt").to(model.device) |
| |
| with torch.no_grad(): |
| out = model.generate( |
| **inputs, |
| max_new_tokens=160, |
| do_sample=False, |
| temperature=0.0, |
| pad_token_id=tokenizer.eos_token_id, |
| ) |
| response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) |
| print(response) |
| ``` |
| |
| Expected output (JSON): |
| |
| ```json |
| { |
| "score": 0.95, |
| "signals": ["urgency", "info_request", "impersonation"], |
| "explanation": "Asks for OTP with urgency pressure from a self-claimed bank agent; matches OTP-theft scam pattern." |
| } |
| ``` |
| |
| ## Training details |
| |
| - **Base model:** Qwen/Qwen2.5-7B-Instruct (4-bit Unsloth quantization for training, bf16 inference) |
| - **LoRA rank:** 64 |
| - **LoRA alpha:** 128 |
| - **KL anchor (Ξ²):** 0.15 |
| - **Training corpus:** 619 examples (456 scam + 204 benign templates, soft-leakage filtered against the test set; see `training/grpo_analyzer.py:_filter_soft_leakage`) |
| - **Algorithm:** GRPO via TRL |
| - **Steps:** 619 (1 full epoch over the corpus) |
| - **Reward function:** Composable 5-rubric system (detection, FP penalty, missed-scam penalty, calibration, explanation quality) |
| - **Hardware:** Single A100-80GB (Colab Pro+) |
| |
| `trainer_state.json` (full training trajectory) is at [logs/v2_trainer_state.json](https://github.com/UjjwalPardeshi/Chakravyuh/blob/main/logs/v2_trainer_state.json) in the source repo. |
|
|
| ## Limitations |
|
|
| 1. **Small benign sample (n=30 evaluated, 1 of 31 in bench skipped due to empty text).** Wilson 95% CI on FPR is approximately [1.9%, 21.3%]. We stand behind the "5Γ FPR reduction vs v1" claim (statistically real) but not the precise "6.7%" figure as a tight estimate. |
| 2. **Single-seed training.** Multi-seed retrains are deferred to v3. |
| 3. **Bench is a proxy.** 175 curated scenarios do not span real-world Indian fraud diversity. Production performance will be lower. |
| 4. **One epoch over 619 templates.** More data + more epochs are deferred to v3. |
| 5. **English-dominant training.** Multi-language detection numbers (Tamil, Telugu, etc.) require per-language eval β not yet measured at the time of writing. |
|
|
| See [docs/RESPONSIBLE_USE.md](https://github.com/UjjwalPardeshi/Chakravyuh/blob/main/docs/RESPONSIBLE_USE.md) for intended use and dual-use considerations. |
|
|
| ## Links |
|
|
| - **GitHub:** <https://github.com/UjjwalPardeshi/Chakravyuh> |
| - **OpenEnv Space (live env):** <https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh> |
| - **Bench dataset:** <https://huggingface.co/datasets/ujjwalpardeshi/chakravyuh-bench-v0> (release pending) |
| - **Hackathon:** Meta PyTorch OpenEnv Hackathon 2026, Bangalore |
|
|
| ## Citation |
|
|
| ```bibtex |
| @software{pardeshi2026chakravyuh, |
| title = {Chakravyuh: A Multi-Agent RL Environment for Indian UPI Fraud Detection}, |
| author = {Pardeshi, Ujjwal}, |
| year = {2026}, |
| url = {https://github.com/UjjwalPardeshi/Chakravyuh} |
| } |
| ``` |
|
|
| ## License |
|
|
| MIT β see [LICENSE](https://github.com/UjjwalPardeshi/Chakravyuh/blob/main/LICENSE) in the source repo. |
|
|