ujjwalpardeshi
/

chakravyuh-analyzer-lora-v2

@@ -1,73 +1,207 @@
 ---
-base_model: unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
 library_name: peft
-model_name: analyzer_lora_v2
-tags:
-- base_model:adapter:unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit
-- grpo
-- lora
-- transformers
-- trl
-- unsloth
-licence: license
 pipeline_tag: text-generation
 ---
-# Model Card for analyzer_lora_v2
-This model is a fine-tuned version of [unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit](https://huggingface.co/unsloth/qwen2.5-7b-instruct-unsloth-bnb-4bit).
-It has been trained using [TRL](https://github.com/huggingface/trl).
 ## Quick start
 ```python
-from transformers import pipeline
-question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
-generator = pipeline("text-generation", model="None", device="cuda")
-output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
-print(output["generated_text"])
 ```
-## Training procedure
-This model was trained with GRPO, a method introduced in [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://huggingface.co/papers/2402.03300).
-### Framework versions
-- PEFT 0.18.1
-- TRL: 0.24.0
-- Transformers: 5.5.0
-- Pytorch: 2.10.0
-- Datasets: 4.3.0
-- Tokenizers: 0.22.2
-## Citations
-Cite GRPO as:
 ```bibtex
-@article{shao2024deepseekmath,
-    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
-    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
-    year         = 2024,
-    eprint       = {arXiv:2402.03300},
 }
 ```
-Cite TRL as:
-```bibtex
-@misc{vonwerra2022trl,
-	title        = {{TRL: Transformer Reinforcement Learning}},
-	author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
-	year         = 2020,
-	journal      = {GitHub repository},
-	publisher    = {GitHub},
-	howpublished = {\url{https://github.com/huggingface/trl}}
-}
-```

 ---
+license: mit
+language:
+  - en
+  - hi
+  - ta
+  - te
+  - kn
+  - bn
+  - mr
+base_model: Qwen/Qwen2.5-7B-Instruct
 library_name: peft
 pipeline_tag: text-generation
+tags:
+  - lora
+  - peft
+  - grpo
+  - trl
+  - unsloth
+  - fraud-detection
+  - upi
+  - india
+  - multi-agent
+  - openenv
+  - scalable-oversight
+datasets:
+  - ujjwalpardeshi/chakravyuh-bench-v0
+metrics:
+  - f1
+  - precision
+  - recall
+model-index:
+  - name: chakravyuh-analyzer-lora-v2
+    results:
+      - task:
+          type: text-classification
+          name: Indian UPI Fraud Detection (Chakravyuh bench-v0)
+        dataset:
+          name: chakravyuh-bench-v0
+          type: custom
+        metrics:
+          - name: Detection (recall)
+            type: recall
+            value: 0.993
+          - name: False Positive Rate
+            type: fpr
+            value: 0.067
+          - name: Precision
+            type: precision
+            value: 0.986
+          - name: F1
+            type: f1
+            value: 0.99
 ---
+# Chakravyuh Analyzer — LoRA v2
+LoRA adapter for **Qwen/Qwen2.5-7B-Instruct**, post-trained with TRL's GRPO on the [Chakravyuh](https://github.com/UjjwalPardeshi/Chakravyuh) multi-agent Indian UPI fraud-detection environment.
+The Analyzer's job: read a multi-turn dialogue between a (scripted) Scammer and Victim and output a calibrated suspicion score plus a justified explanation, in real time, on the victim's device. This adapter is the **v2 of two** Chakravyuh trained adapters and is the **honest one** — see "v1 → v2 story" below.
+## Quick numbers (full results in `logs/eval_v2.json` of the GitHub repo)
+| Metric | v1 (reward-hacked) | **v2 (this adapter)** |
+|---|---|---|
+| Detection rate | 100.0% | **99.3%** |
+| False positive rate | 36.0% | **6.7%** (5× better) |
+| F1 | 0.96 | **0.99** |
+| Bench size | 135 | 174 evaluated (175 total, 1 skipped) |
+### Per-difficulty detection (scams only, n=144)
+| Difficulty | n | Detection |
+|---|---|---|
+| Easy | 26 | 100% |
+| Medium | 66 | 100% |
+| Hard | 18 | 100% |
+| Novel | 34 | 97% |
+The dip on `novel` (post-2024 attack patterns) is the small honest crack that confirms the model is not collapsing to "always flag."
+## v1 → v2 story (the reason this adapter exists)
+v1 hit `detection=100% / FPR=36%` — a textbook **reward-hacking fingerprint**. The model had learned to flag *everything* and then defend the over-flagging with plausible-sounding reasoning. The reward components were:
+- Detection (+1 correct / -0.5 wrong)
+- False-positive penalty (−0.3) ← too light
+- Format reward (+0.15) ← paid even when prediction was wrong
+- Calibration (×0.3 for benign) ← too weak on benign side
+- Explanation (×0.4)
+After diagnosing the hack, three principled changes were applied for v2:
+1. **FP penalty −0.3 → −0.8** — over-flagging is now expensive
+2. **Format reward DENIED on benign-flagged-as-scam** — closes the "lazy over-flag" shortcut
+3. **Benign calibration weight 0.3 → 0.5** — stronger gradient toward score ≤ 0.2 on benign
+Plus the KL anchor was tightened (`β = 0.08 → 0.15`) so the model can't drift far from the base distribution under the new reward shape.
+Result: detection barely moved (100% → 99%), but FPR dropped 5× (36% → 6.7%). That asymmetric improvement is the signature of the model actually learning the task instead of gaming the reward.
 ## Quick start
 ```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from peft import PeftModel
+import torch
+base_model_id = "Qwen/Qwen2.5-7B-Instruct"
+adapter_id = "ujjwalpardeshi/chakravyuh-analyzer-lora-v2"
+tokenizer = AutoTokenizer.from_pretrained(base_model_id)
+base = AutoModelForCausalLM.from_pretrained(
+    base_model_id,
+    torch_dtype=torch.bfloat16,
+    device_map="auto",
+)
+model = PeftModel.from_pretrained(base, adapter_id)
+model.eval()
+system_prompt = (
+    "You are Chakravyuh's Behavioral Analyzer, a fraud detection AI deployed "
+    "on-device to monitor real-time chat for Indian UPI scam patterns. "
+    "Output strict JSON with `score` in [0,1], `signals` from the taxonomy, "
+    "and `explanation`."
+)
+user_prompt = (
+    "Scammer messages:\n"
+    "Urgent! Your bank account will be frozen. Share OTP to verify identity.\n\n"
+    "Analyze these messages. Return strict JSON only."
+)
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": user_prompt},
+]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+with torch.no_grad():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=160,
+        do_sample=False,
+        temperature=0.0,
+        pad_token_id=tokenizer.eos_token_id,
+    )
+response = tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
+print(response)
+```
+Expected output (JSON):
+```json
+{
+  "score": 0.95,
+  "signals": ["urgency", "info_request", "impersonation"],
+  "explanation": "Asks for OTP with urgency pressure from a self-claimed bank agent; matches OTP-theft scam pattern."
+}
 ```
+## Training details
+- **Base model:** Qwen/Qwen2.5-7B-Instruct (4-bit Unsloth quantization for training, bf16 inference)
+- **LoRA rank:** 64
+- **LoRA alpha:** 128
+- **KL anchor (β):** 0.15
+- **Training corpus:** 619 examples (456 scam + 204 benign templates, soft-leakage filtered against the test set; see `training/grpo_analyzer.py:_filter_soft_leakage`)
+- **Algorithm:** GRPO via TRL
+- **Steps:** 619 (1 full epoch over the corpus)
+- **Reward function:** Composable 5-rubric system (detection, FP penalty, missed-scam penalty, calibration, explanation quality)
+- **Hardware:** Single A100-80GB (Colab Pro+)
+`trainer_state.json` (full training trajectory) is at [logs/v2_trainer_state.json](https://github.com/UjjwalPardeshi/Chakravyuh/blob/main/logs/v2_trainer_state.json) in the source repo.
+## Limitations
+1. **Small benign sample (n=30 evaluated, 1 of 31 in bench skipped due to empty text).** Wilson 95% CI on FPR is approximately [1.9%, 21.3%]. We stand behind the "5× FPR reduction vs v1" claim (statistically real) but not the precise "6.7%" figure as a tight estimate.
+2. **Single-seed training.** Multi-seed retrains are deferred to v3.
+3. **Bench is a proxy.** 175 curated scenarios do not span real-world Indian fraud diversity. Production performance will be lower.
+4. **One epoch over 619 templates.** More data + more epochs are deferred to v3.
+5. **English-dominant training.** Multi-language detection numbers (Tamil, Telugu, etc.) require per-language eval — not yet measured at the time of writing.
+See [docs/RESPONSIBLE_USE.md](https://github.com/UjjwalPardeshi/Chakravyuh/blob/main/docs/RESPONSIBLE_USE.md) for intended use and dual-use considerations.
+## Links
+- **GitHub:** <https://github.com/UjjwalPardeshi/Chakravyuh>
+- **OpenEnv Space (live env):** <https://huggingface.co/spaces/ujjwalpardeshi/chakravyuh>
+- **Bench dataset:** <https://huggingface.co/datasets/ujjwalpardeshi/chakravyuh-bench-v0> (release pending)
+- **Hackathon:** Meta PyTorch OpenEnv Hackathon 2026, Bangalore
+## Citation
 ```bibtex
+@software{pardeshi2026chakravyuh,
+  title  = {Chakravyuh: A Multi-Agent RL Environment for Indian UPI Fraud Detection},
+  author = {Pardeshi, Ujjwal},
+  year   = {2026},
+  url    = {https://github.com/UjjwalPardeshi/Chakravyuh}
 }
 ```
+## License
+MIT — see [LICENSE](https://github.com/UjjwalPardeshi/Chakravyuh/blob/main/LICENSE) in the source repo.