# GrandgemMa — Gemma 4 Scam Detection Eval & Fine-Tune Kit > **Goal:** Zero-shot test `google/gemma-4-E2B-it` (2B text) on real scam-call transcripts. > If accuracy < 90 % or F1(SCAM) < 85 % → fine-tune with Unsloth 4-bit LoRA. ## Quick Links | Artifact | Path | |---|---| | Zero-shot eval script | `eval_zero_shot.py` | | Unsloth SFT trainer | `train_sft_unsloth.py` | | Dataset formatter | `format_dataset.py` | | Decision rubric | below | ## Datasets Used - **Primary:** [`BothBosu/scam-dialogue`](https://huggingface.co/datasets/BothBosu/scam-dialogue) — 800+ synthetic scam/legit call transcripts, labeled `dialogue` + `label` (1=SCAM, 0=LEGIT). - **Secondary (optional):** [`BothBosu/Scammer-Conversation`](https://huggingface.co/datasets/BothBosu/Scammer-Conversation) — extra mixed conversations. ## 1. Zero-Shot Evaluation ```bash # Quick smoke test (100 rows) python eval_zero_shot.py --limit 100 # Full test split (~400 rows) python eval_zero_shot.py --limit -1 # CPU-only fallback python eval_zero_shot.py --device cpu --dtype fp32 --limit 20 ``` **Output:** `results_zero_shot.json` + console report with accuracy / precision / recall / F1 / confusion matrix. ## 2. Decision Rubric | Condition | Verdict | Action | |---|---|---| | Accuracy ≥ 90 % **and** F1(SCAM) ≥ 85 % | ✅ PASS | Base model is strong enough. Fine-tuning optional. | | Accuracy 75–89 % **or** F1(SCAM) 70–84 % | ⚠️ MARGINAL | **Fine-tune recommended.** Expected uplift +5–15 pp. | | Accuracy < 75 % **or** F1(SCAM) < 70 % | ❌ FAIL | **Fine-tune REQUIRED.** Run `train_sft_unsloth.py`. | > **Why these thresholds?** For elder-scam defense, missing a scam (false negative) is catastrophic. High recall on SCAM class is mandatory. ## 3. Fine-Tuning (Unsloth SFT) ```bash # Install deps pip install unsloth transformers datasets trl peft accelerate # Train & push to HF Hub python train_sft_unsloth.py \ --output grandgemma-scam-sft \ --push_to_hub s23deepak/grandgemma-scam-sft ``` **Hardware:** Kaggle T4×2 (free) or any single GPU ≥ 16 GB VRAM. **Config:** 4-bit + LoRA r=16, 3 epochs, lr=2e-4, batch=2, grad_accum=4. **Expected time:** ~3–5 min / epoch on T4×2. ## 4. Re-Eval After Fine-Tuning After training, run the same `eval_zero_shot.py` but point `--model` at your fine-tuned checkpoint: ```bash python eval_zero_shot.py \ --model s23deepak/grandgemma-scam-sft \ --limit -1 ``` Compare the delta in `accuracy`, `recall_scam`, and `f1_scam`. ## 5. Team Notes - This repo is **evaluation-only** — no app code. App code lives in your monorepo (`/android`, `/ios`, `/extensions`, `/portal`). - Fine-tuned weights produced here should be quantized to `.litertlm` for on-device Android inference (Stream A) and converted for iOS/browser WebGPU (Stream B). - Track all runs in a spreadsheet: run_id | model | dataset | accuracy | f1_scam | notes.