| # GrandgemMa β Gemma 4 Scam Detection Eval & Fine-Tune Kit |
|
|
| > **Goal:** Zero-shot test `google/gemma-4-E2B-it` (2B text) on real scam-call transcripts. |
| > If accuracy < 90 % or F1(SCAM) < 85 % β fine-tune with Unsloth 4-bit LoRA. |
|
|
| ## Quick Links |
|
|
| | Artifact | Path | |
| |---|---| |
| | Zero-shot eval script | `eval_zero_shot.py` | |
| | Unsloth SFT trainer | `train_sft_unsloth.py` | |
| | Dataset formatter | `format_dataset.py` | |
| | Decision rubric | below | |
|
|
| ## Datasets Used |
|
|
| - **Primary:** [`BothBosu/scam-dialogue`](https://huggingface.co/datasets/BothBosu/scam-dialogue) β 800+ synthetic scam/legit call transcripts, labeled `dialogue` + `label` (1=SCAM, 0=LEGIT). |
| - **Secondary (optional):** [`BothBosu/Scammer-Conversation`](https://huggingface.co/datasets/BothBosu/Scammer-Conversation) β extra mixed conversations. |
|
|
| ## 1. Zero-Shot Evaluation |
|
|
| ```bash |
| # Quick smoke test (100 rows) |
| python eval_zero_shot.py --limit 100 |
| |
| # Full test split (~400 rows) |
| python eval_zero_shot.py --limit -1 |
| |
| # CPU-only fallback |
| python eval_zero_shot.py --device cpu --dtype fp32 --limit 20 |
| ``` |
|
|
| **Output:** `results_zero_shot.json` + console report with accuracy / precision / recall / F1 / confusion matrix. |
|
|
| ## 2. Decision Rubric |
|
|
| | Condition | Verdict | Action | |
| |---|---|---| |
| | Accuracy β₯ 90 % **and** F1(SCAM) β₯ 85 % | β
PASS | Base model is strong enough. Fine-tuning optional. | |
| | Accuracy 75β89 % **or** F1(SCAM) 70β84 % | β οΈ MARGINAL | **Fine-tune recommended.** Expected uplift +5β15 pp. | |
| | Accuracy < 75 % **or** F1(SCAM) < 70 % | β FAIL | **Fine-tune REQUIRED.** Run `train_sft_unsloth.py`. | |
|
|
| > **Why these thresholds?** For elder-scam defense, missing a scam (false negative) is catastrophic. High recall on SCAM class is mandatory. |
|
|
| ## 3. Fine-Tuning (Unsloth SFT) |
|
|
| ```bash |
| # Install deps |
| pip install unsloth transformers datasets trl peft accelerate |
| |
| # Train & push to HF Hub |
| python train_sft_unsloth.py \ |
| --output grandgemma-scam-sft \ |
| --push_to_hub s23deepak/grandgemma-scam-sft |
| ``` |
|
|
| **Hardware:** Kaggle T4Γ2 (free) or any single GPU β₯ 16 GB VRAM. |
| **Config:** 4-bit + LoRA r=16, 3 epochs, lr=2e-4, batch=2, grad_accum=4. |
| **Expected time:** ~3β5 min / epoch on T4Γ2. |
| |
| ## 4. Re-Eval After Fine-Tuning |
| |
| After training, run the same `eval_zero_shot.py` but point `--model` at your fine-tuned checkpoint: |
| |
| ```bash |
| python eval_zero_shot.py \ |
| --model s23deepak/grandgemma-scam-sft \ |
| --limit -1 |
| ``` |
| |
| Compare the delta in `accuracy`, `recall_scam`, and `f1_scam`. |
| |
| ## 5. Team Notes |
| |
| - This repo is **evaluation-only** β no app code. App code lives in your monorepo (`/android`, `/ios`, `/extensions`, `/portal`). |
| - Fine-tuned weights produced here should be quantized to `.litertlm` for on-device Android inference (Stream A) and converted for iOS/browser WebGPU (Stream B). |
| - Track all runs in a spreadsheet: run_id | model | dataset | accuracy | f1_scam | notes. |
| |