grandgemma-eval / README.md
s23deepak's picture
Upload README.md
b644b23 verified
# GrandgemMa β€” Gemma 4 Scam Detection Eval & Fine-Tune Kit
> **Goal:** Zero-shot test `google/gemma-4-E2B-it` (2B text) on real scam-call transcripts.
> If accuracy < 90 % or F1(SCAM) < 85 % β†’ fine-tune with Unsloth 4-bit LoRA.
## Quick Links
| Artifact | Path |
|---|---|
| Zero-shot eval script | `eval_zero_shot.py` |
| Unsloth SFT trainer | `train_sft_unsloth.py` |
| Dataset formatter | `format_dataset.py` |
| Decision rubric | below |
## Datasets Used
- **Primary:** [`BothBosu/scam-dialogue`](https://huggingface.co/datasets/BothBosu/scam-dialogue) β€” 800+ synthetic scam/legit call transcripts, labeled `dialogue` + `label` (1=SCAM, 0=LEGIT).
- **Secondary (optional):** [`BothBosu/Scammer-Conversation`](https://huggingface.co/datasets/BothBosu/Scammer-Conversation) β€” extra mixed conversations.
## 1. Zero-Shot Evaluation
```bash
# Quick smoke test (100 rows)
python eval_zero_shot.py --limit 100
# Full test split (~400 rows)
python eval_zero_shot.py --limit -1
# CPU-only fallback
python eval_zero_shot.py --device cpu --dtype fp32 --limit 20
```
**Output:** `results_zero_shot.json` + console report with accuracy / precision / recall / F1 / confusion matrix.
## 2. Decision Rubric
| Condition | Verdict | Action |
|---|---|---|
| Accuracy β‰₯ 90 % **and** F1(SCAM) β‰₯ 85 % | βœ… PASS | Base model is strong enough. Fine-tuning optional. |
| Accuracy 75–89 % **or** F1(SCAM) 70–84 % | ⚠️ MARGINAL | **Fine-tune recommended.** Expected uplift +5–15 pp. |
| Accuracy < 75 % **or** F1(SCAM) < 70 % | ❌ FAIL | **Fine-tune REQUIRED.** Run `train_sft_unsloth.py`. |
> **Why these thresholds?** For elder-scam defense, missing a scam (false negative) is catastrophic. High recall on SCAM class is mandatory.
## 3. Fine-Tuning (Unsloth SFT)
```bash
# Install deps
pip install unsloth transformers datasets trl peft accelerate
# Train & push to HF Hub
python train_sft_unsloth.py \
--output grandgemma-scam-sft \
--push_to_hub s23deepak/grandgemma-scam-sft
```
**Hardware:** Kaggle T4Γ—2 (free) or any single GPU β‰₯ 16 GB VRAM.
**Config:** 4-bit + LoRA r=16, 3 epochs, lr=2e-4, batch=2, grad_accum=4.
**Expected time:** ~3–5 min / epoch on T4Γ—2.
## 4. Re-Eval After Fine-Tuning
After training, run the same `eval_zero_shot.py` but point `--model` at your fine-tuned checkpoint:
```bash
python eval_zero_shot.py \
--model s23deepak/grandgemma-scam-sft \
--limit -1
```
Compare the delta in `accuracy`, `recall_scam`, and `f1_scam`.
## 5. Team Notes
- This repo is **evaluation-only** β€” no app code. App code lives in your monorepo (`/android`, `/ios`, `/extensions`, `/portal`).
- Fine-tuned weights produced here should be quantized to `.litertlm` for on-device Android inference (Stream A) and converted for iOS/browser WebGPU (Stream B).
- Track all runs in a spreadsheet: run_id | model | dataset | accuracy | f1_scam | notes.