GrandgemMa — Gemma 4 Scam Detection Eval & Fine-Tune Kit

Goal: Zero-shot test google/gemma-4-E2B-it (2B text) on real scam-call transcripts.
If accuracy < 90 % or F1(SCAM) < 85 % → fine-tune with Unsloth 4-bit LoRA.

Quick Links

Artifact	Path
Zero-shot eval script	`eval_zero_shot.py`
Unsloth SFT trainer	`train_sft_unsloth.py`
Dataset formatter	`format_dataset.py`
Decision rubric	below

Datasets Used

Primary: BothBosu/scam-dialogue — 800+ synthetic scam/legit call transcripts, labeled dialogue + label (1=SCAM, 0=LEGIT).
Secondary (optional): BothBosu/Scammer-Conversation — extra mixed conversations.

1. Zero-Shot Evaluation

# Quick smoke test (100 rows)
python eval_zero_shot.py --limit 100

# Full test split (~400 rows)
python eval_zero_shot.py --limit -1

# CPU-only fallback
python eval_zero_shot.py --device cpu --dtype fp32 --limit 20

Output: results_zero_shot.json + console report with accuracy / precision / recall / F1 / confusion matrix.

2. Decision Rubric

Condition	Verdict	Action
Accuracy ≥ 90 % and F1(SCAM) ≥ 85 %	✅ PASS	Base model is strong enough. Fine-tuning optional.
Accuracy 75–89 % or F1(SCAM) 70–84 %	⚠️ MARGINAL	Fine-tune recommended. Expected uplift +5–15 pp.
Accuracy < 75 % or F1(SCAM) < 70 %	❌ FAIL	Fine-tune REQUIRED. Run `train_sft_unsloth.py`.

Why these thresholds? For elder-scam defense, missing a scam (false negative) is catastrophic. High recall on SCAM class is mandatory.

3. Fine-Tuning (Unsloth SFT)

# Install deps
pip install unsloth transformers datasets trl peft accelerate

# Train & push to HF Hub
python train_sft_unsloth.py \
  --output grandgemma-scam-sft \
  --push_to_hub s23deepak/grandgemma-scam-sft

Hardware: Kaggle T4×2 (free) or any single GPU ≥ 16 GB VRAM.
Config: 4-bit + LoRA r=16, 3 epochs, lr=2e-4, batch=2, grad_accum=4.
Expected time: ~3–5 min / epoch on T4×2.

4. Re-Eval After Fine-Tuning

After training, run the same eval_zero_shot.py but point --model at your fine-tuned checkpoint:

python eval_zero_shot.py \
  --model s23deepak/grandgemma-scam-sft \
  --limit -1

Compare the delta in accuracy, recall_scam, and f1_scam.

5. Team Notes

This repo is evaluation-only — no app code. App code lives in your monorepo (/android, /ios, /extensions, /portal).
Fine-tuned weights produced here should be quantized to .litertlm for on-device Android inference (Stream A) and converted for iOS/browser WebGPU (Stream B).
Track all runs in a spreadsheet: run_id | model | dataset | accuracy | f1_scam | notes.