GrandgemMa β Gemma 4 Scam Detection Eval & Fine-Tune Kit
Goal: Zero-shot test
google/gemma-4-E2B-it(2B text) on real scam-call transcripts.
If accuracy < 90 % or F1(SCAM) < 85 % β fine-tune with Unsloth 4-bit LoRA.
Quick Links
| Artifact | Path |
|---|---|
| Zero-shot eval script | eval_zero_shot.py |
| Unsloth SFT trainer | train_sft_unsloth.py |
| Dataset formatter | format_dataset.py |
| Decision rubric | below |
Datasets Used
- Primary:
BothBosu/scam-dialogueβ 800+ synthetic scam/legit call transcripts, labeleddialogue+label(1=SCAM, 0=LEGIT). - Secondary (optional):
BothBosu/Scammer-Conversationβ extra mixed conversations.
1. Zero-Shot Evaluation
# Quick smoke test (100 rows)
python eval_zero_shot.py --limit 100
# Full test split (~400 rows)
python eval_zero_shot.py --limit -1
# CPU-only fallback
python eval_zero_shot.py --device cpu --dtype fp32 --limit 20
Output: results_zero_shot.json + console report with accuracy / precision / recall / F1 / confusion matrix.
2. Decision Rubric
| Condition | Verdict | Action |
|---|---|---|
| Accuracy β₯ 90 % and F1(SCAM) β₯ 85 % | β PASS | Base model is strong enough. Fine-tuning optional. |
| Accuracy 75β89 % or F1(SCAM) 70β84 % | β οΈ MARGINAL | Fine-tune recommended. Expected uplift +5β15 pp. |
| Accuracy < 75 % or F1(SCAM) < 70 % | β FAIL | Fine-tune REQUIRED. Run train_sft_unsloth.py. |
Why these thresholds? For elder-scam defense, missing a scam (false negative) is catastrophic. High recall on SCAM class is mandatory.
3. Fine-Tuning (Unsloth SFT)
# Install deps
pip install unsloth transformers datasets trl peft accelerate
# Train & push to HF Hub
python train_sft_unsloth.py \
--output grandgemma-scam-sft \
--push_to_hub s23deepak/grandgemma-scam-sft
Hardware: Kaggle T4Γ2 (free) or any single GPU β₯ 16 GB VRAM.
Config: 4-bit + LoRA r=16, 3 epochs, lr=2e-4, batch=2, grad_accum=4.
Expected time: ~3β5 min / epoch on T4Γ2.
4. Re-Eval After Fine-Tuning
After training, run the same eval_zero_shot.py but point --model at your fine-tuned checkpoint:
python eval_zero_shot.py \
--model s23deepak/grandgemma-scam-sft \
--limit -1
Compare the delta in accuracy, recall_scam, and f1_scam.
5. Team Notes
- This repo is evaluation-only β no app code. App code lives in your monorepo (
/android,/ios,/extensions,/portal). - Fine-tuned weights produced here should be quantized to
.litertlmfor on-device Android inference (Stream A) and converted for iOS/browser WebGPU (Stream B). - Track all runs in a spreadsheet: run_id | model | dataset | accuracy | f1_scam | notes.