s23deepak
/

grandgemma-eval

Model card Files Files and versions

grandgemma-eval / README.md

s23deepak's picture

Upload README.md

b644b23 verified about 7 hours ago

|

history blame contribute delete

2.9 kB

	# GrandgemMa — Gemma 4 Scam Detection Eval & Fine-Tune Kit

	> Goal: Zero-shot test `google/gemma-4-E2B-it` (2B text) on real scam-call transcripts.
	> If accuracy < 90 % or F1(SCAM) < 85 % → fine-tune with Unsloth 4-bit LoRA.

	## Quick Links

	\| Artifact \| Path \|
	\|---\|---\|
	\| Zero-shot eval script \| `eval_zero_shot.py` \|
	\| Unsloth SFT trainer \| `train_sft_unsloth.py` \|
	\| Dataset formatter \| `format_dataset.py` \|
	\| Decision rubric \| below \|

	## Datasets Used

	- Primary: [`BothBosu/scam-dialogue`](https://huggingface.co/datasets/BothBosu/scam-dialogue) — 800+ synthetic scam/legit call transcripts, labeled `dialogue` + `label` (1=SCAM, 0=LEGIT).
	- Secondary (optional): [`BothBosu/Scammer-Conversation`](https://huggingface.co/datasets/BothBosu/Scammer-Conversation) — extra mixed conversations.

	## 1. Zero-Shot Evaluation

	```bash
	# Quick smoke test (100 rows)
	python eval_zero_shot.py --limit 100

	# Full test split (~400 rows)
	python eval_zero_shot.py --limit -1

	# CPU-only fallback
	python eval_zero_shot.py --device cpu --dtype fp32 --limit 20
	```

	Output: `results_zero_shot.json` + console report with accuracy / precision / recall / F1 / confusion matrix.

	## 2. Decision Rubric

	\| Condition \| Verdict \| Action \|
	\|---\|---\|---\|
	\| Accuracy ≥ 90 % and F1(SCAM) ≥ 85 % \| ✅ PASS \| Base model is strong enough. Fine-tuning optional. \|
	\| Accuracy 75–89 % or F1(SCAM) 70–84 % \| ⚠️ MARGINAL \| Fine-tune recommended. Expected uplift +5–15 pp. \|
	\| Accuracy < 75 % or F1(SCAM) < 70 % \| ❌ FAIL \| Fine-tune REQUIRED. Run `train_sft_unsloth.py`. \|

	> Why these thresholds? For elder-scam defense, missing a scam (false negative) is catastrophic. High recall on SCAM class is mandatory.

	## 3. Fine-Tuning (Unsloth SFT)

	```bash
	# Install deps
	pip install unsloth transformers datasets trl peft accelerate

	# Train & push to HF Hub
	python train_sft_unsloth.py \
	--output grandgemma-scam-sft \
	--push_to_hub s23deepak/grandgemma-scam-sft
	```

	Hardware: Kaggle T4×2 (free) or any single GPU ≥ 16 GB VRAM.
	Config: 4-bit + LoRA r=16, 3 epochs, lr=2e-4, batch=2, grad_accum=4.
	Expected time: ~3–5 min / epoch on T4×2.

	## 4. Re-Eval After Fine-Tuning

	After training, run the same `eval_zero_shot.py` but point `--model` at your fine-tuned checkpoint:

	```bash
	python eval_zero_shot.py \
	--model s23deepak/grandgemma-scam-sft \
	--limit -1
	```

	Compare the delta in `accuracy`, `recall_scam`, and `f1_scam`.

	## 5. Team Notes

	- This repo is evaluation-only — no app code. App code lives in your monorepo (`/android`, `/ios`, `/extensions`, `/portal`).
	- Fine-tuned weights produced here should be quantized to `.litertlm` for on-device Android inference (Stream A) and converted for iOS/browser WebGPU (Stream B).
	- Track all runs in a spreadsheet: run_id \| model \| dataset \| accuracy \| f1_scam \| notes.