asdf98 commited on
Commit
ba76afd
Β·
verified Β·
1 Parent(s): f09f7ce

Upload EthicalHacking_Gemma4_E2B_Colab.ipynb

Browse files
Files changed (1) hide show
  1. EthicalHacking_Gemma4_E2B_Colab.ipynb +166 -154
EthicalHacking_Gemma4_E2B_Colab.ipynb CHANGED
@@ -4,27 +4,26 @@
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
- "# πŸ” Ultimate Ethical Hacking LLM – Gemma 4 E2B (Colab Free Tier T4)\n",
8
  "\n",
9
  "**πŸ₯‡ Model:** [Google Gemma 4 E2B](https://huggingface.co/google/gemma-4-E2B-it) via Unsloth 4-bit \n",
10
- "**πŸ† Why this model?** Dense ~2B parameter edge model. NOT an MoE β€” all 2B params are active every forward pass. Strong reasoning for its size. \n",
11
- "**⚠️ T4 WARNING:** This is **tight on 16GB VRAM**. The 4-bit model alone uses ~7.4GB. You MUST follow the memory-optimized settings below. \n",
12
- "**πŸ“Š Datasets:** [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) + [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) \n",
13
  "**⚑ Framework:** Unsloth + TRL SFTTrainer \n",
14
  "\n",
15
- "> ⚠️ **Disclaimer:** Defensive cybersecurity datasets only. Ethical hacking education.\n",
16
  "\n",
17
  "---\n",
18
  "\n",
19
- "## πŸ“‹ Why Gemma-4 E2B?\n",
20
  "\n",
21
  "| Spec | Value |\n",
22
  "|------|-------|\n",
23
  "| Parameters | ~2B (dense, NOT MoE) |\n",
24
  "| 4-bit VRAM | ~7.4 GB |\n",
25
- "| Context | Up to 256K tokens |\n",
26
  "| Batch size on T4 | **1 only** |\n",
27
- "| Max seq length | **2048 max** on T4 |\n",
28
  "| LoRA rank | **8** (save VRAM) |\n",
29
  "\n",
30
  "**Unsloth docs:** https://unsloth.ai/docs/models/gemma-4/train \n",
@@ -71,15 +70,15 @@
71
  "source": [
72
  "## 3️⃣ Load Gemma-4 E2B in 4-bit via Unsloth\n",
73
  "\n",
74
- "**⚠️ T4 MEMORY LIMITS β€” READ CAREFULLY:**\n",
75
  "\n",
76
  "| Setting | Value | Why |\n",
77
  "|---------|-------|-----|\n",
78
  "| `BATCH_SIZE` | **1** | Cannot fit >1 on T4 |\n",
79
- "| `MAX_SEQ_LENGTH` | **2048** | Longer = OOM during backprop |\n",
80
- "| `LORA_R` | **8** | Small rank = fewer adapter params |\n",
81
- "| `GRAD_ACCUM` | **8** | Effective batch still = 8 |\n",
82
- "| `PACKING` | **False** | Avoids complex memory spikes |\n",
83
  "| `optim` | `adamw_8bit` | Must use 8-bit optimizer |\n",
84
  "\n",
85
  "If you still OOM: lower `MAX_SEQ_LENGTH` to 1024, or use `use_rslora=True`."
@@ -98,27 +97,24 @@
98
  "MAX_SEQ_LENGTH = 2048 # DO NOT exceed 2048 on T4\n",
99
  "LORA_R = 8 # small rank for memory\n",
100
  "LORA_ALPHA = 8 \n",
101
- "BATCH_SIZE = 1 # MUST be 1 on T4 (model is ~7.4GB in 4-bit)\n",
102
  "GRAD_ACCUM = 8 # effective batch = 8\n",
103
  "LEARNING_RATE = 2e-4 \n",
104
- "NUM_EPOCHS = 1\n",
105
  "MAX_STEPS = 4000 \n",
106
- "WARMUP_STEPS = 100 # shorter warmup (tight memory)\n",
107
  "LOGGING_STEPS = 50 \n",
108
  "SAVE_STEPS = 500 \n",
109
- "PACKING = False # False = simpler memory profile\n",
110
  "SAMPLE_SIZE = 50000 \n",
111
- "HUB_MODEL_ID = \"your-username/cyber-gemma4-e2b-lora\" \n",
112
  "# ================================================================================\n",
113
  "\n",
114
- "# NOTE: Unsloth auto-applies 4-bit when loading Gemma-4.\n",
115
- "# If the unsloth-bnb-4bit ID doesn't exist, try the base unsloth ID with load_in_4bit=True.\n",
116
  "MODEL_NAME = \"unsloth/gemma-4-E2B-it-unsloth-bnb-4bit\" # ~7.6GB download\n",
117
  "\n",
118
  "model, tokenizer = FastLanguageModel.from_pretrained(\n",
119
  " model_name=MODEL_NAME,\n",
120
  " max_seq_length=MAX_SEQ_LENGTH,\n",
121
- " dtype=None, # auto-detect (fp16 on T4)\n",
122
  " load_in_4bit=True,\n",
123
  ")\n",
124
  "\n",
@@ -128,18 +124,18 @@
128
  " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
129
  " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
130
  " lora_alpha=LORA_ALPHA,\n",
131
- " lora_dropout=0, \n",
132
  " bias=\"none\",\n",
133
  " use_gradient_checkpointing=\"unsloth\", # CRITICAL for T4\n",
134
  " random_state=3407,\n",
135
- " use_rslora=False, # set True if still OOM\n",
136
  " loftq_config=None,\n",
137
  ")\n",
138
  "\n",
139
  "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
140
  "total = sum(p.numel() for p in model.parameters())\n",
141
  "print(f\"βœ… Gemma-4 E2B loaded. Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)\")\n",
142
- "print(f\"⚠️ This model is LARGE. Expected VRAM during training: ~12-14 GB\")\n",
143
  "print(f\" If you get OOM, lower MAX_SEQ_LENGTH to 1024 or set use_rslora=True\")"
144
  ]
145
  },
@@ -147,7 +143,19 @@
147
  "cell_type": "markdown",
148
  "metadata": {},
149
  "source": [
150
- "## 4️⃣ Load, Audit, Subsample & Merge Cybersecurity Datasets"
 
 
 
 
 
 
 
 
 
 
 
 
151
  ]
152
  },
153
  {
@@ -157,53 +165,119 @@
157
  "outputs": [],
158
  "source": [
159
  "from datasets import load_dataset, concatenate_datasets\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
  "import random\n",
161
  "\n",
162
- "# ---------- Dataset 1: Fenrir v2.1 ----------\n",
163
- "print(\"πŸ“₯ Loading Fenrir v2.1...\")\n",
164
- "ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
165
- "print(f\" Rows: {len(ds1)} | Columns: {ds1.column_names}\")\n",
166
- "\n",
167
- "for i in random.sample(range(len(ds1)), 2):\n",
168
- " print(f\"\\n--- Sample {i} ---\")\n",
169
- " print(f\"SYSTEM: {ds1[i]['system'][:120]}...\")\n",
170
- " print(f\"USER: {ds1[i]['user'][:120]}...\")\n",
171
- " print(f\"ASSIST: {ds1[i]['assistant'][:120]}...\")\n",
172
- "\n",
173
- "def fenrir_to_messages(example):\n",
174
- " return {\n",
175
- " \"messages\": [\n",
176
- " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
177
- " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
178
- " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
179
- " ]\n",
180
- " }\n",
181
- "\n",
182
- "ds1 = ds1.map(fenrir_to_messages, remove_columns=ds1.column_names, batched=False)\n",
183
- "\n",
184
- "# ---------- Dataset 2: Trendyol ----------\n",
185
- "print(\"\\nπŸ“₯ Loading Trendyol Cybersecurity...\")\n",
186
- "ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
187
- "print(f\" Rows: {len(ds2)} | Columns: {ds2.column_names}\")\n",
188
- "\n",
189
- "def trendyol_to_messages(example):\n",
190
- " return {\n",
191
- " \"messages\": [\n",
192
- " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
193
- " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
194
- " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
195
- " ]\n",
196
- " }\n",
197
- "\n",
198
- "ds2 = ds2.map(trendyol_to_messages, remove_columns=ds2.column_names, batched=False)\n",
199
- "\n",
200
- "# ---------- Merge & Subsample ----------\n",
201
- "train_dataset = concatenate_datasets([ds1, ds2])\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
202
  "print(f\"\\nπŸ“Š COMBINED DATASET: {len(train_dataset)} rows\")\n",
203
  "\n",
 
 
 
 
204
  "if len(train_dataset) > SAMPLE_SIZE:\n",
205
  " train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
206
- " print(f\"πŸš€ SUBSAMPLED to {len(train_dataset)} rows\")\n",
207
  "\n",
208
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
209
  "print(f\" Steps per epoch: ~{len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)}\")\n",
@@ -214,7 +288,7 @@
214
  "cell_type": "markdown",
215
  "metadata": {},
216
  "source": [
217
- "## 5️⃣ Pre-process Dataset to Text (Avoid Unsloth formatting_func issues)"
218
  ]
219
  },
220
  {
@@ -226,22 +300,12 @@
226
  "def convert_messages_to_text(examples):\n",
227
  " texts = []\n",
228
  " for msgs in examples[\"messages\"]:\n",
229
- " text = tokenizer.apply_chat_template(\n",
230
- " msgs,\n",
231
- " tokenize=False,\n",
232
- " add_generation_prompt=False,\n",
233
- " )\n",
234
  " texts.append(text)\n",
235
  " return {\"text\": texts}\n",
236
  "\n",
237
  "print(\"πŸ”„ Converting messages to text...\")\n",
238
- "train_dataset = train_dataset.map(\n",
239
- " convert_messages_to_text,\n",
240
- " batched=True,\n",
241
- " remove_columns=[\"messages\"],\n",
242
- " batch_size=100,\n",
243
- ")\n",
244
- "\n",
245
  "print(f\"βœ… Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
246
  "print(f\"πŸ“„ Sample text length: {len(train_dataset[0]['text'])} chars\")"
247
  ]
@@ -250,7 +314,7 @@
250
  "cell_type": "markdown",
251
  "metadata": {},
252
  "source": [
253
- "## 6️⃣ Configure SFT Trainer (T4-Safe Memory Settings)"
254
  ]
255
  },
256
  {
@@ -269,16 +333,16 @@
269
  " dataset_text_field=\"text\",\n",
270
  " max_seq_length=MAX_SEQ_LENGTH,\n",
271
  " dataset_num_proc=2,\n",
272
- " packing=PACKING, # False = safer for T4 with large model\n",
273
  " args=TrainingArguments(\n",
274
- " per_device_train_batch_size=BATCH_SIZE, # MUST be 1\n",
275
- " gradient_accumulation_steps=GRAD_ACCUM, # effective batch = 8\n",
276
  " warmup_steps=WARMUP_STEPS,\n",
277
  " max_steps=MAX_STEPS,\n",
278
  " learning_rate=LEARNING_RATE,\n",
279
- " fp16=True, # T4 = fp16 only\n",
280
  " logging_steps=LOGGING_STEPS,\n",
281
- " optim=\"adamw_8bit\", # CRITICAL: saves ~2-3GB VRAM\n",
282
  " weight_decay=0.01,\n",
283
  " lr_scheduler_type=\"linear\",\n",
284
  " seed=3407,\n",
@@ -287,11 +351,10 @@
287
  " save_steps=SAVE_STEPS,\n",
288
  " save_total_limit=2,\n",
289
  " report_to=\"none\",\n",
290
- " # gradient_checkpointing=True, # already set via use_gradient_checkpointing in LoRA\n",
291
  " ),\n",
292
  ")\n",
293
  "\n",
294
- "print(f\"βœ… Trainer ready. Total steps: {MAX_STEPS}\")\n",
295
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
296
  "print(f\" Packing enabled: {PACKING}\")\n",
297
  "print(f\" ⚠️ Expected training VRAM: ~12-14 GB (out of 16 GB)\")\n",
@@ -302,7 +365,7 @@
302
  "cell_type": "markdown",
303
  "metadata": {},
304
  "source": [
305
- "## 7️⃣ Train πŸš€ (Watch for OOM!)"
306
  ]
307
  },
308
  {
@@ -330,7 +393,7 @@
330
  "cell_type": "markdown",
331
  "metadata": {},
332
  "source": [
333
- "## 8️⃣ Save & Push to HuggingFace Hub"
334
  ]
335
  },
336
  {
@@ -339,20 +402,16 @@
339
  "metadata": {},
340
  "outputs": [],
341
  "source": [
342
- "# 8A) Save LoRA adapter (tiny, fast)\n",
343
  "model.save_pretrained(\"./gemma4-lora-adapter\")\n",
344
  "tokenizer.save_pretrained(\"./gemma4-lora-adapter\")\n",
345
  "print(\"βœ… LoRA adapter saved\")\n",
346
  "\n",
347
- "# 8B) Merge & save full model\n",
348
- "# ⚠️ Merging may push to CPU swap on Colab. Still works but slower.\n",
349
  "print(\"\\nπŸ”„ Merging LoRA into base model...\")\n",
350
  "merged_model = model.merge_and_unload()\n",
351
  "merged_model.save_pretrained(\"./gemma4-merged\")\n",
352
  "tokenizer.save_pretrained(\"./gemma4-merged\")\n",
353
  "print(\"βœ… Merged model saved\")\n",
354
  "\n",
355
- "# 8C) Push to HF Hub (uncomment if logged in)\n",
356
  "# model.push_to_hub(HUB_MODEL_ID)\n",
357
  "# tokenizer.push_to_hub(HUB_MODEL_ID)"
358
  ]
@@ -361,7 +420,7 @@
361
  "cell_type": "markdown",
362
  "metadata": {},
363
  "source": [
364
- "## 9️⃣ Inference Demo – Responsible Pentesting"
365
  ]
366
  },
367
  {
@@ -372,90 +431,43 @@
372
  "source": [
373
  "FastLanguageModel.for_inference(model)\n",
374
  "\n",
375
- "test_prompt = \"How would you perform a responsible penetration test on a web application?\"\n",
376
  "\n",
377
  "messages = [\n",
378
- " {\"role\": \"system\", \"content\": \"You are a cybersecurity expert. Explain concepts clearly and ethically.\"},\n",
379
  " {\"role\": \"user\", \"content\": test_prompt},\n",
380
  "]\n",
381
  "\n",
382
- "inputs = tokenizer.apply_chat_template(\n",
383
- " messages,\n",
384
- " tokenize=True,\n",
385
- " add_generation_prompt=True,\n",
386
- " return_tensors=\"pt\",\n",
387
- ").to(model.device)\n",
388
- "\n",
389
- "outputs = model.generate(\n",
390
- " input_ids=inputs,\n",
391
- " max_new_tokens=512,\n",
392
- " temperature=0.7,\n",
393
- " top_p=0.9,\n",
394
- " do_sample=True,\n",
395
- " pad_token_id=tokenizer.pad_token_id,\n",
396
- " eos_token_id=tokenizer.eos_token_id,\n",
397
- ")\n",
398
  "\n",
399
  "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
400
  "reply = response.split(\"user\")[-1].split(\"assistant\")[-1].strip()\n",
401
  "print(reply[:800])"
402
  ]
403
  },
404
- {
405
- "cell_type": "markdown",
406
- "metadata": {},
407
- "source": [
408
- "## πŸ”Ÿ Quick Benchmark – CyberMetric Sample"
409
- ]
410
- },
411
- {
412
- "cell_type": "code",
413
- "execution_count": null,
414
- "metadata": {},
415
- "outputs": [],
416
- "source": [
417
- "benchmark_q = (\n",
418
- " \"Which of the following is the MOST effective defense against SQL injection?\\n\"\n",
419
- " \"A) Input validation only\\n\"\n",
420
- " \"B) Parameterized queries\\n\"\n",
421
- " \"C) Escaping special characters\\n\"\n",
422
- " \"D) Client-side filtering\\n\"\n",
423
- " \"Answer with the letter only.\"\n",
424
- ")\n",
425
- "\n",
426
- "bench_msgs = [\n",
427
- " {\"role\": \"system\", \"content\": \"You are a cybersecurity expert. Answer accurately.\"},\n",
428
- " {\"role\": \"user\", \"content\": benchmark_q},\n",
429
- "]\n",
430
- "\n",
431
- "inputs = tokenizer.apply_chat_template(bench_msgs, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(model.device)\n",
432
- "\n",
433
- "outputs = model.generate(input_ids=inputs, max_new_tokens=64, temperature=0.1, do_sample=True,\n",
434
- " pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)\n",
435
- "\n",
436
- "answer = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
437
- "print(\"πŸ“Š Benchmark Answer:\")\n",
438
- "print(answer.split(\"assistant\")[-1].strip())"
439
- ]
440
- },
441
  {
442
  "cell_type": "markdown",
443
  "metadata": {},
444
  "source": [
445
  "---\n",
446
- "## πŸ“š References\n",
447
  "\n",
448
  "| Resource | Link |\n",
449
  "|----------|------|\n",
450
  "| **Gemma 4 Paper** | https://storage.googleapis.com/deepmind-media/gemma/gemma-4-report.pdf |\n",
451
  "| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |\n",
452
  "| **Unsloth Gemma-4 Train** | https://unsloth.ai/docs/models/gemma-4/train |\n",
453
- "| **Official Colab** | https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Text.ipynb |\n",
454
- "| **Fenrir Dataset** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
455
- "| **Trendyol Dataset** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
 
 
456
  "\n",
457
  "---\n",
458
- "*Built with ❀️ for the cybersecurity community. Use responsibly.*"
459
  ]
460
  }
461
  ],
 
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
+ "# πŸ” Ultimate LLM Fine-Tuning – Gemma 4 E2B (Colab Free Tier T4)\n",
8
  "\n",
9
  "**πŸ₯‡ Model:** [Google Gemma 4 E2B](https://huggingface.co/google/gemma-4-E2B-it) via Unsloth 4-bit \n",
10
+ "**πŸ† Why this model?** Dense ~2B parameter edge model. NOT an MoE β€” all 2B params are active. \n",
11
+ "**⚠️ T4 WARNING:** This is **tight on 16GB VRAM**. The 4-bit model uses ~7.4GB. Follow memory settings strictly. \n",
12
+ "**πŸ“Š Datasets:** Your choice β€” cybersecurity, general chat, multilingual, coding, or mix them! \n",
13
  "**⚑ Framework:** Unsloth + TRL SFTTrainer \n",
14
  "\n",
15
+ "> Pick any dataset below. Default is cybersecurity. Mix datasets for hybrid tuning.\n",
16
  "\n",
17
  "---\n",
18
  "\n",
19
+ "## πŸ“‹ Gemma-4 E2B T4 Notes\n",
20
  "\n",
21
  "| Spec | Value |\n",
22
  "|------|-------|\n",
23
  "| Parameters | ~2B (dense, NOT MoE) |\n",
24
  "| 4-bit VRAM | ~7.4 GB |\n",
 
25
  "| Batch size on T4 | **1 only** |\n",
26
+ "| Max seq length | **2048 max** |\n",
27
  "| LoRA rank | **8** (save VRAM) |\n",
28
  "\n",
29
  "**Unsloth docs:** https://unsloth.ai/docs/models/gemma-4/train \n",
 
70
  "source": [
71
  "## 3️⃣ Load Gemma-4 E2B in 4-bit via Unsloth\n",
72
  "\n",
73
+ "**⚠️ MEMORY LIMITS:**\n",
74
  "\n",
75
  "| Setting | Value | Why |\n",
76
  "|---------|-------|-----|\n",
77
  "| `BATCH_SIZE` | **1** | Cannot fit >1 on T4 |\n",
78
+ "| `MAX_SEQ_LENGTH` | **2048** | Longer = OOM |\n",
79
+ "| `LORA_R` | **8** | Small rank saves VRAM |\n",
80
+ "| `GRAD_ACCUM` | **8** | Effective batch = 8 |\n",
81
+ "| `PACKING` | **False** | Safer memory profile |\n",
82
  "| `optim` | `adamw_8bit` | Must use 8-bit optimizer |\n",
83
  "\n",
84
  "If you still OOM: lower `MAX_SEQ_LENGTH` to 1024, or use `use_rslora=True`."
 
97
  "MAX_SEQ_LENGTH = 2048 # DO NOT exceed 2048 on T4\n",
98
  "LORA_R = 8 # small rank for memory\n",
99
  "LORA_ALPHA = 8 \n",
100
+ "BATCH_SIZE = 1 # MUST be 1 on T4\n",
101
  "GRAD_ACCUM = 8 # effective batch = 8\n",
102
  "LEARNING_RATE = 2e-4 \n",
 
103
  "MAX_STEPS = 4000 \n",
104
+ "WARMUP_STEPS = 100 \n",
105
  "LOGGING_STEPS = 50 \n",
106
  "SAVE_STEPS = 500 \n",
107
+ "PACKING = False # False = safer memory\n",
108
  "SAMPLE_SIZE = 50000 \n",
109
+ "HUB_MODEL_ID = \"your-username/gemma4-e2b-lora\"\n",
110
  "# ================================================================================\n",
111
  "\n",
 
 
112
  "MODEL_NAME = \"unsloth/gemma-4-E2B-it-unsloth-bnb-4bit\" # ~7.6GB download\n",
113
  "\n",
114
  "model, tokenizer = FastLanguageModel.from_pretrained(\n",
115
  " model_name=MODEL_NAME,\n",
116
  " max_seq_length=MAX_SEQ_LENGTH,\n",
117
+ " dtype=None,\n",
118
  " load_in_4bit=True,\n",
119
  ")\n",
120
  "\n",
 
124
  " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
125
  " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
126
  " lora_alpha=LORA_ALPHA,\n",
127
+ " lora_dropout=0,\n",
128
  " bias=\"none\",\n",
129
  " use_gradient_checkpointing=\"unsloth\", # CRITICAL for T4\n",
130
  " random_state=3407,\n",
131
+ " use_rslora=False,\n",
132
  " loftq_config=None,\n",
133
  ")\n",
134
  "\n",
135
  "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
136
  "total = sum(p.numel() for p in model.parameters())\n",
137
  "print(f\"βœ… Gemma-4 E2B loaded. Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)\")\n",
138
+ "print(f\"⚠️ Expected training VRAM: ~12-14 GB (out of 16 GB)\")\n",
139
  "print(f\" If you get OOM, lower MAX_SEQ_LENGTH to 1024 or set use_rslora=True\")"
140
  ]
141
  },
 
143
  "cell_type": "markdown",
144
  "metadata": {},
145
  "source": [
146
+ "## 4️⃣ 🎯 CHOOSE YOUR DATASET(S)\n",
147
+ "\n",
148
+ "Uncomment **ONE** `DATASET_CHOICE` line. Mix datasets with `custom_mix`.\n",
149
+ "\n",
150
+ "| Choice | Dataset | Size | Format | Best For |\n",
151
+ "|--------|---------|------|--------|----------|\n",
152
+ "| `\"cybersecurity\"` | Fenrir + Trendyol | 153K | system/user/assistant | **Ethical hacking education** |\n",
153
+ "| `\"ultrachat\"` | UltraChat 200K SFT | 200K | messages | General conversation |\n",
154
+ "| `\"openhermes\"` | OpenHermes 2.5 | 1M+ | conversations | Reasoning, coding |\n",
155
+ "| `\"sharegpt_en\"` | ShareGPT English | ~90K | conversations | Multi-turn dialogue |\n",
156
+ "| `\"sharegpt_de\"` | ShareGPT German | ~104K | conversations | German fine-tuning |\n",
157
+ "| `\"sharegpt_hi\"` | ShareGPT Hindi | ~153K | conversations | Hindi fine-tuning |\n",
158
+ "| `\"custom_mix\"` | Your mix | β€” | varies | Combine multiple |"
159
  ]
160
  },
161
  {
 
165
  "outputs": [],
166
  "source": [
167
  "from datasets import load_dataset, concatenate_datasets\n",
168
+ "\n",
169
+ "# ═══════════════════════════════════════════════════════════════\n",
170
+ "# SELECT YOUR DATASET β€” UNCOMMENT ONE LINE\n",
171
+ "# ═══════════════════════════════════════════════════════════════\n",
172
+ "\n",
173
+ "DATASET_CHOICE = \"cybersecurity\"\n",
174
+ "\n",
175
+ "# DATASET_CHOICE = \"ultrachat\"\n",
176
+ "# DATASET_CHOICE = \"openhermes\"\n",
177
+ "# DATASET_CHOICE = \"sharegpt_en\"\n",
178
+ "# DATASET_CHOICE = \"sharegpt_de\"\n",
179
+ "# DATASET_CHOICE = \"sharegpt_hi\"\n",
180
+ "# DATASET_CHOICE = \"custom_mix\"\n",
181
+ "\n",
182
+ "CUSTOM_DATASETS = [\n",
183
+ " (\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", \"train\", 10000, \"messages\"),\n",
184
+ " (\"HuggingFaceH4/ultrachat_200k\", \"train_sft\", 20000, \"messages\"),\n",
185
+ " (\"teknium/OpenHermes-2.5\", \"train\", 20000, \"conversations\"),\n",
186
+ "]\n",
187
+ "\n",
188
+ "print(f\"🎯 DATASET_CHOICE = {DATASET_CHOICE}\")"
189
+ ]
190
+ },
191
+ {
192
+ "cell_type": "markdown",
193
+ "metadata": {},
194
+ "source": [
195
+ "## 5️⃣ Load, Convert & Pre-process Selected Dataset"
196
+ ]
197
+ },
198
+ {
199
+ "cell_type": "code",
200
+ "execution_count": null,
201
+ "metadata": {},
202
+ "outputs": [],
203
+ "source": [
204
  "import random\n",
205
  "\n",
206
+ "def _convert_fenrir(example):\n",
207
+ " return {\"messages\": [\n",
208
+ " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
209
+ " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
210
+ " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
211
+ " ]}\n",
212
+ "\n",
213
+ "def _convert_trendyol(example):\n",
214
+ " return {\"messages\": [\n",
215
+ " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
216
+ " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
217
+ " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
218
+ " ]}\n",
219
+ "\n",
220
+ "def _convert_ultrachat(example):\n",
221
+ " return {\"messages\": example[\"messages\"]}\n",
222
+ "\n",
223
+ "def _convert_conversations(example):\n",
224
+ " msgs = []\n",
225
+ " system = example.get(\"system_prompt\", \"\") or example.get(\"system\", \"\")\n",
226
+ " if system:\n",
227
+ " msgs.append({\"role\": \"system\", \"content\": system})\n",
228
+ " for turn in example[\"conversations\"]:\n",
229
+ " role = \"user\" if turn[\"from\"] in (\"human\", \"user\") else \"assistant\"\n",
230
+ " msgs.append({\"role\": role, \"content\": turn[\"value\"]})\n",
231
+ " return {\"messages\": msgs}\n",
232
+ "\n",
233
+ "all_datasets = []\n",
234
+ "\n",
235
+ "if DATASET_CHOICE == \"cybersecurity\":\n",
236
+ " ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
237
+ " ds1 = ds1.map(_convert_fenrir, remove_columns=ds1.column_names, batched=False)\n",
238
+ " all_datasets.append(ds1)\n",
239
+ " ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
240
+ " ds2 = ds2.map(_convert_trendyol, remove_columns=ds2.column_names, batched=False)\n",
241
+ " all_datasets.append(ds2)\n",
242
+ "\n",
243
+ "elif DATASET_CHOICE == \"ultrachat\":\n",
244
+ " ds = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
245
+ " ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
246
+ " all_datasets.append(ds)\n",
247
+ "\n",
248
+ "elif DATASET_CHOICE == \"openhermes\":\n",
249
+ " ds = load_dataset(\"teknium/OpenHermes-2.5\", split=\"train\")\n",
250
+ " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
251
+ " all_datasets.append(ds)\n",
252
+ "\n",
253
+ "elif DATASET_CHOICE.startswith(\"sharegpt_\"):\n",
254
+ " split_map = {\"sharegpt_en\": \"english\", \"sharegpt_de\": \"german_4b_translated\", \"sharegpt_hi\": \"hindi_27b_translated\"}\n",
255
+ " ds = load_dataset(\"deepmage121/ShareGPT_multilingual\", split=split_map[DATASET_CHOICE])\n",
256
+ " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
257
+ " all_datasets.append(ds)\n",
258
+ "\n",
259
+ "elif DATASET_CHOICE == \"custom_mix\":\n",
260
+ " for ds_id, split, n_rows, fmt in CUSTOM_DATASETS:\n",
261
+ " ds = load_dataset(ds_id, split=split)\n",
262
+ " if n_rows and len(ds) > n_rows:\n",
263
+ " ds = ds.shuffle(seed=3407).select(range(n_rows))\n",
264
+ " if fmt == \"messages\": ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
265
+ " elif fmt == \"conversations\": ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
266
+ " all_datasets.append(ds)\n",
267
+ "\n",
268
+ "else:\n",
269
+ " raise ValueError(f\"Unknown DATASET_CHOICE: {DATASET_CHOICE}\")\n",
270
+ "\n",
271
+ "train_dataset = concatenate_datasets(all_datasets) if len(all_datasets) > 1 else all_datasets[0]\n",
272
  "print(f\"\\nπŸ“Š COMBINED DATASET: {len(train_dataset)} rows\")\n",
273
  "\n",
274
+ "sample = train_dataset[random.randint(0, len(train_dataset)-1)]\n",
275
+ "print(f\"Sample roles: {[m['role'] for m in sample['messages']]}\")\n",
276
+ "for m in sample[\"messages\"]: print(f\" {m['role']}: {m['content'][:80]}...\")\n",
277
+ "\n",
278
  "if len(train_dataset) > SAMPLE_SIZE:\n",
279
  " train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
280
+ " print(f\"\\nπŸš€ SUBSAMPLED to {len(train_dataset)} rows\")\n",
281
  "\n",
282
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
283
  "print(f\" Steps per epoch: ~{len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)}\")\n",
 
288
  "cell_type": "markdown",
289
  "metadata": {},
290
  "source": [
291
+ "## 6️⃣ Convert Messages β†’ Text (Chat Template)"
292
  ]
293
  },
294
  {
 
300
  "def convert_messages_to_text(examples):\n",
301
  " texts = []\n",
302
  " for msgs in examples[\"messages\"]:\n",
303
+ " text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)\n",
 
 
 
 
304
  " texts.append(text)\n",
305
  " return {\"text\": texts}\n",
306
  "\n",
307
  "print(\"πŸ”„ Converting messages to text...\")\n",
308
+ "train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=[\"messages\"], batch_size=100)\n",
 
 
 
 
 
 
309
  "print(f\"βœ… Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
310
  "print(f\"πŸ“„ Sample text length: {len(train_dataset[0]['text'])} chars\")"
311
  ]
 
314
  "cell_type": "markdown",
315
  "metadata": {},
316
  "source": [
317
+ "## 7️⃣ Configure SFT Trainer (T4-Safe Memory Settings)"
318
  ]
319
  },
320
  {
 
333
  " dataset_text_field=\"text\",\n",
334
  " max_seq_length=MAX_SEQ_LENGTH,\n",
335
  " dataset_num_proc=2,\n",
336
+ " packing=PACKING,\n",
337
  " args=TrainingArguments(\n",
338
+ " per_device_train_batch_size=BATCH_SIZE,\n",
339
+ " gradient_accumulation_steps=GRAD_ACCUM,\n",
340
  " warmup_steps=WARMUP_STEPS,\n",
341
  " max_steps=MAX_STEPS,\n",
342
  " learning_rate=LEARNING_RATE,\n",
343
+ " fp16=True,\n",
344
  " logging_steps=LOGGING_STEPS,\n",
345
+ " optim=\"adamw_8bit\",\n",
346
  " weight_decay=0.01,\n",
347
  " lr_scheduler_type=\"linear\",\n",
348
  " seed=3407,\n",
 
351
  " save_steps=SAVE_STEPS,\n",
352
  " save_total_limit=2,\n",
353
  " report_to=\"none\",\n",
 
354
  " ),\n",
355
  ")\n",
356
  "\n",
357
+ "print(f\"βœ… Trainer ready. Dataset: {DATASET_CHOICE} | Steps: {MAX_STEPS}\")\n",
358
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
359
  "print(f\" Packing enabled: {PACKING}\")\n",
360
  "print(f\" ⚠️ Expected training VRAM: ~12-14 GB (out of 16 GB)\")\n",
 
365
  "cell_type": "markdown",
366
  "metadata": {},
367
  "source": [
368
+ "## 8️⃣ Train πŸš€ (Watch for OOM!)"
369
  ]
370
  },
371
  {
 
393
  "cell_type": "markdown",
394
  "metadata": {},
395
  "source": [
396
+ "## 9️⃣ Save & Push to HuggingFace Hub"
397
  ]
398
  },
399
  {
 
402
  "metadata": {},
403
  "outputs": [],
404
  "source": [
 
405
  "model.save_pretrained(\"./gemma4-lora-adapter\")\n",
406
  "tokenizer.save_pretrained(\"./gemma4-lora-adapter\")\n",
407
  "print(\"βœ… LoRA adapter saved\")\n",
408
  "\n",
 
 
409
  "print(\"\\nπŸ”„ Merging LoRA into base model...\")\n",
410
  "merged_model = model.merge_and_unload()\n",
411
  "merged_model.save_pretrained(\"./gemma4-merged\")\n",
412
  "tokenizer.save_pretrained(\"./gemma4-merged\")\n",
413
  "print(\"βœ… Merged model saved\")\n",
414
  "\n",
 
415
  "# model.push_to_hub(HUB_MODEL_ID)\n",
416
  "# tokenizer.push_to_hub(HUB_MODEL_ID)"
417
  ]
 
420
  "cell_type": "markdown",
421
  "metadata": {},
422
  "source": [
423
+ "## πŸ”Ÿ Inference Demo"
424
  ]
425
  },
426
  {
 
431
  "source": [
432
  "FastLanguageModel.for_inference(model)\n",
433
  "\n",
434
+ "test_prompt = \"Explain how parameterized queries prevent SQL injection, with a Python example.\"\n",
435
  "\n",
436
  "messages = [\n",
437
+ " {\"role\": \"system\", \"content\": \"You are a helpful and knowledgeable assistant.\"},\n",
438
  " {\"role\": \"user\", \"content\": test_prompt},\n",
439
  "]\n",
440
  "\n",
441
+ "inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(model.device)\n",
442
+ "\n",
443
+ "outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7, top_p=0.9,\n",
444
+ " do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)\n",
 
 
 
 
 
 
 
 
 
 
 
 
445
  "\n",
446
  "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
447
  "reply = response.split(\"user\")[-1].split(\"assistant\")[-1].strip()\n",
448
  "print(reply[:800])"
449
  ]
450
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
451
  {
452
  "cell_type": "markdown",
453
  "metadata": {},
454
  "source": [
455
  "---\n",
456
+ "## πŸ“š Dataset & Model References\n",
457
  "\n",
458
  "| Resource | Link |\n",
459
  "|----------|------|\n",
460
  "| **Gemma 4 Paper** | https://storage.googleapis.com/deepmind-media/gemma/gemma-4-report.pdf |\n",
461
  "| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |\n",
462
  "| **Unsloth Gemma-4 Train** | https://unsloth.ai/docs/models/gemma-4/train |\n",
463
+ "| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |\n",
464
+ "| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |\n",
465
+ "| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |\n",
466
+ "| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
467
+ "| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
468
  "\n",
469
  "---\n",
470
+ "*Pick any dataset. Train anything. Use responsibly.*"
471
  ]
472
  }
473
  ],