asdf98 commited on
Commit
3d6a9e6
·
verified ·
1 Parent(s): 145c629

Upload EthicalHacking_MultiModel_Comparison_Colab.ipynb

Browse files
EthicalHacking_MultiModel_Comparison_Colab.ipynb ADDED
@@ -0,0 +1,363 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# 🔐 Multi-Model Ethical Hacking Fine-Tuning – Pick Your Model\n",
8
+ "\n",
9
+ "This notebook lets you choose between multiple models for cybersecurity fine-tuning on Google Colab Free Tier (T4 GPU, ~16GB VRAM).\n",
10
+ "\n",
11
+ "**All models tested with Unsloth for 2× faster training + 70% less VRAM.**\n",
12
+ "\n",
13
+ "---\n",
14
+ "\n",
15
+ "## 📊 Model Comparison Matrix (T4 16GB)\n",
16
+ "\n",
17
+ "| Model | 4-bit Size | T4 Fit | Coding Score | Unsloth | ✅/❌ | Why |\n",
18
+ "|-------|-----------|--------|-------------|---------|------|-----|\n",
19
+ "| **Qwen3-4B-Instruct-2507** 🥇 | 3.3 GB | ✅✅✅ Excellent | LiveCodeBench 35.1 | ✅ Confirmed | ✅ **USE THIS** | Best coding/reasoning under 10B |\n",
20
+ "| Qwen3-8B | 7.0 GB | ✅✅ Good | Strong base | ✅ Confirmed | ✅ Viable | More capacity, tighter VRAM |\n",
21
+ "| Gemma-3-4B-it | ~2.5 GB | ✅✅✅ Excellent | Decent | ✅ Confirmed | ✅ Alternative | Good for multimodal tasks |\n",
22
+ "| Gemma-4-E2B-it | ~7.6 GB | ✅✅ Good | Unverified | ⚠️ Limited | ⚠️ Experimental | Very new, may have issues |\n",
23
+ "| Bonsai-4B | ~0.5 GB | ✅✅✅ Excellent | Weak (~30% MMLU) | ❌ No | ❌ **AVOID** | Ternary weights, NOT for coding |\n",
24
+ "| LFM2-2.6B | ~2.5 GB | ✅✅ Good | **Not for programming** | ❌ No | ❌ **AVOID** | Officially disclaimed by Liquid AI |\n",
25
+ "\n",
26
+ "---\n",
27
+ "\n",
28
+ "## 🎯 Quick Pick\n",
29
+ "\n",
30
+ "```python\n",
31
+ "MODEL_CHOICE = \"qwen3-4b\" # Options: qwen3-4b | qwen3-8b | gemma-3-4b\n",
32
+ "```\n",
33
+ "\n",
34
+ "> ⚠️ **Disclaimer:** This trains on **defensive cybersecurity** datasets only. For ethical hacking education and security research."
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "markdown",
39
+ "metadata": {},
40
+ "source": [
41
+ "## 1️⃣ Install Dependencies"
42
+ ]
43
+ },
44
+ {
45
+ "cell_type": "code",
46
+ "execution_count": null,
47
+ "metadata": {},
48
+ "outputs": [],
49
+ "source": [
50
+ "%%capture\n",
51
+ "!pip install -q unsloth trl datasets accelerate transformers bitsandbytes huggingface_hub"
52
+ ]
53
+ },
54
+ {
55
+ "cell_type": "markdown",
56
+ "metadata": {},
57
+ "source": [
58
+ "## 2️⃣ Choose Your Model (Edit This Cell)"
59
+ ]
60
+ },
61
+ {
62
+ "cell_type": "code",
63
+ "execution_count": null,
64
+ "metadata": {},
65
+ "outputs": [],
66
+ "source": [
67
+ "# ======================== PICK YOUR MODEL ========================\n",
68
+ "MODEL_CHOICE = \"qwen3-4b\" # Change this to: \"qwen3-4b\" | \"qwen3-8b\" | \"gemma-3-4b\"\n",
69
+ "# ================================================================\n",
70
+ "\n",
71
+ "MODEL_CONFIGS = {\n",
72
+ " \"qwen3-4b\": {\n",
73
+ " \"name\": \"unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit\",\n",
74
+ " \"max_seq_length\": 4096,\n",
75
+ " \"lora_r\": 64,\n",
76
+ " \"lora_alpha\": 64,\n",
77
+ " \"batch_size\": 2,\n",
78
+ " \"grad_accum\": 4,\n",
79
+ " \"description\": \"Best coding/reasoning under 10B. Massive VRAM headroom on T4.\",\n",
80
+ " },\n",
81
+ " \"qwen3-8b\": {\n",
82
+ " \"name\": \"unsloth/Qwen3-8B-unsloth-bnb-4bit\",\n",
83
+ " \"max_seq_length\": 2048,\n",
84
+ " \"lora_r\": 16,\n",
85
+ " \"lora_alpha\": 16,\n",
86
+ " \"batch_size\": 1,\n",
87
+ " \"grad_accum\": 4,\n",
88
+ " \"description\": \"More capacity for complex exploits. Tighter VRAM on T4.\",\n",
89
+ " },\n",
90
+ " \"gemma-3-4b\": {\n",
91
+ " \"name\": \"unsloth/gemma-3-4b-it-unsloth-bnb-4bit\",\n",
92
+ " \"max_seq_length\": 2048,\n",
93
+ " \"lora_r\": 32,\n",
94
+ " \"lora_alpha\": 32,\n",
95
+ " \"batch_size\": 2,\n",
96
+ " \"grad_accum\": 4,\n",
97
+ " \"description\": \"Google's Gemma 3. Good alternative with different tokenizer.\",\n",
98
+ " },\n",
99
+ "}\n",
100
+ "\n",
101
+ "cfg = MODEL_CONFIGS[MODEL_CHOICE]\n",
102
+ "print(f\"🎯 Model: {MODEL_CHOICE}\")\n",
103
+ "print(f\" HF ID: {cfg['name']}\")\n",
104
+ "print(f\" {cfg['description']}\")\n",
105
+ "print(f\" MAX_SEQ_LENGTH={cfg['max_seq_length']}, LoRA r={cfg['lora_r']}, batch={cfg['batch_size']}\")"
106
+ ]
107
+ },
108
+ {
109
+ "cell_type": "markdown",
110
+ "metadata": {},
111
+ "source": [
112
+ "## 3️⃣ Load Model with Unsloth"
113
+ ]
114
+ },
115
+ {
116
+ "cell_type": "code",
117
+ "execution_count": null,
118
+ "metadata": {},
119
+ "outputs": [],
120
+ "source": [
121
+ "from unsloth import FastLanguageModel\n",
122
+ "import torch\n",
123
+ "\n",
124
+ "MAX_SEQ_LENGTH = cfg[\"max_seq_length\"]\n",
125
+ "LORA_R = cfg[\"lora_r\"]\n",
126
+ "LORA_ALPHA = cfg[\"lora_alpha\"]\n",
127
+ "BATCH_SIZE = cfg[\"batch_size\"]\n",
128
+ "GRAD_ACCUM = cfg[\"grad_accum\"]\n",
129
+ "LEARNING_RATE = 2e-4\n",
130
+ "NUM_EPOCHS = 1\n",
131
+ "WARMUP_STEPS = 10\n",
132
+ "LOGGING_STEPS = 5\n",
133
+ "\n",
134
+ "model, tokenizer = FastLanguageModel.from_pretrained(\n",
135
+ " model_name=cfg[\"name\"],\n",
136
+ " max_seq_length=MAX_SEQ_LENGTH,\n",
137
+ " dtype=None, # auto-detect\n",
138
+ " load_in_4bit=True,\n",
139
+ ")\n",
140
+ "\n",
141
+ "model = FastLanguageModel.get_peft_model(\n",
142
+ " model,\n",
143
+ " r=LORA_R,\n",
144
+ " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
145
+ " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
146
+ " lora_alpha=LORA_ALPHA,\n",
147
+ " lora_dropout=0,\n",
148
+ " bias=\"none\",\n",
149
+ " use_gradient_checkpointing=\"unsloth\",\n",
150
+ " random_state=3407,\n",
151
+ " use_rslora=False,\n",
152
+ ")\n",
153
+ "\n",
154
+ "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
155
+ "total = sum(p.numel() for p in model.parameters())\n",
156
+ "print(f\"✅ Model loaded. Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)\")"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "markdown",
161
+ "metadata": {},
162
+ "source": [
163
+ "## 4️⃣ Load & Prepare Cybersecurity Datasets"
164
+ ]
165
+ },
166
+ {
167
+ "cell_type": "code",
168
+ "execution_count": null,
169
+ "metadata": {},
170
+ "outputs": [],
171
+ "source": [
172
+ "from datasets import load_dataset, concatenate_datasets\n",
173
+ "\n",
174
+ "ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
175
+ "ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
176
+ "\n",
177
+ "def to_messages(example):\n",
178
+ " return {\"messages\": [\n",
179
+ " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
180
+ " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
181
+ " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
182
+ " ]}\n",
183
+ "\n",
184
+ "ds1 = ds1.map(to_messages, remove_columns=ds1.column_names, batched=False)\n",
185
+ "ds2 = ds2.map(to_messages, remove_columns=ds2.column_names, batched=False)\n",
186
+ "train_dataset = concatenate_datasets([ds1, ds2])\n",
187
+ "print(f\"✅ Combined dataset: {len(train_dataset)} rows\")"
188
+ ]
189
+ },
190
+ {
191
+ "cell_type": "markdown",
192
+ "metadata": {},
193
+ "source": [
194
+ "## 5️⃣ Configure SFTTrainer (with formatting_func fix for Unsloth)"
195
+ ]
196
+ },
197
+ {
198
+ "cell_type": "code",
199
+ "execution_count": null,
200
+ "metadata": {},
201
+ "outputs": [],
202
+ "source": [
203
+ "from trl import SFTTrainer, SFTConfig\n",
204
+ "\n",
205
+ "# ========== CRITICAL: formatting_func required by Unsloth ==========\n",
206
+ "def formatting_func(example):\n",
207
+ " return tokenizer.apply_chat_template(\n",
208
+ " example[\"messages\"],\n",
209
+ " tokenize=False, # MUST return text string\n",
210
+ " add_generation_prompt=False,\n",
211
+ " )\n",
212
+ "# ====================================================================\n",
213
+ "\n",
214
+ "training_args = SFTConfig(\n",
215
+ " output_dir=f\"./outputs_{MODEL_CHOICE}\",\n",
216
+ " max_length=MAX_SEQ_LENGTH,\n",
217
+ " per_device_train_batch_size=BATCH_SIZE,\n",
218
+ " gradient_accumulation_steps=GRAD_ACCUM,\n",
219
+ " warmup_steps=WARMUP_STEPS,\n",
220
+ " num_train_epochs=NUM_EPOCHS,\n",
221
+ " learning_rate=LEARNING_RATE,\n",
222
+ " fp16=True,\n",
223
+ " logging_steps=LOGGING_STEPS,\n",
224
+ " optim=\"adamw_8bit\",\n",
225
+ " weight_decay=0.01,\n",
226
+ " lr_scheduler_type=\"linear\",\n",
227
+ " seed=3407,\n",
228
+ " save_strategy=\"epoch\",\n",
229
+ " report_to=\"none\",\n",
230
+ ")\n",
231
+ "\n",
232
+ "trainer = SFTTrainer(\n",
233
+ " model=model,\n",
234
+ " tokenizer=tokenizer,\n",
235
+ " train_dataset=train_dataset,\n",
236
+ " args=training_args,\n",
237
+ " formatting_func=formatting_func, # ← REQUIRED by Unsloth!\n",
238
+ " max_seq_length=MAX_SEQ_LENGTH,\n",
239
+ " dataset_num_proc=2,\n",
240
+ " packing=False,\n",
241
+ ")\n",
242
+ "\n",
243
+ "steps_per_epoch = len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)\n",
244
+ "print(f\"✅ Trainer ready. Steps per epoch: ~{steps_per_epoch}\")"
245
+ ]
246
+ },
247
+ {
248
+ "cell_type": "markdown",
249
+ "metadata": {},
250
+ "source": [
251
+ "## 6️⃣ Train 🚀"
252
+ ]
253
+ },
254
+ {
255
+ "cell_type": "code",
256
+ "execution_count": null,
257
+ "metadata": {},
258
+ "outputs": [],
259
+ "source": [
260
+ "if torch.cuda.is_available():\n",
261
+ " print(f\"VRAM before: {torch.cuda.memory_allocated()/1e9:.2f} GB / {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB\")\n",
262
+ "\n",
263
+ "trainer_stats = trainer.train()\n",
264
+ "print(\"\\n🎉 Training complete!\")\n",
265
+ "print(trainer_stats)\n",
266
+ "\n",
267
+ "if torch.cuda.is_available():\n",
268
+ " print(f\"VRAM after: {torch.cuda.memory_allocated()/1e9:.2f} GB\")"
269
+ ]
270
+ },
271
+ {
272
+ "cell_type": "markdown",
273
+ "metadata": {},
274
+ "source": [
275
+ "## 7️⃣ Save & Inference"
276
+ ]
277
+ },
278
+ {
279
+ "cell_type": "code",
280
+ "execution_count": null,
281
+ "metadata": {},
282
+ "outputs": [],
283
+ "source": [
284
+ "# Save LoRA adapter\n",
285
+ "save_path = f\"./cyber-lora-{MODEL_CHOICE}\"\n",
286
+ "model.save_pretrained(save_path)\n",
287
+ "tokenizer.save_pretrained(save_path)\n",
288
+ "print(f\"✅ Adapter saved to {save_path}\")\n",
289
+ "\n",
290
+ "# Quick inference test\n",
291
+ "FastLanguageModel.for_inference(model)\n",
292
+ "\n",
293
+ "test_msgs = [\n",
294
+ " {\"role\": \"system\", \"content\": \"You are a cybersecurity expert.\"},\n",
295
+ " {\"role\": \"user\", \"content\": \"List the phases of a responsible web app penetration test.\"},\n",
296
+ "]\n",
297
+ "\n",
298
+ "inputs = tokenizer.apply_chat_template(\n",
299
+ " test_msgs,\n",
300
+ " tokenize=True,\n",
301
+ " add_generation_prompt=True,\n",
302
+ " return_tensors=\"pt\",\n",
303
+ ").to(model.device)\n",
304
+ "\n",
305
+ "outputs = model.generate(\n",
306
+ " input_ids=inputs,\n",
307
+ " max_new_tokens=256,\n",
308
+ " temperature=0.7,\n",
309
+ " top_p=0.9,\n",
310
+ " do_sample=True,\n",
311
+ " pad_token_id=tokenizer.pad_token_id,\n",
312
+ " eos_token_id=tokenizer.eos_token_id,\n",
313
+ ")\n",
314
+ "\n",
315
+ "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
316
+ "reply = response.split(\"assistant\")[-1].strip()[:500]\n",
317
+ "print(f\"\\n📝 Test Response:\\n{reply}...\")"
318
+ ]
319
+ },
320
+ {
321
+ "cell_type": "markdown",
322
+ "metadata": {},
323
+ "source": [
324
+ "---\n",
325
+ "## 🔧 Model-Specific Notes\n",
326
+ "\n",
327
+ "### Qwen3-4B / Qwen3-8B\n",
328
+ "- Has `enable_thinking=True/False` toggle for deep vs fast reasoning\n",
329
+ "- Best coding scores among sub-10B models\n",
330
+ "- Apache 2.0 license\n",
331
+ "\n",
332
+ "### Gemma-3-4B\n",
333
+ "- Google's Gemma 3 series\n",
334
+ "- Different tokenizer than Qwen — results may vary\n",
335
+ "- Good multimodal capabilities (text + vision)\n",
336
+ "\n",
337
+ "### ⚠️ NOT Recommended\n",
338
+ "\n",
339
+ "| Model | Why Avoid |\n",
340
+ "|-------|-----------|\n",
341
+ "| **Bonsai** (prism-ml) | Ternary weights (1-bit), custom architecture, no Unsloth support. MMLU ~30% — too weak for cybersecurity. |\n",
342
+ "| **LFM2** (Liquid AI) | Official disclaimer: \"not recommended for programming tasks.\" No Unsloth support. |\n",
343
+ "| Gemma-4-E2B | Too new, Unsloth support unverified for small sizes. Large variants (26B+) won't fit T4. |\n",
344
+ "\n",
345
+ "---\n",
346
+ "*Built with ❤️ for the cybersecurity community. Use responsibly.*"
347
+ ]
348
+ }
349
+ ],
350
+ "metadata": {
351
+ "kernelspec": {
352
+ "display_name": "Python 3",
353
+ "language": "python",
354
+ "name": "python3"
355
+ },
356
+ "language_info": {
357
+ "name": "python",
358
+ "version": "3.10.12"
359
+ }
360
+ },
361
+ "nbformat": 4,
362
+ "nbformat_minor": 4
363
+ }