asdf98
/

ethical-hacking-llm-colab

Model card Files Files and versions

xet

Community

asdf98 commited on 18 days ago

Commit

3d6a9e6

verified ·

1 Parent(s): 145c629

Upload EthicalHacking_MultiModel_Comparison_Colab.ipynb

Browse files

Files changed (1) hide show

EthicalHacking_MultiModel_Comparison_Colab.ipynb +363 -0

EthicalHacking_MultiModel_Comparison_Colab.ipynb ADDED Viewed

	@@ -0,0 +1,363 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🔐 Multi-Model Ethical Hacking Fine-Tuning – Pick Your Model\n",
+    "\n",
+    "This notebook lets you choose between multiple models for cybersecurity fine-tuning on Google Colab Free Tier (T4 GPU, ~16GB VRAM).\n",
+    "\n",
+    "**All models tested with Unsloth for 2× faster training + 70% less VRAM.**\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📊 Model Comparison Matrix (T4 16GB)\n",
+    "\n",
+    "| Model | 4-bit Size | T4 Fit | Coding Score | Unsloth | ✅/❌ | Why |\n",
+    "|-------|-----------|--------|-------------|---------|------|-----|\n",
+    "| **Qwen3-4B-Instruct-2507** 🥇 | 3.3 GB | ✅✅✅ Excellent | LiveCodeBench 35.1 | ✅ Confirmed | ✅ **USE THIS** | Best coding/reasoning under 10B |\n",
+    "| Qwen3-8B | 7.0 GB | ✅✅ Good | Strong base | ✅ Confirmed | ✅ Viable | More capacity, tighter VRAM |\n",
+    "| Gemma-3-4B-it | ~2.5 GB | ✅✅✅ Excellent | Decent | ✅ Confirmed | ✅ Alternative | Good for multimodal tasks |\n",
+    "| Gemma-4-E2B-it | ~7.6 GB | ✅✅ Good | Unverified | ⚠️ Limited | ⚠️ Experimental | Very new, may have issues |\n",
+    "| Bonsai-4B | ~0.5 GB | ✅✅✅ Excellent | Weak (~30% MMLU) | ❌ No | ❌ **AVOID** | Ternary weights, NOT for coding |\n",
+    "| LFM2-2.6B | ~2.5 GB | ✅✅ Good | **Not for programming** | ❌ No | ❌ **AVOID** | Officially disclaimed by Liquid AI |\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 🎯 Quick Pick\n",
+    "\n",
+    "```python\n",
+    "MODEL_CHOICE = \"qwen3-4b\"   # Options: qwen3-4b | qwen3-8b | gemma-3-4b\n",
+    "```\n",
+    "\n",
+    "> ⚠️ **Disclaimer:** This trains on **defensive cybersecurity** datasets only. For ethical hacking education and security research."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1️⃣ Install Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install -q unsloth trl datasets accelerate transformers bitsandbytes huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2️⃣ Choose Your Model (Edit This Cell)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# ======================== PICK YOUR MODEL ========================\n",
+    "MODEL_CHOICE = \"qwen3-4b\"  # Change this to: \"qwen3-4b\" | \"qwen3-8b\" | \"gemma-3-4b\"\n",
+    "# ================================================================\n",
+    "\n",
+    "MODEL_CONFIGS = {\n",
+    "    \"qwen3-4b\": {\n",
+    "        \"name\": \"unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit\",\n",
+    "        \"max_seq_length\": 4096,\n",
+    "        \"lora_r\": 64,\n",
+    "        \"lora_alpha\": 64,\n",
+    "        \"batch_size\": 2,\n",
+    "        \"grad_accum\": 4,\n",
+    "        \"description\": \"Best coding/reasoning under 10B. Massive VRAM headroom on T4.\",\n",
+    "    },\n",
+    "    \"qwen3-8b\": {\n",
+    "        \"name\": \"unsloth/Qwen3-8B-unsloth-bnb-4bit\",\n",
+    "        \"max_seq_length\": 2048,\n",
+    "        \"lora_r\": 16,\n",
+    "        \"lora_alpha\": 16,\n",
+    "        \"batch_size\": 1,\n",
+    "        \"grad_accum\": 4,\n",
+    "        \"description\": \"More capacity for complex exploits. Tighter VRAM on T4.\",\n",
+    "    },\n",
+    "    \"gemma-3-4b\": {\n",
+    "        \"name\": \"unsloth/gemma-3-4b-it-unsloth-bnb-4bit\",\n",
+    "        \"max_seq_length\": 2048,\n",
+    "        \"lora_r\": 32,\n",
+    "        \"lora_alpha\": 32,\n",
+    "        \"batch_size\": 2,\n",
+    "        \"grad_accum\": 4,\n",
+    "        \"description\": \"Google's Gemma 3. Good alternative with different tokenizer.\",\n",
+    "    },\n",
+    "}\n",
+    "\n",
+    "cfg = MODEL_CONFIGS[MODEL_CHOICE]\n",
+    "print(f\"🎯 Model: {MODEL_CHOICE}\")\n",
+    "print(f\"   HF ID: {cfg['name']}\")\n",
+    "print(f\"   {cfg['description']}\")\n",
+    "print(f\"   MAX_SEQ_LENGTH={cfg['max_seq_length']}, LoRA r={cfg['lora_r']}, batch={cfg['batch_size']}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3️⃣ Load Model with Unsloth"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "import torch\n",
+    "\n",
+    "MAX_SEQ_LENGTH = cfg[\"max_seq_length\"]\n",
+    "LORA_R = cfg[\"lora_r\"]\n",
+    "LORA_ALPHA = cfg[\"lora_alpha\"]\n",
+    "BATCH_SIZE = cfg[\"batch_size\"]\n",
+    "GRAD_ACCUM = cfg[\"grad_accum\"]\n",
+    "LEARNING_RATE = 2e-4\n",
+    "NUM_EPOCHS = 1\n",
+    "WARMUP_STEPS = 10\n",
+    "LOGGING_STEPS = 5\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=cfg[\"name\"],\n",
+    "    max_seq_length=MAX_SEQ_LENGTH,\n",
+    "    dtype=None,                   # auto-detect\n",
+    "    load_in_4bit=True,\n",
+    ")\n",
+    "\n",
+    "model = FastLanguageModel.get_peft_model(\n",
+    "    model,\n",
+    "    r=LORA_R,\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+    "                   \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    lora_alpha=LORA_ALPHA,\n",
+    "    lora_dropout=0,\n",
+    "    bias=\"none\",\n",
+    "    use_gradient_checkpointing=\"unsloth\",\n",
+    "    random_state=3407,\n",
+    "    use_rslora=False,\n",
+    ")\n",
+    "\n",
+    "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "total     = sum(p.numel() for p in model.parameters())\n",
+    "print(f\"✅ Model loaded. Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4️⃣ Load & Prepare Cybersecurity Datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset, concatenate_datasets\n",
+    "\n",
+    "ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
+    "ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
+    "\n",
+    "def to_messages(example):\n",
+    "    return {\"messages\": [\n",
+    "        {\"role\": \"system\", \"content\": example[\"system\"]},\n",
+    "        {\"role\": \"user\", \"content\": example[\"user\"]},\n",
+    "        {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
+    "    ]}\n",
+    "\n",
+    "ds1 = ds1.map(to_messages, remove_columns=ds1.column_names, batched=False)\n",
+    "ds2 = ds2.map(to_messages, remove_columns=ds2.column_names, batched=False)\n",
+    "train_dataset = concatenate_datasets([ds1, ds2])\n",
+    "print(f\"✅ Combined dataset: {len(train_dataset)} rows\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5️⃣ Configure SFTTrainer (with formatting_func fix for Unsloth)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import SFTTrainer, SFTConfig\n",
+    "\n",
+    "# ========== CRITICAL: formatting_func required by Unsloth ==========\n",
+    "def formatting_func(example):\n",
+    "    return tokenizer.apply_chat_template(\n",
+    "        example[\"messages\"],\n",
+    "        tokenize=False,              # MUST return text string\n",
+    "        add_generation_prompt=False,\n",
+    "    )\n",
+    "# ====================================================================\n",
+    "\n",
+    "training_args = SFTConfig(\n",
+    "    output_dir=f\"./outputs_{MODEL_CHOICE}\",\n",
+    "    max_length=MAX_SEQ_LENGTH,\n",
+    "    per_device_train_batch_size=BATCH_SIZE,\n",
+    "    gradient_accumulation_steps=GRAD_ACCUM,\n",
+    "    warmup_steps=WARMUP_STEPS,\n",
+    "    num_train_epochs=NUM_EPOCHS,\n",
+    "    learning_rate=LEARNING_RATE,\n",
+    "    fp16=True,\n",
+    "    logging_steps=LOGGING_STEPS,\n",
+    "    optim=\"adamw_8bit\",\n",
+    "    weight_decay=0.01,\n",
+    "    lr_scheduler_type=\"linear\",\n",
+    "    seed=3407,\n",
+    "    save_strategy=\"epoch\",\n",
+    "    report_to=\"none\",\n",
+    ")\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    tokenizer=tokenizer,\n",
+    "    train_dataset=train_dataset,\n",
+    "    args=training_args,\n",
+    "    formatting_func=formatting_func,      # ← REQUIRED by Unsloth!\n",
+    "    max_seq_length=MAX_SEQ_LENGTH,\n",
+    "    dataset_num_proc=2,\n",
+    "    packing=False,\n",
+    ")\n",
+    "\n",
+    "steps_per_epoch = len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)\n",
+    "print(f\"✅ Trainer ready. Steps per epoch: ~{steps_per_epoch}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6️⃣ Train 🚀"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if torch.cuda.is_available():\n",
+    "    print(f\"VRAM before: {torch.cuda.memory_allocated()/1e9:.2f} GB / {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB\")\n",
+    "\n",
+    "trainer_stats = trainer.train()\n",
+    "print(\"\\n🎉 Training complete!\")\n",
+    "print(trainer_stats)\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f\"VRAM after: {torch.cuda.memory_allocated()/1e9:.2f} GB\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7️⃣ Save & Inference"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Save LoRA adapter\n",
+    "save_path = f\"./cyber-lora-{MODEL_CHOICE}\"\n",
+    "model.save_pretrained(save_path)\n",
+    "tokenizer.save_pretrained(save_path)\n",
+    "print(f\"✅ Adapter saved to {save_path}\")\n",
+    "\n",
+    "# Quick inference test\n",
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "test_msgs = [\n",
+    "    {\"role\": \"system\", \"content\": \"You are a cybersecurity expert.\"},\n",
+    "    {\"role\": \"user\", \"content\": \"List the phases of a responsible web app penetration test.\"},\n",
+    "]\n",
+    "\n",
+    "inputs = tokenizer.apply_chat_template(\n",
+    "    test_msgs,\n",
+    "    tokenize=True,\n",
+    "    add_generation_prompt=True,\n",
+    "    return_tensors=\"pt\",\n",
+    ").to(model.device)\n",
+    "\n",
+    "outputs = model.generate(\n",
+    "    input_ids=inputs,\n",
+    "    max_new_tokens=256,\n",
+    "    temperature=0.7,\n",
+    "    top_p=0.9,\n",
+    "    do_sample=True,\n",
+    "    pad_token_id=tokenizer.pad_token_id,\n",
+    "    eos_token_id=tokenizer.eos_token_id,\n",
+    ")\n",
+    "\n",
+    "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
+    "reply = response.split(\"assistant\")[-1].strip()[:500]\n",
+    "print(f\"\\n📝 Test Response:\\n{reply}...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 🔧 Model-Specific Notes\n",
+    "\n",
+    "### Qwen3-4B / Qwen3-8B\n",
+    "- Has `enable_thinking=True/False` toggle for deep vs fast reasoning\n",
+    "- Best coding scores among sub-10B models\n",
+    "- Apache 2.0 license\n",
+    "\n",
+    "### Gemma-3-4B\n",
+    "- Google's Gemma 3 series\n",
+    "- Different tokenizer than Qwen — results may vary\n",
+    "- Good multimodal capabilities (text + vision)\n",
+    "\n",
+    "### ⚠️ NOT Recommended\n",
+    "\n",
+    "| Model | Why Avoid |\n",
+    "|-------|-----------|\n",
+    "| **Bonsai** (prism-ml) | Ternary weights (1-bit), custom architecture, no Unsloth support. MMLU ~30% — too weak for cybersecurity. |\n",
+    "| **LFM2** (Liquid AI) | Official disclaimer: \"not recommended for programming tasks.\" No Unsloth support. |\n",
+    "| Gemma-4-E2B | Too new, Unsloth support unverified for small sizes. Large variants (26B+) won't fit T4. |\n",
+    "\n",
+    "---\n",
+    "*Built with ❤️ for the cybersecurity community. Use responsibly.*"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}