asdf98
/

ethical-hacking-llm-colab

Model card Files Files and versions

xet

Community

asdf98 commited on 1 day ago

Commit

2aa76c3

verified ·

1 Parent(s): 89f71bf

Upload EthicalHacking_Gemma4_E2B_Colab.ipynb

Browse files

Files changed (1) hide show

EthicalHacking_Gemma4_E2B_Colab.ipynb +475 -0

EthicalHacking_Gemma4_E2B_Colab.ipynb ADDED Viewed

	@@ -0,0 +1,475 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 🔐 Ultimate Ethical Hacking LLM – Gemma 4 E2B (Colab Free Tier T4)\n",
+    "\n",
+    "**🥇 Model:** [Google Gemma 4 E2B](https://huggingface.co/google/gemma-4-E2B-it) via Unsloth 4-bit  \n",
+    "**🏆 Why this model?** Dense ~2B parameter edge model. NOT an MoE — all 2B params are active every forward pass. Strong reasoning for its size.  \n",
+    "**⚠️ T4 WARNING:** This is **tight on 16GB VRAM**. The 4-bit model alone uses ~7.4GB. You MUST follow the memory-optimized settings below.  \n",
+    "**📊 Datasets:** [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) + [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset)  \n",
+    "**⚡ Framework:** Unsloth + TRL SFTTrainer  \n",
+    "\n",
+    "> ⚠️ **Disclaimer:** Defensive cybersecurity datasets only. Ethical hacking education.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "## 📋 Why Gemma-4 E2B?\n",
+    "\n",
+    "| Spec | Value |\n",
+    "|------|-------|\n",
+    "| Parameters | ~2B (dense, NOT MoE) |\n",
+    "| 4-bit VRAM | ~7.4 GB |\n",
+    "| Context | Up to 256K tokens |\n",
+    "| Batch size on T4 | **1 only** |\n",
+    "| Max seq length | **2048 max** on T4 |\n",
+    "| LoRA rank | **8** (save VRAM) |\n",
+    "\n",
+    "**Unsloth docs:** https://unsloth.ai/docs/models/gemma-4/train  \n",
+    "**Official notebook:** https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Text.ipynb"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1️⃣ Install Dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture\n",
+    "!pip install -q unsloth trl datasets accelerate transformers bitsandbytes huggingface_hub"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2️⃣ (Optional) Login to HuggingFace Hub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from huggingface_hub import login\n",
+    "# login(token=\"hf_YOUR_TOKEN\")   # ← uncomment and paste your token"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3️⃣ Load Gemma-4 E2B in 4-bit via Unsloth\n",
+    "\n",
+    "**⚠️ T4 MEMORY LIMITS — READ CAREFULLY:**\n",
+    "\n",
+    "| Setting | Value | Why |\n",
+    "|---------|-------|-----|\n",
+    "| `BATCH_SIZE` | **1** | Cannot fit >1 on T4 |\n",
+    "| `MAX_SEQ_LENGTH` | **2048** | Longer = OOM during backprop |\n",
+    "| `LORA_R` | **8** | Small rank = fewer adapter params |\n",
+    "| `GRAD_ACCUM` | **8** | Effective batch still = 8 |\n",
+    "| `PACKING` | **False** | Avoids complex memory spikes |\n",
+    "| `optim` | `adamw_8bit` | Must use 8-bit optimizer |\n",
+    "\n",
+    "If you still OOM: lower `MAX_SEQ_LENGTH` to 1024, or use `use_rslora=True`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from unsloth import FastLanguageModel\n",
+    "import torch\n",
+    "\n",
+    "# ==================== T4-COLAB HYPERPARAMETERS (Gemma-4 E2B) ====================\n",
+    "MAX_SEQ_LENGTH = 2048          # DO NOT exceed 2048 on T4\n",
+    "LORA_R = 8                     # small rank for memory\n",
+    "LORA_ALPHA = 8                 \n",
+    "BATCH_SIZE = 1                 # MUST be 1 on T4 (model is ~7.4GB in 4-bit)\n",
+    "GRAD_ACCUM = 8                 # effective batch = 8\n",
+    "LEARNING_RATE = 2e-4           \n",
+    "NUM_EPOCHS = 1\n",
+    "MAX_STEPS = 4000               \n",
+    "WARMUP_STEPS = 100             # shorter warmup (tight memory)\n",
+    "LOGGING_STEPS = 50             \n",
+    "SAVE_STEPS = 500               \n",
+    "PACKING = False                # False = simpler memory profile\n",
+    "SAMPLE_SIZE = 50000            \n",
+    "HUB_MODEL_ID = \"your-username/cyber-gemma4-e2b-lora\"  \n",
+    "# ================================================================================\n",
+    "\n",
+    "# NOTE: Unsloth auto-applies 4-bit when loading Gemma-4.\n",
+    "# If the unsloth-bnb-4bit ID doesn't exist, try the base unsloth ID with load_in_4bit=True.\n",
+    "MODEL_NAME = \"unsloth/gemma-4-E2B-it-unsloth-bnb-4bit\"  # ~7.6GB download\n",
+    "\n",
+    "model, tokenizer = FastLanguageModel.from_pretrained(\n",
+    "    model_name=MODEL_NAME,\n",
+    "    max_seq_length=MAX_SEQ_LENGTH,\n",
+    "    dtype=None,                   # auto-detect (fp16 on T4)\n",
+    "    load_in_4bit=True,\n",
+    ")\n",
+    "\n",
+    "model = FastLanguageModel.get_peft_model(\n",
+    "    model,\n",
+    "    r=LORA_R,\n",
+    "    target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
+    "                   \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
+    "    lora_alpha=LORA_ALPHA,\n",
+    "    lora_dropout=0,               \n",
+    "    bias=\"none\",\n",
+    "    use_gradient_checkpointing=\"unsloth\",  # CRITICAL for T4\n",
+    "    random_state=3407,\n",
+    "    use_rslora=False,             # set True if still OOM\n",
+    "    loftq_config=None,\n",
+    ")\n",
+    "\n",
+    "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n",
+    "total     = sum(p.numel() for p in model.parameters())\n",
+    "print(f\"✅ Gemma-4 E2B loaded. Trainable params: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)\")\n",
+    "print(f\"⚠️  This model is LARGE. Expected VRAM during training: ~12-14 GB\")\n",
+    "print(f\"    If you get OOM, lower MAX_SEQ_LENGTH to 1024 or set use_rslora=True\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4️⃣ Load, Audit, Subsample & Merge Cybersecurity Datasets"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset, concatenate_datasets\n",
+    "import random\n",
+    "\n",
+    "# ---------- Dataset 1: Fenrir v2.1 ----------\n",
+    "print(\"📥 Loading Fenrir v2.1...\")\n",
+    "ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
+    "print(f\"   Rows: {len(ds1)} | Columns: {ds1.column_names}\")\n",
+    "\n",
+    "for i in random.sample(range(len(ds1)), 2):\n",
+    "    print(f\"\\n--- Sample {i} ---\")\n",
+    "    print(f\"SYSTEM: {ds1[i]['system'][:120]}...\")\n",
+    "    print(f\"USER:   {ds1[i]['user'][:120]}...\")\n",
+    "    print(f\"ASSIST: {ds1[i]['assistant'][:120]}...\")\n",
+    "\n",
+    "def fenrir_to_messages(example):\n",
+    "    return {\n",
+    "        \"messages\": [\n",
+    "            {\"role\": \"system\",    \"content\": example[\"system\"]},\n",
+    "            {\"role\": \"user\",      \"content\": example[\"user\"]},\n",
+    "            {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
+    "        ]\n",
+    "    }\n",
+    "\n",
+    "ds1 = ds1.map(fenrir_to_messages, remove_columns=ds1.column_names, batched=False)\n",
+    "\n",
+    "# ---------- Dataset 2: Trendyol ----------\n",
+    "print(\"\\n📥 Loading Trendyol Cybersecurity...\")\n",
+    "ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
+    "print(f\"   Rows: {len(ds2)} | Columns: {ds2.column_names}\")\n",
+    "\n",
+    "def trendyol_to_messages(example):\n",
+    "    return {\n",
+    "        \"messages\": [\n",
+    "            {\"role\": \"system\",    \"content\": example[\"system\"]},\n",
+    "            {\"role\": \"user\",      \"content\": example[\"user\"]},\n",
+    "            {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
+    "        ]\n",
+    "    }\n",
+    "\n",
+    "ds2 = ds2.map(trendyol_to_messages, remove_columns=ds2.column_names, batched=False)\n",
+    "\n",
+    "# ---------- Merge & Subsample ----------\n",
+    "train_dataset = concatenate_datasets([ds1, ds2])\n",
+    "print(f\"\\n📊 COMBINED DATASET: {len(train_dataset)} rows\")\n",
+    "\n",
+    "if len(train_dataset) > SAMPLE_SIZE:\n",
+    "    train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
+    "    print(f\"🚀 SUBSAMPLED to {len(train_dataset)} rows\")\n",
+    "\n",
+    "print(f\"   Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
+    "print(f\"   Steps per epoch: ~{len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)}\")\n",
+    "print(f\"   Capped to MAX_STEPS: {MAX_STEPS}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5️⃣ Pre-process Dataset to Text (Avoid Unsloth formatting_func issues)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def convert_messages_to_text(examples):\n",
+    "    texts = []\n",
+    "    for msgs in examples[\"messages\"]:\n",
+    "        text = tokenizer.apply_chat_template(\n",
+    "            msgs,\n",
+    "            tokenize=False,\n",
+    "            add_generation_prompt=False,\n",
+    "        )\n",
+    "        texts.append(text)\n",
+    "    return {\"text\": texts}\n",
+    "\n",
+    "print(\"🔄 Converting messages to text...\")\n",
+    "train_dataset = train_dataset.map(\n",
+    "    convert_messages_to_text,\n",
+    "    batched=True,\n",
+    "    remove_columns=[\"messages\"],\n",
+    "    batch_size=100,\n",
+    ")\n",
+    "\n",
+    "print(f\"✅ Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
+    "print(f\"📄 Sample text length: {len(train_dataset[0]['text'])} chars\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6️⃣ Configure SFT Trainer (T4-Safe Memory Settings)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from trl import SFTTrainer\n",
+    "from transformers import TrainingArguments\n",
+    "\n",
+    "trainer = SFTTrainer(\n",
+    "    model=model,\n",
+    "    tokenizer=tokenizer,\n",
+    "    train_dataset=train_dataset,\n",
+    "    dataset_text_field=\"text\",\n",
+    "    max_seq_length=MAX_SEQ_LENGTH,\n",
+    "    dataset_num_proc=2,\n",
+    "    packing=PACKING,                # False = safer for T4 with large model\n",
+    "    args=TrainingArguments(\n",
+    "        per_device_train_batch_size=BATCH_SIZE,     # MUST be 1\n",
+    "        gradient_accumulation_steps=GRAD_ACCUM,     # effective batch = 8\n",
+    "        warmup_steps=WARMUP_STEPS,\n",
+    "        max_steps=MAX_STEPS,\n",
+    "        learning_rate=LEARNING_RATE,\n",
+    "        fp16=True,                      # T4 = fp16 only\n",
+    "        logging_steps=LOGGING_STEPS,\n",
+    "        optim=\"adamw_8bit\",         # CRITICAL: saves ~2-3GB VRAM\n",
+    "        weight_decay=0.01,\n",
+    "        lr_scheduler_type=\"linear\",\n",
+    "        seed=3407,\n",
+    "        output_dir=\"./outputs_gemma4\",\n",
+    "        save_strategy=\"steps\",\n",
+    "        save_steps=SAVE_STEPS,\n",
+    "        save_total_limit=2,\n",
+    "        report_to=\"none\",\n",
+    "        # gradient_checkpointing=True,    # already set via use_gradient_checkpointing in LoRA\n",
+    "    ),\n",
+    ")\n",
+    "\n",
+    "print(f\"✅ Trainer ready. Total steps: {MAX_STEPS}\")\n",
+    "print(f\"   Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
+    "print(f\"   Packing enabled: {PACKING}\")\n",
+    "print(f\"   ⚠️  Expected training VRAM: ~12-14 GB (out of 16 GB)\")\n",
+    "print(f\"   Est. time at ~0.15 it/s: ~{MAX_STEPS * 6.7 / 3600:.1f} hours\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 7️⃣ Train 🚀 (Watch for OOM!)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if torch.cuda.is_available():\n",
+    "    total_mem = torch.cuda.get_device_properties(0).total_memory / 1e9\n",
+    "    alloc = torch.cuda.memory_allocated() / 1e9\n",
+    "    print(f\"VRAM before train: {alloc:.2f} GB / {total_mem:.2f} GB ({100*alloc/total_mem:.0f}%)\")\n",
+    "    print(f\"⚠️  If >80% before training starts, you WILL OOM during backprop.\")\n",
+    "\n",
+    "trainer_stats = trainer.train()\n",
+    "\n",
+    "print(\"\\n🎉 Training complete!\")\n",
+    "print(trainer_stats)\n",
+    "\n",
+    "if torch.cuda.is_available():\n",
+    "    print(f\"VRAM after train:  {torch.cuda.memory_allocated()/1e9:.2f} GB\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 8️⃣ Save & Push to HuggingFace Hub"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# 8A) Save LoRA adapter (tiny, fast)\n",
+    "model.save_pretrained(\"./gemma4-lora-adapter\")\n",
+    "tokenizer.save_pretrained(\"./gemma4-lora-adapter\")\n",
+    "print(\"✅ LoRA adapter saved\")\n",
+    "\n",
+    "# 8B) Merge & save full model\n",
+    "# ⚠️ Merging may push to CPU swap on Colab. Still works but slower.\n",
+    "print(\"\\n🔄 Merging LoRA into base model...\")\n",
+    "merged_model = model.merge_and_unload()\n",
+    "merged_model.save_pretrained(\"./gemma4-merged\")\n",
+    "tokenizer.save_pretrained(\"./gemma4-merged\")\n",
+    "print(\"✅ Merged model saved\")\n",
+    "\n",
+    "# 8C) Push to HF Hub (uncomment if logged in)\n",
+    "# model.push_to_hub(HUB_MODEL_ID)\n",
+    "# tokenizer.push_to_hub(HUB_MODEL_ID)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 9️⃣ Inference Demo – Responsible Pentesting"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "FastLanguageModel.for_inference(model)\n",
+    "\n",
+    "test_prompt = \"How would you perform a responsible penetration test on a web application?\"\n",
+    "\n",
+    "messages = [\n",
+    "    {\"role\": \"system\", \"content\": \"You are a cybersecurity expert. Explain concepts clearly and ethically.\"},\n",
+    "    {\"role\": \"user\",     \"content\": test_prompt},\n",
+    "]\n",
+    "\n",
+    "inputs = tokenizer.apply_chat_template(\n",
+    "    messages,\n",
+    "    tokenize=True,\n",
+    "    add_generation_prompt=True,\n",
+    "    return_tensors=\"pt\",\n",
+    ").to(model.device)\n",
+    "\n",
+    "outputs = model.generate(\n",
+    "    input_ids=inputs,\n",
+    "    max_new_tokens=512,\n",
+    "    temperature=0.7,\n",
+    "    top_p=0.9,\n",
+    "    do_sample=True,\n",
+    "    pad_token_id=tokenizer.pad_token_id,\n",
+    "    eos_token_id=tokenizer.eos_token_id,\n",
+    ")\n",
+    "\n",
+    "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
+    "reply = response.split(\"user\")[-1].split(\"assistant\")[-1].strip()\n",
+    "print(reply[:800])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 🔟 Quick Benchmark – CyberMetric Sample"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "benchmark_q = (\n",
+    "    \"Which of the following is the MOST effective defense against SQL injection?\\n\"\n",
+    "    \"A) Input validation only\\n\"\n",
+    "    \"B) Parameterized queries\\n\"\n",
+    "    \"C) Escaping special characters\\n\"\n",
+    "    \"D) Client-side filtering\\n\"\n",
+    "    \"Answer with the letter only.\"\n",
+    ")\n",
+    "\n",
+    "bench_msgs = [\n",
+    "    {\"role\": \"system\", \"content\": \"You are a cybersecurity expert. Answer accurately.\"},\n",
+    "    {\"role\": \"user\",     \"content\": benchmark_q},\n",
+    "]\n",
+    "\n",
+    "inputs = tokenizer.apply_chat_template(bench_msgs, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(model.device)\n",
+    "\n",
+    "outputs = model.generate(input_ids=inputs, max_new_tokens=64, temperature=0.1, do_sample=True,\n",
+    "    pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)\n",
+    "\n",
+    "answer = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
+    "print(\"📊 Benchmark Answer:\")\n",
+    "print(answer.split(\"assistant\")[-1].strip())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---\n",
+    "## 📚 References\n",
+    "\n",
+    "| Resource | Link |\n",
+    "|----------|------|\n",
+    "| **Gemma 4 Paper** | https://storage.googleapis.com/deepmind-media/gemma/gemma-4-report.pdf |\n",
+    "| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |\n",
+    "| **Unsloth Gemma-4 Train** | https://unsloth.ai/docs/models/gemma-4/train |\n",
+    "| **Official Colab** | https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Text.ipynb |\n",
+    "| **Fenrir Dataset** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
+    "| **Trendyol Dataset** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
+    "\n",
+    "---\n",
+    "*Built with ❤️ for the cybersecurity community. Use responsibly.*"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+  "nbformat_minor": 4
+}