{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# šŸ” Multi-Model Ethical Hacking Fine-Tuning – Pick Your Model\n", "\n", "This notebook lets you choose between multiple models for cybersecurity fine-tuning on Google Colab Free Tier (T4 GPU, ~16GB VRAM).\n", "\n", "**All models tested with Unsloth for 2Ɨ faster training + 70% less VRAM.**\n", "\n", "---\n", "\n", "## šŸ“Š Model Comparison Matrix (T4 16GB)\n", "\n", "| Model | 4-bit Size | T4 Fit | Coding Score | Unsloth | āœ…/āŒ | Why |\n", "|-------|-----------|--------|-------------|---------|------|-----|\n", "| **Qwen3-4B-Instruct-2507** šŸ„‡ | 3.3 GB | āœ…āœ…āœ… Excellent | LiveCodeBench 35.1 | āœ… Confirmed | āœ… **USE THIS** | Best coding/reasoning under 10B |\n", "| Qwen3-8B | 7.0 GB | āœ…āœ… Good | Strong base | āœ… Confirmed | āœ… Viable | More capacity, tighter VRAM |\n", "| Gemma-3-4B-it | ~2.5 GB | āœ…āœ…āœ… Excellent | Decent | āœ… Confirmed | āœ… Alternative | Good for multimodal tasks |\n", "| Gemma-4-E2B-it | ~7.6 GB | āœ…āœ… Good | Unverified | āš ļø Limited | āš ļø Experimental | Very new, may have issues |\n", "| Bonsai-4B | ~0.5 GB | āœ…āœ…āœ… Excellent | Weak (~30% MMLU) | āŒ No | āŒ **AVOID** | Ternary weights, NOT for coding |\n", "| LFM2-2.6B | ~2.5 GB | āœ…āœ… Good | **Not for programming** | āŒ No | āŒ **AVOID** | Officially disclaimed by Liquid AI |\n", "\n", "---\n", "\n", "## šŸŽÆ Quick Pick\n", "\n", "```python\n", "MODEL_CHOICE = \"qwen3-4b\" # Options: qwen3-4b | qwen3-8b | gemma-3-4b\n", "```\n", "\n", "> āš ļø **Disclaimer:** This trains on **defensive cybersecurity** datasets only. For ethical hacking education and security research." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1ļøāƒ£ Install Dependencies" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!pip install -q unsloth trl datasets accelerate transformers bitsandbytes huggingface_hub" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2ļøāƒ£ Choose Your Model (Edit This Cell)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# ======================== PICK YOUR MODEL ========================\n", "MODEL_CHOICE = \"qwen3-4b\" # Change this to: \"qwen3-4b\" | \"qwen3-8b\" | \"gemma-3-4b\"\n", "# ================================================================\n", "\n", "MODEL_CONFIGS = {\n", " \"qwen3-4b\": {\n", " \"name\": \"unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit\",\n", " \"max_seq_length\": 4096,\n", " \"lora_r\": 64,\n", " \"lora_alpha\": 64,\n", " \"batch_size\": 2,\n", " \"grad_accum\": 4,\n", " \"description\": \"Best coding/reasoning under 10B. Massive VRAM headroom on T4.\",\n", " },\n", " \"qwen3-8b\": {\n", " \"name\": \"unsloth/Qwen3-8B-unsloth-bnb-4bit\",\n", " \"max_seq_length\": 2048,\n", " \"lora_r\": 16,\n", " \"lora_alpha\": 16,\n", " \"batch_size\": 1,\n", " \"grad_accum\": 4,\n", " \"description\": \"More capacity for complex exploits. Tighter VRAM on T4.\",\n", " },\n", " \"gemma-3-4b\": {\n", " \"name\": \"unsloth/gemma-3-4b-it-unsloth-bnb-4bit\",\n", " \"max_seq_length\": 2048,\n", " \"lora_r\": 32,\n", " \"lora_alpha\": 32,\n", " \"batch_size\": 2,\n", " \"grad_accum\": 4,\n", " \"description\": \"Google's Gemma 3. Good alternative with different tokenizer.\",\n", " },\n", "}\n", "\n", "cfg = MODEL_CONFIGS[MODEL_CHOICE]\n", "print(f\"šŸŽÆ Model: {MODEL_CHOICE}\")\n", "print(f\" HF ID: {cfg['name']}\")\n", "print(f\" {cfg['description']}\")\n", "print(f\" MAX_SEQ_LENGTH={cfg['max_seq_length']}, LoRA r={cfg['lora_r']}, batch={cfg['batch_size']}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3ļøāƒ£ Load Model with Unsloth" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from unsloth import FastLanguageModel\n", "import torch\n", "\n", "MAX_SEQ_LENGTH = cfg[\"max_seq_length\"]\n", "LORA_R = cfg[\"lora_r\"]\n", "LORA_ALPHA = cfg[\"lora_alpha\"]\n", "BATCH_SIZE = cfg[\"batch_size\"]\n", "GRAD_ACCUM = cfg[\"grad_accum\"]\n", "LEARNING_RATE = 2e-4\n", "NUM_EPOCHS = 1\n", "WARMUP_STEPS = 10\n", "LOGGING_STEPS = 5\n", "\n", "model, tokenizer = FastLanguageModel.from_pretrained(\n", " model_name=cfg[\"name\"],\n", " max_seq_length=MAX_SEQ_LENGTH,\n", " dtype=None, # auto-detect\n", " load_in_4bit=True,\n", ")\n", "\n", "model = FastLanguageModel.get_peft_model(\n", " model,\n", " r=LORA_R,\n", " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n", " \"gate_proj\", \"up_proj\", \"down_proj\"],\n", " lora_alpha=LORA_ALPHA,\n", " lora_dropout=0,\n", " bias=\"none\",\n", " use_gradient_checkpointing=\"unsloth\",\n", " random_state=3407,\n", " use_rslora=False,\n", ")\n", "\n", "trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)\n", "total = sum(p.numel() for p in model.parameters())\n", "print(f\"āœ… Model loaded. Trainable: {trainable:,} / {total:,} ({100*trainable/total:.2f}%)\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4ļøāƒ£ Load & Prepare Cybersecurity Datasets\n", "\n", "Pre-process to `text` format using chat template. This avoids Unsloth `formatting_func` issues entirely." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from datasets import load_dataset, concatenate_datasets\n", "\n", "ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n", "ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n", "\n", "def to_messages(example):\n", " return {\"messages\": [\n", " {\"role\": \"system\", \"content\": example[\"system\"]},\n", " {\"role\": \"user\", \"content\": example[\"user\"]},\n", " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n", " ]}\n", "\n", "ds1 = ds1.map(to_messages, remove_columns=ds1.column_names, batched=False)\n", "ds2 = ds2.map(to_messages, remove_columns=ds2.column_names, batched=False)\n", "train_dataset = concatenate_datasets([ds1, ds2])\n", "print(f\"āœ… Messages dataset: {len(train_dataset)} rows\")\n", "\n", "# ========== PRE-PROCESS: messages → text with chat template ==========\n", "def convert_messages_to_text(examples):\n", " \"\"\"Convert batched messages to formatted text strings.\"\"\"\n", " texts = []\n", " for msgs in examples[\"messages\"]:\n", " text = tokenizer.apply_chat_template(\n", " msgs,\n", " tokenize=False,\n", " add_generation_prompt=False,\n", " )\n", " texts.append(text)\n", " return {\"text\": texts}\n", "\n", "print(\"šŸ”„ Converting messages to text with chat template...\")\n", "train_dataset = train_dataset.map(\n", " convert_messages_to_text,\n", " batched=True,\n", " remove_columns=[\"messages\"],\n", " batch_size=100,\n", ")\n", "print(f\"āœ… Dataset ready with columns: {train_dataset.column_names}\")\n", "print(f\"šŸ“„ Sample length: {len(train_dataset[0]['text'])} chars\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5ļøāƒ£ Configure SFTTrainer (dataset_text_field=\"text\" – no formatting_func!)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from trl import SFTTrainer\n", "from transformers import TrainingArguments\n", "\n", "trainer = SFTTrainer(\n", " model=model,\n", " tokenizer=tokenizer,\n", " train_dataset=train_dataset,\n", " dataset_text_field=\"text\", # ← standard text format, no formatting_func needed!\n", " max_seq_length=MAX_SEQ_LENGTH,\n", " dataset_num_proc=2,\n", " packing=False,\n", " args=TrainingArguments(\n", " output_dir=f\"./outputs_{MODEL_CHOICE}\",\n", " per_device_train_batch_size=BATCH_SIZE,\n", " gradient_accumulation_steps=GRAD_ACCUM,\n", " warmup_steps=WARMUP_STEPS,\n", " num_train_epochs=NUM_EPOCHS,\n", " learning_rate=LEARNING_RATE,\n", " fp16=True,\n", " logging_steps=LOGGING_STEPS,\n", " optim=\"adamw_8bit\",\n", " weight_decay=0.01,\n", " lr_scheduler_type=\"linear\",\n", " seed=3407,\n", " save_strategy=\"epoch\",\n", " report_to=\"none\",\n", " ),\n", ")\n", "\n", "steps_per_epoch = len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)\n", "print(f\"āœ… Trainer ready. Steps per epoch: ~{steps_per_epoch}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6ļøāƒ£ Train šŸš€" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "if torch.cuda.is_available():\n", " print(f\"VRAM before: {torch.cuda.memory_allocated()/1e9:.2f} GB / {torch.cuda.get_device_properties(0).total_memory/1e9:.2f} GB\")\n", "\n", "trainer_stats = trainer.train()\n", "print(\"\\nšŸŽ‰ Training complete!\")\n", "print(trainer_stats)\n", "\n", "if torch.cuda.is_available():\n", " print(f\"VRAM after: {torch.cuda.memory_allocated()/1e9:.2f} GB\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7ļøāƒ£ Save & Inference" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Save LoRA adapter\n", "save_path = f\"./cyber-lora-{MODEL_CHOICE}\"\n", "model.save_pretrained(save_path)\n", "tokenizer.save_pretrained(save_path)\n", "print(f\"āœ… Adapter saved to {save_path}\")\n", "\n", "# Quick inference test\n", "FastLanguageModel.for_inference(model)\n", "\n", "test_msgs = [\n", " {\"role\": \"system\", \"content\": \"You are a cybersecurity expert.\"},\n", " {\"role\": \"user\", \"content\": \"List the phases of a responsible web app penetration test.\"},\n", "]\n", "\n", "inputs = tokenizer.apply_chat_template(\n", " test_msgs,\n", " tokenize=True,\n", " add_generation_prompt=True,\n", " return_tensors=\"pt\",\n", ").to(model.device)\n", "\n", "outputs = model.generate(\n", " input_ids=inputs,\n", " max_new_tokens=256,\n", " temperature=0.7,\n", " top_p=0.9,\n", " do_sample=True,\n", " pad_token_id=tokenizer.pad_token_id,\n", " eos_token_id=tokenizer.eos_token_id,\n", ")\n", "\n", "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n", "reply = response.split(\"assistant\")[-1].strip()[:500]\n", "print(f\"\\nšŸ“ Test Response:\\n{reply}...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---\n", "## šŸ”§ Model-Specific Notes\n", "\n", "### Qwen3-4B / Qwen3-8B\n", "- Has `enable_thinking=True/False` toggle for deep vs fast reasoning\n", "- Best coding scores among sub-10B models\n", "- Apache 2.0 license\n", "\n", "### Gemma-3-4B\n", "- Google's Gemma 3 series\n", "- Different tokenizer than Qwen — results may vary\n", "- Good multimodal capabilities (text + vision)\n", "\n", "### āš ļø NOT Recommended\n", "\n", "| Model | Why Avoid |\n", "|-------|-----------|\n", "| **Bonsai** (prism-ml) | Ternary weights (1-bit), custom architecture, no Unsloth support. MMLU ~30% — too weak for cybersecurity. |\n", "| **LFM2** (Liquid AI) | Official disclaimer: \"not recommended for programming tasks.\" No Unsloth support. |\n", "| Gemma-4-E2B | Too new, Unsloth support unverified for small sizes. Large variants (26B+) won't fit T4. |\n", "\n", "---\n", "*Built with ā¤ļø for the cybersecurity community. Use responsibly.*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "name": "python", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }