asdf98
/

ethical-hacking-llm-colab

Model card Files Files and versions

xet

Community

asdf98 commited on 2 days ago

Commit

25abf03

verified ·

1 Parent(s): 3d6a9e6

Upload README.md

Browse files

Files changed (1) hide show

README.md +60 -42

README.md CHANGED Viewed

@@ -1,10 +1,15 @@
 ---
 tags:
 - ml-intern
 ---
 # 🔐 Ethical Hacking LLM Fine-Tuning Collection
-> **Public collection of Colab-ready notebooks for fine-tuning cybersecurity/ethical hacking LLMs on Google Colab Free Tier (T4 GPU).**
 ---
@@ -12,31 +17,58 @@ tags:
 | File | Model | Description |
 |------|-------|-------------|
-| `EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb` | **Qwen3-4B-Instruct-2507** (🥇 recommended) | Best coding/reasoning scores among sub-10B models with confirmed Unsloth support |
-| `EthicalHacking_Qwen3-8B_Colab.ipynb` | Qwen3-8B-bnb-4bit | Larger capacity, less VRAM headroom on T4 |
 ---
-## 🏆 Why Qwen3-4B-Instruct-2507?
-After researching the **latest small models** as of May/June 2026, here's the verdict for T4 16GB:
-| Model | 4-bit Size | T4 Fit | Coding Benchmarks | Unsloth Support | Verdict |
-|-------|-----------|--------|------------------|-----------------|---------|
-| **Qwen3-4B-Instruct-2507** 🥇 | **3.3 GB** | ✅✅✅ Excellent | LiveCodeBench 35.1, MultiPL-E 76.8 | ✅ Confirmed | **Use this** |
-| Qwen3-8B | 7.0 GB | ✅✅ Good | Stronger base model | ✅ Confirmed | Viable but tighter |
-| Gemma-4-E2B-it | 7.6 GB | ✅✅ Good | Unverified coding scores | ✅ 4-bit exists | Multimodal option |
-| Gemma-4-E4B-it | 10.2 GB | ⚠️ Tight | — | ✅ 4-bit exists | **Avoid (OOM risk)** |
-| Bonsai (prism-ml) | ~1 GB | ✅✅✅ Excellent | Weak (MMLU 30%) | ❌ Custom arch | **Avoid** |
-| LFM2 (Liquid AI) | ~2.5 GB | ✅✅ Good | **Not for programming** (official disclaimer) | ❌ Unknown | **Avoid** |
-| Qwen3.5 series | — | — | — | ⚠️ Uncertain | Wait for Unsloth |
 ### Key Datasets Used
 | Dataset | Rows | Focus |
 |---------|------|-------|
-| [AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) | 99,870 | Threat analysis, IR, offensive education |
-| [Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) | 53,202 | C2 analysis, forensics, 200+ topics |
 ---
@@ -51,7 +83,7 @@ After researching the **latest small models** as of May/June 2026, here's the ve
 ## ⚙️ T4 VRAM Optimizations Used
-- `load_in_4bit=True` + LoRA (r=64)
 - `adamw_8bit` optimizer
 - `use_gradient_checkpointing="unsloth"`
 - `fp16=True` (T4 has no bf16)
@@ -67,29 +99,15 @@ All datasets are **defensive/educational** (pentesting methodology, threat analy
 ## 📚 References
-- [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
-- [Unsloth 4-bit](https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit)
-- [Unsloth Docs](https://unsloth.ai/docs)
-- [TRL SFTTrainer](https://huggingface.co/docs/trl/sft_trainer)
-- [Fenrir Dataset](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1)
-- [Trendyol Dataset](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset)
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "asdf98/ethical-hacking-llm-colab"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

 ---
 tags:
 - ml-intern
+- ethical-hacking
+- cybersecurity
+- unsloth
+- colab
 ---
 # 🔐 Ethical Hacking LLM Fine-Tuning Collection
+> **Public collection of Colab-ready notebooks for fine-tuning cybersecurity/ethical hacking LLMs on Google Colab Free Tier (T4 GPU, ~16GB VRAM).**
 ---
 | File | Model | Description |
 |------|-------|-------------|
+| `EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb` | **Qwen3-4B-Instruct-2507** 🥇 | Best coding/reasoning under 10B. **Recommended for T4.** |
+| `EthicalHacking_Qwen3-8B_Colab.ipynb` | Qwen3-8B | More capacity, tighter VRAM. Simpler notebook. |
+| `EthicalHacking_MultiModel_Comparison_Colab.ipynb` | **Multi-model selector** | Pick between Qwen3-4B/8B or Gemma-3-4B in one notebook |
 ---
+## 🚨 CRITICAL FIX: `formatting_func` Required by Unsloth
+If you get this error:
+```
+RuntimeError: Unsloth: You must specify a formatting_func
+```
+**The fix:** When using `FastLanguageModel` + `SFTTrainer`, Unsloth **requires** you to explicitly pass a `formatting_func` that converts `messages` → text string:
+```python
+def formatting_func(example):
+    return tokenizer.apply_chat_template(
+        example["messages"],
+        tokenize=False,              # MUST be False!
+        add_generation_prompt=False,
+    )
+trainer = SFTTrainer(
+    model=model,
+    train_dataset=dataset,
+    formatting_func=formatting_func,  # ← REQUIRED
+    ...
+)
+```
+All notebooks in this repo now include this fix.
+---
+## 🏆 Model Comparison (T4 16GB, May/June 2026)
+| Model | 4-bit Size | T4 Fit | Coding Benchmarks | Unsloth | Verdict |
+|-------|-----------|--------|------------------|---------|---------|
+| **Qwen3-4B-Instruct-2507** 🥇 | **3.3 GB** | ✅✅✅ Excellent | LiveCodeBench 35.1, MultiPL-E 76.8 | ✅ Confirmed | **USE THIS** |
+| Qwen3-8B | 7.0 GB | ✅✅ Good | Stronger base | ✅ Confirmed | Viable |
+| Gemma-3-4B | ~2.5 GB | ✅✅✅ Excellent | Decent | ✅ Confirmed | Alternative |
+| Gemma-4-E2B | ~7.6 GB | ✅✅ Good | Unverified | ⚠️ Limited | Experimental |
+| **Bonsai** (prism-ml) | ~0.5 GB | ✅✅✅ Excellent | Weak (MMLU ~30%) | ❌ No | **AVOID** |
+| **LFM2** (Liquid AI) | ~2.5 GB | ✅✅ Good | **Not for programming** | ❌ No | **AVOID** |
 ### Key Datasets Used
 | Dataset | Rows | Focus |
 |---------|------|-------|
+| [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) | 99,870 | Threat analysis, IR, offensive education |
+| [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) | 53,202 | C2 analysis, forensics, 200+ topics |
 ---
 ## ⚙️ T4 VRAM Optimizations Used
+- `load_in_4bit=True` + LoRA (r=64 for 4B, r=16 for 8B)
 - `adamw_8bit` optimizer
 - `use_gradient_checkpointing="unsloth"`
 - `fp16=True` (T4 has no bf16)
 ## 📚 References
+| Resource | Link |
+|----------|------|
+| Qwen3-4B-Instruct-2507 | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
+| Unsloth 4-bit | https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit |
+| Unsloth Docs | https://unsloth.ai/docs |
+| TRL SFTTrainer | https://huggingface.co/docs/trl/sft_trainer |
+| Fenrir Dataset | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
+| Trendyol Dataset | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
+| CyberMetric Eval | https://huggingface.co/datasets/cybermetric/cybermetric-500 |
+---
+*Built with ❤️ for the cybersecurity community. Use responsibly.*