asdf98
/

ethical-hacking-llm-colab

Model card Files Files and versions

xet

Community

asdf98 commited on about 23 hours ago

Commit

00c07ae

verified ·

1 Parent(s): ba76afd

Upload README.md

Browse files

Files changed (1) hide show

README.md +72 -76

README.md CHANGED Viewed

@@ -1,12 +1,8 @@
----
-tags:
-- ml-intern
----
-# 🔐 Ethical Hacking LLM Collection — Google Colab Free Tier (T4)
-A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **defensive cybersecurity / ethical hacking** tasks using **Google Colab Free Tier (T4, 16GB VRAM)**.
-> ⚠️ **All datasets are defensive/educational.** We only train on pentesting methodology, threat analysis, incident response, and CTF education — never malicious payloads or active attack instructions.
 ---
@@ -25,7 +21,7 @@ A curated collection of **production-ready Colab notebooks** for fine-tuning sta
 | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
 |-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
-| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning ratio. Thinking toggle. |
 | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | — | — | **128K** | Fastest training. Liquid AI edge model. |
 | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | — | — | 256K | Dense (not MoE). Google edge model. |
 | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | — | — | 1-bit ternary. **Cannot train with Unsloth.** |
@@ -34,36 +30,63 @@ A curated collection of **production-ready Colab notebooks** for fine-tuning sta
 ---
-## 🚀 How to Use (Any Notebook)
-1. Open the notebook in **Google Colab** (click the notebook link above)
-2. Runtime → Change runtime type → **T4 GPU**
-3. Run cells top-to-bottom
-4. (Optional) Set your HF token in cell 2 to push the LoRA adapter
-5. The last cells show **inference demos** and a **CyberMetric benchmark**
-**Zero-config:** All hyperparameters are tuned for T4. Just click ▶️ and train.
----
-## 📊 Datasets
-Both notebooks use the same **merged + subsampled** dataset:
-| Dataset | Rows | Focus |
-|---------|------|-------|
-| [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) | 99,870 | Causal reasoning, threat analysis, IR |
-| [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) | 53,202 | 200+ topics, C2 analysis, forensics |
-| **Merged** | 153,072 | — |
-| **Subsampled** | **50,000** | Enough for LoRA convergence |
 ---
-## 🔧 Key Technical Decisions
-### Why `dataset_text_field="text"` instead of `formatting_func`
-Unsloth's `SFTTrainer` has issues with `formatting_func` when using `FastLanguageModel`. The cleanest fix used in all notebooks:
 ```python
 # Pre-convert messages → text using dataset.map(batched=True)
@@ -76,36 +99,11 @@ def convert_messages_to_text(examples):
 train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
-# Then pass dataset_text_field="text" to SFTTrainer — no formatting_func needed
 trainer = SFTTrainer(..., dataset_text_field="text")
 ```
-### Speed Optimizations (Qwen3-4B v2)
-| Setting | v1 | v2 | Impact |
-|---------|-----|-----|--------|
-| Dataset | 153K rows | **50K rows** | 3× fewer steps |
-| Batch size | 2 | **4** | 2× throughput |
-| Grad accum | 4 | **2** | Same effective batch |
-| Packing | False | **True** | 2–3× GPU utilization |
-| Max steps | 19K (full epoch) | **4,000** | Loss already plateaus |
-| **Est. time** | ~45 hrs | **~3–4 hrs** | Same quality |
----
-## 📖 Model-Specific Links
-### Qwen3-4B
-- Model: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
-- Unsloth 4-bit: https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit
-### LFM2.5
-- Docs: https://unsloth.ai/docs/models/tutorials/lfm2.5
-- Unsloth notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Liquid_LFM2_(1.2B)-Conversational.ipynb
-### Gemma-4 E2B
-- Docs: https://unsloth.ai/docs/models/gemma-4/train
-- Unsloth notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Text.ipynb
 ---
@@ -121,37 +119,35 @@ trainer = SFTTrainer(..., dataset_text_field="text")
 ---
 ## 📂 Repository Structure
 ```
 asdf98/ethical-hacking-llm-colab/
-├── EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb   ← Best accuracy (recommended)
 ├── EthicalHacking_LFM2.5_Ultimate_Colab.ipynb     ← Fastest training
 ├── EthicalHacking_Gemma4_E2B_Colab.ipynb          ← Google model (tight VRAM)
 ├── BONSAI_LIMITATIONS.md                          ← Why Bonsai can't be fine-tuned
 └── README.md                                      ← This file
 ```
 ---
-*Built with ❤️ for the cybersecurity community. Use responsibly.*
-<!-- ml-intern-provenance -->
-## Generated by ML Intern
-This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
-- Try ML Intern: https://smolagents-ml-intern.hf.space
-- Source code: https://github.com/huggingface/ml-intern
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-model_id = "asdf98/ethical-hacking-llm-colab"
-tokenizer = AutoTokenizer.from_pretrained(model_id)
-model = AutoModelForCausalLM.from_pretrained(model_id)
-```
-For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.

+# 🔐 General-Purpose LLM Fine-Tuning Collection — Google Colab Free Tier (T4)
+A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.
+> Pick your model, pick your dataset, click run. Zero-config fine-tuning.
 ---
 | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
 |-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
+| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
 | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | — | — | **128K** | Fastest training. Liquid AI edge model. |
 | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | — | — | 256K | Dense (not MoE). Google edge model. |
 | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | — | — | 1-bit ternary. **Cannot train with Unsloth.** |
 ---
+## 📊 Dataset Selection — 7 Built-in Choices
+Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
+| Choice | Dataset | Rows | Format | Best For | Language |
+|--------|---------|------|--------|----------|----------|
+| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | Ethical hacking, pentesting education | English |
+| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot | English |
+| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+→50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
+| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
+| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** |
+| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
+| `custom_mix` | Your combination | — | varies | Combine datasets for hybrid tuning | Mixed |
+### How to Switch Datasets (in any notebook)
+```python
+# In Cell 4 — uncomment ONE line:
+DATASET_CHOICE = "cybersecurity"    # ← Default (defensive security)
+# DATASET_CHOICE = "ultrachat"      # ← General chat
+# DATASET_CHOICE = "openhermes"     # ← Reasoning & coding
+# DATASET_CHOICE = "sharegpt_en"    # ← English dialogue
+# DATASET_CHOICE = "sharegpt_de"    # ← German
+# DATASET_CHOICE = "sharegpt_hi"    # ← Hindi
+# DATASET_CHOICE = "custom_mix"     # ← Mix multiple
+```
+### Mixing Datasets (custom_mix)
+```python
+CUSTOM_DATASETS = [
+    # (dataset_id, split, num_rows, format_type)
+    ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
+    ("HuggingFaceH4/ultrachat_200k", "train_sft", 20000, "messages"),
+    ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
+]
+```
 ---
+## 🚀 How to Use (Any Notebook)
+1. Open the notebook in **Google Colab** (click the notebook link above)
+2. Runtime → Change runtime type → **T4 GPU**
+3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
+4. Run cells top-to-bottom
+5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
+6. The last cells show **inference demos**
+**Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▶️.
+---
+## 🔧 Technical: Why `dataset_text_field="text"`?
+Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:
 ```python
 # Pre-convert messages → text using dataset.map(batched=True)
 train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
+# Then pass dataset_text_field="text" to SFTTrainer
 trainer = SFTTrainer(..., dataset_text_field="text")
 ```
+All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT) automatically.
 ---
 ---
+## 📖 References
+| Resource | Link |
+|----------|------|
+| **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
+| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
+| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
+| **Unsloth Docs** | https://unsloth.ai/docs |
+| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
+| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
+| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
+| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
+| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
+---
 ## 📂 Repository Structure
 ```
 asdf98/ethical-hacking-llm-colab/
+├── EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb   ← Best accuracy
 ├── EthicalHacking_LFM2.5_Ultimate_Colab.ipynb     ← Fastest training
 ├── EthicalHacking_Gemma4_E2B_Colab.ipynb          ← Google model (tight VRAM)
+├── EthicalHacking_Qwen3-8B_Colab.ipynb            ← Simpler backup (8B)
+├── EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
 ├── BONSAI_LIMITATIONS.md                          ← Why Bonsai can't be fine-tuned
 └── README.md                                      ← This file
 ```
 ---
+*Pick any dataset. Train anything. Use responsibly.*