asdf98 commited on
Commit
059a7fd
Β·
verified Β·
1 Parent(s): 12b6652

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -91
README.md CHANGED
@@ -1,133 +1,133 @@
1
- ---
2
- tags:
3
- - ml-intern
4
- - ethical-hacking
5
- - cybersecurity
6
- - unsloth
7
- - colab
8
- ---
9
 
10
- # πŸ” Ethical Hacking LLM Fine-Tuning Collection
11
 
12
- > **Public collection of Colab-ready notebooks for fine-tuning cybersecurity/ethical hacking LLMs on Google Colab Free Tier (T4 GPU, ~16GB VRAM).**
13
 
14
  ---
15
 
16
- ## πŸ“¦ What's Included
17
 
18
- | File | Model | Description |
19
- |------|-------|-------------|
20
- | `EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb` | **Qwen3-4B-Instruct-2507** πŸ₯‡ | Best coding/reasoning under 10B. **Recommended for T4.** |
21
- | `EthicalHacking_Qwen3-8B_Colab.ipynb` | Qwen3-8B | More capacity, tighter VRAM. Simpler notebook. |
22
- | `EthicalHacking_MultiModel_Comparison_Colab.ipynb` | **Multi-model selector** | Pick between Qwen3-4B/8B or Gemma-3-4B in one notebook |
 
23
 
24
  ---
25
 
26
- ## 🚨 CRITICAL FIX: `formatting_func` Required by Unsloth
27
 
28
- If you get this error:
29
- ```
30
- RuntimeError: Unsloth: You must specify a formatting_func
31
- ```
 
 
32
 
33
- **The fix:** When using `FastLanguageModel` + `SFTTrainer`, Unsloth **requires** you to explicitly pass a `formatting_func` that converts `messages` β†’ text string:
34
 
35
- ```python
36
- def formatting_func(example):
37
- return tokenizer.apply_chat_template(
38
- example["messages"],
39
- tokenize=False, # MUST be False!
40
- add_generation_prompt=False,
41
- )
42
-
43
- trainer = SFTTrainer(
44
- model=model,
45
- train_dataset=dataset,
46
- formatting_func=formatting_func, # ← REQUIRED
47
- ...
48
- )
49
- ```
50
 
51
- All notebooks in this repo now include this fix.
52
 
53
- ---
 
 
 
 
 
 
54
 
55
- ## πŸ† Model Comparison (T4 16GB, May/June 2026)
56
 
57
- | Model | 4-bit Size | T4 Fit | Coding Benchmarks | Unsloth | Verdict |
58
- |-------|-----------|--------|------------------|---------|---------|
59
- | **Qwen3-4B-Instruct-2507** πŸ₯‡ | **3.3 GB** | βœ…βœ…βœ… Excellent | LiveCodeBench 35.1, MultiPL-E 76.8 | βœ… Confirmed | **USE THIS** |
60
- | Qwen3-8B | 7.0 GB | βœ…βœ… Good | Stronger base | βœ… Confirmed | Viable |
61
- | Gemma-3-4B | ~2.5 GB | βœ…βœ…βœ… Excellent | Decent | βœ… Confirmed | Alternative |
62
- | Gemma-4-E2B | ~7.6 GB | βœ…βœ… Good | Unverified | ⚠️ Limited | Experimental |
63
- | **Bonsai** (prism-ml) | ~0.5 GB | βœ…βœ…βœ… Excellent | Weak (MMLU ~30%) | ❌ No | **AVOID** |
64
- | **LFM2** (Liquid AI) | ~2.5 GB | βœ…βœ… Good | **Not for programming** | ❌ No | **AVOID** |
65
 
66
- ### Key Datasets Used
67
 
68
  | Dataset | Rows | Focus |
69
  |---------|------|-------|
70
- | [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) | 99,870 | Threat analysis, IR, offensive education |
71
- | [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) | 53,202 | C2 analysis, forensics, 200+ topics |
 
 
72
 
73
  ---
74
 
75
- ## πŸš€ Quick Start
76
 
77
- 1. Open [Google Colab](https://colab.research.google.com)
78
- 2. **Runtime β†’ Change runtime type β†’ GPU (T4)**
79
- 3. Upload the `.ipynb` file from this repo
80
- 4. **Run all cells** β€” training takes ~1.5–2.5 hours for 1 epoch
81
 
82
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
- ## βš™οΈ T4 VRAM Optimizations Used
85
 
86
- - `load_in_4bit=True` + LoRA (r=64 for 4B, r=16 for 8B)
87
- - `adamw_8bit` optimizer
88
- - `use_gradient_checkpointing="unsloth"`
89
- - `fp16=True` (T4 has no bf16)
90
- - Batch=2, Accum=4 β†’ effective batch=8
 
 
 
91
 
92
  ---
93
 
94
- ## πŸ›‘οΈ Disclaimer
95
-
96
- All datasets are **defensive/educational** (pentesting methodology, threat analysis, incident response). Intended for **ethical hacking education and security research** only.
97
 
98
- ---
 
 
99
 
100
- ## πŸ“š References
 
 
101
 
102
- | Resource | Link |
103
- |----------|------|
104
- | Qwen3-4B-Instruct-2507 | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
105
- | Unsloth 4-bit | https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit |
106
- | Unsloth Docs | https://unsloth.ai/docs |
107
- | TRL SFTTrainer | https://huggingface.co/docs/trl/sft_trainer |
108
- | Fenrir Dataset | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
109
- | Trendyol Dataset | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
110
- | CyberMetric Eval | https://huggingface.co/datasets/cybermetric/cybermetric-500 |
111
 
112
  ---
113
- *Built with ❀️ for the cybersecurity community. Use responsibly.*
114
-
115
- <!-- ml-intern-provenance -->
116
- ## Generated by ML Intern
117
 
118
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
119
 
120
- - Try ML Intern: https://smolagents-ml-intern.hf.space
121
- - Source code: https://github.com/huggingface/ml-intern
 
 
 
 
 
122
 
123
- ## Usage
124
 
125
- ```python
126
- from transformers import AutoModelForCausalLM, AutoTokenizer
127
 
128
- model_id = "asdf98/ethical-hacking-llm-colab"
129
- tokenizer = AutoTokenizer.from_pretrained(model_id)
130
- model = AutoModelForCausalLM.from_pretrained(model_id)
 
 
 
 
131
  ```
132
 
133
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
 
 
1
+ # πŸ” Ethical Hacking LLM Collection β€” Google Colab Free Tier (T4)
 
 
 
 
 
 
 
2
 
3
+ A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **defensive cybersecurity / ethical hacking** tasks using **Google Colab Free Tier (T4, 16GB VRAM)**.
4
 
5
+ > ⚠️ **All datasets are defensive/educational.** We only train on pentesting methodology, threat analysis, incident response, and CTF education β€” never malicious payloads or active attack instructions.
6
 
7
  ---
8
 
9
+ ## πŸ“š Notebooks
10
 
11
+ | Notebook | Model | Size | T4 Batch | Est. Time | Status |
12
+ |----------|-------|------|----------|-----------|--------|
13
+ | [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3–4 hrs | βœ… Recommended |
14
+ | [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1–2 hrs | βœ… Fastest |
15
+ | [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6–8 hrs | ⚠️ Tight VRAM |
16
+ | **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | ❌ Not supported |
17
 
18
  ---
19
 
20
+ ## πŸ₯‡ Model Comparison (May 2026)
21
 
22
+ | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
23
+ |-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
24
+ | **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning ratio. Thinking toggle. |
25
+ | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β€” | β€” | **128K** | Fastest training. Liquid AI edge model. |
26
+ | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β€” | β€” | 256K | Dense (not MoE). Google edge model. |
27
+ | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β€” | β€” | 1-bit ternary. **Cannot train with Unsloth.** |
28
 
29
+ **Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation.
30
 
31
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
+ ## πŸš€ How to Use (Any Notebook)
34
 
35
+ 1. Open the notebook in **Google Colab** (click the notebook link above)
36
+ 2. Runtime β†’ Change runtime type β†’ **T4 GPU**
37
+ 3. Run cells top-to-bottom
38
+ 4. (Optional) Set your HF token in cell 2 to push the LoRA adapter
39
+ 5. The last cells show **inference demos** and a **CyberMetric benchmark**
40
+
41
+ **Zero-config:** All hyperparameters are tuned for T4. Just click ▢️ and train.
42
 
43
+ ---
44
 
45
+ ## πŸ“Š Datasets
 
 
 
 
 
 
 
46
 
47
+ Both notebooks use the same **merged + subsampled** dataset:
48
 
49
  | Dataset | Rows | Focus |
50
  |---------|------|-------|
51
+ | [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) | 99,870 | Causal reasoning, threat analysis, IR |
52
+ | [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) | 53,202 | 200+ topics, C2 analysis, forensics |
53
+ | **Merged** | 153,072 | β€” |
54
+ | **Subsampled** | **50,000** | Enough for LoRA convergence |
55
 
56
  ---
57
 
58
+ ## πŸ”§ Key Technical Decisions
59
 
60
+ ### Why `dataset_text_field="text"` instead of `formatting_func`
 
 
 
61
 
62
+ Unsloth's `SFTTrainer` has issues with `formatting_func` when using `FastLanguageModel`. The cleanest fix used in all notebooks:
63
+
64
+ ```python
65
+ # Pre-convert messages β†’ text using dataset.map(batched=True)
66
+ def convert_messages_to_text(examples):
67
+ texts = []
68
+ for msgs in examples["messages"]:
69
+ text = tokenizer.apply_chat_template(msgs, tokenize=False)
70
+ texts.append(text)
71
+ return {"text": texts}
72
+
73
+ train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
74
+
75
+ # Then pass dataset_text_field="text" to SFTTrainer β€” no formatting_func needed
76
+ trainer = SFTTrainer(..., dataset_text_field="text")
77
+ ```
78
 
79
+ ### Speed Optimizations (Qwen3-4B v2)
80
 
81
+ | Setting | v1 | v2 | Impact |
82
+ |---------|-----|-----|--------|
83
+ | Dataset | 153K rows | **50K rows** | 3Γ— fewer steps |
84
+ | Batch size | 2 | **4** | 2Γ— throughput |
85
+ | Grad accum | 4 | **2** | Same effective batch |
86
+ | Packing | False | **True** | 2–3Γ— GPU utilization |
87
+ | Max steps | 19K (full epoch) | **4,000** | Loss already plateaus |
88
+ | **Est. time** | ~45 hrs | **~3–4 hrs** | Same quality |
89
 
90
  ---
91
 
92
+ ## πŸ“– Model-Specific Links
 
 
93
 
94
+ ### Qwen3-4B
95
+ - Model: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
96
+ - Unsloth 4-bit: https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit
97
 
98
+ ### LFM2.5
99
+ - Docs: https://unsloth.ai/docs/models/tutorials/lfm2.5
100
+ - Unsloth notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Liquid_LFM2_(1.2B)-Conversational.ipynb
101
 
102
+ ### Gemma-4 E2B
103
+ - Docs: https://unsloth.ai/docs/models/gemma-4/train
104
+ - Unsloth notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Text.ipynb
 
 
 
 
 
 
105
 
106
  ---
 
 
 
 
107
 
108
+ ## ⚠️ T4 VRAM Cheat-Sheet
109
 
110
+ | Symptom | Fix |
111
+ |---------|-----|
112
+ | `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` |
113
+ | Still OOM | Enable `use_rslora=True` in LoRA config |
114
+ | Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` |
115
+ | Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs |
116
+ | Can't push to Hub | Run `login(token=...)` with a **WRITE** token |
117
 
118
+ ---
119
 
120
+ ## πŸ“‚ Repository Structure
 
121
 
122
+ ```
123
+ asdf98/ethical-hacking-llm-colab/
124
+ β”œβ”€β”€ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb ← Best accuracy (recommended)
125
+ β”œβ”€β”€ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb οΏ½οΏ½οΏ½ Fastest training
126
+ β”œβ”€β”€ EthicalHacking_Gemma4_E2B_Colab.ipynb ← Google model (tight VRAM)
127
+ β”œβ”€β”€ BONSAI_LIMITATIONS.md ← Why Bonsai can't be fine-tuned
128
+ └── README.md ← This file
129
  ```
130
 
131
+ ---
132
+
133
+ *Built with ❀️ for the cybersecurity community. Use responsibly.*