asdf98 commited on
Commit
00c07ae
Β·
verified Β·
1 Parent(s): ba76afd

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -76
README.md CHANGED
@@ -1,12 +1,8 @@
1
- ---
2
- tags:
3
- - ml-intern
4
- ---
5
- # πŸ” Ethical Hacking LLM Collection β€” Google Colab Free Tier (T4)
6
 
7
- A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **defensive cybersecurity / ethical hacking** tasks using **Google Colab Free Tier (T4, 16GB VRAM)**.
8
 
9
- > ⚠️ **All datasets are defensive/educational.** We only train on pentesting methodology, threat analysis, incident response, and CTF education β€” never malicious payloads or active attack instructions.
10
 
11
  ---
12
 
@@ -25,7 +21,7 @@ A curated collection of **production-ready Colab notebooks** for fine-tuning sta
25
 
26
  | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
27
  |-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
28
- | **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning ratio. Thinking toggle. |
29
  | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β€” | β€” | **128K** | Fastest training. Liquid AI edge model. |
30
  | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β€” | β€” | 256K | Dense (not MoE). Google edge model. |
31
  | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β€” | β€” | 1-bit ternary. **Cannot train with Unsloth.** |
@@ -34,36 +30,63 @@ A curated collection of **production-ready Colab notebooks** for fine-tuning sta
34
 
35
  ---
36
 
37
- ## πŸš€ How to Use (Any Notebook)
38
 
39
- 1. Open the notebook in **Google Colab** (click the notebook link above)
40
- 2. Runtime β†’ Change runtime type β†’ **T4 GPU**
41
- 3. Run cells top-to-bottom
42
- 4. (Optional) Set your HF token in cell 2 to push the LoRA adapter
43
- 5. The last cells show **inference demos** and a **CyberMetric benchmark**
44
 
45
- **Zero-config:** All hyperparameters are tuned for T4. Just click ▢️ and train.
 
 
 
 
 
 
 
 
46
 
47
- ---
48
 
49
- ## πŸ“Š Datasets
 
 
 
 
 
 
 
 
 
 
50
 
51
- Both notebooks use the same **merged + subsampled** dataset:
52
 
53
- | Dataset | Rows | Focus |
54
- |---------|------|-------|
55
- | [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) | 99,870 | Causal reasoning, threat analysis, IR |
56
- | [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) | 53,202 | 200+ topics, C2 analysis, forensics |
57
- | **Merged** | 153,072 | β€” |
58
- | **Subsampled** | **50,000** | Enough for LoRA convergence |
 
 
59
 
60
  ---
61
 
62
- ## πŸ”§ Key Technical Decisions
63
 
64
- ### Why `dataset_text_field="text"` instead of `formatting_func`
 
 
 
 
 
65
 
66
- Unsloth's `SFTTrainer` has issues with `formatting_func` when using `FastLanguageModel`. The cleanest fix used in all notebooks:
 
 
 
 
 
 
67
 
68
  ```python
69
  # Pre-convert messages β†’ text using dataset.map(batched=True)
@@ -76,36 +99,11 @@ def convert_messages_to_text(examples):
76
 
77
  train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
78
 
79
- # Then pass dataset_text_field="text" to SFTTrainer β€” no formatting_func needed
80
  trainer = SFTTrainer(..., dataset_text_field="text")
81
  ```
82
 
83
- ### Speed Optimizations (Qwen3-4B v2)
84
-
85
- | Setting | v1 | v2 | Impact |
86
- |---------|-----|-----|--------|
87
- | Dataset | 153K rows | **50K rows** | 3Γ— fewer steps |
88
- | Batch size | 2 | **4** | 2Γ— throughput |
89
- | Grad accum | 4 | **2** | Same effective batch |
90
- | Packing | False | **True** | 2–3Γ— GPU utilization |
91
- | Max steps | 19K (full epoch) | **4,000** | Loss already plateaus |
92
- | **Est. time** | ~45 hrs | **~3–4 hrs** | Same quality |
93
-
94
- ---
95
-
96
- ## πŸ“– Model-Specific Links
97
-
98
- ### Qwen3-4B
99
- - Model: https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507
100
- - Unsloth 4-bit: https://huggingface.co/unsloth/Qwen3-4B-Instruct-2507-unsloth-bnb-4bit
101
-
102
- ### LFM2.5
103
- - Docs: https://unsloth.ai/docs/models/tutorials/lfm2.5
104
- - Unsloth notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Liquid_LFM2_(1.2B)-Conversational.ipynb
105
-
106
- ### Gemma-4 E2B
107
- - Docs: https://unsloth.ai/docs/models/gemma-4/train
108
- - Unsloth notebook: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Gemma4_(E2B)-Text.ipynb
109
 
110
  ---
111
 
@@ -121,37 +119,35 @@ trainer = SFTTrainer(..., dataset_text_field="text")
121
 
122
  ---
123
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
  ## πŸ“‚ Repository Structure
125
 
126
  ```
127
  asdf98/ethical-hacking-llm-colab/
128
- β”œβ”€β”€ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb ← Best accuracy (recommended)
129
  β”œβ”€β”€ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb ← Fastest training
130
  β”œβ”€β”€ EthicalHacking_Gemma4_E2B_Colab.ipynb ← Google model (tight VRAM)
 
 
131
  β”œβ”€β”€ BONSAI_LIMITATIONS.md ← Why Bonsai can't be fine-tuned
132
  └── README.md ← This file
133
  ```
134
 
135
  ---
136
 
137
- *Built with ❀️ for the cybersecurity community. Use responsibly.*
138
-
139
- <!-- ml-intern-provenance -->
140
- ## Generated by ML Intern
141
-
142
- This model repository was generated by [ML Intern](https://github.com/huggingface/ml-intern), an agent for machine learning research and development on the Hugging Face Hub.
143
-
144
- - Try ML Intern: https://smolagents-ml-intern.hf.space
145
- - Source code: https://github.com/huggingface/ml-intern
146
-
147
- ## Usage
148
-
149
- ```python
150
- from transformers import AutoModelForCausalLM, AutoTokenizer
151
-
152
- model_id = "asdf98/ethical-hacking-llm-colab"
153
- tokenizer = AutoTokenizer.from_pretrained(model_id)
154
- model = AutoModelForCausalLM.from_pretrained(model_id)
155
- ```
156
-
157
- For non-causal architectures, replace `AutoModelForCausalLM` with the appropriate `AutoModel` class.
 
1
+ # πŸ” General-Purpose LLM Fine-Tuning Collection β€” Google Colab Free Tier (T4)
 
 
 
 
2
 
3
+ A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.
4
 
5
+ > Pick your model, pick your dataset, click run. Zero-config fine-tuning.
6
 
7
  ---
8
 
 
21
 
22
  | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
23
  |-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
24
+ | **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
25
  | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β€” | β€” | **128K** | Fastest training. Liquid AI edge model. |
26
  | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β€” | β€” | 256K | Dense (not MoE). Google edge model. |
27
  | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β€” | β€” | 1-bit ternary. **Cannot train with Unsloth.** |
 
30
 
31
  ---
32
 
33
+ ## πŸ“Š Dataset Selection β€” 7 Built-in Choices
34
 
35
+ Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
 
 
 
 
36
 
37
+ | Choice | Dataset | Rows | Format | Best For | Language |
38
+ |--------|---------|------|--------|----------|----------|
39
+ | `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | Ethical hacking, pentesting education | English |
40
+ | `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot | English |
41
+ | `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β†’50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
42
+ | `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
43
+ | `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** |
44
+ | `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
45
+ | `custom_mix` | Your combination | β€” | varies | Combine datasets for hybrid tuning | Mixed |
46
 
47
+ ### How to Switch Datasets (in any notebook)
48
 
49
+ ```python
50
+ # In Cell 4 β€” uncomment ONE line:
51
+
52
+ DATASET_CHOICE = "cybersecurity" # ← Default (defensive security)
53
+ # DATASET_CHOICE = "ultrachat" # ← General chat
54
+ # DATASET_CHOICE = "openhermes" # ← Reasoning & coding
55
+ # DATASET_CHOICE = "sharegpt_en" # ← English dialogue
56
+ # DATASET_CHOICE = "sharegpt_de" # ← German
57
+ # DATASET_CHOICE = "sharegpt_hi" # ← Hindi
58
+ # DATASET_CHOICE = "custom_mix" # ← Mix multiple
59
+ ```
60
 
61
+ ### Mixing Datasets (custom_mix)
62
 
63
+ ```python
64
+ CUSTOM_DATASETS = [
65
+ # (dataset_id, split, num_rows, format_type)
66
+ ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
67
+ ("HuggingFaceH4/ultrachat_200k", "train_sft", 20000, "messages"),
68
+ ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
69
+ ]
70
+ ```
71
 
72
  ---
73
 
74
+ ## πŸš€ How to Use (Any Notebook)
75
 
76
+ 1. Open the notebook in **Google Colab** (click the notebook link above)
77
+ 2. Runtime β†’ Change runtime type β†’ **T4 GPU**
78
+ 3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
79
+ 4. Run cells top-to-bottom
80
+ 5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
81
+ 6. The last cells show **inference demos**
82
 
83
+ **Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▢️.
84
+
85
+ ---
86
+
87
+ ## πŸ”§ Technical: Why `dataset_text_field="text"`?
88
+
89
+ Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:
90
 
91
  ```python
92
  # Pre-convert messages β†’ text using dataset.map(batched=True)
 
99
 
100
  train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
101
 
102
+ # Then pass dataset_text_field="text" to SFTTrainer
103
  trainer = SFTTrainer(..., dataset_text_field="text")
104
  ```
105
 
106
+ All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT) automatically.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
107
 
108
  ---
109
 
 
119
 
120
  ---
121
 
122
+ ## πŸ“– References
123
+
124
+ | Resource | Link |
125
+ |----------|------|
126
+ | **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
127
+ | **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
128
+ | **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
129
+ | **Unsloth Docs** | https://unsloth.ai/docs |
130
+ | **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
131
+ | **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
132
+ | **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
133
+ | **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
134
+ | **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
135
+
136
+ ---
137
+
138
  ## πŸ“‚ Repository Structure
139
 
140
  ```
141
  asdf98/ethical-hacking-llm-colab/
142
+ β”œβ”€β”€ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb ← Best accuracy
143
  β”œβ”€β”€ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb ← Fastest training
144
  β”œβ”€β”€ EthicalHacking_Gemma4_E2B_Colab.ipynb ← Google model (tight VRAM)
145
+ β”œβ”€β”€ EthicalHacking_Qwen3-8B_Colab.ipynb ← Simpler backup (8B)
146
+ β”œβ”€β”€ EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
147
  β”œβ”€β”€ BONSAI_LIMITATIONS.md ← Why Bonsai can't be fine-tuned
148
  └── README.md ← This file
149
  ```
150
 
151
  ---
152
 
153
+ *Pick any dataset. Train anything. Use responsibly.*