File size: 8,780 Bytes
00c07ae
25abf03
00c07ae
f3dd97e
00c07ae
f3dd97e
 
 
059a7fd
f3dd97e
059a7fd
 
 
 
 
 
f3dd97e
 
 
059a7fd
25abf03
059a7fd
 
00c07ae
059a7fd
 
 
25abf03
059a7fd
25abf03
059a7fd
25abf03
a7d1cc9
25abf03
00c07ae
059a7fd
00c07ae
 
a7d1cc9
 
00c07ae
 
 
 
a7d1cc9
00c07ae
f3dd97e
00c07ae
f3dd97e
00c07ae
 
 
 
 
 
 
 
 
a7d1cc9
00c07ae
 
f3dd97e
a7d1cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00c07ae
f3dd97e
00c07ae
 
 
a7d1cc9
00c07ae
a7d1cc9
00c07ae
 
 
f3dd97e
1d986fb
f3dd97e
00c07ae
f3dd97e
00c07ae
 
 
 
 
 
f3dd97e
00c07ae
 
 
 
 
 
 
059a7fd
 
 
 
 
 
 
 
 
 
 
 
00c07ae
059a7fd
 
1d986fb
a7d1cc9
447f300
25abf03
5cfd0a8
059a7fd
5cfd0a8
059a7fd
 
 
 
 
 
 
5cfd0a8
059a7fd
5cfd0a8
00c07ae
 
 
 
 
 
 
 
 
 
 
a7d1cc9
00c07ae
 
 
 
 
059a7fd
5cfd0a8
059a7fd
 
00c07ae
059a7fd
 
00c07ae
 
059a7fd
 
5cfd0a8
 
059a7fd
 
00c07ae
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
# πŸ” General-Purpose LLM Fine-Tuning Collection β€” Google Colab Free Tier (T4)

A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.

> Pick your model, pick your dataset, click run. Zero-config fine-tuning.

---

## πŸ“š Notebooks

| Notebook | Model | Size | T4 Batch | Est. Time | Status |
|----------|-------|------|----------|-----------|--------|
| [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3–4 hrs | βœ… Recommended |
| [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1–2 hrs | βœ… Fastest |
| [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6–8 hrs | ⚠️ Tight VRAM |
| **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | ❌ Not supported |

---

## πŸ₯‡ Model Comparison (May 2026)

| Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
|-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
| **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β€” | β€” | **128K** | Fastest training. Liquid AI edge model. |
| **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β€” | β€” | 256K | Dense (not MoE). Google edge model. |
| Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β€” | β€” | 1-bit ternary. **Cannot train with Unsloth.** |

**Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation.

---

## πŸ“Š Dataset Selection β€” 8 Built-in Choices

Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.

| Choice | Dataset | Rows | Format | Best For | Language |
|--------|---------|------|--------|----------|----------|
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot tuning | English |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β†’50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
| `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240K→50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
| `custom_mix` | Your combination | β€” | varies | Combine datasets for hybrid tuning | Mixed |

### How to Switch Datasets (in any notebook)

```python
# In Cell 4 β€” uncomment ONE line:

DATASET_CHOICE = "cybersecurity"    # ← Default (defensive security)
# DATASET_CHOICE = "ultrachat"      # ← General chat
# DATASET_CHOICE = "openhermes"     # ← Reasoning & coding
# DATASET_CHOICE = "sharegpt_en"    # ← English dialogue
# DATASET_CHOICE = "sharegpt_de"    # ← German
# DATASET_CHOICE = "sharegpt_hi"    # ← Hindi
# DATASET_CHOICE = "code_corpus"    # ← Code completion (Rust, Python, C++, etc.)
# DATASET_CHOICE = "custom_mix"     # ← Mix multiple
```

### Code Corpus Dataset Details

The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:

| Domain | Examples |
|--------|----------|
| `web_ui` | Web frameworks, UI components |
| `cpp` | C++ systems code |
| `kotlin_android` | Android apps |
| `rust` | Rust systems (e.g., actix-web) |
| `python` | Python libraries |
| `ethical_hacking` | Security tools, pentesting repos |
| `game_engines` | Game development |
| `ui_ux_design` | Design systems |

Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.

### Mixing Datasets (custom_mix)

```python
CUSTOM_DATASETS = [
    # (dataset_id, split, num_rows, format_type)
    # format_type: "messages" | "conversations" | "text"
    ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
    ("krystv/code-corpus-llm-training", "train", 20000, "text"),
    ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
]
```

---

## πŸš€ How to Use (Any Notebook)

1. Open the notebook in **Google Colab** (click the notebook link above)
2. Runtime β†’ Change runtime type β†’ **T4 GPU**
3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
4. Run cells top-to-bottom
5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
6. The last cells show **inference demos**

**Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▢️.

---

## πŸ”§ Technical: Why `dataset_text_field="text"`?

Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:

```python
# Pre-convert messages β†’ text using dataset.map(batched=True)
def convert_messages_to_text(examples):
    texts = []
    for msgs in examples["messages"]:
        text = tokenizer.apply_chat_template(msgs, tokenize=False)
        texts.append(text)
    return {"text": texts}

train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])

# Then pass dataset_text_field="text" to SFTTrainer
trainer = SFTTrainer(..., dataset_text_field="text")
```

All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.

---

## ⚠️ T4 VRAM Cheat-Sheet

| Symptom | Fix |
|---------|-----|
| `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` |
| Still OOM | Enable `use_rslora=True` in LoRA config |
| Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` |
| Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs |
| Can't push to Hub | Run `login(token=...)` with a **WRITE** token |

---

## πŸ“– References

| Resource | Link |
|----------|------|
| **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
| **Unsloth Docs** | https://unsloth.ai/docs |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |

---

## πŸ“‚ Repository Structure

```
asdf98/ethical-hacking-llm-colab/
β”œβ”€β”€ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb   ← Best accuracy
β”œβ”€β”€ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb     ← Fastest training
β”œβ”€β”€ EthicalHacking_Gemma4_E2B_Colab.ipynb          ← Google model (tight VRAM)
β”œβ”€β”€ EthicalHacking_Qwen3-8B_Colab.ipynb            ← Simpler backup (8B)
β”œβ”€β”€ EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
β”œβ”€β”€ BONSAI_LIMITATIONS.md                          ← Why Bonsai can't be fine-tuned
└── README.md                                      ← This file
```

---

*Pick any dataset. Train anything. Use responsibly.*