File size: 8,780 Bytes
00c07ae 25abf03 00c07ae f3dd97e 00c07ae f3dd97e 059a7fd f3dd97e 059a7fd f3dd97e 059a7fd 25abf03 059a7fd 00c07ae 059a7fd 25abf03 059a7fd 25abf03 059a7fd 25abf03 a7d1cc9 25abf03 00c07ae 059a7fd 00c07ae a7d1cc9 00c07ae a7d1cc9 00c07ae f3dd97e 00c07ae f3dd97e 00c07ae a7d1cc9 00c07ae f3dd97e a7d1cc9 00c07ae f3dd97e 00c07ae a7d1cc9 00c07ae a7d1cc9 00c07ae f3dd97e 1d986fb f3dd97e 00c07ae f3dd97e 00c07ae f3dd97e 00c07ae 059a7fd 00c07ae 059a7fd 1d986fb a7d1cc9 447f300 25abf03 5cfd0a8 059a7fd 5cfd0a8 059a7fd 5cfd0a8 059a7fd 5cfd0a8 00c07ae a7d1cc9 00c07ae 059a7fd 5cfd0a8 059a7fd 00c07ae 059a7fd 00c07ae 059a7fd 5cfd0a8 059a7fd 00c07ae | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 | # π General-Purpose LLM Fine-Tuning Collection β Google Colab Free Tier (T4)
A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.
> Pick your model, pick your dataset, click run. Zero-config fine-tuning.
---
## π Notebooks
| Notebook | Model | Size | T4 Batch | Est. Time | Status |
|----------|-------|------|----------|-----------|--------|
| [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3β4 hrs | β
Recommended |
| [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1β2 hrs | β
Fastest |
| [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6β8 hrs | β οΈ Tight VRAM |
| **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | β Not supported |
---
## π₯ Model Comparison (May 2026)
| Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
|-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
| **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β | β | **128K** | Fastest training. Liquid AI edge model. |
| **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β | β | 256K | Dense (not MoE). Google edge model. |
| Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β | β | 1-bit ternary. **Cannot train with Unsloth.** |
**Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation.
---
## π Dataset Selection β 8 Built-in Choices
Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
| Choice | Dataset | Rows | Format | Best For | Language |
|--------|---------|------|--------|----------|----------|
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153Kβ50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200Kβ50K | messages (role/content) | General conversation, chatbot tuning | English |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90Kβ50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104Kβ50K | conversations (human/gpt) | German language fine-tuning | **German** |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153Kβ50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
| `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240Kβ50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
| `custom_mix` | Your combination | β | varies | Combine datasets for hybrid tuning | Mixed |
### How to Switch Datasets (in any notebook)
```python
# In Cell 4 β uncomment ONE line:
DATASET_CHOICE = "cybersecurity" # β Default (defensive security)
# DATASET_CHOICE = "ultrachat" # β General chat
# DATASET_CHOICE = "openhermes" # β Reasoning & coding
# DATASET_CHOICE = "sharegpt_en" # β English dialogue
# DATASET_CHOICE = "sharegpt_de" # β German
# DATASET_CHOICE = "sharegpt_hi" # β Hindi
# DATASET_CHOICE = "code_corpus" # β Code completion (Rust, Python, C++, etc.)
# DATASET_CHOICE = "custom_mix" # β Mix multiple
```
### Code Corpus Dataset Details
The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:
| Domain | Examples |
|--------|----------|
| `web_ui` | Web frameworks, UI components |
| `cpp` | C++ systems code |
| `kotlin_android` | Android apps |
| `rust` | Rust systems (e.g., actix-web) |
| `python` | Python libraries |
| `ethical_hacking` | Security tools, pentesting repos |
| `game_engines` | Game development |
| `ui_ux_design` | Design systems |
Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.
### Mixing Datasets (custom_mix)
```python
CUSTOM_DATASETS = [
# (dataset_id, split, num_rows, format_type)
# format_type: "messages" | "conversations" | "text"
("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
("krystv/code-corpus-llm-training", "train", 20000, "text"),
("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
]
```
---
## π How to Use (Any Notebook)
1. Open the notebook in **Google Colab** (click the notebook link above)
2. Runtime β Change runtime type β **T4 GPU**
3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
4. Run cells top-to-bottom
5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
6. The last cells show **inference demos**
**Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click βΆοΈ.
---
## π§ Technical: Why `dataset_text_field="text"`?
Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:
```python
# Pre-convert messages β text using dataset.map(batched=True)
def convert_messages_to_text(examples):
texts = []
for msgs in examples["messages"]:
text = tokenizer.apply_chat_template(msgs, tokenize=False)
texts.append(text)
return {"text": texts}
train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
# Then pass dataset_text_field="text" to SFTTrainer
trainer = SFTTrainer(..., dataset_text_field="text")
```
All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.
---
## β οΈ T4 VRAM Cheat-Sheet
| Symptom | Fix |
|---------|-----|
| `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` |
| Still OOM | Enable `use_rslora=True` in LoRA config |
| Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` |
| Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs |
| Can't push to Hub | Run `login(token=...)` with a **WRITE** token |
---
## π References
| Resource | Link |
|----------|------|
| **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
| **Unsloth Docs** | https://unsloth.ai/docs |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
---
## π Repository Structure
```
asdf98/ethical-hacking-llm-colab/
βββ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb β Best accuracy
βββ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb β Fastest training
βββ EthicalHacking_Gemma4_E2B_Colab.ipynb β Google model (tight VRAM)
βββ EthicalHacking_Qwen3-8B_Colab.ipynb β Simpler backup (8B)
βββ EthicalHacking_MultiModel_Comparison_Colab.ipynb β Compare models
βββ BONSAI_LIMITATIONS.md β Why Bonsai can't be fine-tuned
βββ README.md β This file
```
---
*Pick any dataset. Train anything. Use responsibly.*
|