File size: 8,780 Bytes

00c07ae
25abf03
00c07ae
f3dd97e
00c07ae
f3dd97e
 
 
059a7fd
f3dd97e
059a7fd
 
 
 
 
 
f3dd97e
 
 
059a7fd
25abf03
059a7fd
 
00c07ae
059a7fd
 
 
25abf03
059a7fd
25abf03
059a7fd
25abf03
a7d1cc9
25abf03
00c07ae
059a7fd
00c07ae
 
a7d1cc9
 
00c07ae
 
 
 
a7d1cc9
00c07ae
f3dd97e
00c07ae
f3dd97e
00c07ae
 
 
 
 
 
 
 
 
a7d1cc9
00c07ae
 
f3dd97e
a7d1cc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
00c07ae
f3dd97e
00c07ae
 
 
a7d1cc9
00c07ae
a7d1cc9
00c07ae
 
 
f3dd97e
1d986fb
f3dd97e
00c07ae
f3dd97e
00c07ae
 
 
 
 
 
f3dd97e
00c07ae
 
 
 
 
 
 
059a7fd
 
 
 
 
 
 
 
 
 
 
 
00c07ae
059a7fd
 
1d986fb
a7d1cc9
447f300
25abf03
5cfd0a8
059a7fd
5cfd0a8
059a7fd
 
 
 
 
 
 
5cfd0a8
059a7fd
5cfd0a8
00c07ae
 
 
 
 
 
 
 
 
 
 
a7d1cc9
00c07ae
 
 
 
 
059a7fd
5cfd0a8
059a7fd
 
00c07ae
059a7fd
 
00c07ae
 
059a7fd
 
5cfd0a8
 
059a7fd
 
00c07ae

# 🔐 General-Purpose LLM Fine-Tuning Collection — Google Colab Free Tier (T4)

A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.

> Pick your model, pick your dataset, click run. Zero-config fine-tuning.

---

## 📚 Notebooks

| Notebook | Model | Size | T4 Batch | Est. Time | Status |
|----------|-------|------|----------|-----------|--------|
| [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3–4 hrs | ✅ Recommended |
| [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1–2 hrs | ✅ Fastest |
| [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6–8 hrs | ⚠️ Tight VRAM |
| **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | ❌ Not supported |

---

## 🥇 Model Comparison (May 2026)

| Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
|-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
| **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | — | — | **128K** | Fastest training. Liquid AI edge model. |
| **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | — | — | 256K | Dense (not MoE). Google edge model. |
| Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | — | — | 1-bit ternary. **Cannot train with Unsloth.** |

**Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation.

---

## 📊 Dataset Selection — 8 Built-in Choices

Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.

| Choice | Dataset | Rows | Format | Best For | Language |
|--------|---------|------|--------|----------|----------|
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot tuning | English |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+→50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
| `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240K→50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
| `custom_mix` | Your combination | — | varies | Combine datasets for hybrid tuning | Mixed |

### How to Switch Datasets (in any notebook)

```python
# In Cell 4 — uncomment ONE line:

DATASET_CHOICE = "cybersecurity"    # ← Default (defensive security)
# DATASET_CHOICE = "ultrachat"      # ← General chat
# DATASET_CHOICE = "openhermes"     # ← Reasoning & coding
# DATASET_CHOICE = "sharegpt_en"    # ← English dialogue
# DATASET_CHOICE = "sharegpt_de"    # ← German
# DATASET_CHOICE = "sharegpt_hi"    # ← Hindi
# DATASET_CHOICE = "code_corpus"    # ← Code completion (Rust, Python, C++, etc.)
# DATASET_CHOICE = "custom_mix"     # ← Mix multiple
```

### Code Corpus Dataset Details

The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:

| Domain | Examples |
|--------|----------|
| `web_ui` | Web frameworks, UI components |
| `cpp` | C++ systems code |
| `kotlin_android` | Android apps |
| `rust` | Rust systems (e.g., actix-web) |
| `python` | Python libraries |
| `ethical_hacking` | Security tools, pentesting repos |
| `game_engines` | Game development |
| `ui_ux_design` | Design systems |

Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.

### Mixing Datasets (custom_mix)

```python
CUSTOM_DATASETS = [
    # (dataset_id, split, num_rows, format_type)
    # format_type: "messages" | "conversations" | "text"
    ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
    ("krystv/code-corpus-llm-training", "train", 20000, "text"),
    ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
]
```

---

## 🚀 How to Use (Any Notebook)

1. Open the notebook in **Google Colab** (click the notebook link above)
2. Runtime → Change runtime type → **T4 GPU**
3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
4. Run cells top-to-bottom
5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
6. The last cells show **inference demos**

**Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▶️.

---

## 🔧 Technical: Why `dataset_text_field="text"`?

Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:

```python
# Pre-convert messages → text using dataset.map(batched=True)
def convert_messages_to_text(examples):
    texts = []
    for msgs in examples["messages"]:
        text = tokenizer.apply_chat_template(msgs, tokenize=False)
        texts.append(text)
    return {"text": texts}

train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])

# Then pass dataset_text_field="text" to SFTTrainer
trainer = SFTTrainer(..., dataset_text_field="text")
```

All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.

---

## ⚠️ T4 VRAM Cheat-Sheet

| Symptom | Fix |
|---------|-----|
| `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` |
| Still OOM | Enable `use_rslora=True` in LoRA config |
| Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` |
| Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs |
| Can't push to Hub | Run `login(token=...)` with a **WRITE** token |

---

## 📖 References

| Resource | Link |
|----------|------|
| **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
| **Unsloth Docs** | https://unsloth.ai/docs |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |

---

## 📂 Repository Structure

```
asdf98/ethical-hacking-llm-colab/
├── EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb   ← Best accuracy
├── EthicalHacking_LFM2.5_Ultimate_Colab.ipynb     ← Fastest training
├── EthicalHacking_Gemma4_E2B_Colab.ipynb          ← Google model (tight VRAM)
├── EthicalHacking_Qwen3-8B_Colab.ipynb            ← Simpler backup (8B)
├── EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
├── BONSAI_LIMITATIONS.md                          ← Why Bonsai can't be fine-tuned
└── README.md                                      ← This file
```

---

*Pick any dataset. Train anything. Use responsibly.*