# 🔐 General-Purpose LLM Fine-Tuning Collection — Google Colab Free Tier (T4)

A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.

> Pick your model, pick your dataset, click run. Zero-config fine-tuning.

---

## 📚 Notebooks

| Notebook | Model | Size | T4 Batch | Est. Time | Status |
|----------|-------|------|----------|-----------|--------|
| [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3–4 hrs | ✅ Recommended |
| [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1–2 hrs | ✅ Fastest |
| [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6–8 hrs | ⚠️ Tight VRAM |
| **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | ❌ Not supported |

---

## 🥇 Model Comparison (May 2026)

| Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
|-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
| **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | — | — | **128K** | Fastest training. Liquid AI edge model. |
| **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | — | — | 256K | Dense (not MoE). Google edge model. |
| Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | — | — | 1-bit ternary. **Cannot train with Unsloth.** |

**Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation.

---

## 📊 Dataset Selection — 8 Built-in Choices

Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.

| Choice | Dataset | Rows | Format | Best For | Language |
|--------|---------|------|--------|----------|----------|
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot tuning | English |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+→50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
| `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240K→50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
| `custom_mix` | Your combination | — | varies | Combine datasets for hybrid tuning | Mixed |

### How to Switch Datasets (in any notebook)

```python
# In Cell 4 — uncomment ONE line:

DATASET_CHOICE = "cybersecurity"    # ← Default (defensive security)
# DATASET_CHOICE = "ultrachat"      # ← General chat
# DATASET_CHOICE = "openhermes"     # ← Reasoning & coding
# DATASET_CHOICE = "sharegpt_en"    # ← English dialogue
# DATASET_CHOICE = "sharegpt_de"    # ← German
# DATASET_CHOICE = "sharegpt_hi"    # ← Hindi
# DATASET_CHOICE = "code_corpus"    # ← Code completion (Rust, Python, C++, etc.)
# DATASET_CHOICE = "custom_mix"     # ← Mix multiple
```

### Code Corpus Dataset Details

The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:

| Domain | Examples |
|--------|----------|
| `web_ui` | Web frameworks, UI components |
| `cpp` | C++ systems code |
| `kotlin_android` | Android apps |
| `rust` | Rust systems (e.g., actix-web) |
| `python` | Python libraries |
| `ethical_hacking` | Security tools, pentesting repos |
| `game_engines` | Game development |
| `ui_ux_design` | Design systems |

Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.

### Mixing Datasets (custom_mix)

```python
CUSTOM_DATASETS = [
    # (dataset_id, split, num_rows, format_type)
    # format_type: "messages" | "conversations" | "text"
    ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
    ("krystv/code-corpus-llm-training", "train", 20000, "text"),
    ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
]
```

---

## 🚀 How to Use (Any Notebook)

1. Open the notebook in **Google Colab** (click the notebook link above)
2. Runtime → Change runtime type → **T4 GPU**
3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
4. Run cells top-to-bottom
5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
6. The last cells show **inference demos**

**Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▶️.

---

## 🔧 Technical: Why `dataset_text_field="text"`?

Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:

```python
# Pre-convert messages → text using dataset.map(batched=True)
def convert_messages_to_text(examples):
    texts = []
    for msgs in examples["messages"]:
        text = tokenizer.apply_chat_template(msgs, tokenize=False)
        texts.append(text)
    return {"text": texts}

train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])

# Then pass dataset_text_field="text" to SFTTrainer
trainer = SFTTrainer(..., dataset_text_field="text")
```

All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.

---

## ⚠️ T4 VRAM Cheat-Sheet

| Symptom | Fix |
|---------|-----|
| `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` |
| Still OOM | Enable `use_rslora=True` in LoRA config |
| Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` |
| Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs |
| Can't push to Hub | Run `login(token=...)` with a **WRITE** token |

---

## 📖 References

| Resource | Link |
|----------|------|
| **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
| **Unsloth Docs** | https://unsloth.ai/docs |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |

---

## 📂 Repository Structure

```
asdf98/ethical-hacking-llm-colab/
├── EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb   ← Best accuracy
├── EthicalHacking_LFM2.5_Ultimate_Colab.ipynb     ← Fastest training
├── EthicalHacking_Gemma4_E2B_Colab.ipynb          ← Google model (tight VRAM)
├── EthicalHacking_Qwen3-8B_Colab.ipynb            ← Simpler backup (8B)
├── EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
├── BONSAI_LIMITATIONS.md                          ← Why Bonsai can't be fine-tuned
└── README.md                                      ← This file
```

---

*Pick any dataset. Train anything. Use responsibly.*