asdf98's picture
Upload README.md
a7d1cc9 verified
# πŸ” General-Purpose LLM Fine-Tuning Collection β€” Google Colab Free Tier (T4)
A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**.
> Pick your model, pick your dataset, click run. Zero-config fine-tuning.
---
## πŸ“š Notebooks
| Notebook | Model | Size | T4 Batch | Est. Time | Status |
|----------|-------|------|----------|-----------|--------|
| [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3–4 hrs | βœ… Recommended |
| [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1–2 hrs | βœ… Fastest |
| [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6–8 hrs | ⚠️ Tight VRAM |
| **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | ❌ Not supported |
---
## πŸ₯‡ Model Comparison (May 2026)
| Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes |
|-------|--------|-----------|----------|-------|----------|---------------|---------|-------|
| **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. |
| **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β€” | β€” | **128K** | Fastest training. Liquid AI edge model. |
| **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β€” | β€” | 256K | Dense (not MoE). Google edge model. |
| Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β€” | β€” | 1-bit ternary. **Cannot train with Unsloth.** |
**Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation.
---
## πŸ“Š Dataset Selection β€” 8 Built-in Choices
Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
| Choice | Dataset | Rows | Format | Best For | Language |
|--------|---------|------|--------|----------|----------|
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot tuning | English |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β†’50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
| `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240K→50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
| `custom_mix` | Your combination | β€” | varies | Combine datasets for hybrid tuning | Mixed |
### How to Switch Datasets (in any notebook)
```python
# In Cell 4 β€” uncomment ONE line:
DATASET_CHOICE = "cybersecurity" # ← Default (defensive security)
# DATASET_CHOICE = "ultrachat" # ← General chat
# DATASET_CHOICE = "openhermes" # ← Reasoning & coding
# DATASET_CHOICE = "sharegpt_en" # ← English dialogue
# DATASET_CHOICE = "sharegpt_de" # ← German
# DATASET_CHOICE = "sharegpt_hi" # ← Hindi
# DATASET_CHOICE = "code_corpus" # ← Code completion (Rust, Python, C++, etc.)
# DATASET_CHOICE = "custom_mix" # ← Mix multiple
```
### Code Corpus Dataset Details
The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:
| Domain | Examples |
|--------|----------|
| `web_ui` | Web frameworks, UI components |
| `cpp` | C++ systems code |
| `kotlin_android` | Android apps |
| `rust` | Rust systems (e.g., actix-web) |
| `python` | Python libraries |
| `ethical_hacking` | Security tools, pentesting repos |
| `game_engines` | Game development |
| `ui_ux_design` | Design systems |
Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.
### Mixing Datasets (custom_mix)
```python
CUSTOM_DATASETS = [
# (dataset_id, split, num_rows, format_type)
# format_type: "messages" | "conversations" | "text"
("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
("krystv/code-corpus-llm-training", "train", 20000, "text"),
("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
]
```
---
## πŸš€ How to Use (Any Notebook)
1. Open the notebook in **Google Colab** (click the notebook link above)
2. Runtime β†’ Change runtime type β†’ **T4 GPU**
3. In **Cell 4**, uncomment your desired `DATASET_CHOICE`
4. Run cells top-to-bottom
5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
6. The last cells show **inference demos**
**Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▢️.
---
## πŸ”§ Technical: Why `dataset_text_field="text"`?
Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:
```python
# Pre-convert messages β†’ text using dataset.map(batched=True)
def convert_messages_to_text(examples):
texts = []
for msgs in examples["messages"]:
text = tokenizer.apply_chat_template(msgs, tokenize=False)
texts.append(text)
return {"text": texts}
train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])
# Then pass dataset_text_field="text" to SFTTrainer
trainer = SFTTrainer(..., dataset_text_field="text")
```
All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.
---
## ⚠️ T4 VRAM Cheat-Sheet
| Symptom | Fix |
|---------|-----|
| `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` |
| Still OOM | Enable `use_rslora=True` in LoRA config |
| Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` |
| Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs |
| Can't push to Hub | Run `login(token=...)` with a **WRITE** token |
---
## πŸ“– References
| Resource | Link |
|----------|------|
| **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 |
| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |
| **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it |
| **Unsloth Docs** | https://unsloth.ai/docs |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
---
## πŸ“‚ Repository Structure
```
asdf98/ethical-hacking-llm-colab/
β”œβ”€β”€ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb ← Best accuracy
β”œβ”€β”€ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb ← Fastest training
β”œβ”€β”€ EthicalHacking_Gemma4_E2B_Colab.ipynb ← Google model (tight VRAM)
β”œβ”€β”€ EthicalHacking_Qwen3-8B_Colab.ipynb ← Simpler backup (8B)
β”œβ”€β”€ EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
β”œβ”€β”€ BONSAI_LIMITATIONS.md ← Why Bonsai can't be fine-tuned
└── README.md ← This file
```
---
*Pick any dataset. Train anything. Use responsibly.*