# 🔐 General-Purpose LLM Fine-Tuning Collection — Google Colab Free Tier (T4) A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**. > Pick your model, pick your dataset, click run. Zero-config fine-tuning. --- ## 📚 Notebooks | Notebook | Model | Size | T4 Batch | Est. Time | Status | |----------|-------|------|----------|-----------|--------| | [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3–4 hrs | ✅ Recommended | | [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1–2 hrs | ✅ Fastest | | [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6–8 hrs | ⚠️ Tight VRAM | | **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | ❌ Not supported | --- ## 🥇 Model Comparison (May 2026) | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes | |-------|--------|-----------|----------|-------|----------|---------------|---------|-------| | **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. | | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | — | — | **128K** | Fastest training. Liquid AI edge model. | | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | — | — | 256K | Dense (not MoE). Google edge model. | | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | — | — | 1-bit ternary. **Cannot train with Unsloth.** | **Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation. --- ## 📊 Dataset Selection — 8 Built-in Choices Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data. | Choice | Dataset | Rows | Format | Best For | Language | |--------|---------|------|--------|----------|----------| | `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | **Ethical hacking, pentesting education** | English | | `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot tuning | English | | `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+→50K | conversations (human/gpt) | Reasoning, coding, instruction following | English | | `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English | | `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104K→50K | conversations (human/gpt) | German language fine-tuning | **German** | | `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153K→50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** | | `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240K→50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) | | `custom_mix` | Your combination | — | varies | Combine datasets for hybrid tuning | Mixed | ### How to Switch Datasets (in any notebook) ```python # In Cell 4 — uncomment ONE line: DATASET_CHOICE = "cybersecurity" # ← Default (defensive security) # DATASET_CHOICE = "ultrachat" # ← General chat # DATASET_CHOICE = "openhermes" # ← Reasoning & coding # DATASET_CHOICE = "sharegpt_en" # ← English dialogue # DATASET_CHOICE = "sharegpt_de" # ← German # DATASET_CHOICE = "sharegpt_hi" # ← Hindi # DATASET_CHOICE = "code_corpus" # ← Code completion (Rust, Python, C++, etc.) # DATASET_CHOICE = "custom_mix" # ← Mix multiple ``` ### Code Corpus Dataset Details The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains: | Domain | Examples | |--------|----------| | `web_ui` | Web frameworks, UI components | | `cpp` | C++ systems code | | `kotlin_android` | Android apps | | `rust` | Rust systems (e.g., actix-web) | | `python` | Python libraries | | `ethical_hacking` | Security tools, pentesting repos | | `game_engines` | Game development | | `ui_ux_design` | Design systems | Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code. ### Mixing Datasets (custom_mix) ```python CUSTOM_DATASETS = [ # (dataset_id, split, num_rows, format_type) # format_type: "messages" | "conversations" | "text" ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"), ("krystv/code-corpus-llm-training", "train", 20000, "text"), ("teknium/OpenHermes-2.5", "train", 20000, "conversations"), ] ``` --- ## 🚀 How to Use (Any Notebook) 1. Open the notebook in **Google Colab** (click the notebook link above) 2. Runtime → Change runtime type → **T4 GPU** 3. In **Cell 4**, uncomment your desired `DATASET_CHOICE` 4. Run cells top-to-bottom 5. (Optional) Set your HF token in cell 2 to push the LoRA adapter 6. The last cells show **inference demos** **Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click ▶️. --- ## 🔧 Technical: Why `dataset_text_field="text"`? Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix: ```python # Pre-convert messages → text using dataset.map(batched=True) def convert_messages_to_text(examples): texts = [] for msgs in examples["messages"]: text = tokenizer.apply_chat_template(msgs, tokenize=False) texts.append(text) return {"text": texts} train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"]) # Then pass dataset_text_field="text" to SFTTrainer trainer = SFTTrainer(..., dataset_text_field="text") ``` All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically. --- ## ⚠️ T4 VRAM Cheat-Sheet | Symptom | Fix | |---------|-----| | `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` | | Still OOM | Enable `use_rslora=True` in LoRA config | | Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` | | Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs | | Can't push to Hub | Run `login(token=...)` with a **WRITE** token | --- ## 📖 References | Resource | Link | |----------|------| | **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 | | **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct | | **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it | | **Unsloth Docs** | https://unsloth.ai/docs | | **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k | | **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 | | **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual | | **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training | | **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 | | **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset | --- ## 📂 Repository Structure ``` asdf98/ethical-hacking-llm-colab/ ├── EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb ← Best accuracy ├── EthicalHacking_LFM2.5_Ultimate_Colab.ipynb ← Fastest training ├── EthicalHacking_Gemma4_E2B_Colab.ipynb ← Google model (tight VRAM) ├── EthicalHacking_Qwen3-8B_Colab.ipynb ← Simpler backup (8B) ├── EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models ├── BONSAI_LIMITATIONS.md ← Why Bonsai can't be fine-tuned └── README.md ← This file ``` --- *Pick any dataset. Train anything. Use responsibly.*