| # π General-Purpose LLM Fine-Tuning Collection β Google Colab Free Tier (T4) |
|
|
| A curated collection of **production-ready Colab notebooks** for fine-tuning state-of-the-art small LLMs on **any domain** using **Google Colab Free Tier (T4, 16GB VRAM)**. |
|
|
| > Pick your model, pick your dataset, click run. Zero-config fine-tuning. |
|
|
| --- |
|
|
| ## π Notebooks |
|
|
| | Notebook | Model | Size | T4 Batch | Est. Time | Status | |
| |----------|-------|------|----------|-----------|--------| |
| | [**Qwen3-4B Ultimate**](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) | `unsloth/Qwen3-4B-Instruct-2507` | 3.3GB 4-bit | **4** | ~3β4 hrs | β
Recommended | |
| | [**LFM2.5 Ultimate**](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) | `unsloth/LFM2.5-1.2B-Instruct` | ~1GB 4-bit | **8** | ~1β2 hrs | β
Fastest | |
| | [**Gemma-4 E2B**](./EthicalHacking_Gemma4_E2B_Colab.ipynb) | `unsloth/gemma-4-E2B-it` | ~7.6GB 4-bit | **1** | ~6β8 hrs | β οΈ Tight VRAM | |
| | **Bonsai (PrismML)** | See [limitations](./BONSAI_LIMITATIONS.md) | ~1GB 1-bit | N/A | N/A | β Not supported | |
|
|
| --- |
|
|
| ## π₯ Model Comparison (May 2026) |
|
|
| | Model | Params | 4-bit Size | VRAM Fit | Batch | MMLU-Pro | LiveCodeBench | Context | Notes | |
| |-------|--------|-----------|----------|-------|----------|---------------|---------|-------| |
| | **Qwen3-4B** | 4B | 3.3 GB | Easy (12GB free) | 4 | 69.6 | **35.1** | 32K | Best coding/reasoning. Thinking toggle. | |
| | **LFM2.5-1.2B** | 1.2B | **~1 GB** | Huge headroom | 8 | β | β | **128K** | Fastest training. Liquid AI edge model. | |
| | **Gemma-4 E2B** | ~2B dense | 7.6 GB | Tight (8GB free) | 1 | β | β | 256K | Dense (not MoE). Google edge model. | |
| | Bonsai-8B | 8B | ~1 GB packed | N/A | N/A | ~30 | β | β | 1-bit ternary. **Cannot train with Unsloth.** | |
|
|
| **Recommendation:** Start with **Qwen3-4B** for best accuracy, or **LFM2.5** for fastest experimentation. |
|
|
| --- |
|
|
| ## π Dataset Selection β 8 Built-in Choices |
|
|
| Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data. |
|
|
| | Choice | Dataset | Rows | Format | Best For | Language | |
| |--------|---------|------|--------|----------|----------| |
| | `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153Kβ50K | system/user/assistant | **Ethical hacking, pentesting education** | English | |
| | `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200Kβ50K | messages (role/content) | General conversation, chatbot tuning | English | |
| | `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β50K | conversations (human/gpt) | Reasoning, coding, instruction following | English | |
| | `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90Kβ50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English | |
| | `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104Kβ50K | conversations (human/gpt) | German language fine-tuning | **German** | |
| | `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153Kβ50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** | |
| | `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240Kβ50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) | |
| | `custom_mix` | Your combination | β | varies | Combine datasets for hybrid tuning | Mixed | |
|
|
| ### How to Switch Datasets (in any notebook) |
|
|
| ```python |
| # In Cell 4 β uncomment ONE line: |
| |
| DATASET_CHOICE = "cybersecurity" # β Default (defensive security) |
| # DATASET_CHOICE = "ultrachat" # β General chat |
| # DATASET_CHOICE = "openhermes" # β Reasoning & coding |
| # DATASET_CHOICE = "sharegpt_en" # β English dialogue |
| # DATASET_CHOICE = "sharegpt_de" # β German |
| # DATASET_CHOICE = "sharegpt_hi" # β Hindi |
| # DATASET_CHOICE = "code_corpus" # β Code completion (Rust, Python, C++, etc.) |
| # DATASET_CHOICE = "custom_mix" # β Mix multiple |
| ``` |
|
|
| ### Code Corpus Dataset Details |
|
|
| The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains: |
|
|
| | Domain | Examples | |
| |--------|----------| |
| | `web_ui` | Web frameworks, UI components | |
| | `cpp` | C++ systems code | |
| | `kotlin_android` | Android apps | |
| | `rust` | Rust systems (e.g., actix-web) | |
| | `python` | Python libraries | |
| | `ethical_hacking` | Security tools, pentesting repos | |
| | `game_engines` | Game development | |
| | `ui_ux_design` | Design systems | |
|
|
| Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code. |
|
|
| ### Mixing Datasets (custom_mix) |
| |
| ```python |
| CUSTOM_DATASETS = [ |
| # (dataset_id, split, num_rows, format_type) |
| # format_type: "messages" | "conversations" | "text" |
| ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"), |
| ("krystv/code-corpus-llm-training", "train", 20000, "text"), |
| ("teknium/OpenHermes-2.5", "train", 20000, "conversations"), |
| ] |
| ``` |
| |
| --- |
|
|
| ## π How to Use (Any Notebook) |
|
|
| 1. Open the notebook in **Google Colab** (click the notebook link above) |
| 2. Runtime β Change runtime type β **T4 GPU** |
| 3. In **Cell 4**, uncomment your desired `DATASET_CHOICE` |
| 4. Run cells top-to-bottom |
| 5. (Optional) Set your HF token in cell 2 to push the LoRA adapter |
| 6. The last cells show **inference demos** |
|
|
| **Zero-config:** All hyperparameters are tuned for T4. Just pick a dataset and click βΆοΈ. |
|
|
| --- |
|
|
| ## π§ Technical: Why `dataset_text_field="text"`? |
|
|
| Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix: |
|
|
| ```python |
| # Pre-convert messages β text using dataset.map(batched=True) |
| def convert_messages_to_text(examples): |
| texts = [] |
| for msgs in examples["messages"]: |
| text = tokenizer.apply_chat_template(msgs, tokenize=False) |
| texts.append(text) |
| return {"text": texts} |
| |
| train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"]) |
| |
| # Then pass dataset_text_field="text" to SFTTrainer |
| trainer = SFTTrainer(..., dataset_text_field="text") |
| ``` |
|
|
| All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically. |
|
|
| --- |
|
|
| ## β οΈ T4 VRAM Cheat-Sheet |
|
|
| | Symptom | Fix | |
| |---------|-----| |
| | `CUDA out of memory` | Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` | |
| | Still OOM | Enable `use_rslora=True` in LoRA config | |
| | Training very slow | Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` | |
| | Loss not decreasing | Try `LEARNING_RATE=5e-4` or train for 2 epochs | |
| | Can't push to Hub | Run `login(token=...)` with a **WRITE** token | |
|
|
| --- |
|
|
| ## π References |
|
|
| | Resource | Link | |
| |----------|------| |
| | **Qwen3-4B-Instruct-2507** | https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 | |
| | **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct | |
| | **Gemma 4 E2B** | https://huggingface.co/google/gemma-4-E2B-it | |
| | **Unsloth Docs** | https://unsloth.ai/docs | |
| | **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k | |
| | **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 | |
| | **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual | |
| | **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training | |
| | **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 | |
| | **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset | |
|
|
| --- |
|
|
| ## π Repository Structure |
|
|
| ``` |
| asdf98/ethical-hacking-llm-colab/ |
| βββ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb β Best accuracy |
| βββ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb β Fastest training |
| βββ EthicalHacking_Gemma4_E2B_Colab.ipynb β Google model (tight VRAM) |
| βββ EthicalHacking_Qwen3-8B_Colab.ipynb β Simpler backup (8B) |
| βββ EthicalHacking_MultiModel_Comparison_Colab.ipynb β Compare models |
| βββ BONSAI_LIMITATIONS.md β Why Bonsai can't be fine-tuned |
| βββ README.md β This file |
| ``` |
|
|
| --- |
|
|
| *Pick any dataset. Train anything. Use responsibly.* |
|
|