Upload README.md

a7d1cc9 verified about 23 hours ago

8.78 kB

	# 🔐 General-Purpose LLM Fine-Tuning Collection — Google Colab Free Tier (T4)

	A curated collection of production-ready Colab notebooks for fine-tuning state-of-the-art small LLMs on any domain using Google Colab Free Tier (T4, 16GB VRAM).

	> Pick your model, pick your dataset, click run. Zero-config fine-tuning.

	---

	## 📚 Notebooks

	\| Notebook \| Model \| Size \| T4 Batch \| Est. Time \| Status \|
	\|----------\|-------\|------\|----------\|-----------\|--------\|
	\| [Qwen3-4B Ultimate](./EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb) \| `unsloth/Qwen3-4B-Instruct-2507` \| 3.3GB 4-bit \| 4 \| ~3–4 hrs \| ✅ Recommended \|
	\| [LFM2.5 Ultimate](./EthicalHacking_LFM2.5_Ultimate_Colab.ipynb) \| `unsloth/LFM2.5-1.2B-Instruct` \| ~1GB 4-bit \| 8 \| ~1–2 hrs \| ✅ Fastest \|
	\| [Gemma-4 E2B](./EthicalHacking_Gemma4_E2B_Colab.ipynb) \| `unsloth/gemma-4-E2B-it` \| ~7.6GB 4-bit \| 1 \| ~6–8 hrs \| ⚠️ Tight VRAM \|
	\| Bonsai (PrismML) \| See [limitations](./BONSAI_LIMITATIONS.md) \| ~1GB 1-bit \| N/A \| N/A \| ❌ Not supported \|

	---

	## 🥇 Model Comparison (May 2026)

	\| Model \| Params \| 4-bit Size \| VRAM Fit \| Batch \| MMLU-Pro \| LiveCodeBench \| Context \| Notes \|
	\|-------\|--------\|-----------\|----------\|-------\|----------\|---------------\|---------\|-------\|
	\| Qwen3-4B \| 4B \| 3.3 GB \| Easy (12GB free) \| 4 \| 69.6 \| 35.1 \| 32K \| Best coding/reasoning. Thinking toggle. \|
	\| LFM2.5-1.2B \| 1.2B \| ~1 GB \| Huge headroom \| 8 \| — \| — \| 128K \| Fastest training. Liquid AI edge model. \|
	\| Gemma-4 E2B \| ~2B dense \| 7.6 GB \| Tight (8GB free) \| 1 \| — \| — \| 256K \| Dense (not MoE). Google edge model. \|
	\| Bonsai-8B \| 8B \| ~1 GB packed \| N/A \| N/A \| ~30 \| — \| — \| 1-bit ternary. Cannot train with Unsloth. \|

	Recommendation: Start with Qwen3-4B for best accuracy, or LFM2.5 for fastest experimentation.

	---

	## 📊 Dataset Selection — 8 Built-in Choices

	Every notebook includes a `DATASET_CHOICE` variable. Just uncomment one line to pick your data.

	\| Choice \| Dataset \| Rows \| Format \| Best For \| Language \|
	\|--------\|---------\|------\|--------\|----------\|----------\|
	\| `cybersecurity` \| Fenrir v2.1 + Trendyol \| 153K→50K \| system/user/assistant \| Ethical hacking, pentesting education \| English \|
	\| `ultrachat` \| [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) \| 200K→50K \| messages (role/content) \| General conversation, chatbot tuning \| English \|
	\| `openhermes` \| [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) \| 1M+→50K \| conversations (human/gpt) \| Reasoning, coding, instruction following \| English \|
	\| `sharegpt_en` \| [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) \| ~90K→50K \| conversations (human/gpt) \| Multi-turn dialogue, general QA \| English \|
	\| `sharegpt_de` \| [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) \| ~104K→50K \| conversations (human/gpt) \| German language fine-tuning \| German \|
	\| `sharegpt_hi` \| [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) \| ~153K→50K \| conversations (human/gpt) \| Hindi language fine-tuning \| Hindi \|
	\| `code_corpus` \| [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) \| 240K→50K \| text (code files with domain/repo/lang metadata) \| Code completion, coding assistant \| Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) \|
	\| `custom_mix` \| Your combination \| — \| varies \| Combine datasets for hybrid tuning \| Mixed \|

	### How to Switch Datasets (in any notebook)

	```python
	# In Cell 4 — uncomment ONE line:

	DATASET_CHOICE = "cybersecurity" # ← Default (defensive security)
	# DATASET_CHOICE = "ultrachat" # ← General chat
	# DATASET_CHOICE = "openhermes" # ← Reasoning & coding
	# DATASET_CHOICE = "sharegpt_en" # ← English dialogue
	# DATASET_CHOICE = "sharegpt_de" # ← German
	# DATASET_CHOICE = "sharegpt_hi" # ← Hindi
	# DATASET_CHOICE = "code_corpus" # ← Code completion (Rust, Python, C++, etc.)
	# DATASET_CHOICE = "custom_mix" # ← Mix multiple
	```

	### Code Corpus Dataset Details

	The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains 240,378 code files from top open-source repositories across 20 domains:

	\| Domain \| Examples \|
	\|--------\|----------\|
	\| `web_ui` \| Web frameworks, UI components \|
	\| `cpp` \| C++ systems code \|
	\| `kotlin_android` \| Android apps \|
	\| `rust` \| Rust systems (e.g., actix-web) \|
	\| `python` \| Python libraries \|
	\| `ethical_hacking` \| Security tools, pentesting repos \|
	\| `game_engines` \| Game development \|
	\| `ui_ux_design` \| Design systems \|

	Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.

	### Mixing Datasets (custom_mix)

	```python
	CUSTOM_DATASETS = [
	# (dataset_id, split, num_rows, format_type)
	# format_type: "messages" \| "conversations" \| "text"
	("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
	("krystv/code-corpus-llm-training", "train", 20000, "text"),
	("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
	]
	```

	---

	## 🚀 How to Use (Any Notebook)

	1. Open the notebook in Google Colab (click the notebook link above)
	2. Runtime → Change runtime type → T4 GPU
	3. In Cell 4, uncomment your desired `DATASET_CHOICE`
	4. Run cells top-to-bottom
	5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
	6. The last cells show inference demos

	Zero-config: All hyperparameters are tuned for T4. Just pick a dataset and click ▶️.

	---

	## 🔧 Technical: Why `dataset_text_field="text"`?

	Unsloth's `SFTTrainer` has issues with `formatting_func`. The clean fix:

	```python
	# Pre-convert messages → text using dataset.map(batched=True)
	def convert_messages_to_text(examples):
	texts = []
	for msgs in examples["messages"]:
	text = tokenizer.apply_chat_template(msgs, tokenize=False)
	texts.append(text)
	return {"text": texts}

	train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])

	# Then pass dataset_text_field="text" to SFTTrainer
	trainer = SFTTrainer(..., dataset_text_field="text")
	```

	All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.

	---

	## ⚠️ T4 VRAM Cheat-Sheet

	\| Symptom \| Fix \|
	\|---------\|-----\|
	\| `CUDA out of memory` \| Lower `MAX_SEQ_LENGTH` to 2048; set `BATCH_SIZE=1`; set `PACKING=False` \|
	\| Still OOM \| Enable `use_rslora=True` in LoRA config \|
	\| Training very slow \| Increase `BATCH_SIZE` if VRAM allows; enable `PACKING=True` \|
	\| Loss not decreasing \| Try `LEARNING_RATE=5e-4` or train for 2 epochs \|
	\| Can't push to Hub \| Run `login(token=...)` with a WRITE token \|

	---

	## 📖 References

	\| Resource \| Link \|
	\|----------\|------\|
	\| Qwen3-4B-Instruct-2507 \| https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507 \|
	\| LFM2.5-1.2B-Instruct \| https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct \|
	\| Gemma 4 E2B \| https://huggingface.co/google/gemma-4-E2B-it \|
	\| Unsloth Docs \| https://unsloth.ai/docs \|
	\| UltraChat 200K \| https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k \|
	\| OpenHermes 2.5 \| https://huggingface.co/datasets/teknium/OpenHermes-2.5 \|
	\| ShareGPT Multilingual \| https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual \|
	\| Code Corpus LLM Training \| https://huggingface.co/datasets/krystv/code-corpus-llm-training \|
	\| Fenrir Cybersecurity \| https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 \|
	\| Trendyol Cybersecurity \| https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset \|

	---

	## 📂 Repository Structure

	```
	asdf98/ethical-hacking-llm-colab/
	├── EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb ← Best accuracy
	├── EthicalHacking_LFM2.5_Ultimate_Colab.ipynb ← Fastest training
	├── EthicalHacking_Gemma4_E2B_Colab.ipynb ← Google model (tight VRAM)
	├── EthicalHacking_Qwen3-8B_Colab.ipynb ← Simpler backup (8B)
	├── EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
	├── BONSAI_LIMITATIONS.md ← Why Bonsai can't be fine-tuned
	└── README.md ← This file
	```

	---

	Pick any dataset. Train anything. Use responsibly.