YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

πŸ” General-Purpose LLM Fine-Tuning Collection β€” Google Colab Free Tier (T4)

A curated collection of production-ready Colab notebooks for fine-tuning state-of-the-art small LLMs on any domain using Google Colab Free Tier (T4, 16GB VRAM).

Pick your model, pick your dataset, click run. Zero-config fine-tuning.


πŸ“š Notebooks

Notebook Model Size T4 Batch Est. Time Status
Qwen3-4B Ultimate unsloth/Qwen3-4B-Instruct-2507 3.3GB 4-bit 4 ~3–4 hrs βœ… Recommended
LFM2.5 Ultimate unsloth/LFM2.5-1.2B-Instruct ~1GB 4-bit 8 ~1–2 hrs βœ… Fastest
Gemma-4 E2B unsloth/gemma-4-E2B-it ~7.6GB 4-bit 1 ~6–8 hrs ⚠️ Tight VRAM
Bonsai (PrismML) See limitations ~1GB 1-bit N/A N/A ❌ Not supported

πŸ₯‡ Model Comparison (May 2026)

Model Params 4-bit Size VRAM Fit Batch MMLU-Pro LiveCodeBench Context Notes
Qwen3-4B 4B 3.3 GB Easy (12GB free) 4 69.6 35.1 32K Best coding/reasoning. Thinking toggle.
LFM2.5-1.2B 1.2B ~1 GB Huge headroom 8 β€” β€” 128K Fastest training. Liquid AI edge model.
Gemma-4 E2B ~2B dense 7.6 GB Tight (8GB free) 1 β€” β€” 256K Dense (not MoE). Google edge model.
Bonsai-8B 8B ~1 GB packed N/A N/A ~30 β€” β€” 1-bit ternary. Cannot train with Unsloth.

Recommendation: Start with Qwen3-4B for best accuracy, or LFM2.5 for fastest experimentation.


πŸ“Š Dataset Selection β€” 8 Built-in Choices

Every notebook includes a DATASET_CHOICE variable. Just uncomment one line to pick your data.

Choice Dataset Rows Format Best For Language
cybersecurity Fenrir v2.1 + Trendyol 153K→50K system/user/assistant Ethical hacking, pentesting education English
ultrachat UltraChat 200K (SFT) 200K→50K messages (role/content) General conversation, chatbot tuning English
openhermes OpenHermes 2.5 1M+β†’50K conversations (human/gpt) Reasoning, coding, instruction following English
sharegpt_en ShareGPT (English) ~90K→50K conversations (human/gpt) Multi-turn dialogue, general QA English
sharegpt_de ShareGPT (German) ~104K→50K conversations (human/gpt) German language fine-tuning German
sharegpt_hi ShareGPT (Hindi 27B) ~153K→50K conversations (human/gpt) Hindi language fine-tuning Hindi
code_corpus Code Corpus LLM Training 240K→50K text (code files with domain/repo/lang metadata) Code completion, coding assistant Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.)
custom_mix Your combination β€” varies Combine datasets for hybrid tuning Mixed

How to Switch Datasets (in any notebook)

# In Cell 4 β€” uncomment ONE line:

DATASET_CHOICE = "cybersecurity"    # ← Default (defensive security)
# DATASET_CHOICE = "ultrachat"      # ← General chat
# DATASET_CHOICE = "openhermes"     # ← Reasoning & coding
# DATASET_CHOICE = "sharegpt_en"    # ← English dialogue
# DATASET_CHOICE = "sharegpt_de"    # ← German
# DATASET_CHOICE = "sharegpt_hi"    # ← Hindi
# DATASET_CHOICE = "code_corpus"    # ← Code completion (Rust, Python, C++, etc.)
# DATASET_CHOICE = "custom_mix"     # ← Mix multiple

Code Corpus Dataset Details

The Code Corpus LLM Training dataset contains 240,378 code files from top open-source repositories across 20 domains:

Domain Examples
web_ui Web frameworks, UI components
cpp C++ systems code
kotlin_android Android apps
rust Rust systems (e.g., actix-web)
python Python libraries
ethical_hacking Security tools, pentesting repos
game_engines Game development
ui_ux_design Design systems

Each example has: text (the full code file), domain, repo, language, file_path, size_chars. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.

Mixing Datasets (custom_mix)

CUSTOM_DATASETS = [
    # (dataset_id, split, num_rows, format_type)
    # format_type: "messages" | "conversations" | "text"
    ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
    ("krystv/code-corpus-llm-training", "train", 20000, "text"),
    ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
]

πŸš€ How to Use (Any Notebook)

  1. Open the notebook in Google Colab (click the notebook link above)
  2. Runtime β†’ Change runtime type β†’ T4 GPU
  3. In Cell 4, uncomment your desired DATASET_CHOICE
  4. Run cells top-to-bottom
  5. (Optional) Set your HF token in cell 2 to push the LoRA adapter
  6. The last cells show inference demos

Zero-config: All hyperparameters are tuned for T4. Just pick a dataset and click ▢️.


πŸ”§ Technical: Why dataset_text_field="text"?

Unsloth's SFTTrainer has issues with formatting_func. The clean fix:

# Pre-convert messages β†’ text using dataset.map(batched=True)
def convert_messages_to_text(examples):
    texts = []
    for msgs in examples["messages"]:
        text = tokenizer.apply_chat_template(msgs, tokenize=False)
        texts.append(text)
    return {"text": texts}

train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=["messages"])

# Then pass dataset_text_field="text" to SFTTrainer
trainer = SFTTrainer(..., dataset_text_field="text")

All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.


⚠️ T4 VRAM Cheat-Sheet

Symptom Fix
CUDA out of memory Lower MAX_SEQ_LENGTH to 2048; set BATCH_SIZE=1; set PACKING=False
Still OOM Enable use_rslora=True in LoRA config
Training very slow Increase BATCH_SIZE if VRAM allows; enable PACKING=True
Loss not decreasing Try LEARNING_RATE=5e-4 or train for 2 epochs
Can't push to Hub Run login(token=...) with a WRITE token

πŸ“– References


πŸ“‚ Repository Structure

asdf98/ethical-hacking-llm-colab/
β”œβ”€β”€ EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb   ← Best accuracy
β”œβ”€β”€ EthicalHacking_LFM2.5_Ultimate_Colab.ipynb     ← Fastest training
β”œβ”€β”€ EthicalHacking_Gemma4_E2B_Colab.ipynb          ← Google model (tight VRAM)
β”œβ”€β”€ EthicalHacking_Qwen3-8B_Colab.ipynb            ← Simpler backup (8B)
β”œβ”€β”€ EthicalHacking_MultiModel_Comparison_Colab.ipynb ← Compare models
β”œβ”€β”€ BONSAI_LIMITATIONS.md                          ← Why Bonsai can't be fine-tuned
└── README.md                                      ← This file

Pick any dataset. Train anything. Use responsibly.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support