Upload EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb
Browse files
EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb
CHANGED
|
@@ -4,14 +4,14 @@
|
|
| 4 |
"cell_type": "markdown",
|
| 5 |
"metadata": {},
|
| 6 |
"source": [
|
| 7 |
-
"# π Ultimate
|
| 8 |
"\n",
|
| 9 |
"**π₯ Model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) via Unsloth 4-bit \n",
|
| 10 |
"**π Why this model?** Highest coding/reasoning scores among sub-10B models (LiveCodeBench 35.1, MMLU-Pro 69.6). Only **3.3 GB** in 4-bit. \n",
|
| 11 |
-
"**π Datasets:** Your choice β
|
| 12 |
"**β‘ Framework:** Unsloth + TRL SFTTrainer β 2Γ faster, 70% less VRAM \n",
|
| 13 |
"\n",
|
| 14 |
-
"> β οΈ
|
| 15 |
"\n",
|
| 16 |
"---\n",
|
| 17 |
"\n",
|
|
@@ -127,20 +127,18 @@
|
|
| 127 |
"source": [
|
| 128 |
"## 4οΈβ£ π― CHOOSE YOUR DATASET(S)\n",
|
| 129 |
"\n",
|
| 130 |
-
"Uncomment **ONE** `DATASET_CHOICE` line to select your training data.
|
| 131 |
"\n",
|
| 132 |
-
"| Choice | Dataset |
|
| 133 |
"|--------|---------|------|--------|----------|\n",
|
| 134 |
-
"| `\"cybersecurity\"` | Fenrir v2.1 + Trendyol | 153K
|
| 135 |
-
"| `\"ultrachat\"` | UltraChat 200K
|
| 136 |
-
"| `\"openhermes\"` | OpenHermes 2.5 | 1M+
|
| 137 |
-
"| `\"sharegpt_en\"` | ShareGPT English | ~90K
|
| 138 |
-
"| `\"sharegpt_de\"` | ShareGPT German | ~104K
|
| 139 |
-
"| `\"sharegpt_hi\"` | ShareGPT Hindi
|
| 140 |
-
"| `\"
|
| 141 |
-
"\
|
| 142 |
-
"\n",
|
| 143 |
-
"**To mix datasets**, set `DATASET_CHOICE = \"custom_mix\"` and configure `CUSTOM_DATASETS` below."
|
| 144 |
]
|
| 145 |
},
|
| 146 |
{
|
|
@@ -155,33 +153,19 @@
|
|
| 155 |
"# SELECT YOUR DATASET β UNCOMMENT ONE LINE\n",
|
| 156 |
"# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
|
| 157 |
"\n",
|
| 158 |
-
"# --- Option 1: Cybersecurity (default) ---\n",
|
| 159 |
"DATASET_CHOICE = \"cybersecurity\"\n",
|
| 160 |
"\n",
|
| 161 |
-
"# --- Option 2: General-purpose chat (UltraChat) ---\n",
|
| 162 |
"# DATASET_CHOICE = \"ultrachat\"\n",
|
| 163 |
-
"\n",
|
| 164 |
-
"# --- Option 3: Reasoning & coding (OpenHermes 2.5) ---\n",
|
| 165 |
"# DATASET_CHOICE = \"openhermes\"\n",
|
| 166 |
-
"\n",
|
| 167 |
-
"# --- Option 4: Multi-turn dialogue (ShareGPT English) ---\n",
|
| 168 |
"# DATASET_CHOICE = \"sharegpt_en\"\n",
|
| 169 |
-
"\n",
|
| 170 |
-
"# --- Option 5: German language (ShareGPT German) ---\n",
|
| 171 |
"# DATASET_CHOICE = \"sharegpt_de\"\n",
|
| 172 |
-
"\n",
|
| 173 |
-
"# --- Option 6: Hindi language (ShareGPT Hindi 27B) ---\n",
|
| 174 |
"# DATASET_CHOICE = \"sharegpt_hi\"\n",
|
| 175 |
-
"\n",
|
| 176 |
-
"# --- Option 7: Mix multiple datasets ---\n",
|
| 177 |
"# DATASET_CHOICE = \"custom_mix\"\n",
|
| 178 |
"\n",
|
| 179 |
-
"# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
|
| 180 |
-
"# CUSTOM MIX CONFIG (only used if DATASET_CHOICE = \"custom_mix\")\n",
|
| 181 |
-
"# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
|
| 182 |
"CUSTOM_DATASETS = [\n",
|
| 183 |
" # (\"dataset_name_or_id\", \"split\", rows_to_take, \"format_type\")\n",
|
| 184 |
-
" # format_type: \"messages\" | \"conversations\" | \"
|
| 185 |
" (\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", \"train\", 10000, \"messages\"),\n",
|
| 186 |
" (\"HuggingFaceH4/ultrachat_200k\", \"train_sft\", 20000, \"messages\"),\n",
|
| 187 |
" (\"teknium/OpenHermes-2.5\", \"train\", 20000, \"conversations\"),\n",
|
|
@@ -196,8 +180,7 @@
|
|
| 196 |
"source": [
|
| 197 |
"## 5οΈβ£ Load, Convert & Pre-process Selected Dataset\n",
|
| 198 |
"\n",
|
| 199 |
-
"
|
| 200 |
-
"**No changes needed** β just run it after selecting your dataset above."
|
| 201 |
]
|
| 202 |
},
|
| 203 |
{
|
|
@@ -223,13 +206,11 @@
|
|
| 223 |
" ]}\n",
|
| 224 |
"\n",
|
| 225 |
"def _convert_ultrachat(example):\n",
|
| 226 |
-
" # Already in messages format with role/content\n",
|
| 227 |
" return {\"messages\": example[\"messages\"]}\n",
|
| 228 |
"\n",
|
| 229 |
"def _convert_conversations(example):\n",
|
| 230 |
-
" # OpenHermes / ShareGPT style: [{from: 'human'/'gpt', value: '...'}]\n",
|
| 231 |
" msgs = []\n",
|
| 232 |
-
" system_prompt = example.get(\"system_prompt\") or example.get(\"system\", \"\")\n",
|
| 233 |
" if system_prompt:\n",
|
| 234 |
" msgs.append({\"role\": \"system\", \"content\": system_prompt})\n",
|
| 235 |
" for turn in example[\"conversations\"]:\n",
|
|
@@ -237,40 +218,52 @@
|
|
| 237 |
" msgs.append({\"role\": role, \"content\": turn[\"value\"]})\n",
|
| 238 |
" return {\"messages\": msgs}\n",
|
| 239 |
"\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
"# ===================== LOAD DATASET(S) =====================\n",
|
| 241 |
"all_datasets = []\n",
|
| 242 |
"\n",
|
| 243 |
"if DATASET_CHOICE == \"cybersecurity\":\n",
|
| 244 |
-
" print(\"π₯ Loading Fenrir v2.1...\")\n",
|
| 245 |
" ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
|
| 246 |
" ds1 = ds1.map(_convert_fenrir, remove_columns=ds1.column_names, batched=False)\n",
|
| 247 |
" all_datasets.append(ds1)\n",
|
| 248 |
-
"\n",
|
| 249 |
-
" print(\"π₯ Loading Trendyol Cybersecurity...\")\n",
|
| 250 |
" ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
|
| 251 |
" ds2 = ds2.map(_convert_trendyol, remove_columns=ds2.column_names, batched=False)\n",
|
| 252 |
" all_datasets.append(ds2)\n",
|
| 253 |
"\n",
|
| 254 |
"elif DATASET_CHOICE == \"ultrachat\":\n",
|
| 255 |
-
" print(\"π₯ Loading UltraChat 200K (train_sft split)...\")\n",
|
| 256 |
" ds = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
|
| 257 |
" ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
|
| 258 |
" all_datasets.append(ds)\n",
|
| 259 |
"\n",
|
| 260 |
"elif DATASET_CHOICE == \"openhermes\":\n",
|
| 261 |
-
" print(\"π₯ Loading OpenHermes 2.5...\")\n",
|
| 262 |
" ds = load_dataset(\"teknium/OpenHermes-2.5\", split=\"train\")\n",
|
| 263 |
" ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
|
| 264 |
" all_datasets.append(ds)\n",
|
| 265 |
"\n",
|
| 266 |
"elif DATASET_CHOICE.startswith(\"sharegpt_\"):\n",
|
| 267 |
" split_map = {\"sharegpt_en\": \"english\", \"sharegpt_de\": \"german_4b_translated\", \"sharegpt_hi\": \"hindi_27b_translated\"}\n",
|
| 268 |
-
"
|
| 269 |
-
" print(f\"π₯ Loading ShareGPT multilingual ({split_name})...\")\n",
|
| 270 |
-
" ds = load_dataset(\"deepmage121/ShareGPT_multilingual\", split=split_name)\n",
|
| 271 |
" ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
|
| 272 |
" all_datasets.append(ds)\n",
|
| 273 |
"\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 274 |
"elif DATASET_CHOICE == \"custom_mix\":\n",
|
| 275 |
" for ds_id, split, n_rows, fmt in CUSTOM_DATASETS:\n",
|
| 276 |
" print(f\"π₯ Loading {ds_id} ({split}, {n_rows} rows)...\")\n",
|
|
@@ -281,6 +274,8 @@
|
|
| 281 |
" ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
|
| 282 |
" elif fmt == \"conversations\":\n",
|
| 283 |
" ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
|
|
|
|
|
|
|
| 284 |
" else:\n",
|
| 285 |
" raise ValueError(f\"Unknown format: {fmt}\")\n",
|
| 286 |
" all_datasets.append(ds)\n",
|
|
@@ -288,21 +283,13 @@
|
|
| 288 |
"else:\n",
|
| 289 |
" raise ValueError(f\"Unknown DATASET_CHOICE: {DATASET_CHOICE}\")\n",
|
| 290 |
"\n",
|
| 291 |
-
"
|
| 292 |
-
"if len(all_datasets) == 1:\n",
|
| 293 |
-
" train_dataset = all_datasets[0]\n",
|
| 294 |
-
"else:\n",
|
| 295 |
-
" train_dataset = concatenate_datasets(all_datasets)\n",
|
| 296 |
-
"\n",
|
| 297 |
"print(f\"\\nπ COMBINED DATASET: {len(train_dataset)} rows\")\n",
|
| 298 |
"\n",
|
| 299 |
-
"# Show a random sample\n",
|
| 300 |
"sample = train_dataset[random.randint(0, len(train_dataset)-1)]\n",
|
| 301 |
-
"print(f\"
|
| 302 |
-
"for m in sample[\"messages\"]:\n",
|
| 303 |
-
" print(f\" {m['role']}: {m['content'][:100]}...\")\n",
|
| 304 |
"\n",
|
| 305 |
-
"# Subsample for speed\n",
|
| 306 |
"if len(train_dataset) > SAMPLE_SIZE:\n",
|
| 307 |
" train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
|
| 308 |
" print(f\"\\nπ SUBSAMPLED to {len(train_dataset)} rows\")\n",
|
|
@@ -316,9 +303,7 @@
|
|
| 316 |
"cell_type": "markdown",
|
| 317 |
"metadata": {},
|
| 318 |
"source": [
|
| 319 |
-
"## 6οΈβ£ Convert Messages β Text (Chat Template)
|
| 320 |
-
"\n",
|
| 321 |
-
"Uses `tokenizer.apply_chat_template` to convert structured messages into training text. No `formatting_func` needed."
|
| 322 |
]
|
| 323 |
},
|
| 324 |
{
|
|
@@ -330,25 +315,14 @@
|
|
| 330 |
"def convert_messages_to_text(examples):\n",
|
| 331 |
" texts = []\n",
|
| 332 |
" for msgs in examples[\"messages\"]:\n",
|
| 333 |
-
" text = tokenizer.apply_chat_template(\n",
|
| 334 |
-
" msgs,\n",
|
| 335 |
-
" tokenize=False,\n",
|
| 336 |
-
" add_generation_prompt=False,\n",
|
| 337 |
-
" )\n",
|
| 338 |
" texts.append(text)\n",
|
| 339 |
" return {\"text\": texts}\n",
|
| 340 |
"\n",
|
| 341 |
"print(\"π Converting messages to text...\")\n",
|
| 342 |
-
"train_dataset = train_dataset.map(\n",
|
| 343 |
-
" convert_messages_to_text,\n",
|
| 344 |
-
" batched=True,\n",
|
| 345 |
-
" remove_columns=[\"messages\"],\n",
|
| 346 |
-
" batch_size=100,\n",
|
| 347 |
-
")\n",
|
| 348 |
-
"\n",
|
| 349 |
"print(f\"β
Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
|
| 350 |
-
"print(f\"π Sample text length: {len(train_dataset[0]['text'])} chars\")
|
| 351 |
-
"print(f\"π First 200 chars:\\n{train_dataset[0]['text'][:200]}...\")"
|
| 352 |
]
|
| 353 |
},
|
| 354 |
{
|
|
@@ -438,19 +412,16 @@
|
|
| 438 |
"metadata": {},
|
| 439 |
"outputs": [],
|
| 440 |
"source": [
|
| 441 |
-
"# Save LoRA adapter (tiny, ~50-100 MB)\n",
|
| 442 |
"model.save_pretrained(\"./lora-adapter\")\n",
|
| 443 |
"tokenizer.save_pretrained(\"./lora-adapter\")\n",
|
| 444 |
"print(\"β
LoRA adapter saved\")\n",
|
| 445 |
"\n",
|
| 446 |
-
"# Merge & save full 16-bit model (~8 GB)\n",
|
| 447 |
"print(\"\\nπ Merging LoRA into base model...\")\n",
|
| 448 |
"merged_model = model.merge_and_unload()\n",
|
| 449 |
"merged_model.save_pretrained(\"./merged-model\")\n",
|
| 450 |
"tokenizer.save_pretrained(\"./merged-model\")\n",
|
| 451 |
"print(\"β
Merged model saved\")\n",
|
| 452 |
"\n",
|
| 453 |
-
"# Push to HF Hub (uncomment if logged in)\n",
|
| 454 |
"# model.push_to_hub(HUB_MODEL_ID)\n",
|
| 455 |
"# tokenizer.push_to_hub(HUB_MODEL_ID)"
|
| 456 |
]
|
|
@@ -463,8 +434,8 @@
|
|
| 463 |
"\n",
|
| 464 |
"| Mode | Use Case | Speed |\n",
|
| 465 |
"|------|----------|-------|\n",
|
| 466 |
-
"| `enable_thinking=True` | Deep reasoning, analysis
|
| 467 |
-
"| `enable_thinking=False` | Quick answers, coding
|
| 468 |
]
|
| 469 |
},
|
| 470 |
{
|
|
@@ -517,6 +488,7 @@
|
|
| 517 |
"| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |\n",
|
| 518 |
"| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |\n",
|
| 519 |
"| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |\n",
|
|
|
|
| 520 |
"| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
|
| 521 |
"| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
|
| 522 |
"| **Unsloth Docs** | https://unsloth.ai/docs |\n",
|
|
|
|
| 4 |
"cell_type": "markdown",
|
| 5 |
"metadata": {},
|
| 6 |
"source": [
|
| 7 |
+
"# π Ultimate LLM Fine-Tuning β Qwen3-4B (Colab Free Tier T4)\n",
|
| 8 |
"\n",
|
| 9 |
"**π₯ Model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) via Unsloth 4-bit \n",
|
| 10 |
"**π Why this model?** Highest coding/reasoning scores among sub-10B models (LiveCodeBench 35.1, MMLU-Pro 69.6). Only **3.3 GB** in 4-bit. \n",
|
| 11 |
+
"**π Datasets:** Your choice β cybersecurity, general chat, multilingual, coding, or mix them! \n",
|
| 12 |
"**β‘ Framework:** Unsloth + TRL SFTTrainer β 2Γ faster, 70% less VRAM \n",
|
| 13 |
"\n",
|
| 14 |
+
"> β οΈ Default is cybersecurity. Pick general-purpose datasets for other domains.\n",
|
| 15 |
"\n",
|
| 16 |
"---\n",
|
| 17 |
"\n",
|
|
|
|
| 127 |
"source": [
|
| 128 |
"## 4οΈβ£ π― CHOOSE YOUR DATASET(S)\n",
|
| 129 |
"\n",
|
| 130 |
+
"Uncomment **ONE** `DATASET_CHOICE` line to select your training data.\n",
|
| 131 |
"\n",
|
| 132 |
+
"| Choice | Dataset | Rows | Format | Best For |\n",
|
| 133 |
"|--------|---------|------|--------|----------|\n",
|
| 134 |
+
"| `\"cybersecurity\"` | Fenrir v2.1 + Trendyol | 153Kβ50K | system/user/assistant | Ethical hacking education |\n",
|
| 135 |
+
"| `\"ultrachat\"` | UltraChat 200K SFT | 200Kβ50K | messages (user/assistant) | General conversation |\n",
|
| 136 |
+
"| `\"openhermes\"` | OpenHermes 2.5 | 1M+β50K | conversations (human/gpt) | Reasoning, coding |\n",
|
| 137 |
+
"| `\"sharegpt_en\"` | ShareGPT English | ~90Kβ50K | conversations (human/gpt) | Multi-turn dialogue |\n",
|
| 138 |
+
"| `\"sharegpt_de\"` | ShareGPT German | ~104Kβ50K | conversations (human/gpt) | German fine-tuning |\n",
|
| 139 |
+
"| `\"sharegpt_hi\"` | ShareGPT Hindi | ~153Kβ50K | conversations (human/gpt) | Hindi fine-tuning |\n",
|
| 140 |
+
"| `\"code_corpus\"` | [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) | 240Kβ50K | text (code files) | **Code completion, coding assistant** |\n",
|
| 141 |
+
"| `\"custom_mix\"` | Mix of your choice | β | varies | Combine datasets |"
|
|
|
|
|
|
|
| 142 |
]
|
| 143 |
},
|
| 144 |
{
|
|
|
|
| 153 |
"# SELECT YOUR DATASET β UNCOMMENT ONE LINE\n",
|
| 154 |
"# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ\n",
|
| 155 |
"\n",
|
|
|
|
| 156 |
"DATASET_CHOICE = \"cybersecurity\"\n",
|
| 157 |
"\n",
|
|
|
|
| 158 |
"# DATASET_CHOICE = \"ultrachat\"\n",
|
|
|
|
|
|
|
| 159 |
"# DATASET_CHOICE = \"openhermes\"\n",
|
|
|
|
|
|
|
| 160 |
"# DATASET_CHOICE = \"sharegpt_en\"\n",
|
|
|
|
|
|
|
| 161 |
"# DATASET_CHOICE = \"sharegpt_de\"\n",
|
|
|
|
|
|
|
| 162 |
"# DATASET_CHOICE = \"sharegpt_hi\"\n",
|
| 163 |
+
"# DATASET_CHOICE = \"code_corpus\"\n",
|
|
|
|
| 164 |
"# DATASET_CHOICE = \"custom_mix\"\n",
|
| 165 |
"\n",
|
|
|
|
|
|
|
|
|
|
| 166 |
"CUSTOM_DATASETS = [\n",
|
| 167 |
" # (\"dataset_name_or_id\", \"split\", rows_to_take, \"format_type\")\n",
|
| 168 |
+
" # format_type: \"messages\" | \"conversations\" | \"text\"\n",
|
| 169 |
" (\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", \"train\", 10000, \"messages\"),\n",
|
| 170 |
" (\"HuggingFaceH4/ultrachat_200k\", \"train_sft\", 20000, \"messages\"),\n",
|
| 171 |
" (\"teknium/OpenHermes-2.5\", \"train\", 20000, \"conversations\"),\n",
|
|
|
|
| 180 |
"source": [
|
| 181 |
"## 5οΈβ£ Load, Convert & Pre-process Selected Dataset\n",
|
| 182 |
"\n",
|
| 183 |
+
"Auto-detects dataset format and converts everything to standard `messages` β `text`."
|
|
|
|
| 184 |
]
|
| 185 |
},
|
| 186 |
{
|
|
|
|
| 206 |
" ]}\n",
|
| 207 |
"\n",
|
| 208 |
"def _convert_ultrachat(example):\n",
|
|
|
|
| 209 |
" return {\"messages\": example[\"messages\"]}\n",
|
| 210 |
"\n",
|
| 211 |
"def _convert_conversations(example):\n",
|
|
|
|
| 212 |
" msgs = []\n",
|
| 213 |
+
" system_prompt = example.get(\"system_prompt\", \"\") or example.get(\"system\", \"\")\n",
|
| 214 |
" if system_prompt:\n",
|
| 215 |
" msgs.append({\"role\": \"system\", \"content\": system_prompt})\n",
|
| 216 |
" for turn in example[\"conversations\"]:\n",
|
|
|
|
| 218 |
" msgs.append({\"role\": role, \"content\": turn[\"value\"]})\n",
|
| 219 |
" return {\"messages\": msgs}\n",
|
| 220 |
"\n",
|
| 221 |
+
"def _convert_code_corpus(example):\n",
|
| 222 |
+
" # Code Corpus: raw code text with domain/repo metadata in a user prompt + assistant format\n",
|
| 223 |
+
" # We treat the code block as an assistant response to a user asking about that code\n",
|
| 224 |
+
" code_text = example[\"text\"]\n",
|
| 225 |
+
" domain = example.get(\"domain\", \"code\")\n",
|
| 226 |
+
" repo = example.get(\"repo\", \"unknown\")\n",
|
| 227 |
+
" lang = example.get(\"language\", \"\")\n",
|
| 228 |
+
" user_prompt = f\"Here is a code snippet from the {domain} domain (repo: {repo}, language: {lang}). Please explain or improve it.\"\n",
|
| 229 |
+
" return {\"messages\": [\n",
|
| 230 |
+
" {\"role\": \"user\", \"content\": user_prompt},\n",
|
| 231 |
+
" {\"role\": \"assistant\", \"content\": code_text},\n",
|
| 232 |
+
" ]}\n",
|
| 233 |
+
"\n",
|
| 234 |
"# ===================== LOAD DATASET(S) =====================\n",
|
| 235 |
"all_datasets = []\n",
|
| 236 |
"\n",
|
| 237 |
"if DATASET_CHOICE == \"cybersecurity\":\n",
|
|
|
|
| 238 |
" ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
|
| 239 |
" ds1 = ds1.map(_convert_fenrir, remove_columns=ds1.column_names, batched=False)\n",
|
| 240 |
" all_datasets.append(ds1)\n",
|
|
|
|
|
|
|
| 241 |
" ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
|
| 242 |
" ds2 = ds2.map(_convert_trendyol, remove_columns=ds2.column_names, batched=False)\n",
|
| 243 |
" all_datasets.append(ds2)\n",
|
| 244 |
"\n",
|
| 245 |
"elif DATASET_CHOICE == \"ultrachat\":\n",
|
|
|
|
| 246 |
" ds = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
|
| 247 |
" ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
|
| 248 |
" all_datasets.append(ds)\n",
|
| 249 |
"\n",
|
| 250 |
"elif DATASET_CHOICE == \"openhermes\":\n",
|
|
|
|
| 251 |
" ds = load_dataset(\"teknium/OpenHermes-2.5\", split=\"train\")\n",
|
| 252 |
" ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
|
| 253 |
" all_datasets.append(ds)\n",
|
| 254 |
"\n",
|
| 255 |
"elif DATASET_CHOICE.startswith(\"sharegpt_\"):\n",
|
| 256 |
" split_map = {\"sharegpt_en\": \"english\", \"sharegpt_de\": \"german_4b_translated\", \"sharegpt_hi\": \"hindi_27b_translated\"}\n",
|
| 257 |
+
" ds = load_dataset(\"deepmage121/ShareGPT_multilingual\", split=split_map[DATASET_CHOICE])\n",
|
|
|
|
|
|
|
| 258 |
" ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
|
| 259 |
" all_datasets.append(ds)\n",
|
| 260 |
"\n",
|
| 261 |
+
"elif DATASET_CHOICE == \"code_corpus\":\n",
|
| 262 |
+
" print(\"π₯ Loading Code Corpus LLM Training (krystv)...\")\n",
|
| 263 |
+
" ds = load_dataset(\"krystv/code-corpus-llm-training\", split=\"train\")\n",
|
| 264 |
+
" ds = ds.map(_convert_code_corpus, remove_columns=ds.column_names, batched=False)\n",
|
| 265 |
+
" all_datasets.append(ds)\n",
|
| 266 |
+
"\n",
|
| 267 |
"elif DATASET_CHOICE == \"custom_mix\":\n",
|
| 268 |
" for ds_id, split, n_rows, fmt in CUSTOM_DATASETS:\n",
|
| 269 |
" print(f\"π₯ Loading {ds_id} ({split}, {n_rows} rows)...\")\n",
|
|
|
|
| 274 |
" ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
|
| 275 |
" elif fmt == \"conversations\":\n",
|
| 276 |
" ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
|
| 277 |
+
" elif fmt == \"text\":\n",
|
| 278 |
+
" ds = ds.map(_convert_code_corpus, remove_columns=ds.column_names, batched=False)\n",
|
| 279 |
" else:\n",
|
| 280 |
" raise ValueError(f\"Unknown format: {fmt}\")\n",
|
| 281 |
" all_datasets.append(ds)\n",
|
|
|
|
| 283 |
"else:\n",
|
| 284 |
" raise ValueError(f\"Unknown DATASET_CHOICE: {DATASET_CHOICE}\")\n",
|
| 285 |
"\n",
|
| 286 |
+
"train_dataset = concatenate_datasets(all_datasets) if len(all_datasets) > 1 else all_datasets[0]\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 287 |
"print(f\"\\nπ COMBINED DATASET: {len(train_dataset)} rows\")\n",
|
| 288 |
"\n",
|
|
|
|
| 289 |
"sample = train_dataset[random.randint(0, len(train_dataset)-1)]\n",
|
| 290 |
+
"print(f\"Sample roles: {[m['role'] for m in sample['messages']]}\")\n",
|
| 291 |
+
"for m in sample[\"messages\"]: print(f\" {m['role']}: {m['content'][:80]}...\")\n",
|
|
|
|
| 292 |
"\n",
|
|
|
|
| 293 |
"if len(train_dataset) > SAMPLE_SIZE:\n",
|
| 294 |
" train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
|
| 295 |
" print(f\"\\nπ SUBSAMPLED to {len(train_dataset)} rows\")\n",
|
|
|
|
| 303 |
"cell_type": "markdown",
|
| 304 |
"metadata": {},
|
| 305 |
"source": [
|
| 306 |
+
"## 6οΈβ£ Convert Messages β Text (Chat Template)"
|
|
|
|
|
|
|
| 307 |
]
|
| 308 |
},
|
| 309 |
{
|
|
|
|
| 315 |
"def convert_messages_to_text(examples):\n",
|
| 316 |
" texts = []\n",
|
| 317 |
" for msgs in examples[\"messages\"]:\n",
|
| 318 |
+
" text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
" texts.append(text)\n",
|
| 320 |
" return {\"text\": texts}\n",
|
| 321 |
"\n",
|
| 322 |
"print(\"π Converting messages to text...\")\n",
|
| 323 |
+
"train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=[\"messages\"], batch_size=100)\n",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 324 |
"print(f\"β
Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
|
| 325 |
+
"print(f\"π Sample text length: {len(train_dataset[0]['text'])} chars\")"
|
|
|
|
| 326 |
]
|
| 327 |
},
|
| 328 |
{
|
|
|
|
| 412 |
"metadata": {},
|
| 413 |
"outputs": [],
|
| 414 |
"source": [
|
|
|
|
| 415 |
"model.save_pretrained(\"./lora-adapter\")\n",
|
| 416 |
"tokenizer.save_pretrained(\"./lora-adapter\")\n",
|
| 417 |
"print(\"β
LoRA adapter saved\")\n",
|
| 418 |
"\n",
|
|
|
|
| 419 |
"print(\"\\nπ Merging LoRA into base model...\")\n",
|
| 420 |
"merged_model = model.merge_and_unload()\n",
|
| 421 |
"merged_model.save_pretrained(\"./merged-model\")\n",
|
| 422 |
"tokenizer.save_pretrained(\"./merged-model\")\n",
|
| 423 |
"print(\"β
Merged model saved\")\n",
|
| 424 |
"\n",
|
|
|
|
| 425 |
"# model.push_to_hub(HUB_MODEL_ID)\n",
|
| 426 |
"# tokenizer.push_to_hub(HUB_MODEL_ID)"
|
| 427 |
]
|
|
|
|
| 434 |
"\n",
|
| 435 |
"| Mode | Use Case | Speed |\n",
|
| 436 |
"|------|----------|-------|\n",
|
| 437 |
+
"| `enable_thinking=True` | Deep reasoning, analysis | Slower, thorough |\n",
|
| 438 |
+
"| `enable_thinking=False` | Quick answers, coding | Fast, direct |"
|
| 439 |
]
|
| 440 |
},
|
| 441 |
{
|
|
|
|
| 488 |
"| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |\n",
|
| 489 |
"| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |\n",
|
| 490 |
"| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |\n",
|
| 491 |
+
"| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |\n",
|
| 492 |
"| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
|
| 493 |
"| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
|
| 494 |
"| **Unsloth Docs** | https://unsloth.ai/docs |\n",
|