asdf98 commited on
Commit
fbc4da7
Β·
verified Β·
1 Parent(s): 00c07ae

Upload EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb

Browse files
EthicalHacking_Qwen3-4B_Ultimate_Colab.ipynb CHANGED
@@ -4,14 +4,14 @@
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
- "# πŸ” Ultimate Ethical Hacking / General-Purpose LLM – Colab Free Tier (T4)\n",
8
  "\n",
9
  "**πŸ₯‡ Model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) via Unsloth 4-bit \n",
10
  "**πŸ† Why this model?** Highest coding/reasoning scores among sub-10B models (LiveCodeBench 35.1, MMLU-Pro 69.6). Only **3.3 GB** in 4-bit. \n",
11
- "**πŸ“Š Datasets:** Your choice β€” pick from cybersecurity, general chat, multilingual, coding, or mix them! \n",
12
  "**⚑ Framework:** Unsloth + TRL SFTTrainer β€” 2Γ— faster, 70% less VRAM \n",
13
  "\n",
14
- "> ⚠️ **Disclaimer:** Default datasets include **defensive cybersecurity** content (pentesting education, threat analysis, IR). Pick general-purpose datasets for other domains.\n",
15
  "\n",
16
  "---\n",
17
  "\n",
@@ -127,20 +127,18 @@
127
  "source": [
128
  "## 4️⃣ 🎯 CHOOSE YOUR DATASET(S)\n",
129
  "\n",
130
- "Uncomment **ONE** `DATASET_CHOICE` line to select your training data. You can also mix multiple datasets by setting a list.\n",
131
  "\n",
132
- "| Choice | Dataset | Size | Format | Best For |\n",
133
  "|--------|---------|------|--------|----------|\n",
134
- "| `\"cybersecurity\"` | Fenrir v2.1 + Trendyol | 153K β†’ 50K | system/user/assistant | **Ethical hacking, pentesting education** |\n",
135
- "| `\"ultrachat\"` | UltraChat 200K (SFT) | 200K β†’ 50K | messages (user/assistant) | General conversation, chatbot tuning |\n",
136
- "| `\"openhermes\"` | OpenHermes 2.5 | 1M+ β†’ 50K | conversations (human/gpt) | Reasoning, coding, instruction following |\n",
137
- "| `\"sharegpt_en\"` | ShareGPT English | ~90K β†’ 50K | conversations (human/gpt) | Multi-turn dialogue, general QA |\n",
138
- "| `\"sharegpt_de\"` | ShareGPT German | ~104K β†’ 50K | conversations (human/gpt) | German language fine-tuning |\n",
139
- "| `\"sharegpt_hi\"` | ShareGPT Hindi (27B) | ~153K β†’ 50K | conversations (human/gpt) | Hindi language fine-tuning |\n",
140
- "| `\"custom_mix\"` | Mix of your choice | β€” | varies | Combine datasets for hybrid tuning |\n",
141
- "\n",
142
- "\n",
143
- "**To mix datasets**, set `DATASET_CHOICE = \"custom_mix\"` and configure `CUSTOM_DATASETS` below."
144
  ]
145
  },
146
  {
@@ -155,33 +153,19 @@
155
  "# SELECT YOUR DATASET β€” UNCOMMENT ONE LINE\n",
156
  "# ═══════════════════════════════════════════════════════════════\n",
157
  "\n",
158
- "# --- Option 1: Cybersecurity (default) ---\n",
159
  "DATASET_CHOICE = \"cybersecurity\"\n",
160
  "\n",
161
- "# --- Option 2: General-purpose chat (UltraChat) ---\n",
162
  "# DATASET_CHOICE = \"ultrachat\"\n",
163
- "\n",
164
- "# --- Option 3: Reasoning & coding (OpenHermes 2.5) ---\n",
165
  "# DATASET_CHOICE = \"openhermes\"\n",
166
- "\n",
167
- "# --- Option 4: Multi-turn dialogue (ShareGPT English) ---\n",
168
  "# DATASET_CHOICE = \"sharegpt_en\"\n",
169
- "\n",
170
- "# --- Option 5: German language (ShareGPT German) ---\n",
171
  "# DATASET_CHOICE = \"sharegpt_de\"\n",
172
- "\n",
173
- "# --- Option 6: Hindi language (ShareGPT Hindi 27B) ---\n",
174
  "# DATASET_CHOICE = \"sharegpt_hi\"\n",
175
- "\n",
176
- "# --- Option 7: Mix multiple datasets ---\n",
177
  "# DATASET_CHOICE = \"custom_mix\"\n",
178
  "\n",
179
- "# ═══════════════════════════════════════════════════════════════\n",
180
- "# CUSTOM MIX CONFIG (only used if DATASET_CHOICE = \"custom_mix\")\n",
181
- "# ═══════════════════════════════════════════════════════════════\n",
182
  "CUSTOM_DATASETS = [\n",
183
  " # (\"dataset_name_or_id\", \"split\", rows_to_take, \"format_type\")\n",
184
- " # format_type: \"messages\" | \"conversations\" | \"instruction\"\n",
185
  " (\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", \"train\", 10000, \"messages\"),\n",
186
  " (\"HuggingFaceH4/ultrachat_200k\", \"train_sft\", 20000, \"messages\"),\n",
187
  " (\"teknium/OpenHermes-2.5\", \"train\", 20000, \"conversations\"),\n",
@@ -196,8 +180,7 @@
196
  "source": [
197
  "## 5️⃣ Load, Convert & Pre-process Selected Dataset\n",
198
  "\n",
199
- "This cell auto-detects the dataset format and converts everything to standard `messages` β†’ `text` pipeline.\n",
200
- "**No changes needed** β€” just run it after selecting your dataset above."
201
  ]
202
  },
203
  {
@@ -223,13 +206,11 @@
223
  " ]}\n",
224
  "\n",
225
  "def _convert_ultrachat(example):\n",
226
- " # Already in messages format with role/content\n",
227
  " return {\"messages\": example[\"messages\"]}\n",
228
  "\n",
229
  "def _convert_conversations(example):\n",
230
- " # OpenHermes / ShareGPT style: [{from: 'human'/'gpt', value: '...'}]\n",
231
  " msgs = []\n",
232
- " system_prompt = example.get(\"system_prompt\") or example.get(\"system\", \"\")\n",
233
  " if system_prompt:\n",
234
  " msgs.append({\"role\": \"system\", \"content\": system_prompt})\n",
235
  " for turn in example[\"conversations\"]:\n",
@@ -237,40 +218,52 @@
237
  " msgs.append({\"role\": role, \"content\": turn[\"value\"]})\n",
238
  " return {\"messages\": msgs}\n",
239
  "\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
240
  "# ===================== LOAD DATASET(S) =====================\n",
241
  "all_datasets = []\n",
242
  "\n",
243
  "if DATASET_CHOICE == \"cybersecurity\":\n",
244
- " print(\"πŸ“₯ Loading Fenrir v2.1...\")\n",
245
  " ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
246
  " ds1 = ds1.map(_convert_fenrir, remove_columns=ds1.column_names, batched=False)\n",
247
  " all_datasets.append(ds1)\n",
248
- "\n",
249
- " print(\"πŸ“₯ Loading Trendyol Cybersecurity...\")\n",
250
  " ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
251
  " ds2 = ds2.map(_convert_trendyol, remove_columns=ds2.column_names, batched=False)\n",
252
  " all_datasets.append(ds2)\n",
253
  "\n",
254
  "elif DATASET_CHOICE == \"ultrachat\":\n",
255
- " print(\"πŸ“₯ Loading UltraChat 200K (train_sft split)...\")\n",
256
  " ds = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
257
  " ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
258
  " all_datasets.append(ds)\n",
259
  "\n",
260
  "elif DATASET_CHOICE == \"openhermes\":\n",
261
- " print(\"πŸ“₯ Loading OpenHermes 2.5...\")\n",
262
  " ds = load_dataset(\"teknium/OpenHermes-2.5\", split=\"train\")\n",
263
  " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
264
  " all_datasets.append(ds)\n",
265
  "\n",
266
  "elif DATASET_CHOICE.startswith(\"sharegpt_\"):\n",
267
  " split_map = {\"sharegpt_en\": \"english\", \"sharegpt_de\": \"german_4b_translated\", \"sharegpt_hi\": \"hindi_27b_translated\"}\n",
268
- " split_name = split_map[DATASET_CHOICE]\n",
269
- " print(f\"πŸ“₯ Loading ShareGPT multilingual ({split_name})...\")\n",
270
- " ds = load_dataset(\"deepmage121/ShareGPT_multilingual\", split=split_name)\n",
271
  " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
272
  " all_datasets.append(ds)\n",
273
  "\n",
 
 
 
 
 
 
274
  "elif DATASET_CHOICE == \"custom_mix\":\n",
275
  " for ds_id, split, n_rows, fmt in CUSTOM_DATASETS:\n",
276
  " print(f\"πŸ“₯ Loading {ds_id} ({split}, {n_rows} rows)...\")\n",
@@ -281,6 +274,8 @@
281
  " ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
282
  " elif fmt == \"conversations\":\n",
283
  " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
 
 
284
  " else:\n",
285
  " raise ValueError(f\"Unknown format: {fmt}\")\n",
286
  " all_datasets.append(ds)\n",
@@ -288,21 +283,13 @@
288
  "else:\n",
289
  " raise ValueError(f\"Unknown DATASET_CHOICE: {DATASET_CHOICE}\")\n",
290
  "\n",
291
- "# Merge all loaded datasets\n",
292
- "if len(all_datasets) == 1:\n",
293
- " train_dataset = all_datasets[0]\n",
294
- "else:\n",
295
- " train_dataset = concatenate_datasets(all_datasets)\n",
296
- "\n",
297
  "print(f\"\\nπŸ“Š COMBINED DATASET: {len(train_dataset)} rows\")\n",
298
  "\n",
299
- "# Show a random sample\n",
300
  "sample = train_dataset[random.randint(0, len(train_dataset)-1)]\n",
301
- "print(f\"\\n--- Random sample roles: {[m['role'] for m in sample['messages']]} ---\")\n",
302
- "for m in sample[\"messages\"]:\n",
303
- " print(f\" {m['role']}: {m['content'][:100]}...\")\n",
304
  "\n",
305
- "# Subsample for speed\n",
306
  "if len(train_dataset) > SAMPLE_SIZE:\n",
307
  " train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
308
  " print(f\"\\nπŸš€ SUBSAMPLED to {len(train_dataset)} rows\")\n",
@@ -316,9 +303,7 @@
316
  "cell_type": "markdown",
317
  "metadata": {},
318
  "source": [
319
- "## 6️⃣ Convert Messages β†’ Text (Chat Template)\n",
320
- "\n",
321
- "Uses `tokenizer.apply_chat_template` to convert structured messages into training text. No `formatting_func` needed."
322
  ]
323
  },
324
  {
@@ -330,25 +315,14 @@
330
  "def convert_messages_to_text(examples):\n",
331
  " texts = []\n",
332
  " for msgs in examples[\"messages\"]:\n",
333
- " text = tokenizer.apply_chat_template(\n",
334
- " msgs,\n",
335
- " tokenize=False,\n",
336
- " add_generation_prompt=False,\n",
337
- " )\n",
338
  " texts.append(text)\n",
339
  " return {\"text\": texts}\n",
340
  "\n",
341
  "print(\"πŸ”„ Converting messages to text...\")\n",
342
- "train_dataset = train_dataset.map(\n",
343
- " convert_messages_to_text,\n",
344
- " batched=True,\n",
345
- " remove_columns=[\"messages\"],\n",
346
- " batch_size=100,\n",
347
- ")\n",
348
- "\n",
349
  "print(f\"βœ… Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
350
- "print(f\"πŸ“„ Sample text length: {len(train_dataset[0]['text'])} chars\")\n",
351
- "print(f\"πŸ“„ First 200 chars:\\n{train_dataset[0]['text'][:200]}...\")"
352
  ]
353
  },
354
  {
@@ -438,19 +412,16 @@
438
  "metadata": {},
439
  "outputs": [],
440
  "source": [
441
- "# Save LoRA adapter (tiny, ~50-100 MB)\n",
442
  "model.save_pretrained(\"./lora-adapter\")\n",
443
  "tokenizer.save_pretrained(\"./lora-adapter\")\n",
444
  "print(\"βœ… LoRA adapter saved\")\n",
445
  "\n",
446
- "# Merge & save full 16-bit model (~8 GB)\n",
447
  "print(\"\\nπŸ”„ Merging LoRA into base model...\")\n",
448
  "merged_model = model.merge_and_unload()\n",
449
  "merged_model.save_pretrained(\"./merged-model\")\n",
450
  "tokenizer.save_pretrained(\"./merged-model\")\n",
451
  "print(\"βœ… Merged model saved\")\n",
452
  "\n",
453
- "# Push to HF Hub (uncomment if logged in)\n",
454
  "# model.push_to_hub(HUB_MODEL_ID)\n",
455
  "# tokenizer.push_to_hub(HUB_MODEL_ID)"
456
  ]
@@ -463,8 +434,8 @@
463
  "\n",
464
  "| Mode | Use Case | Speed |\n",
465
  "|------|----------|-------|\n",
466
- "| `enable_thinking=True` | Deep reasoning, analysis, chain-of-thought | Slower, thorough |\n",
467
- "| `enable_thinking=False` | Quick answers, coding snippets, commands | Fast, direct |"
468
  ]
469
  },
470
  {
@@ -517,6 +488,7 @@
517
  "| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |\n",
518
  "| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |\n",
519
  "| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |\n",
 
520
  "| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
521
  "| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
522
  "| **Unsloth Docs** | https://unsloth.ai/docs |\n",
 
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
+ "# πŸ” Ultimate LLM Fine-Tuning – Qwen3-4B (Colab Free Tier T4)\n",
8
  "\n",
9
  "**πŸ₯‡ Model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) via Unsloth 4-bit \n",
10
  "**πŸ† Why this model?** Highest coding/reasoning scores among sub-10B models (LiveCodeBench 35.1, MMLU-Pro 69.6). Only **3.3 GB** in 4-bit. \n",
11
+ "**πŸ“Š Datasets:** Your choice β€” cybersecurity, general chat, multilingual, coding, or mix them! \n",
12
  "**⚑ Framework:** Unsloth + TRL SFTTrainer β€” 2Γ— faster, 70% less VRAM \n",
13
  "\n",
14
+ "> ⚠️ Default is cybersecurity. Pick general-purpose datasets for other domains.\n",
15
  "\n",
16
  "---\n",
17
  "\n",
 
127
  "source": [
128
  "## 4️⃣ 🎯 CHOOSE YOUR DATASET(S)\n",
129
  "\n",
130
+ "Uncomment **ONE** `DATASET_CHOICE` line to select your training data.\n",
131
  "\n",
132
+ "| Choice | Dataset | Rows | Format | Best For |\n",
133
  "|--------|---------|------|--------|----------|\n",
134
+ "| `\"cybersecurity\"` | Fenrir v2.1 + Trendyol | 153K→50K | system/user/assistant | Ethical hacking education |\n",
135
+ "| `\"ultrachat\"` | UltraChat 200K SFT | 200K→50K | messages (user/assistant) | General conversation |\n",
136
+ "| `\"openhermes\"` | OpenHermes 2.5 | 1M+β†’50K | conversations (human/gpt) | Reasoning, coding |\n",
137
+ "| `\"sharegpt_en\"` | ShareGPT English | ~90K→50K | conversations (human/gpt) | Multi-turn dialogue |\n",
138
+ "| `\"sharegpt_de\"` | ShareGPT German | ~104K→50K | conversations (human/gpt) | German fine-tuning |\n",
139
+ "| `\"sharegpt_hi\"` | ShareGPT Hindi | ~153K→50K | conversations (human/gpt) | Hindi fine-tuning |\n",
140
+ "| `\"code_corpus\"` | [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) | 240K→50K | text (code files) | **Code completion, coding assistant** |\n",
141
+ "| `\"custom_mix\"` | Mix of your choice | β€” | varies | Combine datasets |"
 
 
142
  ]
143
  },
144
  {
 
153
  "# SELECT YOUR DATASET β€” UNCOMMENT ONE LINE\n",
154
  "# ═══════════════════════════════════════════════════════════════\n",
155
  "\n",
 
156
  "DATASET_CHOICE = \"cybersecurity\"\n",
157
  "\n",
 
158
  "# DATASET_CHOICE = \"ultrachat\"\n",
 
 
159
  "# DATASET_CHOICE = \"openhermes\"\n",
 
 
160
  "# DATASET_CHOICE = \"sharegpt_en\"\n",
 
 
161
  "# DATASET_CHOICE = \"sharegpt_de\"\n",
 
 
162
  "# DATASET_CHOICE = \"sharegpt_hi\"\n",
163
+ "# DATASET_CHOICE = \"code_corpus\"\n",
 
164
  "# DATASET_CHOICE = \"custom_mix\"\n",
165
  "\n",
 
 
 
166
  "CUSTOM_DATASETS = [\n",
167
  " # (\"dataset_name_or_id\", \"split\", rows_to_take, \"format_type\")\n",
168
+ " # format_type: \"messages\" | \"conversations\" | \"text\"\n",
169
  " (\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", \"train\", 10000, \"messages\"),\n",
170
  " (\"HuggingFaceH4/ultrachat_200k\", \"train_sft\", 20000, \"messages\"),\n",
171
  " (\"teknium/OpenHermes-2.5\", \"train\", 20000, \"conversations\"),\n",
 
180
  "source": [
181
  "## 5️⃣ Load, Convert & Pre-process Selected Dataset\n",
182
  "\n",
183
+ "Auto-detects dataset format and converts everything to standard `messages` β†’ `text`."
 
184
  ]
185
  },
186
  {
 
206
  " ]}\n",
207
  "\n",
208
  "def _convert_ultrachat(example):\n",
 
209
  " return {\"messages\": example[\"messages\"]}\n",
210
  "\n",
211
  "def _convert_conversations(example):\n",
 
212
  " msgs = []\n",
213
+ " system_prompt = example.get(\"system_prompt\", \"\") or example.get(\"system\", \"\")\n",
214
  " if system_prompt:\n",
215
  " msgs.append({\"role\": \"system\", \"content\": system_prompt})\n",
216
  " for turn in example[\"conversations\"]:\n",
 
218
  " msgs.append({\"role\": role, \"content\": turn[\"value\"]})\n",
219
  " return {\"messages\": msgs}\n",
220
  "\n",
221
+ "def _convert_code_corpus(example):\n",
222
+ " # Code Corpus: raw code text with domain/repo metadata in a user prompt + assistant format\n",
223
+ " # We treat the code block as an assistant response to a user asking about that code\n",
224
+ " code_text = example[\"text\"]\n",
225
+ " domain = example.get(\"domain\", \"code\")\n",
226
+ " repo = example.get(\"repo\", \"unknown\")\n",
227
+ " lang = example.get(\"language\", \"\")\n",
228
+ " user_prompt = f\"Here is a code snippet from the {domain} domain (repo: {repo}, language: {lang}). Please explain or improve it.\"\n",
229
+ " return {\"messages\": [\n",
230
+ " {\"role\": \"user\", \"content\": user_prompt},\n",
231
+ " {\"role\": \"assistant\", \"content\": code_text},\n",
232
+ " ]}\n",
233
+ "\n",
234
  "# ===================== LOAD DATASET(S) =====================\n",
235
  "all_datasets = []\n",
236
  "\n",
237
  "if DATASET_CHOICE == \"cybersecurity\":\n",
 
238
  " ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
239
  " ds1 = ds1.map(_convert_fenrir, remove_columns=ds1.column_names, batched=False)\n",
240
  " all_datasets.append(ds1)\n",
 
 
241
  " ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
242
  " ds2 = ds2.map(_convert_trendyol, remove_columns=ds2.column_names, batched=False)\n",
243
  " all_datasets.append(ds2)\n",
244
  "\n",
245
  "elif DATASET_CHOICE == \"ultrachat\":\n",
 
246
  " ds = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
247
  " ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
248
  " all_datasets.append(ds)\n",
249
  "\n",
250
  "elif DATASET_CHOICE == \"openhermes\":\n",
 
251
  " ds = load_dataset(\"teknium/OpenHermes-2.5\", split=\"train\")\n",
252
  " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
253
  " all_datasets.append(ds)\n",
254
  "\n",
255
  "elif DATASET_CHOICE.startswith(\"sharegpt_\"):\n",
256
  " split_map = {\"sharegpt_en\": \"english\", \"sharegpt_de\": \"german_4b_translated\", \"sharegpt_hi\": \"hindi_27b_translated\"}\n",
257
+ " ds = load_dataset(\"deepmage121/ShareGPT_multilingual\", split=split_map[DATASET_CHOICE])\n",
 
 
258
  " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
259
  " all_datasets.append(ds)\n",
260
  "\n",
261
+ "elif DATASET_CHOICE == \"code_corpus\":\n",
262
+ " print(\"πŸ“₯ Loading Code Corpus LLM Training (krystv)...\")\n",
263
+ " ds = load_dataset(\"krystv/code-corpus-llm-training\", split=\"train\")\n",
264
+ " ds = ds.map(_convert_code_corpus, remove_columns=ds.column_names, batched=False)\n",
265
+ " all_datasets.append(ds)\n",
266
+ "\n",
267
  "elif DATASET_CHOICE == \"custom_mix\":\n",
268
  " for ds_id, split, n_rows, fmt in CUSTOM_DATASETS:\n",
269
  " print(f\"πŸ“₯ Loading {ds_id} ({split}, {n_rows} rows)...\")\n",
 
274
  " ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
275
  " elif fmt == \"conversations\":\n",
276
  " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
277
+ " elif fmt == \"text\":\n",
278
+ " ds = ds.map(_convert_code_corpus, remove_columns=ds.column_names, batched=False)\n",
279
  " else:\n",
280
  " raise ValueError(f\"Unknown format: {fmt}\")\n",
281
  " all_datasets.append(ds)\n",
 
283
  "else:\n",
284
  " raise ValueError(f\"Unknown DATASET_CHOICE: {DATASET_CHOICE}\")\n",
285
  "\n",
286
+ "train_dataset = concatenate_datasets(all_datasets) if len(all_datasets) > 1 else all_datasets[0]\n",
 
 
 
 
 
287
  "print(f\"\\nπŸ“Š COMBINED DATASET: {len(train_dataset)} rows\")\n",
288
  "\n",
 
289
  "sample = train_dataset[random.randint(0, len(train_dataset)-1)]\n",
290
+ "print(f\"Sample roles: {[m['role'] for m in sample['messages']]}\")\n",
291
+ "for m in sample[\"messages\"]: print(f\" {m['role']}: {m['content'][:80]}...\")\n",
 
292
  "\n",
 
293
  "if len(train_dataset) > SAMPLE_SIZE:\n",
294
  " train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
295
  " print(f\"\\nπŸš€ SUBSAMPLED to {len(train_dataset)} rows\")\n",
 
303
  "cell_type": "markdown",
304
  "metadata": {},
305
  "source": [
306
+ "## 6️⃣ Convert Messages β†’ Text (Chat Template)"
 
 
307
  ]
308
  },
309
  {
 
315
  "def convert_messages_to_text(examples):\n",
316
  " texts = []\n",
317
  " for msgs in examples[\"messages\"]:\n",
318
+ " text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)\n",
 
 
 
 
319
  " texts.append(text)\n",
320
  " return {\"text\": texts}\n",
321
  "\n",
322
  "print(\"πŸ”„ Converting messages to text...\")\n",
323
+ "train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=[\"messages\"], batch_size=100)\n",
 
 
 
 
 
 
324
  "print(f\"βœ… Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
325
+ "print(f\"πŸ“„ Sample text length: {len(train_dataset[0]['text'])} chars\")"
 
326
  ]
327
  },
328
  {
 
412
  "metadata": {},
413
  "outputs": [],
414
  "source": [
 
415
  "model.save_pretrained(\"./lora-adapter\")\n",
416
  "tokenizer.save_pretrained(\"./lora-adapter\")\n",
417
  "print(\"βœ… LoRA adapter saved\")\n",
418
  "\n",
 
419
  "print(\"\\nπŸ”„ Merging LoRA into base model...\")\n",
420
  "merged_model = model.merge_and_unload()\n",
421
  "merged_model.save_pretrained(\"./merged-model\")\n",
422
  "tokenizer.save_pretrained(\"./merged-model\")\n",
423
  "print(\"βœ… Merged model saved\")\n",
424
  "\n",
 
425
  "# model.push_to_hub(HUB_MODEL_ID)\n",
426
  "# tokenizer.push_to_hub(HUB_MODEL_ID)"
427
  ]
 
434
  "\n",
435
  "| Mode | Use Case | Speed |\n",
436
  "|------|----------|-------|\n",
437
+ "| `enable_thinking=True` | Deep reasoning, analysis | Slower, thorough |\n",
438
+ "| `enable_thinking=False` | Quick answers, coding | Fast, direct |"
439
  ]
440
  },
441
  {
 
488
  "| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |\n",
489
  "| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |\n",
490
  "| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |\n",
491
+ "| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |\n",
492
  "| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
493
  "| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
494
  "| **Unsloth Docs** | https://unsloth.ai/docs |\n",