asdf98 commited on
Commit
f09f7ce
Β·
verified Β·
1 Parent(s): 8ecbd0a

Upload EthicalHacking_LFM2.5_Ultimate_Colab.ipynb

Browse files
EthicalHacking_LFM2.5_Ultimate_Colab.ipynb CHANGED
@@ -4,14 +4,14 @@
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
- "# πŸ” Ultimate Ethical Hacking LLM – Liquid LFM2.5 (Colab Free Tier T4)\n",
8
  "\n",
9
  "**πŸ₯‡ Model:** [Liquid LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) via Unsloth 4-bit \n",
10
- "**πŸ† Why this model?** 1.2B params, only **~1GB in 4-bit**, runs on phones. Massive T4 headroom for training. 128K context window. \n",
11
- "**πŸ“Š Datasets:** [Fenrir v2.1](https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1) + [Trendyol Cybersecurity](https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset) β€” 153K+ instruction pairs \n",
12
  "**⚑ Framework:** Unsloth + TRL SFTTrainer β€” 2Γ— faster, 70% less VRAM \n",
13
  "\n",
14
- "> ⚠️ **Disclaimer:** This trains on **defensive cybersecurity** datasets only. Intended for ethical hacking education and security research.\n",
15
  "\n",
16
  "---\n",
17
  "\n",
@@ -22,10 +22,8 @@
22
  "| Parameters | 1.2B |\n",
23
  "| 4-bit VRAM | ~1.0 GB |\n",
24
  "| Context | 128K tokens |\n",
25
- "| VRAM for training | **~14 GB free on T4** |\n",
26
- "| Batch size | **4-8** easily |\n",
27
- "| Max seq length | 4096-8192 |\n",
28
- "| Speed | **Very fast** on T4 |\n",
29
  "\n",
30
  "**Unsloth docs:** https://unsloth.ai/docs/models/tutorials/lfm2.5 \n",
31
  "**Official notebook:** https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Liquid_LFM2_(1.2B)-Conversational.ipynb"
@@ -69,9 +67,7 @@
69
  "cell_type": "markdown",
70
  "metadata": {},
71
  "source": [
72
- "## 3️⃣ Load LFM2.5-1.2B-Instruct in 4-bit via Unsloth\n",
73
- "\n",
74
- "Uses Unsloth's pre-converted 4-bit model. Only ~1GB in memory β€” leaves massive room for LoRA training."
75
  ]
76
  },
77
  {
@@ -84,26 +80,25 @@
84
  "import torch\n",
85
  "\n",
86
  "# ==================== T4-COLAB HYPERPARAMETERS (LFM2.5) ====================\n",
87
- "MAX_SEQ_LENGTH = 4096 # 1.2B model leaves huge VRAM headroom\n",
88
- "LORA_R = 128 # higher rank possible on LFM2.5 (tiny base)\n",
89
- "LORA_ALPHA = 128 # alpha = r\n",
90
- "BATCH_SIZE = 8 # massive batch thanks to small model\n",
91
- "GRAD_ACCUM = 1 # effective batch = 8\n",
92
- "LEARNING_RATE = 2e-4 \n",
93
- "NUM_EPOCHS = 1\n",
94
- "MAX_STEPS = 4000 # cap steps for speed\n",
95
- "WARMUP_STEPS = 200 \n",
96
- "LOGGING_STEPS = 50 \n",
97
- "SAVE_STEPS = 500 \n",
98
- "PACKING = True # massive throughput boost\n",
99
- "SAMPLE_SIZE = 50000 # subsample for fast convergence\n",
100
- "HUB_MODEL_ID = \"your-username/cyber-lfm25-lora\" \n",
101
  "# ========================================================================\n",
102
  "\n",
103
  "model, tokenizer = FastLanguageModel.from_pretrained(\n",
104
  " model_name=\"unsloth/LFM2.5-1.2B-Instruct\",\n",
105
  " max_seq_length=MAX_SEQ_LENGTH,\n",
106
- " dtype=None, # auto-detect (fp16 on T4)\n",
107
  " load_in_4bit=True,\n",
108
  ")\n",
109
  "\n",
@@ -113,11 +108,11 @@
113
  " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
114
  " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
115
  " lora_alpha=LORA_ALPHA,\n",
116
- " lora_dropout=0, \n",
117
  " bias=\"none\",\n",
118
  " use_gradient_checkpointing=\"unsloth\",\n",
119
  " random_state=3407,\n",
120
- " use_rslora=False, \n",
121
  " loftq_config=None,\n",
122
  ")\n",
123
  "\n",
@@ -132,7 +127,19 @@
132
  "cell_type": "markdown",
133
  "metadata": {},
134
  "source": [
135
- "## 4️⃣ Load, Audit, Subsample & Merge Cybersecurity Datasets"
 
 
 
 
 
 
 
 
 
 
 
 
136
  ]
137
  },
138
  {
@@ -142,53 +149,119 @@
142
  "outputs": [],
143
  "source": [
144
  "from datasets import load_dataset, concatenate_datasets\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  "import random\n",
146
  "\n",
147
- "# ---------- Dataset 1: Fenrir v2.1 ----------\n",
148
- "print(\"πŸ“₯ Loading Fenrir v2.1...\")\n",
149
- "ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
150
- "print(f\" Rows: {len(ds1)} | Columns: {ds1.column_names}\")\n",
151
- "\n",
152
- "for i in random.sample(range(len(ds1)), 2):\n",
153
- " print(f\"\\n--- Sample {i} ---\")\n",
154
- " print(f\"SYSTEM: {ds1[i]['system'][:120]}...\")\n",
155
- " print(f\"USER: {ds1[i]['user'][:120]}...\")\n",
156
- " print(f\"ASSIST: {ds1[i]['assistant'][:120]}...\")\n",
157
- "\n",
158
- "def fenrir_to_messages(example):\n",
159
- " return {\n",
160
- " \"messages\": [\n",
161
- " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
162
- " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
163
- " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
164
- " ]\n",
165
- " }\n",
166
- "\n",
167
- "ds1 = ds1.map(fenrir_to_messages, remove_columns=ds1.column_names, batched=False)\n",
168
- "\n",
169
- "# ---------- Dataset 2: Trendyol ----------\n",
170
- "print(\"\\nπŸ“₯ Loading Trendyol Cybersecurity...\")\n",
171
- "ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
172
- "print(f\" Rows: {len(ds2)} | Columns: {ds2.column_names}\")\n",
173
- "\n",
174
- "def trendyol_to_messages(example):\n",
175
- " return {\n",
176
- " \"messages\": [\n",
177
- " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
178
- " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
179
- " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
180
- " ]\n",
181
- " }\n",
182
- "\n",
183
- "ds2 = ds2.map(trendyol_to_messages, remove_columns=ds2.column_names, batched=False)\n",
184
- "\n",
185
- "# ---------- Merge & Subsample ----------\n",
186
- "train_dataset = concatenate_datasets([ds1, ds2])\n",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
187
  "print(f\"\\nπŸ“Š COMBINED DATASET: {len(train_dataset)} rows\")\n",
188
  "\n",
 
 
 
 
189
  "if len(train_dataset) > SAMPLE_SIZE:\n",
190
  " train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
191
- " print(f\"πŸš€ SUBSAMPLED to {len(train_dataset)} rows\")\n",
192
  "\n",
193
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
194
  "print(f\" Steps per epoch: ~{len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)}\")\n",
@@ -199,7 +272,7 @@
199
  "cell_type": "markdown",
200
  "metadata": {},
201
  "source": [
202
- "## 5️⃣ Pre-process Dataset to Text (Avoid Unsloth formatting_func issues)"
203
  ]
204
  },
205
  {
@@ -208,26 +281,15 @@
208
  "metadata": {},
209
  "outputs": [],
210
  "source": [
211
- "# ========== PRE-PROCESS: messages β†’ text with chat template ==========\n",
212
  "def convert_messages_to_text(examples):\n",
213
  " texts = []\n",
214
  " for msgs in examples[\"messages\"]:\n",
215
- " text = tokenizer.apply_chat_template(\n",
216
- " msgs,\n",
217
- " tokenize=False,\n",
218
- " add_generation_prompt=False,\n",
219
- " )\n",
220
  " texts.append(text)\n",
221
  " return {\"text\": texts}\n",
222
  "\n",
223
  "print(\"πŸ”„ Converting messages to text...\")\n",
224
- "train_dataset = train_dataset.map(\n",
225
- " convert_messages_to_text,\n",
226
- " batched=True,\n",
227
- " remove_columns=[\"messages\"],\n",
228
- " batch_size=100,\n",
229
- ")\n",
230
- "\n",
231
  "print(f\"βœ… Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
232
  "print(f\"πŸ“„ Sample text length: {len(train_dataset[0]['text'])} chars\")"
233
  ]
@@ -236,7 +298,7 @@
236
  "cell_type": "markdown",
237
  "metadata": {},
238
  "source": [
239
- "## 6️⃣ Configure SFT Trainer (with Packing)"
240
  ]
241
  },
242
  {
@@ -276,7 +338,7 @@
276
  " ),\n",
277
  ")\n",
278
  "\n",
279
- "print(f\"βœ… Trainer ready. Total steps: {MAX_STEPS}\")\n",
280
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
281
  "print(f\" Packing enabled: {PACKING}\")\n",
282
  "print(f\" Est. time at ~0.6 it/s: ~{MAX_STEPS * 1.7 / 3600:.1f} hours\")"
@@ -286,7 +348,7 @@
286
  "cell_type": "markdown",
287
  "metadata": {},
288
  "source": [
289
- "## 7️⃣ Train πŸš€"
290
  ]
291
  },
292
  {
@@ -311,7 +373,7 @@
311
  "cell_type": "markdown",
312
  "metadata": {},
313
  "source": [
314
- "## 8️⃣ Save & Push to HuggingFace Hub"
315
  ]
316
  },
317
  {
@@ -320,19 +382,16 @@
320
  "metadata": {},
321
  "outputs": [],
322
  "source": [
323
- "# 8A) Save LoRA adapter\n",
324
  "model.save_pretrained(\"./lfm25-lora-adapter\")\n",
325
  "tokenizer.save_pretrained(\"./lfm25-lora-adapter\")\n",
326
  "print(\"βœ… LoRA adapter saved\")\n",
327
  "\n",
328
- "# 8B) Merge & save full model\n",
329
  "print(\"\\nπŸ”„ Merging LoRA into base model...\")\n",
330
  "merged_model = model.merge_and_unload()\n",
331
  "merged_model.save_pretrained(\"./lfm25-merged\")\n",
332
  "tokenizer.save_pretrained(\"./lfm25-merged\")\n",
333
  "print(\"βœ… Merged model saved\")\n",
334
  "\n",
335
- "# 8C) Push to HF Hub (uncomment if logged in)\n",
336
  "# model.push_to_hub(HUB_MODEL_ID)\n",
337
  "# tokenizer.push_to_hub(HUB_MODEL_ID)"
338
  ]
@@ -341,7 +400,7 @@
341
  "cell_type": "markdown",
342
  "metadata": {},
343
  "source": [
344
- "## 9️⃣ Inference Demo – Responsible Pentesting"
345
  ]
346
  },
347
  {
@@ -352,90 +411,43 @@
352
  "source": [
353
  "FastLanguageModel.for_inference(model)\n",
354
  "\n",
355
- "test_prompt = \"How would you perform a responsible penetration test on a web application?\"\n",
356
  "\n",
357
  "messages = [\n",
358
- " {\"role\": \"system\", \"content\": \"You are a cybersecurity expert. Explain concepts clearly and ethically.\"},\n",
359
  " {\"role\": \"user\", \"content\": test_prompt},\n",
360
  "]\n",
361
  "\n",
362
- "inputs = tokenizer.apply_chat_template(\n",
363
- " messages,\n",
364
- " tokenize=True,\n",
365
- " add_generation_prompt=True,\n",
366
- " return_tensors=\"pt\",\n",
367
- ").to(model.device)\n",
368
- "\n",
369
- "outputs = model.generate(\n",
370
- " input_ids=inputs,\n",
371
- " max_new_tokens=512,\n",
372
- " temperature=0.7,\n",
373
- " top_p=0.9,\n",
374
- " do_sample=True,\n",
375
- " pad_token_id=tokenizer.pad_token_id,\n",
376
- " eos_token_id=tokenizer.eos_token_id,\n",
377
- ")\n",
378
  "\n",
379
  "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
380
  "reply = response.split(\"user\")[-1].split(\"assistant\")[-1].strip()\n",
381
  "print(reply[:800])"
382
  ]
383
  },
384
- {
385
- "cell_type": "markdown",
386
- "metadata": {},
387
- "source": [
388
- "## πŸ”Ÿ Quick Benchmark – CyberMetric Sample"
389
- ]
390
- },
391
- {
392
- "cell_type": "code",
393
- "execution_count": null,
394
- "metadata": {},
395
- "outputs": [],
396
- "source": [
397
- "benchmark_q = (\n",
398
- " \"Which of the following is the MOST effective defense against SQL injection?\\n\"\n",
399
- " \"A) Input validation only\\n\"\n",
400
- " \"B) Parameterized queries\\n\"\n",
401
- " \"C) Escaping special characters\\n\"\n",
402
- " \"D) Client-side filtering\\n\"\n",
403
- " \"Answer with the letter only.\"\n",
404
- ")\n",
405
- "\n",
406
- "bench_msgs = [\n",
407
- " {\"role\": \"system\", \"content\": \"You are a cybersecurity expert. Answer accurately.\"},\n",
408
- " {\"role\": \"user\", \"content\": benchmark_q},\n",
409
- "]\n",
410
- "\n",
411
- "inputs = tokenizer.apply_chat_template(bench_msgs, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(model.device)\n",
412
- "\n",
413
- "outputs = model.generate(input_ids=inputs, max_new_tokens=64, temperature=0.1, do_sample=True,\n",
414
- " pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)\n",
415
- "\n",
416
- "answer = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
417
- "print(\"πŸ“Š Benchmark Answer:\")\n",
418
- "print(answer.split(\"assistant\")[-1].strip())"
419
- ]
420
- },
421
  {
422
  "cell_type": "markdown",
423
  "metadata": {},
424
  "source": [
425
  "---\n",
426
- "## πŸ“š References\n",
427
  "\n",
428
  "| Resource | Link |\n",
429
  "|----------|------|\n",
430
  "| **Liquid AI Models** | https://www.liquid.ai/models |\n",
431
  "| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |\n",
432
  "| **Unsloth LFM2.5 Docs** | https://unsloth.ai/docs/models/tutorials/lfm2.5 |\n",
433
- "| **Official Colab** | https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Liquid_LFM2_(1.2B)-Conversational.ipynb |\n",
434
- "| **Fenrir Dataset** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
435
- "| **Trendyol Dataset** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
 
 
436
  "\n",
437
  "---\n",
438
- "*Built with ❀️ for the cybersecurity community. Use responsibly.*"
439
  ]
440
  }
441
  ],
 
4
  "cell_type": "markdown",
5
  "metadata": {},
6
  "source": [
7
+ "# πŸ” Ultimate LLM Fine-Tuning – Liquid LFM2.5 (Colab Free Tier T4)\n",
8
  "\n",
9
  "**πŸ₯‡ Model:** [Liquid LFM2.5-1.2B-Instruct](https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct) via Unsloth 4-bit \n",
10
+ "**πŸ† Why this model?** 1.2B params, only **~1GB in 4-bit**, runs on phones. Massive T4 headroom for training. 128K context. \n",
11
+ "**πŸ“Š Datasets:** Your choice β€” cybersecurity, general chat, multilingual, coding, or mix them! \n",
12
  "**⚑ Framework:** Unsloth + TRL SFTTrainer β€” 2Γ— faster, 70% less VRAM \n",
13
  "\n",
14
+ "> ⚠️ Pick any dataset below. Default is cybersecurity. Mix datasets for hybrid tuning.\n",
15
  "\n",
16
  "---\n",
17
  "\n",
 
22
  "| Parameters | 1.2B |\n",
23
  "| 4-bit VRAM | ~1.0 GB |\n",
24
  "| Context | 128K tokens |\n",
25
+ "| Batch size on T4 | **4-8** |\n",
26
+ "| Training headroom | **~14 GB free** |\n",
 
 
27
  "\n",
28
  "**Unsloth docs:** https://unsloth.ai/docs/models/tutorials/lfm2.5 \n",
29
  "**Official notebook:** https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Liquid_LFM2_(1.2B)-Conversational.ipynb"
 
67
  "cell_type": "markdown",
68
  "metadata": {},
69
  "source": [
70
+ "## 3️⃣ Load LFM2.5-1.2B-Instruct in 4-bit via Unsloth"
 
 
71
  ]
72
  },
73
  {
 
80
  "import torch\n",
81
  "\n",
82
  "# ==================== T4-COLAB HYPERPARAMETERS (LFM2.5) ====================\n",
83
+ "MAX_SEQ_LENGTH = 4096\n",
84
+ "LORA_R = 128\n",
85
+ "LORA_ALPHA = 128\n",
86
+ "BATCH_SIZE = 8\n",
87
+ "GRAD_ACCUM = 1\n",
88
+ "LEARNING_RATE = 2e-4\n",
89
+ "MAX_STEPS = 4000\n",
90
+ "WARMUP_STEPS = 200\n",
91
+ "LOGGING_STEPS = 50\n",
92
+ "SAVE_STEPS = 500\n",
93
+ "PACKING = True\n",
94
+ "SAMPLE_SIZE = 50000\n",
95
+ "HUB_MODEL_ID = \"your-username/lfm25-lora\"\n",
 
96
  "# ========================================================================\n",
97
  "\n",
98
  "model, tokenizer = FastLanguageModel.from_pretrained(\n",
99
  " model_name=\"unsloth/LFM2.5-1.2B-Instruct\",\n",
100
  " max_seq_length=MAX_SEQ_LENGTH,\n",
101
+ " dtype=None,\n",
102
  " load_in_4bit=True,\n",
103
  ")\n",
104
  "\n",
 
108
  " target_modules=[\"q_proj\", \"k_proj\", \"v_proj\", \"o_proj\",\n",
109
  " \"gate_proj\", \"up_proj\", \"down_proj\"],\n",
110
  " lora_alpha=LORA_ALPHA,\n",
111
+ " lora_dropout=0,\n",
112
  " bias=\"none\",\n",
113
  " use_gradient_checkpointing=\"unsloth\",\n",
114
  " random_state=3407,\n",
115
+ " use_rslora=False,\n",
116
  " loftq_config=None,\n",
117
  ")\n",
118
  "\n",
 
127
  "cell_type": "markdown",
128
  "metadata": {},
129
  "source": [
130
+ "## 4️⃣ 🎯 CHOOSE YOUR DATASET(S)\n",
131
+ "\n",
132
+ "Uncomment **ONE** `DATASET_CHOICE` line. Mix datasets with `custom_mix`.\n",
133
+ "\n",
134
+ "| Choice | Dataset | Size | Format | Best For |\n",
135
+ "|--------|---------|------|--------|----------|\n",
136
+ "| `\"cybersecurity\"` | Fenrir + Trendyol | 153K | system/user/assistant | **Ethical hacking education** |\n",
137
+ "| `\"ultrachat\"` | UltraChat 200K SFT | 200K | messages | General conversation |\n",
138
+ "| `\"openhermes\"` | OpenHermes 2.5 | 1M+ | conversations | Reasoning, coding |\n",
139
+ "| `\"sharegpt_en\"` | ShareGPT English | ~90K | conversations | Multi-turn dialogue |\n",
140
+ "| `\"sharegpt_de\"` | ShareGPT German | ~104K | conversations | German fine-tuning |\n",
141
+ "| `\"sharegpt_hi\"` | ShareGPT Hindi | ~153K | conversations | Hindi fine-tuning |\n",
142
+ "| `\"custom_mix\"` | Your mix | β€” | varies | Combine multiple |"
143
  ]
144
  },
145
  {
 
149
  "outputs": [],
150
  "source": [
151
  "from datasets import load_dataset, concatenate_datasets\n",
152
+ "\n",
153
+ "# ═══════════════════════════════════════════════════════════════\n",
154
+ "# SELECT YOUR DATASET β€” UNCOMMENT ONE LINE\n",
155
+ "# ═══════════════════════════════════════════════════════════════\n",
156
+ "\n",
157
+ "DATASET_CHOICE = \"cybersecurity\"\n",
158
+ "\n",
159
+ "# DATASET_CHOICE = \"ultrachat\"\n",
160
+ "# DATASET_CHOICE = \"openhermes\"\n",
161
+ "# DATASET_CHOICE = \"sharegpt_en\"\n",
162
+ "# DATASET_CHOICE = \"sharegpt_de\"\n",
163
+ "# DATASET_CHOICE = \"sharegpt_hi\"\n",
164
+ "# DATASET_CHOICE = \"custom_mix\"\n",
165
+ "\n",
166
+ "CUSTOM_DATASETS = [\n",
167
+ " (\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", \"train\", 10000, \"messages\"),\n",
168
+ " (\"HuggingFaceH4/ultrachat_200k\", \"train_sft\", 20000, \"messages\"),\n",
169
+ " (\"teknium/OpenHermes-2.5\", \"train\", 20000, \"conversations\"),\n",
170
+ "]\n",
171
+ "\n",
172
+ "print(f\"🎯 DATASET_CHOICE = {DATASET_CHOICE}\")"
173
+ ]
174
+ },
175
+ {
176
+ "cell_type": "markdown",
177
+ "metadata": {},
178
+ "source": [
179
+ "## 5️⃣ Load, Convert & Pre-process Selected Dataset"
180
+ ]
181
+ },
182
+ {
183
+ "cell_type": "code",
184
+ "execution_count": null,
185
+ "metadata": {},
186
+ "outputs": [],
187
+ "source": [
188
  "import random\n",
189
  "\n",
190
+ "def _convert_fenrir(example):\n",
191
+ " return {\"messages\": [\n",
192
+ " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
193
+ " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
194
+ " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
195
+ " ]}\n",
196
+ "\n",
197
+ "def _convert_trendyol(example):\n",
198
+ " return {\"messages\": [\n",
199
+ " {\"role\": \"system\", \"content\": example[\"system\"]},\n",
200
+ " {\"role\": \"user\", \"content\": example[\"user\"]},\n",
201
+ " {\"role\": \"assistant\", \"content\": example[\"assistant\"]},\n",
202
+ " ]}\n",
203
+ "\n",
204
+ "def _convert_ultrachat(example):\n",
205
+ " return {\"messages\": example[\"messages\"]}\n",
206
+ "\n",
207
+ "def _convert_conversations(example):\n",
208
+ " msgs = []\n",
209
+ " system = example.get(\"system_prompt\", \"\") or example.get(\"system\", \"\")\n",
210
+ " if system:\n",
211
+ " msgs.append({\"role\": \"system\", \"content\": system})\n",
212
+ " for turn in example[\"conversations\"]:\n",
213
+ " role = \"user\" if turn[\"from\"] in (\"human\", \"user\") else \"assistant\"\n",
214
+ " msgs.append({\"role\": role, \"content\": turn[\"value\"]})\n",
215
+ " return {\"messages\": msgs}\n",
216
+ "\n",
217
+ "all_datasets = []\n",
218
+ "\n",
219
+ "if DATASET_CHOICE == \"cybersecurity\":\n",
220
+ " ds1 = load_dataset(\"AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1\", split=\"train\")\n",
221
+ " ds1 = ds1.map(_convert_fenrir, remove_columns=ds1.column_names, batched=False)\n",
222
+ " all_datasets.append(ds1)\n",
223
+ " ds2 = load_dataset(\"Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset\", split=\"train\")\n",
224
+ " ds2 = ds2.map(_convert_trendyol, remove_columns=ds2.column_names, batched=False)\n",
225
+ " all_datasets.append(ds2)\n",
226
+ "\n",
227
+ "elif DATASET_CHOICE == \"ultrachat\":\n",
228
+ " ds = load_dataset(\"HuggingFaceH4/ultrachat_200k\", split=\"train_sft\")\n",
229
+ " ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
230
+ " all_datasets.append(ds)\n",
231
+ "\n",
232
+ "elif DATASET_CHOICE == \"openhermes\":\n",
233
+ " ds = load_dataset(\"teknium/OpenHermes-2.5\", split=\"train\")\n",
234
+ " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
235
+ " all_datasets.append(ds)\n",
236
+ "\n",
237
+ "elif DATASET_CHOICE.startswith(\"sharegpt_\"):\n",
238
+ " split_map = {\"sharegpt_en\": \"english\", \"sharegpt_de\": \"german_4b_translated\", \"sharegpt_hi\": \"hindi_27b_translated\"}\n",
239
+ " ds = load_dataset(\"deepmage121/ShareGPT_multilingual\", split=split_map[DATASET_CHOICE])\n",
240
+ " ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
241
+ " all_datasets.append(ds)\n",
242
+ "\n",
243
+ "elif DATASET_CHOICE == \"custom_mix\":\n",
244
+ " for ds_id, split, n_rows, fmt in CUSTOM_DATASETS:\n",
245
+ " ds = load_dataset(ds_id, split=split)\n",
246
+ " if n_rows and len(ds) > n_rows:\n",
247
+ " ds = ds.shuffle(seed=3407).select(range(n_rows))\n",
248
+ " if fmt == \"messages\": ds = ds.map(_convert_ultrachat, remove_columns=ds.column_names, batched=False)\n",
249
+ " elif fmt == \"conversations\": ds = ds.map(_convert_conversations, remove_columns=ds.column_names, batched=False)\n",
250
+ " all_datasets.append(ds)\n",
251
+ "\n",
252
+ "else:\n",
253
+ " raise ValueError(f\"Unknown DATASET_CHOICE: {DATASET_CHOICE}\")\n",
254
+ "\n",
255
+ "train_dataset = concatenate_datasets(all_datasets) if len(all_datasets) > 1 else all_datasets[0]\n",
256
  "print(f\"\\nπŸ“Š COMBINED DATASET: {len(train_dataset)} rows\")\n",
257
  "\n",
258
+ "sample = train_dataset[random.randint(0, len(train_dataset)-1)]\n",
259
+ "print(f\"Sample roles: {[m['role'] for m in sample['messages']]}\")\n",
260
+ "for m in sample[\"messages\"]: print(f\" {m['role']}: {m['content'][:80]}...\")\n",
261
+ "\n",
262
  "if len(train_dataset) > SAMPLE_SIZE:\n",
263
  " train_dataset = train_dataset.shuffle(seed=3407).select(range(SAMPLE_SIZE))\n",
264
+ " print(f\"\\nπŸš€ SUBSAMPLED to {len(train_dataset)} rows\")\n",
265
  "\n",
266
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
267
  "print(f\" Steps per epoch: ~{len(train_dataset) // (BATCH_SIZE * GRAD_ACCUM)}\")\n",
 
272
  "cell_type": "markdown",
273
  "metadata": {},
274
  "source": [
275
+ "## 6️⃣ Convert Messages β†’ Text (Chat Template)"
276
  ]
277
  },
278
  {
 
281
  "metadata": {},
282
  "outputs": [],
283
  "source": [
 
284
  "def convert_messages_to_text(examples):\n",
285
  " texts = []\n",
286
  " for msgs in examples[\"messages\"]:\n",
287
+ " text = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False)\n",
 
 
 
 
288
  " texts.append(text)\n",
289
  " return {\"text\": texts}\n",
290
  "\n",
291
  "print(\"πŸ”„ Converting messages to text...\")\n",
292
+ "train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove_columns=[\"messages\"], batch_size=100)\n",
 
 
 
 
 
 
293
  "print(f\"βœ… Dataset pre-processed. Columns: {train_dataset.column_names}\")\n",
294
  "print(f\"πŸ“„ Sample text length: {len(train_dataset[0]['text'])} chars\")"
295
  ]
 
298
  "cell_type": "markdown",
299
  "metadata": {},
300
  "source": [
301
+ "## 7️⃣ Configure SFT Trainer (with Packing)"
302
  ]
303
  },
304
  {
 
338
  " ),\n",
339
  ")\n",
340
  "\n",
341
+ "print(f\"βœ… Trainer ready. Dataset: {DATASET_CHOICE} | Steps: {MAX_STEPS}\")\n",
342
  "print(f\" Effective batch size: {BATCH_SIZE * GRAD_ACCUM}\")\n",
343
  "print(f\" Packing enabled: {PACKING}\")\n",
344
  "print(f\" Est. time at ~0.6 it/s: ~{MAX_STEPS * 1.7 / 3600:.1f} hours\")"
 
348
  "cell_type": "markdown",
349
  "metadata": {},
350
  "source": [
351
+ "## 8️⃣ Train πŸš€"
352
  ]
353
  },
354
  {
 
373
  "cell_type": "markdown",
374
  "metadata": {},
375
  "source": [
376
+ "## 9️⃣ Save & Push to HuggingFace Hub"
377
  ]
378
  },
379
  {
 
382
  "metadata": {},
383
  "outputs": [],
384
  "source": [
 
385
  "model.save_pretrained(\"./lfm25-lora-adapter\")\n",
386
  "tokenizer.save_pretrained(\"./lfm25-lora-adapter\")\n",
387
  "print(\"βœ… LoRA adapter saved\")\n",
388
  "\n",
 
389
  "print(\"\\nπŸ”„ Merging LoRA into base model...\")\n",
390
  "merged_model = model.merge_and_unload()\n",
391
  "merged_model.save_pretrained(\"./lfm25-merged\")\n",
392
  "tokenizer.save_pretrained(\"./lfm25-merged\")\n",
393
  "print(\"βœ… Merged model saved\")\n",
394
  "\n",
 
395
  "# model.push_to_hub(HUB_MODEL_ID)\n",
396
  "# tokenizer.push_to_hub(HUB_MODEL_ID)"
397
  ]
 
400
  "cell_type": "markdown",
401
  "metadata": {},
402
  "source": [
403
+ "## πŸ”Ÿ Inference Demo"
404
  ]
405
  },
406
  {
 
411
  "source": [
412
  "FastLanguageModel.for_inference(model)\n",
413
  "\n",
414
+ "test_prompt = \"Explain how parameterized queries prevent SQL injection, with a Python example.\"\n",
415
  "\n",
416
  "messages = [\n",
417
+ " {\"role\": \"system\", \"content\": \"You are a helpful and knowledgeable assistant.\"},\n",
418
  " {\"role\": \"user\", \"content\": test_prompt},\n",
419
  "]\n",
420
  "\n",
421
+ "inputs = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors=\"pt\").to(model.device)\n",
422
+ "\n",
423
+ "outputs = model.generate(input_ids=inputs, max_new_tokens=512, temperature=0.7, top_p=0.9,\n",
424
+ " do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id)\n",
 
 
 
 
 
 
 
 
 
 
 
 
425
  "\n",
426
  "response = tokenizer.decode(outputs[0], skip_special_tokens=True)\n",
427
  "reply = response.split(\"user\")[-1].split(\"assistant\")[-1].strip()\n",
428
  "print(reply[:800])"
429
  ]
430
  },
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
431
  {
432
  "cell_type": "markdown",
433
  "metadata": {},
434
  "source": [
435
  "---\n",
436
+ "## πŸ“š Dataset & Model References\n",
437
  "\n",
438
  "| Resource | Link |\n",
439
  "|----------|------|\n",
440
  "| **Liquid AI Models** | https://www.liquid.ai/models |\n",
441
  "| **LFM2.5-1.2B-Instruct** | https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct |\n",
442
  "| **Unsloth LFM2.5 Docs** | https://unsloth.ai/docs/models/tutorials/lfm2.5 |\n",
443
+ "| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |\n",
444
+ "| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |\n",
445
+ "| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |\n",
446
+ "| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |\n",
447
+ "| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |\n",
448
  "\n",
449
  "---\n",
450
+ "*Pick any dataset. Train anything. Use responsibly.*"
451
  ]
452
  }
453
  ],