Upload README.md
Browse files
README.md
CHANGED
|
@@ -30,18 +30,19 @@ A curated collection of **production-ready Colab notebooks** for fine-tuning sta
|
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
-
## π Dataset Selection β
|
| 34 |
|
| 35 |
Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
|
| 36 |
|
| 37 |
| Choice | Dataset | Rows | Format | Best For | Language |
|
| 38 |
|--------|---------|------|--------|----------|----------|
|
| 39 |
-
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153Kβ50K | system/user/assistant | Ethical hacking, pentesting education | English |
|
| 40 |
-
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200Kβ50K | messages (role/content) | General conversation, chatbot | English |
|
| 41 |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
|
| 42 |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90Kβ50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
|
| 43 |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104Kβ50K | conversations (human/gpt) | German language fine-tuning | **German** |
|
| 44 |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153Kβ50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
|
|
|
|
| 45 |
| `custom_mix` | Your combination | β | varies | Combine datasets for hybrid tuning | Mixed |
|
| 46 |
|
| 47 |
### How to Switch Datasets (in any notebook)
|
|
@@ -55,16 +56,35 @@ DATASET_CHOICE = "cybersecurity" # β Default (defensive security)
|
|
| 55 |
# DATASET_CHOICE = "sharegpt_en" # β English dialogue
|
| 56 |
# DATASET_CHOICE = "sharegpt_de" # β German
|
| 57 |
# DATASET_CHOICE = "sharegpt_hi" # β Hindi
|
|
|
|
| 58 |
# DATASET_CHOICE = "custom_mix" # β Mix multiple
|
| 59 |
```
|
| 60 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
### Mixing Datasets (custom_mix)
|
| 62 |
|
| 63 |
```python
|
| 64 |
CUSTOM_DATASETS = [
|
| 65 |
# (dataset_id, split, num_rows, format_type)
|
|
|
|
| 66 |
("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
|
| 67 |
-
("
|
| 68 |
("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
|
| 69 |
]
|
| 70 |
```
|
|
@@ -103,7 +123,7 @@ train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove
|
|
| 103 |
trainer = SFTTrainer(..., dataset_text_field="text")
|
| 104 |
```
|
| 105 |
|
| 106 |
-
All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT) automatically.
|
| 107 |
|
| 108 |
---
|
| 109 |
|
|
@@ -130,6 +150,7 @@ All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, Share
|
|
| 130 |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
|
| 131 |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
|
| 132 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
|
|
|
|
| 133 |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
|
| 134 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
|
| 135 |
|
|
|
|
| 30 |
|
| 31 |
---
|
| 32 |
|
| 33 |
+
## π Dataset Selection β 8 Built-in Choices
|
| 34 |
|
| 35 |
Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
|
| 36 |
|
| 37 |
| Choice | Dataset | Rows | Format | Best For | Language |
|
| 38 |
|--------|---------|------|--------|----------|----------|
|
| 39 |
+
| `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153Kβ50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
|
| 40 |
+
| `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200Kβ50K | messages (role/content) | General conversation, chatbot tuning | English |
|
| 41 |
| `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
|
| 42 |
| `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90Kβ50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
|
| 43 |
| `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104Kβ50K | conversations (human/gpt) | German language fine-tuning | **German** |
|
| 44 |
| `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153Kβ50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
|
| 45 |
+
| `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240Kβ50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
|
| 46 |
| `custom_mix` | Your combination | β | varies | Combine datasets for hybrid tuning | Mixed |
|
| 47 |
|
| 48 |
### How to Switch Datasets (in any notebook)
|
|
|
|
| 56 |
# DATASET_CHOICE = "sharegpt_en" # β English dialogue
|
| 57 |
# DATASET_CHOICE = "sharegpt_de" # β German
|
| 58 |
# DATASET_CHOICE = "sharegpt_hi" # β Hindi
|
| 59 |
+
# DATASET_CHOICE = "code_corpus" # β Code completion (Rust, Python, C++, etc.)
|
| 60 |
# DATASET_CHOICE = "custom_mix" # β Mix multiple
|
| 61 |
```
|
| 62 |
|
| 63 |
+
### Code Corpus Dataset Details
|
| 64 |
+
|
| 65 |
+
The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:
|
| 66 |
+
|
| 67 |
+
| Domain | Examples |
|
| 68 |
+
|--------|----------|
|
| 69 |
+
| `web_ui` | Web frameworks, UI components |
|
| 70 |
+
| `cpp` | C++ systems code |
|
| 71 |
+
| `kotlin_android` | Android apps |
|
| 72 |
+
| `rust` | Rust systems (e.g., actix-web) |
|
| 73 |
+
| `python` | Python libraries |
|
| 74 |
+
| `ethical_hacking` | Security tools, pentesting repos |
|
| 75 |
+
| `game_engines` | Game development |
|
| 76 |
+
| `ui_ux_design` | Design systems |
|
| 77 |
+
|
| 78 |
+
Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.
|
| 79 |
+
|
| 80 |
### Mixing Datasets (custom_mix)
|
| 81 |
|
| 82 |
```python
|
| 83 |
CUSTOM_DATASETS = [
|
| 84 |
# (dataset_id, split, num_rows, format_type)
|
| 85 |
+
# format_type: "messages" | "conversations" | "text"
|
| 86 |
("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
|
| 87 |
+
("krystv/code-corpus-llm-training", "train", 20000, "text"),
|
| 88 |
("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
|
| 89 |
]
|
| 90 |
```
|
|
|
|
| 123 |
trainer = SFTTrainer(..., dataset_text_field="text")
|
| 124 |
```
|
| 125 |
|
| 126 |
+
All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.
|
| 127 |
|
| 128 |
---
|
| 129 |
|
|
|
|
| 150 |
| **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
|
| 151 |
| **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
|
| 152 |
| **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
|
| 153 |
+
| **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
|
| 154 |
| **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
|
| 155 |
| **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
|
| 156 |
|