asdf98 commited on
Commit
a7d1cc9
Β·
verified Β·
1 Parent(s): 2fab0ea

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -5
README.md CHANGED
@@ -30,18 +30,19 @@ A curated collection of **production-ready Colab notebooks** for fine-tuning sta
30
 
31
  ---
32
 
33
- ## πŸ“Š Dataset Selection β€” 7 Built-in Choices
34
 
35
  Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
36
 
37
  | Choice | Dataset | Rows | Format | Best For | Language |
38
  |--------|---------|------|--------|----------|----------|
39
- | `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | Ethical hacking, pentesting education | English |
40
- | `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot | English |
41
  | `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β†’50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
42
  | `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90Kβ†’50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
43
  | `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104Kβ†’50K | conversations (human/gpt) | German language fine-tuning | **German** |
44
  | `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153Kβ†’50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
 
45
  | `custom_mix` | Your combination | β€” | varies | Combine datasets for hybrid tuning | Mixed |
46
 
47
  ### How to Switch Datasets (in any notebook)
@@ -55,16 +56,35 @@ DATASET_CHOICE = "cybersecurity" # ← Default (defensive security)
55
  # DATASET_CHOICE = "sharegpt_en" # ← English dialogue
56
  # DATASET_CHOICE = "sharegpt_de" # ← German
57
  # DATASET_CHOICE = "sharegpt_hi" # ← Hindi
 
58
  # DATASET_CHOICE = "custom_mix" # ← Mix multiple
59
  ```
60
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  ### Mixing Datasets (custom_mix)
62
 
63
  ```python
64
  CUSTOM_DATASETS = [
65
  # (dataset_id, split, num_rows, format_type)
 
66
  ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
67
- ("HuggingFaceH4/ultrachat_200k", "train_sft", 20000, "messages"),
68
  ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
69
  ]
70
  ```
@@ -103,7 +123,7 @@ train_dataset = train_dataset.map(convert_messages_to_text, batched=True, remove
103
  trainer = SFTTrainer(..., dataset_text_field="text")
104
  ```
105
 
106
- All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT) automatically.
107
 
108
  ---
109
 
@@ -130,6 +150,7 @@ All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, Share
130
  | **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
131
  | **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
132
  | **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
 
133
  | **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
134
  | **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
135
 
 
30
 
31
  ---
32
 
33
+ ## πŸ“Š Dataset Selection β€” 8 Built-in Choices
34
 
35
  Every notebook includes a `DATASET_CHOICE` variable. **Just uncomment one line** to pick your data.
36
 
37
  | Choice | Dataset | Rows | Format | Best For | Language |
38
  |--------|---------|------|--------|----------|----------|
39
+ | `cybersecurity` | **Fenrir v2.1 + Trendyol** | 153K→50K | system/user/assistant | **Ethical hacking, pentesting education** | English |
40
+ | `ultrachat` | [UltraChat 200K](https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k) (SFT) | 200K→50K | messages (role/content) | General conversation, chatbot tuning | English |
41
  | `openhermes` | [OpenHermes 2.5](https://huggingface.co/datasets/teknium/OpenHermes-2.5) | 1M+β†’50K | conversations (human/gpt) | Reasoning, coding, instruction following | English |
42
  | `sharegpt_en` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (English) | ~90Kβ†’50K | conversations (human/gpt) | Multi-turn dialogue, general QA | English |
43
  | `sharegpt_de` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (German) | ~104Kβ†’50K | conversations (human/gpt) | German language fine-tuning | **German** |
44
  | `sharegpt_hi` | [ShareGPT](https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual) (Hindi 27B) | ~153Kβ†’50K | conversations (human/gpt) | Hindi language fine-tuning | **Hindi** |
45
+ | `code_corpus` | **[Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training)** | 240K→50K | text (code files with domain/repo/lang metadata) | **Code completion, coding assistant** | Multi (20 domains: Rust, Python, C++, Kotlin, Flutter, game engines, web frameworks, ethical hacking repos, etc.) |
46
  | `custom_mix` | Your combination | β€” | varies | Combine datasets for hybrid tuning | Mixed |
47
 
48
  ### How to Switch Datasets (in any notebook)
 
56
  # DATASET_CHOICE = "sharegpt_en" # ← English dialogue
57
  # DATASET_CHOICE = "sharegpt_de" # ← German
58
  # DATASET_CHOICE = "sharegpt_hi" # ← Hindi
59
+ # DATASET_CHOICE = "code_corpus" # ← Code completion (Rust, Python, C++, etc.)
60
  # DATASET_CHOICE = "custom_mix" # ← Mix multiple
61
  ```
62
 
63
+ ### Code Corpus Dataset Details
64
+
65
+ The [Code Corpus LLM Training](https://huggingface.co/datasets/krystv/code-corpus-llm-training) dataset contains **240,378 code files** from top open-source repositories across 20 domains:
66
+
67
+ | Domain | Examples |
68
+ |--------|----------|
69
+ | `web_ui` | Web frameworks, UI components |
70
+ | `cpp` | C++ systems code |
71
+ | `kotlin_android` | Android apps |
72
+ | `rust` | Rust systems (e.g., actix-web) |
73
+ | `python` | Python libraries |
74
+ | `ethical_hacking` | Security tools, pentesting repos |
75
+ | `game_engines` | Game development |
76
+ | `ui_ux_design` | Design systems |
77
+
78
+ Each example has: `text` (the full code file), `domain`, `repo`, `language`, `file_path`, `size_chars`. The notebook converts each code snippet into a user/assistant conversation: user asks to explain/improve the code, assistant provides the code.
79
+
80
  ### Mixing Datasets (custom_mix)
81
 
82
  ```python
83
  CUSTOM_DATASETS = [
84
  # (dataset_id, split, num_rows, format_type)
85
+ # format_type: "messages" | "conversations" | "text"
86
  ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1", "train", 10000, "messages"),
87
+ ("krystv/code-corpus-llm-training", "train", 20000, "text"),
88
  ("teknium/OpenHermes-2.5", "train", 20000, "conversations"),
89
  ]
90
  ```
 
123
  trainer = SFTTrainer(..., dataset_text_field="text")
124
  ```
125
 
126
+ All notebooks handle format auto-detection (Fenrir, UltraChat, OpenHermes, ShareGPT, Code Corpus) automatically.
127
 
128
  ---
129
 
 
150
  | **UltraChat 200K** | https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k |
151
  | **OpenHermes 2.5** | https://huggingface.co/datasets/teknium/OpenHermes-2.5 |
152
  | **ShareGPT Multilingual** | https://huggingface.co/datasets/deepmage121/ShareGPT_multilingual |
153
+ | **Code Corpus LLM Training** | https://huggingface.co/datasets/krystv/code-corpus-llm-training |
154
  | **Fenrir Cybersecurity** | https://huggingface.co/datasets/AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 |
155
  | **Trendyol Cybersecurity** | https://huggingface.co/datasets/Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset |
156