Spaces:
Runtime error
feat: integrate 16 incident-response/decision-quality datasets (37 \u2192 53)
Browse filesUSER PRINCIPLE HONORED: deep SDLC knowledge first, then judgment becomes sharp.
Surrogate must be expert-level coder/architect across all 19 SDLC roles before
its decisions on critical paths are reliable.
NEW DATASETS (16):
Compliance / Governance (closes biggest gap):
+ ethanolivertroy/nist-cybersecurity-training (CC0, 530K β entire NIST corpus
covering FIPS, SP 800/1800 series, IR, CSWP β 596 docs distilled to chat)
Capped to 100K rows for first pass.
DevSecOps depth:
+ AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 (Apache, 99,870)
OWASP + MITRE ATT&CK + NIST CSF + CIS + AppSec/Cloud/DevSecOps/IR
+ Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset (Apache, 53k)
Tagged: 14.8% IR/forensics, 16.2% threat-hunting, 12.3% AI/ML security
Incident-Response playbooks:
+ darkknight25/Incident_Response_Playbook_Dataset (MIT, 175 NIST SP 800-61
structured playbooks with MITRE ATT&CK + phase-by-phase response)
NEW SCHEMA: ir-playbook
Real agent trajectories (multi-language SWE):
+ nebius/SWE-rebench-openhands-trajectories (CC-BY-4.0, 67k OpenHands traces,
64-turn avg, filterable by resolved=1) β capped 50K
+ nebius/SWE-rebench (CC-BY-4.0, 27,878 issue\u2192patch from 3,400+ repos)
R1-distilled reasoning chains (CoT, capped per source):
+ nvidia/OpenCodeReasoning (CC-BY-4.0) β 100K cap from 735K
+ open-r1/codeforces-cots (CC-BY-4.0) β 50K cap from 253K
+ open-r1/OpenR1-Math-220k (Apache) β 50K cap
+ open-thoughts/OpenThoughts-114k (Apache) β 100K cap
Preference under pressure (DPO):
+ nvidia/HelpSteer3 (CC-BY-4.0, 40,476 with explicit reasoning field)
NEW SCHEMA: helpsteer-pref (extracts 'why this is preferred')
+ argilla/ultrafeedback-binarized-preferences-cleaned (MIT, 60,917)
+ OpenAssistant/oasst2 (Apache, 80K from 135k human-vetted)
Cloud security misconfigs + chaos:
+ darkknight25/Cloud_Vulnerabilities (MIT, 1,200 across AWS/Azure/GCP/Oracle)
NEW SCHEMA: cloud-misconfig (issue \u2192 mitigation + CIS ref)
+ AYI-NEDJIMI/cloud-security-en (Apache, 230 bilingual EN/FR)
+ ddjain/krkn-dataset (MIT, K8s chaos engineering)
Linux/Bash:
+ mecha-org/linux-command-dataset (Apache, 8,669)
NEW SCHEMAS: ir-playbook, helpsteer-pref, cloud-misconfig
EVAL-ONLY HOLDOUT: SWE-bench/SWE-bench_Verified (500 instances)
Documented but NOT added to DATASETS list (prevents contamination)
EXPECTED IMPACT:
- ~1.4M new pairs after first full pull (md5 dedup against existing axentx pairs)
- Compliance/Governance: 0 \u2192 STRONG (NIST + Fenrir + Trendyol cover the field)
- IR playbooks: 0 \u2192 ADEQUATE (175 structured + scrape-sre-postmortems daemon)
- Reasoning DPO: 1 \u2192 STRONG (HelpSteer + UF + oasst2)
- Agentic trajectories: 1 \u2192 STRONG (SWE-smith + SWE-rebench Γ 2)
- bin/dataset-enrich.sh +66 -2
|
@@ -87,6 +87,32 @@ DATASETS = [
|
|
| 87 |
("CohereForAI/aya_dataset", "Apache", "aya-multi", "instr-resp", 150000),
|
| 88 |
# ββ Code corpus (legal alternative to the-stack) βββββββββββββββββββββββββ
|
| 89 |
("iidai/codenet", "CDLA", "ibm-codenet", "code-only", 200000),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 90 |
]
|
| 91 |
|
| 92 |
# 1. Existing axentx hashes for dedup
|
|
@@ -226,8 +252,46 @@ with open(out_path, "w") as out:
|
|
| 226 |
if len(code) < 80: continue
|
| 227 |
prompt = f"Explain what this {lang} code does:\n```{lang}\n{code}\n```"
|
| 228 |
response = f"[Code sample from IBM CodeNet β pending LLM-generated explanation]"
|
| 229 |
-
# Skip
|
| 230 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 231 |
else:
|
| 232 |
continue
|
| 233 |
|
|
|
|
| 87 |
("CohereForAI/aya_dataset", "Apache", "aya-multi", "instr-resp", 150000),
|
| 88 |
# ββ Code corpus (legal alternative to the-stack) βββββββββββββββββββββββββ
|
| 89 |
("iidai/codenet", "CDLA", "ibm-codenet", "code-only", 200000),
|
| 90 |
+
# ββ NIST cybersecurity full corpus (530K CC0 β closes compliance gap) ββββ
|
| 91 |
+
("ethanolivertroy/nist-cybersecurity-training", "CC0", "nist-cyber", "messages", 100000),
|
| 92 |
+
# ββ DevSecOps depth (Fenrir + Trendyol explicitly tagged for IR/threat) β
|
| 93 |
+
("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1","Apache", "fenrir-cyber", "system-user-assistant", 99870),
|
| 94 |
+
("Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset","Apache","trendyol-cyber","instr-resp", 53202),
|
| 95 |
+
# ββ Incident-Response playbooks (NIST SP 800-61 structure) βββββββββββββββ
|
| 96 |
+
("darkknight25/Incident_Response_Playbook_Dataset","MIT", "ir-playbooks", "ir-playbook", 175),
|
| 97 |
+
# ββ Real agent trajectories (OpenHands SWE-rebench) βββββββββββββββββββββ
|
| 98 |
+
("nebius/SWE-rebench-openhands-trajectories", "CC-BY-4.0", "swe-rebench-traj", "swe-trajectory", 50000),
|
| 99 |
+
("nebius/SWE-rebench", "CC-BY-4.0", "swe-rebench-tasks", "swe-instance", 27878),
|
| 100 |
+
# ββ Reasoning chains (R1-distilled CoT) ββββββββββββββββββββββββββββββββββ
|
| 101 |
+
("nvidia/OpenCodeReasoning", "CC-BY-4.0", "opencode-reasoning", "instr-resp", 100000),
|
| 102 |
+
("open-r1/codeforces-cots", "CC-BY-4.0", "codeforces-cots", "instr-resp", 50000),
|
| 103 |
+
("open-r1/OpenR1-Math-220k", "Apache", "openr1-math", "instr-resp", 50000),
|
| 104 |
+
("open-thoughts/OpenThoughts-114k", "Apache", "open-thoughts", "instr-resp", 100000),
|
| 105 |
+
# ββ Preference / DPO under pressure (good vs bad reasoning) ββββββββββββββ
|
| 106 |
+
("nvidia/HelpSteer3", "CC-BY-4.0", "helpsteer3", "helpsteer-pref", 40476),
|
| 107 |
+
("argilla/ultrafeedback-binarized-preferences-cleaned","MIT", "uf-cleaned", "chosen-rejected", 60917),
|
| 108 |
+
("OpenAssistant/oasst2", "Apache", "oasst2", "messages", 80000),
|
| 109 |
+
# ββ Cloud security misconfigs + chaos engineering ββββββββββββββββββββββββ
|
| 110 |
+
("darkknight25/Cloud_Vulnerabilities", "MIT", "cloud-vulns", "cloud-misconfig", 1200),
|
| 111 |
+
("AYI-NEDJIMI/cloud-security-en", "Apache", "cloud-sec-en", "cloud-misconfig", 230),
|
| 112 |
+
("ddjain/krkn-dataset", "MIT", "krkn-chaos", "instr-resp", 1000),
|
| 113 |
+
# ββ Linux/bash command knowledge βββββββββββββββββββββββββββββββββββββββββ
|
| 114 |
+
("mecha-org/linux-command-dataset", "Apache", "linux-commands", "instr-resp", 8669),
|
| 115 |
+
# NOTE: SWE-bench/SWE-bench_Verified RESERVED AS EVAL ONLY β never include here.
|
| 116 |
]
|
| 117 |
|
| 118 |
# 1. Existing axentx hashes for dedup
|
|
|
|
| 252 |
if len(code) < 80: continue
|
| 253 |
prompt = f"Explain what this {lang} code does:\n```{lang}\n{code}\n```"
|
| 254 |
response = f"[Code sample from IBM CodeNet β pending LLM-generated explanation]"
|
| 255 |
+
continue # Skip β placeholder responses pollute training
|
| 256 |
+
elif schema == "ir-playbook": # NIST SP 800-61 IR playbooks
|
| 257 |
+
title = str(row.get("title") or row.get("incident_type") or row.get("name",""))
|
| 258 |
+
phases = row.get("phases") or row.get("response_phases") or {}
|
| 259 |
+
mitre = row.get("mitre_attack") or row.get("tactics") or []
|
| 260 |
+
if not title: continue
|
| 261 |
+
prompt = f"How should an incident response team handle a {title} incident? Provide a NIST SP 800-61 playbook."
|
| 262 |
+
response_parts = [f"# {title}"]
|
| 263 |
+
if mitre:
|
| 264 |
+
response_parts.append(f"\n## MITRE ATT&CK tactics: {', '.join(str(m) for m in mitre[:6])}")
|
| 265 |
+
if isinstance(phases, dict):
|
| 266 |
+
for phase, content in phases.items():
|
| 267 |
+
response_parts.append(f"\n## {phase}\n{content}")
|
| 268 |
+
elif isinstance(phases, list):
|
| 269 |
+
for p in phases:
|
| 270 |
+
response_parts.append(f"\n## {p}")
|
| 271 |
+
response = "\n".join(response_parts)
|
| 272 |
+
elif schema == "helpsteer-pref": # NVIDIA HelpSteer3 preference + reasoning
|
| 273 |
+
user_msg = str(row.get("context") or row.get("prompt",""))[:4000]
|
| 274 |
+
chosen = str(row.get("response_a") or row.get("chosen",""))[:6000]
|
| 275 |
+
rejected = str(row.get("response_b") or row.get("rejected",""))[:6000]
|
| 276 |
+
pref = row.get("individual_preference", {}) or {}
|
| 277 |
+
reasoning = ""
|
| 278 |
+
if isinstance(pref, dict):
|
| 279 |
+
reasoning = str(pref.get("reasoning",""))[:2000]
|
| 280 |
+
if not user_msg or not chosen: continue
|
| 281 |
+
prompt = user_msg
|
| 282 |
+
response = chosen
|
| 283 |
+
if reasoning:
|
| 284 |
+
response += f"\n\n[Why this is preferred: {reasoning}]"
|
| 285 |
+
elif schema == "cloud-misconfig": # darkknight25 / AYI-NEDJIMI cloud security
|
| 286 |
+
cloud = str(row.get("cloud_provider") or row.get("provider") or "Cloud")
|
| 287 |
+
issue = str(row.get("vulnerability") or row.get("misconfiguration") or row.get("issue",""))[:2000]
|
| 288 |
+
mitig = str(row.get("mitigation") or row.get("fix") or row.get("remediation",""))[:3000]
|
| 289 |
+
cis = row.get("cis_benchmark") or row.get("cis_ref","")
|
| 290 |
+
if not issue or not mitig: continue
|
| 291 |
+
prompt = f"In {cloud}, how do you remediate this misconfiguration: {issue}"
|
| 292 |
+
response = f"**Mitigation**: {mitig}"
|
| 293 |
+
if cis:
|
| 294 |
+
response += f"\n\n**CIS Benchmark reference**: {cis}"
|
| 295 |
else:
|
| 296 |
continue
|
| 297 |
|