Ashira Pitchayapakayakul commited on
Commit
821074c
Β·
1 Parent(s): e40f7ec

feat: integrate 16 incident-response/decision-quality datasets (37 \u2192 53)

Browse files

USER PRINCIPLE HONORED: deep SDLC knowledge first, then judgment becomes sharp.
Surrogate must be expert-level coder/architect across all 19 SDLC roles before
its decisions on critical paths are reliable.

NEW DATASETS (16):

Compliance / Governance (closes biggest gap):
+ ethanolivertroy/nist-cybersecurity-training (CC0, 530K β€” entire NIST corpus
covering FIPS, SP 800/1800 series, IR, CSWP β€” 596 docs distilled to chat)
Capped to 100K rows for first pass.

DevSecOps depth:
+ AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1 (Apache, 99,870)
OWASP + MITRE ATT&CK + NIST CSF + CIS + AppSec/Cloud/DevSecOps/IR
+ Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset (Apache, 53k)
Tagged: 14.8% IR/forensics, 16.2% threat-hunting, 12.3% AI/ML security

Incident-Response playbooks:
+ darkknight25/Incident_Response_Playbook_Dataset (MIT, 175 NIST SP 800-61
structured playbooks with MITRE ATT&CK + phase-by-phase response)
NEW SCHEMA: ir-playbook

Real agent trajectories (multi-language SWE):
+ nebius/SWE-rebench-openhands-trajectories (CC-BY-4.0, 67k OpenHands traces,
64-turn avg, filterable by resolved=1) β€” capped 50K
+ nebius/SWE-rebench (CC-BY-4.0, 27,878 issue\u2192patch from 3,400+ repos)

R1-distilled reasoning chains (CoT, capped per source):
+ nvidia/OpenCodeReasoning (CC-BY-4.0) β€” 100K cap from 735K
+ open-r1/codeforces-cots (CC-BY-4.0) β€” 50K cap from 253K
+ open-r1/OpenR1-Math-220k (Apache) β€” 50K cap
+ open-thoughts/OpenThoughts-114k (Apache) β€” 100K cap

Preference under pressure (DPO):
+ nvidia/HelpSteer3 (CC-BY-4.0, 40,476 with explicit reasoning field)
NEW SCHEMA: helpsteer-pref (extracts 'why this is preferred')
+ argilla/ultrafeedback-binarized-preferences-cleaned (MIT, 60,917)
+ OpenAssistant/oasst2 (Apache, 80K from 135k human-vetted)

Cloud security misconfigs + chaos:
+ darkknight25/Cloud_Vulnerabilities (MIT, 1,200 across AWS/Azure/GCP/Oracle)
NEW SCHEMA: cloud-misconfig (issue \u2192 mitigation + CIS ref)
+ AYI-NEDJIMI/cloud-security-en (Apache, 230 bilingual EN/FR)
+ ddjain/krkn-dataset (MIT, K8s chaos engineering)

Linux/Bash:
+ mecha-org/linux-command-dataset (Apache, 8,669)

NEW SCHEMAS: ir-playbook, helpsteer-pref, cloud-misconfig

EVAL-ONLY HOLDOUT: SWE-bench/SWE-bench_Verified (500 instances)
Documented but NOT added to DATASETS list (prevents contamination)

EXPECTED IMPACT:
- ~1.4M new pairs after first full pull (md5 dedup against existing axentx pairs)
- Compliance/Governance: 0 \u2192 STRONG (NIST + Fenrir + Trendyol cover the field)
- IR playbooks: 0 \u2192 ADEQUATE (175 structured + scrape-sre-postmortems daemon)
- Reasoning DPO: 1 \u2192 STRONG (HelpSteer + UF + oasst2)
- Agentic trajectories: 1 \u2192 STRONG (SWE-smith + SWE-rebench Γ— 2)

Files changed (1) hide show
  1. bin/dataset-enrich.sh +66 -2
bin/dataset-enrich.sh CHANGED
@@ -87,6 +87,32 @@ DATASETS = [
87
  ("CohereForAI/aya_dataset", "Apache", "aya-multi", "instr-resp", 150000),
88
  # ── Code corpus (legal alternative to the-stack) ─────────────────────────
89
  ("iidai/codenet", "CDLA", "ibm-codenet", "code-only", 200000),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
90
  ]
91
 
92
  # 1. Existing axentx hashes for dedup
@@ -226,8 +252,46 @@ with open(out_path, "w") as out:
226
  if len(code) < 80: continue
227
  prompt = f"Explain what this {lang} code does:\n```{lang}\n{code}\n```"
228
  response = f"[Code sample from IBM CodeNet β€” pending LLM-generated explanation]"
229
- # Skip writing β€” placeholder responses pollute training data
230
- continue
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
231
  else:
232
  continue
233
 
 
87
  ("CohereForAI/aya_dataset", "Apache", "aya-multi", "instr-resp", 150000),
88
  # ── Code corpus (legal alternative to the-stack) ─────────────────────────
89
  ("iidai/codenet", "CDLA", "ibm-codenet", "code-only", 200000),
90
+ # ── NIST cybersecurity full corpus (530K CC0 β€” closes compliance gap) ────
91
+ ("ethanolivertroy/nist-cybersecurity-training", "CC0", "nist-cyber", "messages", 100000),
92
+ # ── DevSecOps depth (Fenrir + Trendyol explicitly tagged for IR/threat) ─
93
+ ("AlicanKiraz0/Cybersecurity-Dataset-Fenrir-v2.1","Apache", "fenrir-cyber", "system-user-assistant", 99870),
94
+ ("Trendyol/Trendyol-Cybersecurity-Instruction-Tuning-Dataset","Apache","trendyol-cyber","instr-resp", 53202),
95
+ # ── Incident-Response playbooks (NIST SP 800-61 structure) ───────────────
96
+ ("darkknight25/Incident_Response_Playbook_Dataset","MIT", "ir-playbooks", "ir-playbook", 175),
97
+ # ── Real agent trajectories (OpenHands SWE-rebench) ─────────────────────
98
+ ("nebius/SWE-rebench-openhands-trajectories", "CC-BY-4.0", "swe-rebench-traj", "swe-trajectory", 50000),
99
+ ("nebius/SWE-rebench", "CC-BY-4.0", "swe-rebench-tasks", "swe-instance", 27878),
100
+ # ── Reasoning chains (R1-distilled CoT) ──────────────────────────────────
101
+ ("nvidia/OpenCodeReasoning", "CC-BY-4.0", "opencode-reasoning", "instr-resp", 100000),
102
+ ("open-r1/codeforces-cots", "CC-BY-4.0", "codeforces-cots", "instr-resp", 50000),
103
+ ("open-r1/OpenR1-Math-220k", "Apache", "openr1-math", "instr-resp", 50000),
104
+ ("open-thoughts/OpenThoughts-114k", "Apache", "open-thoughts", "instr-resp", 100000),
105
+ # ── Preference / DPO under pressure (good vs bad reasoning) ──────────────
106
+ ("nvidia/HelpSteer3", "CC-BY-4.0", "helpsteer3", "helpsteer-pref", 40476),
107
+ ("argilla/ultrafeedback-binarized-preferences-cleaned","MIT", "uf-cleaned", "chosen-rejected", 60917),
108
+ ("OpenAssistant/oasst2", "Apache", "oasst2", "messages", 80000),
109
+ # ── Cloud security misconfigs + chaos engineering ────────────────────────
110
+ ("darkknight25/Cloud_Vulnerabilities", "MIT", "cloud-vulns", "cloud-misconfig", 1200),
111
+ ("AYI-NEDJIMI/cloud-security-en", "Apache", "cloud-sec-en", "cloud-misconfig", 230),
112
+ ("ddjain/krkn-dataset", "MIT", "krkn-chaos", "instr-resp", 1000),
113
+ # ── Linux/bash command knowledge ─────────────────────────────────────────
114
+ ("mecha-org/linux-command-dataset", "Apache", "linux-commands", "instr-resp", 8669),
115
+ # NOTE: SWE-bench/SWE-bench_Verified RESERVED AS EVAL ONLY β€” never include here.
116
  ]
117
 
118
  # 1. Existing axentx hashes for dedup
 
252
  if len(code) < 80: continue
253
  prompt = f"Explain what this {lang} code does:\n```{lang}\n{code}\n```"
254
  response = f"[Code sample from IBM CodeNet β€” pending LLM-generated explanation]"
255
+ continue # Skip β€” placeholder responses pollute training
256
+ elif schema == "ir-playbook": # NIST SP 800-61 IR playbooks
257
+ title = str(row.get("title") or row.get("incident_type") or row.get("name",""))
258
+ phases = row.get("phases") or row.get("response_phases") or {}
259
+ mitre = row.get("mitre_attack") or row.get("tactics") or []
260
+ if not title: continue
261
+ prompt = f"How should an incident response team handle a {title} incident? Provide a NIST SP 800-61 playbook."
262
+ response_parts = [f"# {title}"]
263
+ if mitre:
264
+ response_parts.append(f"\n## MITRE ATT&CK tactics: {', '.join(str(m) for m in mitre[:6])}")
265
+ if isinstance(phases, dict):
266
+ for phase, content in phases.items():
267
+ response_parts.append(f"\n## {phase}\n{content}")
268
+ elif isinstance(phases, list):
269
+ for p in phases:
270
+ response_parts.append(f"\n## {p}")
271
+ response = "\n".join(response_parts)
272
+ elif schema == "helpsteer-pref": # NVIDIA HelpSteer3 preference + reasoning
273
+ user_msg = str(row.get("context") or row.get("prompt",""))[:4000]
274
+ chosen = str(row.get("response_a") or row.get("chosen",""))[:6000]
275
+ rejected = str(row.get("response_b") or row.get("rejected",""))[:6000]
276
+ pref = row.get("individual_preference", {}) or {}
277
+ reasoning = ""
278
+ if isinstance(pref, dict):
279
+ reasoning = str(pref.get("reasoning",""))[:2000]
280
+ if not user_msg or not chosen: continue
281
+ prompt = user_msg
282
+ response = chosen
283
+ if reasoning:
284
+ response += f"\n\n[Why this is preferred: {reasoning}]"
285
+ elif schema == "cloud-misconfig": # darkknight25 / AYI-NEDJIMI cloud security
286
+ cloud = str(row.get("cloud_provider") or row.get("provider") or "Cloud")
287
+ issue = str(row.get("vulnerability") or row.get("misconfiguration") or row.get("issue",""))[:2000]
288
+ mitig = str(row.get("mitigation") or row.get("fix") or row.get("remediation",""))[:3000]
289
+ cis = row.get("cis_benchmark") or row.get("cis_ref","")
290
+ if not issue or not mitig: continue
291
+ prompt = f"In {cloud}, how do you remediate this misconfiguration: {issue}"
292
+ response = f"**Mitigation**: {mitig}"
293
+ if cis:
294
+ response += f"\n\n**CIS Benchmark reference**: {cis}"
295
  else:
296
  continue
297