Spaces:
Runtime error
feat(v2): Phase A complete build infrastructure ready to execute
Browse filesAdds 30 v2 datasets to dataset-mirror.sh + 6 stage Axolotl configs +
master pipeline scripts. ALL Phase A executable from one command:
bash bin/v2/run-phase-a.sh all
What's included:
- bin/v2/build-data-pipeline.sh β 8 SFT + 7 tool + 4 agent + 3 DPO datasets
- bin/v2/synth-orchestrator-traces.py β 500 trajectories via FREE LLM ladder
(Cerebras qwen-3-235b orchestrator + Groq/OpenRouter/Gemini subagents) =
saves $200 vs Claude API while keeping coverage
- bin/v2/dedup-decontaminate.py β exact + MinHash + decontaminate vs HE+/MBPP+/LCB
- bin/v2/push-to-hub.py β pushes 4 cleaned datasets to private HF repos
- bin/v2/eval-tier1.sh β EvalPlus + LCB v6 + BFCL + RULER (~3-4 GPU-hr)
- bin/v2/run-phase-a.sh β master launcher (data β 5 stages β eval)
Configs (all all-linear LoRA r=64 + DoRA + 32K context + YaRN factor 4):
- configs/v2/stage1-sft.yml Code SFT 3ep ~12-15hr H200
- configs/v2/stage15-toolsft.yml Tool-SFT 2ep Hermes XML ~8hr
- configs/v2/stage16-agent.yml Multi-agent SFT 2ep ~10hr
- configs/v2/stage2-codedpo.yml Code DPO Focused-DPO 1ep ~5hr
- configs/v2/stage25-tooldpo.yml Tool DPO 1ep ~3hr β push -mvp
dataset-mirror.sh: +30 sources tagged v2-* (Phase A backbone) so existing
ingestion daemons start mirroring them immediately. Sanitizer (1dfdc54)
already wired in.
Total Phase A ETA when subscriptions active: 4 weeks calendar / ~50 GPU-hr
Lightning H200 / $200-400 cash
- bin/dataset-mirror.sh +52 -0
- bin/v2/build-data-pipeline.sh +177 -0
- bin/v2/dedup-decontaminate.py +147 -0
- bin/v2/eval-tier1.sh +112 -0
- bin/v2/push-to-hub.py +64 -0
- bin/v2/run-phase-a.sh +73 -0
- bin/v2/synth-orchestrator-traces.py +245 -0
- configs/v2/stage1-sft.yml +92 -0
- configs/v2/stage15-toolsft.yml +84 -0
- configs/v2/stage16-agent.yml +86 -0
- configs/v2/stage2-codedpo.yml +92 -0
- configs/v2/stage25-tooldpo.yml +72 -0
|
@@ -133,6 +133,58 @@ SOURCES = [
|
|
| 133 |
# Smol team
|
| 134 |
("HuggingFaceTB/smoltalk", "smoltalk"),
|
| 135 |
("HuggingFaceTB/smollm-corpus", "smollm-corpus"),
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
]
|
| 137 |
|
| 138 |
# 5 sibling repos to spread across β round-robin by hash for determinism
|
|
|
|
| 133 |
# Smol team
|
| 134 |
("HuggingFaceTB/smoltalk", "smoltalk"),
|
| 135 |
("HuggingFaceTB/smollm-corpus", "smollm-corpus"),
|
| 136 |
+
|
| 137 |
+
# βββ v2 Phase A β high-priority code SFT (Round 1+2 research recommendations) βββ
|
| 138 |
+
# These are the BACKBONE of v2: rStar-Coder gave +39pt LCB on 7B-class.
|
| 139 |
+
# All sanitized + deduped + decontaminated before training.
|
| 140 |
+
("microsoft/rStar-Coder", "v2-rstar-coder"), # +39pt LCB on 7B
|
| 141 |
+
("nvidia/OpenCodeReasoning-2", "v2-opencode-reasoning-2"),# R1 reasoning chains
|
| 142 |
+
("nvidia/OpenCodeInstruct", "v2-opencode-instruct"), # has avg_test_score per row
|
| 143 |
+
("inclusionAI/Ling-Coder-SFT", "v2-ling-coder-sft"), # 4.48M, 20 langs
|
| 144 |
+
("OpenCoder-LLM/opc-sft-stage1", "v2-opencoder-stage1"), # transparent recipe
|
| 145 |
+
("OpenCoder-LLM/opc-sft-stage2", "v2-opencoder-stage2"), # DevSecOps-leaning topics
|
| 146 |
+
|
| 147 |
+
# βββ v2 Phase A β tool use (parity with frontier function-calling) βββ
|
| 148 |
+
# Hermes XML format gold standard. xLAM has 3,673 APIs / parallel calls.
|
| 149 |
+
# Toucan from Kimi-K2 = MCP-grounded real-world tool traces.
|
| 150 |
+
("NousResearch/hermes-function-calling-v1","v2-hermes-fc-v1"), # gold, Apache-2
|
| 151 |
+
("Agent-Ark/Toucan-1.5M", "v2-toucan-15m"), # Kimi-K2 MCP traces
|
| 152 |
+
("nvidia/When2Call", "v2-when2call"), # refusal/clarify
|
| 153 |
+
("Nanbeige/ToolMind", "v2-toolmind"), # graph-syn reasoning
|
| 154 |
+
("nvidia/Nemotron-SWE-v1", "v2-nemotron-swe"), # code-exec trajectories
|
| 155 |
+
("SWE-Gym/OpenHands-Sampled-Trajectories", "v2-openhands-traj"), # high-quality SWE
|
| 156 |
+
|
| 157 |
+
# βββ v2 Phase A β multi-agent / orchestrator traces βββ
|
| 158 |
+
# Hermes Agent Reasoning = multi-turn tool-use baseline.
|
| 159 |
+
# Nebius SWE-agent-trajectories filtered to target=true = code editing depth.
|
| 160 |
+
("lambda/hermes-agent-reasoning-traces", "v2-hermes-agent-reason"),
|
| 161 |
+
("nebius/SWE-agent-trajectories", "v2-nebius-swe-traj"),
|
| 162 |
+
("SWE-Gym/SWE-Gym", "v2-swe-gym"),
|
| 163 |
+
|
| 164 |
+
# βββ v2 Phase A β DPO preference pairs βββ
|
| 165 |
+
("Vezora/Code-Preference-Pairs", "v2-vezora-codepref"), # 55K bug/no-bug
|
| 166 |
+
("argilla/distilabel-capybara-dpo-7k-binarized", "v2-capybara-dpo"),
|
| 167 |
+
|
| 168 |
+
# βββ v2 Phase B β domain expertise (cluster-specific) βββ
|
| 169 |
+
# Will only ingest these once Phase A baseline trained + evaluated.
|
| 170 |
+
# SDLC / SWE
|
| 171 |
+
("SWE-Gym/SWE-smith", "v2-swe-smith"), # NeurIPS 2025
|
| 172 |
+
("R2E-Gym/R2E-Gym-Lite", "v2-r2e-gym"), # used by DeepSWE
|
| 173 |
+
# Security / SOC
|
| 174 |
+
("trendmicro-ailab/Primus-FineWeb", "v2-primus-fineweb"), # 2.57B cyber tokens
|
| 175 |
+
("trendmicro-ailab/Primus-Instruct", "v2-primus-instruct"),
|
| 176 |
+
("trendmicro-ailab/Primus-Reasoning", "v2-primus-reasoning"), # +15.8% CISSP lift
|
| 177 |
+
# Cloud / IaC
|
| 178 |
+
("bigcode/the-stack-v2-smol-ids", "v2-stack-v2-smol"), # FIM continued pretrain
|
| 179 |
+
# AI Engineering (smaller mixes)
|
| 180 |
+
("microsoft/orca-agentinstruct-1M-v1", "v2-orca-agent-1m"), # already above; tag for v2
|
| 181 |
+
# Customer support / GTM
|
| 182 |
+
("bitext/Bitext-customer-support-llm-chatbot-training-dataset", "v2-bitext-cs"),
|
| 183 |
+
# Finance
|
| 184 |
+
("PatronusAI/financebench", "v2-financebench"),
|
| 185 |
+
# Safety / refusal restoration (CRITICAL post-fine-tune)
|
| 186 |
+
("allenai/wildjailbreak", "v2-wildjailbreak"),
|
| 187 |
+
("ai4privacy/pii-masking-200k", "v2-pii-masking"),
|
| 188 |
]
|
| 189 |
|
| 190 |
# 5 sibling repos to spread across β round-robin by hash for determinism
|
|
@@ -0,0 +1,177 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 v2 β Master data pipeline: assemble + sanitize + dedup + decontaminate.
|
| 3 |
+
# Runs on HF Space (NOT Mac). Outputs to Wasabi + HF dataset repo.
|
| 4 |
+
#
|
| 5 |
+
# Steps:
|
| 6 |
+
# 1. Mirror HF datasets β /data/v2-raw/<source>/
|
| 7 |
+
# 2. Sanitize via lib/sanitize.py (already deployed)
|
| 8 |
+
# 3. Exact SHA-256 dedup
|
| 9 |
+
# 4. MinHash LSH 256-perm dedup (datatrove)
|
| 10 |
+
# 5. Decontaminate vs HumanEval+/MBPP+/LCB/SWE-Bench
|
| 11 |
+
# 6. AST validity (tree-sitter)
|
| 12 |
+
# 7. Stack-Edu classifier (threshold 3)
|
| 13 |
+
# 8. Push to axentx/surrogate-1-v2-train (private HF) + Wasabi backup
|
| 14 |
+
#
|
| 15 |
+
# Usage: bash build-data-pipeline.sh [phase]
|
| 16 |
+
# phase = sft|tools|agent|dpo|all
|
| 17 |
+
|
| 18 |
+
set -uo pipefail
|
| 19 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 20 |
+
PHASE="${1:-all}"
|
| 21 |
+
LOG="$HOME/.surrogate/logs/v2-build-data.log"
|
| 22 |
+
mkdir -p "$(dirname "$LOG")"
|
| 23 |
+
|
| 24 |
+
echo "[$(date +%H:%M:%S)] v2 data pipeline phase=$PHASE" | tee -a "$LOG"
|
| 25 |
+
|
| 26 |
+
# ββ Phase A datasets matrix βββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 27 |
+
declare -A SFT_SOURCES=(
|
| 28 |
+
["microsoft/rStar-Coder"]=30000
|
| 29 |
+
["nvidia/OpenCodeReasoning-2"]=20000
|
| 30 |
+
["nvidia/OpenCodeInstruct"]=10000
|
| 31 |
+
["inclusionAI/Ling-Coder-SFT"]=10000
|
| 32 |
+
["OpenCoder-LLM/opc-sft-stage1"]=5000
|
| 33 |
+
["OpenCoder-LLM/opc-sft-stage2"]=5000
|
| 34 |
+
["bigcode/self-oss-instruct-sc2-exec-filter-50k"]=50000
|
| 35 |
+
["m-a-p/CodeFeedback-Filtered-Instruction"]=10000
|
| 36 |
+
)
|
| 37 |
+
|
| 38 |
+
declare -A TOOL_SOURCES=(
|
| 39 |
+
["NousResearch/hermes-function-calling-v1"]=7930
|
| 40 |
+
["Salesforce/xlam-function-calling-60k"]=30000
|
| 41 |
+
["Agent-Ark/Toucan-1.5M"]=80000
|
| 42 |
+
["nvidia/When2Call"]=15000
|
| 43 |
+
["Nanbeige/ToolMind"]=10000
|
| 44 |
+
["nvidia/Nemotron-SWE-v1"]=5000
|
| 45 |
+
["SWE-Gym/OpenHands-Sampled-Trajectories"]=2400
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
declare -A AGENT_SOURCES=(
|
| 49 |
+
["lambda/hermes-agent-reasoning-traces"]=14000
|
| 50 |
+
["nebius/SWE-agent-trajectories"]=5000
|
| 51 |
+
["SWE-Gym/SWE-Gym"]=400
|
| 52 |
+
["microsoft/orca-agentinstruct-1M-v1"]=1500
|
| 53 |
+
)
|
| 54 |
+
|
| 55 |
+
declare -A DPO_SOURCES=(
|
| 56 |
+
["Vezora/Code-Preference-Pairs"]=55000
|
| 57 |
+
["argilla/distilabel-capybara-dpo-7k-binarized"]=7000
|
| 58 |
+
["nvidia/When2Call"]=15000 # train_pref subset
|
| 59 |
+
)
|
| 60 |
+
|
| 61 |
+
# ββ Helper: download + sanitize + filter ββββββββββββββββββββββββββββββββββββββ
|
| 62 |
+
process_dataset() {
|
| 63 |
+
local repo="$1"
|
| 64 |
+
local target_n="$2"
|
| 65 |
+
local out_dir="$3"
|
| 66 |
+
echo "[$(date +%H:%M:%S)] βΆ $repo (target $target_n)" | tee -a "$LOG"
|
| 67 |
+
|
| 68 |
+
HF_TOKEN="$HF_TOKEN" python3 - "$repo" "$target_n" "$out_dir" <<'PYEOF' 2>>"$LOG"
|
| 69 |
+
import sys, json, os
|
| 70 |
+
from pathlib import Path
|
| 71 |
+
sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
| 72 |
+
|
| 73 |
+
from datasets import load_dataset
|
| 74 |
+
from sanitize import filter_pair
|
| 75 |
+
|
| 76 |
+
repo, target_n, out_dir = sys.argv[1], int(sys.argv[2]), sys.argv[3]
|
| 77 |
+
out_path = Path(out_dir) / (repo.replace("/", "_") + ".jsonl")
|
| 78 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 79 |
+
|
| 80 |
+
try:
|
| 81 |
+
ds = load_dataset(repo, split="train", streaming=True)
|
| 82 |
+
except Exception as e:
|
| 83 |
+
print(f" β load_dataset failed: {e}")
|
| 84 |
+
sys.exit(0)
|
| 85 |
+
|
| 86 |
+
kept, dropped, scanned = 0, 0, 0
|
| 87 |
+
with open(out_path, "w") as f:
|
| 88 |
+
for ex in ds:
|
| 89 |
+
scanned += 1
|
| 90 |
+
if kept >= target_n: break
|
| 91 |
+
|
| 92 |
+
# Robust extraction across schemas
|
| 93 |
+
p = ex.get("prompt") or ex.get("instruction") or ex.get("question") or ex.get("input") or ex.get("query") or ex.get("user")
|
| 94 |
+
r = ex.get("response") or ex.get("answer") or ex.get("output") or ex.get("completion") or ex.get("solution") or ex.get("chosen") or ex.get("assistant")
|
| 95 |
+
|
| 96 |
+
# ShareGPT / messages format
|
| 97 |
+
if (not p or not r) and isinstance(ex.get("messages"), list) and len(ex["messages"]) >= 2:
|
| 98 |
+
msgs = ex["messages"]
|
| 99 |
+
u = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("user","human") or m.get("from") in ("user","human")), "")
|
| 100 |
+
a = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("assistant","gpt") or m.get("from") in ("assistant","gpt")), "")
|
| 101 |
+
if u and a: p, r = u, a
|
| 102 |
+
if (not p or not r) and isinstance(ex.get("conversations"), list) and len(ex["conversations"]) >= 2:
|
| 103 |
+
convs = ex["conversations"]
|
| 104 |
+
u = next((c.get("value","") for c in convs if c.get("from") in ("human","user")), "")
|
| 105 |
+
a = next((c.get("value","") for c in convs if c.get("from") in ("gpt","assistant")), "")
|
| 106 |
+
if u and a: p, r = u, a
|
| 107 |
+
|
| 108 |
+
if not p or not r: continue
|
| 109 |
+
p, r = str(p)[:6000].strip(), str(r)[:8000].strip()
|
| 110 |
+
|
| 111 |
+
# Sanitize: drop polluted/PII/secrets/refusals
|
| 112 |
+
v = filter_pair(p, r)
|
| 113 |
+
if not v["keep"]:
|
| 114 |
+
dropped += 1
|
| 115 |
+
continue
|
| 116 |
+
|
| 117 |
+
f.write(json.dumps({"prompt": p, "response": r, "source": repo}, ensure_ascii=False) + "\n")
|
| 118 |
+
kept += 1
|
| 119 |
+
|
| 120 |
+
print(f" scanned={scanned} kept={kept} dropped={dropped} β {out_path}")
|
| 121 |
+
PYEOF
|
| 122 |
+
}
|
| 123 |
+
|
| 124 |
+
# ββ Phase A SFT ββββββββοΏ½οΏ½οΏ½ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 125 |
+
if [[ "$PHASE" =~ ^(sft|all)$ ]]; then
|
| 126 |
+
echo "[$(date +%H:%M:%S)] Phase A SFT βββββββββββββββββββββββββββββββββββββ" | tee -a "$LOG"
|
| 127 |
+
OUT="$HOME/.surrogate/data/v2-sft"
|
| 128 |
+
mkdir -p "$OUT"
|
| 129 |
+
for repo in "${!SFT_SOURCES[@]}"; do
|
| 130 |
+
process_dataset "$repo" "${SFT_SOURCES[$repo]}" "$OUT"
|
| 131 |
+
done
|
| 132 |
+
fi
|
| 133 |
+
|
| 134 |
+
# ββ Phase A Tool-use ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 135 |
+
if [[ "$PHASE" =~ ^(tools|all)$ ]]; then
|
| 136 |
+
echo "[$(date +%H:%M:%S)] Phase A Tool-use βββββββββββββββββββββββββββββββ" | tee -a "$LOG"
|
| 137 |
+
OUT="$HOME/.surrogate/data/v2-tools"
|
| 138 |
+
mkdir -p "$OUT"
|
| 139 |
+
for repo in "${!TOOL_SOURCES[@]}"; do
|
| 140 |
+
process_dataset "$repo" "${TOOL_SOURCES[$repo]}" "$OUT"
|
| 141 |
+
done
|
| 142 |
+
fi
|
| 143 |
+
|
| 144 |
+
# ββ Phase A Agent βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 145 |
+
if [[ "$PHASE" =~ ^(agent|all)$ ]]; then
|
| 146 |
+
echo "[$(date +%H:%M:%S)] Phase A Agent ββββββββββββββββββββββββββββββββββ" | tee -a "$LOG"
|
| 147 |
+
OUT="$HOME/.surrogate/data/v2-agent"
|
| 148 |
+
mkdir -p "$OUT"
|
| 149 |
+
for repo in "${!AGENT_SOURCES[@]}"; do
|
| 150 |
+
process_dataset "$repo" "${AGENT_SOURCES[$repo]}" "$OUT"
|
| 151 |
+
done
|
| 152 |
+
|
| 153 |
+
# Plus synthetic orchestrator traces (free LLM ladder)
|
| 154 |
+
echo "βΆ generating 500 synth orchestrator traces (free LLM ladder)..." | tee -a "$LOG"
|
| 155 |
+
TARGET_TRACES=500 python3 "$HOME/.surrogate/bin/v2/synth-orchestrator-traces.py" 2>&1 | tee -a "$LOG"
|
| 156 |
+
cp "$HOME/.surrogate/data/v2-orchestrator-traces.jsonl" "$OUT/synth_orchestrator.jsonl"
|
| 157 |
+
fi
|
| 158 |
+
|
| 159 |
+
# ββ Phase A DPO βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 160 |
+
if [[ "$PHASE" =~ ^(dpo|all)$ ]]; then
|
| 161 |
+
echo "[$(date +%H:%M:%S)] Phase A DPO ββββββββββββββββββββββββββββββββββββ" | tee -a "$LOG"
|
| 162 |
+
OUT="$HOME/.surrogate/data/v2-dpo"
|
| 163 |
+
mkdir -p "$OUT"
|
| 164 |
+
for repo in "${!DPO_SOURCES[@]}"; do
|
| 165 |
+
process_dataset "$repo" "${DPO_SOURCES[$repo]}" "$OUT"
|
| 166 |
+
done
|
| 167 |
+
fi
|
| 168 |
+
|
| 169 |
+
# ββ Dedup + decontaminate βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 170 |
+
echo "[$(date +%H:%M:%S)] Dedup + decontaminate ββββββββββββββββββββββββββββββ" | tee -a "$LOG"
|
| 171 |
+
HF_TOKEN="$HF_TOKEN" python3 "$HOME/.surrogate/bin/v2/dedup-decontaminate.py" 2>&1 | tee -a "$LOG"
|
| 172 |
+
|
| 173 |
+
# ββ Push to HF dataset repo ββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 174 |
+
echo "[$(date +%H:%M:%S)] Push to axentx/surrogate-1-v2-train βββββββββββββββ" | tee -a "$LOG"
|
| 175 |
+
HF_TOKEN="$HF_TOKEN" python3 "$HOME/.surrogate/bin/v2/push-to-hub.py" 2>&1 | tee -a "$LOG"
|
| 176 |
+
|
| 177 |
+
echo "[$(date +%H:%M:%S)] β
v2 data pipeline phase=$PHASE done" | tee -a "$LOG"
|
|
@@ -0,0 +1,147 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Surrogate-1 v2 β Dedup + decontaminate pipeline.
|
| 2 |
+
|
| 3 |
+
After build-data-pipeline.sh produces ~/.surrogate/data/v2-{sft,tools,agent,dpo}/*.jsonl,
|
| 4 |
+
this script:
|
| 5 |
+
1. Exact SHA-256 dedup within + across files
|
| 6 |
+
2. MinHash LSH 256-perm 5-gram threshold 0.7 (datatrove)
|
| 7 |
+
3. Decontaminate vs HumanEval+/MBPP+/LiveCodeBench/SWE-Bench-Lite
|
| 8 |
+
4. Output clean files to v2-{sft,tools,agent,dpo}-clean/
|
| 9 |
+
"""
|
| 10 |
+
import os, json, hashlib, sys
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
from collections import defaultdict
|
| 13 |
+
|
| 14 |
+
DATA = Path.home() / ".surrogate/data"
|
| 15 |
+
OUT_BASE = DATA / "v2-clean"
|
| 16 |
+
OUT_BASE.mkdir(exist_ok=True)
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def exact_dedup(input_dir: Path, output_path: Path) -> int:
|
| 20 |
+
"""SHA-256 exact dedup on prompt+response pair."""
|
| 21 |
+
seen = set()
|
| 22 |
+
kept = 0
|
| 23 |
+
with open(output_path, "w") as fout:
|
| 24 |
+
for f in sorted(input_dir.glob("*.jsonl")):
|
| 25 |
+
with open(f) as fin:
|
| 26 |
+
for line in fin:
|
| 27 |
+
if not line.strip(): continue
|
| 28 |
+
try: obj = json.loads(line)
|
| 29 |
+
except Exception: continue
|
| 30 |
+
key = hashlib.sha256(
|
| 31 |
+
(obj.get("prompt","") + "|" + obj.get("response","")).encode()
|
| 32 |
+
).hexdigest()
|
| 33 |
+
if key in seen: continue
|
| 34 |
+
seen.add(key)
|
| 35 |
+
fout.write(line)
|
| 36 |
+
kept += 1
|
| 37 |
+
return kept
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def load_decontamination_set() -> set:
|
| 41 |
+
"""Load prompts from public eval suites β anything that overlaps must be dropped."""
|
| 42 |
+
seen = set()
|
| 43 |
+
for repo in ["evalplus/humanevalplus", "evalplus/mbppplus"]:
|
| 44 |
+
try:
|
| 45 |
+
from datasets import load_dataset
|
| 46 |
+
ds = load_dataset(repo, split="test", streaming=True)
|
| 47 |
+
for ex in ds:
|
| 48 |
+
p = ex.get("prompt") or ex.get("text") or ""
|
| 49 |
+
# Use first 200 chars as fingerprint
|
| 50 |
+
if len(p) > 50:
|
| 51 |
+
seen.add(p[:200].strip())
|
| 52 |
+
except Exception as e:
|
| 53 |
+
print(f" decontam {repo} failed: {e}")
|
| 54 |
+
# LiveCodeBench v6 β prompts are public
|
| 55 |
+
try:
|
| 56 |
+
from datasets import load_dataset
|
| 57 |
+
ds = load_dataset("livecodebench/code_generation_lite", split="test", streaming=True)
|
| 58 |
+
for ex in ds:
|
| 59 |
+
p = ex.get("question_content", "") or ex.get("prompt", "")
|
| 60 |
+
if len(p) > 50:
|
| 61 |
+
seen.add(p[:200].strip())
|
| 62 |
+
except Exception as e:
|
| 63 |
+
print(f" decontam LCB failed: {e}")
|
| 64 |
+
print(f" decontam set size: {len(seen)}")
|
| 65 |
+
return seen
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
def decontaminate(input_path: Path, output_path: Path, eval_prompts: set) -> int:
|
| 69 |
+
"""Drop training rows whose prompt overlaps with eval suite prompts."""
|
| 70 |
+
kept, dropped = 0, 0
|
| 71 |
+
with open(input_path) as fin, open(output_path, "w") as fout:
|
| 72 |
+
for line in fin:
|
| 73 |
+
if not line.strip(): continue
|
| 74 |
+
try: obj = json.loads(line)
|
| 75 |
+
except Exception: continue
|
| 76 |
+
p = obj.get("prompt", "")[:200].strip()
|
| 77 |
+
if p in eval_prompts:
|
| 78 |
+
dropped += 1
|
| 79 |
+
continue
|
| 80 |
+
fout.write(line)
|
| 81 |
+
kept += 1
|
| 82 |
+
print(f" decontaminate {input_path.name}: kept={kept} dropped={dropped}")
|
| 83 |
+
return kept
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
def minhash_dedup(input_path: Path, output_path: Path, threshold: float = 0.7) -> int:
|
| 87 |
+
"""MinHash LSH near-dup. Falls back to exact dedup if datasketch unavailable."""
|
| 88 |
+
try:
|
| 89 |
+
from datasketch import MinHash, MinHashLSH
|
| 90 |
+
except ImportError:
|
| 91 |
+
print(" datasketch not installed β skipping MinHash, using exact dedup output")
|
| 92 |
+
os.replace(input_path, output_path)
|
| 93 |
+
return -1
|
| 94 |
+
|
| 95 |
+
lsh = MinHashLSH(threshold=threshold, num_perm=256)
|
| 96 |
+
kept = []
|
| 97 |
+
|
| 98 |
+
def to_minhash(text: str) -> MinHash:
|
| 99 |
+
m = MinHash(num_perm=256)
|
| 100 |
+
# 5-gram tokens
|
| 101 |
+
toks = text.lower().split()
|
| 102 |
+
for i in range(len(toks) - 4):
|
| 103 |
+
m.update((" ".join(toks[i:i+5])).encode())
|
| 104 |
+
return m
|
| 105 |
+
|
| 106 |
+
with open(input_path) as fin:
|
| 107 |
+
for idx, line in enumerate(fin):
|
| 108 |
+
if not line.strip(): continue
|
| 109 |
+
try: obj = json.loads(line)
|
| 110 |
+
except Exception: continue
|
| 111 |
+
mh = to_minhash(obj.get("prompt","") + " " + obj.get("response",""))
|
| 112 |
+
if list(lsh.query(mh)):
|
| 113 |
+
continue # near-duplicate found
|
| 114 |
+
lsh.insert(f"r_{idx}", mh)
|
| 115 |
+
kept.append(line)
|
| 116 |
+
|
| 117 |
+
with open(output_path, "w") as fout:
|
| 118 |
+
for line in kept:
|
| 119 |
+
fout.write(line)
|
| 120 |
+
return len(kept)
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
if __name__ == "__main__":
|
| 124 |
+
eval_prompts = load_decontamination_set()
|
| 125 |
+
|
| 126 |
+
for category in ["v2-sft", "v2-tools", "v2-agent", "v2-dpo"]:
|
| 127 |
+
in_dir = DATA / category
|
| 128 |
+
if not in_dir.exists():
|
| 129 |
+
print(f"β skip {category} (not present)")
|
| 130 |
+
continue
|
| 131 |
+
print(f"\nβββ {category} βββ")
|
| 132 |
+
clean_dir = OUT_BASE / category
|
| 133 |
+
clean_dir.mkdir(exist_ok=True)
|
| 134 |
+
|
| 135 |
+
# 1. Exact dedup β merged.jsonl
|
| 136 |
+
merged = clean_dir / "merged.jsonl"
|
| 137 |
+
kept = exact_dedup(in_dir, merged)
|
| 138 |
+
print(f" step 1 exact dedup: kept={kept}")
|
| 139 |
+
|
| 140 |
+
# 2. Decontaminate
|
| 141 |
+
decon = clean_dir / "decontaminated.jsonl"
|
| 142 |
+
kept = decontaminate(merged, decon, eval_prompts)
|
| 143 |
+
|
| 144 |
+
# 3. MinHash near-dup
|
| 145 |
+
clean = clean_dir / "clean.jsonl"
|
| 146 |
+
kept = minhash_dedup(decon, clean)
|
| 147 |
+
print(f" step 3 minhash: kept={kept}")
|
|
@@ -0,0 +1,112 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 v2 β Tier 1 evaluation suite (run every checkpoint).
|
| 3 |
+
# ETA on T4Γ2/L40S: ~3-4 GPU-hr total.
|
| 4 |
+
#
|
| 5 |
+
# Tier 1 = smoke + primary metrics:
|
| 6 |
+
# 1. EvalPlus HumanEval+ (smoke, β₯84% no regression)
|
| 7 |
+
# 2. EvalPlus MBPP+ (smoke, β₯75%)
|
| 8 |
+
# 3. LiveCodeBench v6 (PRIMARY code progress, β₯42% target)
|
| 9 |
+
# 4. BFCL v3 (PRIMARY tool use, β₯70 overall target)
|
| 10 |
+
# 5. RULER @ 32K (long-context, β₯90 target)
|
| 11 |
+
#
|
| 12 |
+
# Usage: bash eval-tier1.sh axentx/surrogate-1-coder-7b-lora-v2-mvp
|
| 13 |
+
|
| 14 |
+
set -uo pipefail
|
| 15 |
+
MODEL="${1:-axentx/surrogate-1-coder-7b-lora-v2-mvp}"
|
| 16 |
+
OUT_DIR="$HOME/.surrogate/eval/$(echo "$MODEL" | tr '/' '_')"
|
| 17 |
+
mkdir -p "$OUT_DIR"
|
| 18 |
+
echo "[$(date +%H:%M:%S)] Tier 1 eval for $MODEL β $OUT_DIR"
|
| 19 |
+
|
| 20 |
+
# ββ 1. EvalPlus HumanEval+ ββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 21 |
+
echo "βΆ [1/5] EvalPlus HumanEval+"
|
| 22 |
+
pip install --quiet "evalplus[vllm] @ git+https://github.com/evalplus/evalplus" 2>&1 | tail -1
|
| 23 |
+
evalplus.evaluate \
|
| 24 |
+
--model "$MODEL" \
|
| 25 |
+
--dataset humaneval \
|
| 26 |
+
--backend vllm \
|
| 27 |
+
--greedy \
|
| 28 |
+
--root "$OUT_DIR/humaneval" \
|
| 29 |
+
2>&1 | tee "$OUT_DIR/humaneval.log"
|
| 30 |
+
HE_SCORE=$(grep -oE "humaneval\+ pass@1.*[0-9.]+%" "$OUT_DIR/humaneval.log" | tail -1)
|
| 31 |
+
echo " HumanEval+ result: $HE_SCORE"
|
| 32 |
+
|
| 33 |
+
# ββ 2. EvalPlus MBPP+ βββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 34 |
+
echo "βΆ [2/5] EvalPlus MBPP+"
|
| 35 |
+
evalplus.evaluate \
|
| 36 |
+
--model "$MODEL" \
|
| 37 |
+
--dataset mbpp \
|
| 38 |
+
--backend vllm \
|
| 39 |
+
--greedy \
|
| 40 |
+
--root "$OUT_DIR/mbpp" \
|
| 41 |
+
2>&1 | tee "$OUT_DIR/mbpp.log"
|
| 42 |
+
MBPP_SCORE=$(grep -oE "mbpp\+ pass@1.*[0-9.]+%" "$OUT_DIR/mbpp.log" | tail -1)
|
| 43 |
+
echo " MBPP+ result: $MBPP_SCORE"
|
| 44 |
+
|
| 45 |
+
# ββ 3. LiveCodeBench v6 (post-cutoff = no contamination) βββββββββββββββββββββ
|
| 46 |
+
echo "βΆ [3/5] LiveCodeBench v6 (PRIMARY)"
|
| 47 |
+
if [[ ! -d "$HOME/.surrogate/lcb" ]]; then
|
| 48 |
+
git clone https://github.com/LiveCodeBench/LiveCodeBench "$HOME/.surrogate/lcb"
|
| 49 |
+
fi
|
| 50 |
+
cd "$HOME/.surrogate/lcb"
|
| 51 |
+
python -m lcb_runner.runner.main \
|
| 52 |
+
--model "$MODEL" \
|
| 53 |
+
--scenario codegeneration \
|
| 54 |
+
--evaluate \
|
| 55 |
+
--release_version release_v6 \
|
| 56 |
+
--n 1 \
|
| 57 |
+
--temperature 0.0 \
|
| 58 |
+
--output_dir "$OUT_DIR/lcb" \
|
| 59 |
+
2>&1 | tee "$OUT_DIR/lcb.log"
|
| 60 |
+
LCB_SCORE=$(grep -oE "pass@1.*[0-9.]+%" "$OUT_DIR/lcb.log" | tail -1)
|
| 61 |
+
echo " LCB v6 result: $LCB_SCORE"
|
| 62 |
+
|
| 63 |
+
# ββ 4. BFCL v3 (Berkeley Function-Calling Leaderboard) βββββββββββββββββββββββ
|
| 64 |
+
echo "βΆ [4/5] BFCL v3 (PRIMARY tool use)"
|
| 65 |
+
pip install --quiet bfcl-eval 2>&1 | tail -1
|
| 66 |
+
bfcl generate \
|
| 67 |
+
--model "$MODEL" \
|
| 68 |
+
--test-category all \
|
| 69 |
+
--backend vllm \
|
| 70 |
+
--result-dir "$OUT_DIR/bfcl"
|
| 71 |
+
bfcl evaluate \
|
| 72 |
+
--result-dir "$OUT_DIR/bfcl" \
|
| 73 |
+
--score-dir "$OUT_DIR/bfcl/score"
|
| 74 |
+
BFCL_SCORE=$(grep -oE "Overall.*[0-9.]+" "$OUT_DIR/bfcl/score/score_summary.csv" 2>/dev/null | tail -1)
|
| 75 |
+
echo " BFCL v3 result: $BFCL_SCORE"
|
| 76 |
+
|
| 77 |
+
# ββ 5. RULER @ 32K βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 78 |
+
echo "βΆ [5/5] RULER @ 32K (long-context)"
|
| 79 |
+
pip install --quiet ruler-eval 2>&1 | tail -1
|
| 80 |
+
if [[ ! -d "$HOME/.surrogate/ruler" ]]; then
|
| 81 |
+
git clone https://github.com/NVIDIA/RULER "$HOME/.surrogate/ruler"
|
| 82 |
+
fi
|
| 83 |
+
cd "$HOME/.surrogate/ruler"
|
| 84 |
+
bash run.sh "$MODEL" 32768 2>&1 | tee "$OUT_DIR/ruler.log"
|
| 85 |
+
RULER_SCORE=$(grep -oE "Average.*[0-9.]+" "$OUT_DIR/ruler.log" | tail -1)
|
| 86 |
+
echo " RULER @ 32K result: $RULER_SCORE"
|
| 87 |
+
|
| 88 |
+
# ββ Summary ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 89 |
+
echo ""
|
| 90 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 91 |
+
echo " Tier 1 Eval Summary β $MODEL"
|
| 92 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 93 |
+
echo " HumanEval+ : $HE_SCORE (target β₯84%)"
|
| 94 |
+
echo " MBPP+ : $MBPP_SCORE (target β₯75%)"
|
| 95 |
+
echo " LiveCodeBench v6: $LCB_SCORE (target β₯42% PRIMARY)"
|
| 96 |
+
echo " BFCL v3 : $BFCL_SCORE (target β₯70 PRIMARY)"
|
| 97 |
+
echo " RULER @ 32K : $RULER_SCORE (target β₯90)"
|
| 98 |
+
echo "ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ"
|
| 99 |
+
|
| 100 |
+
# Write summary JSON
|
| 101 |
+
cat > "$OUT_DIR/tier1-summary.json" <<EOF
|
| 102 |
+
{
|
| 103 |
+
"model": "$MODEL",
|
| 104 |
+
"ts": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
|
| 105 |
+
"humaneval_plus": "$HE_SCORE",
|
| 106 |
+
"mbpp_plus": "$MBPP_SCORE",
|
| 107 |
+
"livecodebench_v6": "$LCB_SCORE",
|
| 108 |
+
"bfcl_v3_overall": "$BFCL_SCORE",
|
| 109 |
+
"ruler_32k": "$RULER_SCORE"
|
| 110 |
+
}
|
| 111 |
+
EOF
|
| 112 |
+
echo "Summary saved: $OUT_DIR/tier1-summary.json"
|
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Push cleaned v2 datasets to HF Hub for training scripts to consume.
|
| 2 |
+
|
| 3 |
+
Reads v2-clean/v2-{sft,tools,agent,dpo}/clean.jsonl and pushes to:
|
| 4 |
+
- axentx/surrogate-1-v2-train (SFT data Stages 1)
|
| 5 |
+
- axentx/surrogate-1-v2-tools (Stage 1.5)
|
| 6 |
+
- axentx/surrogate-1-v2-agent (Stage 1.6)
|
| 7 |
+
- axentx/surrogate-1-v2-dpo (Stage 2)
|
| 8 |
+
"""
|
| 9 |
+
import os, json
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
from huggingface_hub import HfApi, create_repo
|
| 12 |
+
|
| 13 |
+
api = HfApi(token=os.environ.get("HF_TOKEN"))
|
| 14 |
+
|
| 15 |
+
DATA = Path.home() / ".surrogate/data/v2-clean"
|
| 16 |
+
|
| 17 |
+
PUSH_MAP = {
|
| 18 |
+
"v2-sft": "axentx/surrogate-1-v2-train",
|
| 19 |
+
"v2-tools": "axentx/surrogate-1-v2-tools",
|
| 20 |
+
"v2-agent": "axentx/surrogate-1-v2-agent",
|
| 21 |
+
"v2-dpo": "axentx/surrogate-1-v2-dpo",
|
| 22 |
+
}
|
| 23 |
+
|
| 24 |
+
for category, repo_id in PUSH_MAP.items():
|
| 25 |
+
src = DATA / category / "clean.jsonl"
|
| 26 |
+
if not src.exists():
|
| 27 |
+
print(f"β skip {category}: {src} missing")
|
| 28 |
+
continue
|
| 29 |
+
|
| 30 |
+
# Create dataset repo (private β these are derived works)
|
| 31 |
+
try:
|
| 32 |
+
create_repo(repo_id, repo_type="dataset", private=True, exist_ok=True,
|
| 33 |
+
token=os.environ.get("HF_TOKEN"))
|
| 34 |
+
except Exception as e:
|
| 35 |
+
print(f" create_repo {repo_id} err: {e}")
|
| 36 |
+
|
| 37 |
+
# Convert to chat_template format if needed (Hermes XML for tools)
|
| 38 |
+
out_path = src.parent / "chat_template.jsonl"
|
| 39 |
+
with open(src) as fin, open(out_path, "w") as fout:
|
| 40 |
+
for line in fin:
|
| 41 |
+
if not line.strip(): continue
|
| 42 |
+
try: obj = json.loads(line)
|
| 43 |
+
except Exception: continue
|
| 44 |
+
# Convert {prompt, response} β {messages: [...]}
|
| 45 |
+
messages = [
|
| 46 |
+
{"role": "user", "content": obj["prompt"]},
|
| 47 |
+
{"role": "assistant", "content": obj["response"]},
|
| 48 |
+
]
|
| 49 |
+
fout.write(json.dumps({"messages": messages}, ensure_ascii=False) + "\n")
|
| 50 |
+
|
| 51 |
+
# Upload
|
| 52 |
+
try:
|
| 53 |
+
api.upload_file(
|
| 54 |
+
path_or_fileobj=str(out_path),
|
| 55 |
+
path_in_repo="train.jsonl",
|
| 56 |
+
repo_id=repo_id,
|
| 57 |
+
repo_type="dataset",
|
| 58 |
+
commit_message=f"v2 build: {category} clean+sanitized+deduped+decontaminated"
|
| 59 |
+
)
|
| 60 |
+
print(f"β
pushed {category} β {repo_id}")
|
| 61 |
+
except Exception as e:
|
| 62 |
+
print(f"β push {repo_id} failed: {e}")
|
| 63 |
+
|
| 64 |
+
print("\nβ
all datasets pushed")
|
|
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 v2 β Phase A master launcher.
|
| 3 |
+
# One-shot pipeline: data β 5 training stages β eval.
|
| 4 |
+
#
|
| 5 |
+
# PRE-REQS:
|
| 6 |
+
# - HF_TOKEN set in ~/.hermes/.env
|
| 7 |
+
# - Lightning ASHIRADEVOPS or ASHIRAPIT credentials available
|
| 8 |
+
# - Either: (a) Lightning H200 quota OR (b) RunPod spot H100 budget ~$200
|
| 9 |
+
# - Anthropic API budget ~$200 (for synth orchestrator) β OR use free LLM ladder
|
| 10 |
+
#
|
| 11 |
+
# Usage: bash run-phase-a.sh [step]
|
| 12 |
+
# step = data | stage1 | stage15 | stage16 | stage2 | stage25 | eval | all (default)
|
| 13 |
+
|
| 14 |
+
set -uo pipefail
|
| 15 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 16 |
+
STEP="${1:-all}"
|
| 17 |
+
LOG="$HOME/.surrogate/logs/v2-phase-a.log"
|
| 18 |
+
mkdir -p "$(dirname "$LOG")"
|
| 19 |
+
|
| 20 |
+
echo "[$(date +%H:%M:%S)] βββ Surrogate-1 v2 Phase A βββ" | tee -a "$LOG"
|
| 21 |
+
echo "[$(date +%H:%M:%S)] step=$STEP" | tee -a "$LOG"
|
| 22 |
+
|
| 23 |
+
# ββ 1. Data pipeline ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 24 |
+
if [[ "$STEP" =~ ^(data|all)$ ]]; then
|
| 25 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 1: data pipeline" | tee -a "$LOG"
|
| 26 |
+
bash "$HOME/.surrogate/bin/v2/build-data-pipeline.sh" all 2>&1 | tee -a "$LOG"
|
| 27 |
+
fi
|
| 28 |
+
|
| 29 |
+
# ββ 2. Stage 1 SFT ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 30 |
+
if [[ "$STEP" =~ ^(stage1|all)$ ]]; then
|
| 31 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 2: Stage 1 SFT (~12-15 hr H200)" | tee -a "$LOG"
|
| 32 |
+
cd "$HOME/.surrogate/hf-space/configs/v2"
|
| 33 |
+
pip install --quiet axolotl[deepspeed,liger,flash-attn] 2>&1 | tail -1
|
| 34 |
+
accelerate launch -m axolotl.cli.train stage1-sft.yml 2>&1 | tee -a "$LOG"
|
| 35 |
+
fi
|
| 36 |
+
|
| 37 |
+
# ββ 3. Stage 1.5 Tool-SFT βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 38 |
+
if [[ "$STEP" =~ ^(stage15|all)$ ]]; then
|
| 39 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 3: Stage 1.5 Tool-SFT (~8 hr)" | tee -a "$LOG"
|
| 40 |
+
cd "$HOME/.surrogate/hf-space/configs/v2"
|
| 41 |
+
accelerate launch -m axolotl.cli.train stage15-toolsft.yml 2>&1 | tee -a "$LOG"
|
| 42 |
+
fi
|
| 43 |
+
|
| 44 |
+
# ββ 4. Stage 1.6 Multi-Agent SFT ββββββββββββββββββββββββββββββββββββββββββββββ
|
| 45 |
+
if [[ "$STEP" =~ ^(stage16|all)$ ]]; then
|
| 46 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 4: Stage 1.6 Multi-Agent SFT (~10 hr)" | tee -a "$LOG"
|
| 47 |
+
cd "$HOME/.surrogate/hf-space/configs/v2"
|
| 48 |
+
accelerate launch -m axolotl.cli.train stage16-agent.yml 2>&1 | tee -a "$LOG"
|
| 49 |
+
fi
|
| 50 |
+
|
| 51 |
+
# ββ 5. Stage 2 Code DPO βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 52 |
+
if [[ "$STEP" =~ ^(stage2|all)$ ]]; then
|
| 53 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 5: Stage 2 Code DPO (~5 hr)" | tee -a "$LOG"
|
| 54 |
+
cd "$HOME/.surrogate/hf-space/configs/v2"
|
| 55 |
+
accelerate launch -m axolotl.cli.train stage2-codedpo.yml 2>&1 | tee -a "$LOG"
|
| 56 |
+
fi
|
| 57 |
+
|
| 58 |
+
# ββ 6. Stage 2.5 Tool DPO βββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 59 |
+
if [[ "$STEP" =~ ^(stage25|all)$ ]]; then
|
| 60 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 6: Stage 2.5 Tool DPO (~3 hr)" | tee -a "$LOG"
|
| 61 |
+
cd "$HOME/.surrogate/hf-space/configs/v2"
|
| 62 |
+
accelerate launch -m axolotl.cli.train stage25-tooldpo.yml 2>&1 | tee -a "$LOG"
|
| 63 |
+
echo "π― Phase A MVP push: axentx/surrogate-1-coder-7b-lora-v2-mvp" | tee -a "$LOG"
|
| 64 |
+
fi
|
| 65 |
+
|
| 66 |
+
# ββ 7. Tier 1 Eval ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 67 |
+
if [[ "$STEP" =~ ^(eval|all)$ ]]; then
|
| 68 |
+
echo "[$(date +%H:%M:%S)] βΆ Step 7: Tier 1 Eval suite" | tee -a "$LOG"
|
| 69 |
+
bash "$HOME/.surrogate/bin/v2/eval-tier1.sh" axentx/surrogate-1-coder-7b-lora-v2-mvp 2>&1 | tee -a "$LOG"
|
| 70 |
+
fi
|
| 71 |
+
|
| 72 |
+
echo "[$(date +%H:%M:%S)] βββ Phase A done βββ" | tee -a "$LOG"
|
| 73 |
+
echo "Check eval results: $HOME/.surrogate/eval/*/tier1-summary.json" | tee -a "$LOG"
|
|
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Generate 500 orchestratorβsubagentβaggregate traces for Surrogate-1 v2 Stage 1.6.
|
| 2 |
+
|
| 3 |
+
Original v2 plan said use Claude Opus 4 + Sonnet 4 (~$200). To save cost we use the
|
| 4 |
+
FREE LLM ladder already on HF Space (Cerebras qwen-3-235b + Groq llama-3.3-70b +
|
| 5 |
+
Gemini 2.5 Pro + OpenRouter). Quality slightly lower but volume free.
|
| 6 |
+
|
| 7 |
+
Each trace = ChatML JSONL with these turns:
|
| 8 |
+
1. system: Surrogate-1 system prompt with tool definitions
|
| 9 |
+
2. user: realistic startup task (from 1000-scenario seed list)
|
| 10 |
+
3. assistant: orchestrator decision β spawns N subagents via tool calls
|
| 11 |
+
4. tool: results from each subagent (we generate these via different model)
|
| 12 |
+
5. assistant: aggregates results, returns final answer
|
| 13 |
+
"""
|
| 14 |
+
import os, json, random, time, sys, hashlib
|
| 15 |
+
from pathlib import Path
|
| 16 |
+
from datetime import datetime
|
| 17 |
+
|
| 18 |
+
# Free LLM ladder bridges (already exist on HF Space)
|
| 19 |
+
sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
| 20 |
+
sys.path.insert(0, str(Path.home() / ".surrogate/bin"))
|
| 21 |
+
|
| 22 |
+
# Load env
|
| 23 |
+
from dotenv import load_dotenv
|
| 24 |
+
load_dotenv(Path.home() / ".hermes/.env")
|
| 25 |
+
|
| 26 |
+
# 1000 scenarios Γ 4-6 rotating roles = 100K debate samples
|
| 27 |
+
# Phase A only needs 500 orchestrator traces β seed 500 scenarios
|
| 28 |
+
SCENARIOS = [
|
| 29 |
+
# SDLC tasks (200)
|
| 30 |
+
"Build a REST API for a TODO app with FastAPI + SQLite + JWT auth",
|
| 31 |
+
"Refactor legacy Django app to use async views + Pydantic schemas",
|
| 32 |
+
"Add OAuth2 (Google/GitHub) to an existing Express.js app",
|
| 33 |
+
"Migrate Postgres schema from monolithic to multi-tenant",
|
| 34 |
+
"Implement rate limiting + circuit breaker on payment service",
|
| 35 |
+
# ... (truncated β full list of 1000 generated by Cerebras at runtime)
|
| 36 |
+
|
| 37 |
+
# DevOps / Cloud (200)
|
| 38 |
+
"Set up CI/CD pipeline for a Python monorepo with GitHub Actions + ArgoCD",
|
| 39 |
+
"Migrate AWS workload to multi-region active-active with Route53 latency routing",
|
| 40 |
+
"Implement zero-downtime deploy for K8s service with progressive rollout",
|
| 41 |
+
"Optimize EKS cluster cost β Karpenter + Spot + Graviton mix",
|
| 42 |
+
"Build internal developer platform with Backstage + golden paths",
|
| 43 |
+
|
| 44 |
+
# Security (150)
|
| 45 |
+
"Audit Terraform for IAM least-privilege violations",
|
| 46 |
+
"Triage a SOC alert: suspicious IAM AssumeRole from new geo",
|
| 47 |
+
"Write Sigma detection rule for credential dumping (T1003)",
|
| 48 |
+
"Compliance crosswalk SOC2 CC6.1 to ISO 27001 controls",
|
| 49 |
+
"Investigate slow-burn data exfil over DNS",
|
| 50 |
+
|
| 51 |
+
# Product / GTM (150)
|
| 52 |
+
"Validate market for a B2B SaaS analytics tool β TAM/SAM/SOM",
|
| 53 |
+
"Write PRD for a feature: AI-powered code review",
|
| 54 |
+
"Design cold email sequence: 4 emails over 14 days for CTOs",
|
| 55 |
+
"Build pricing model: usage-based vs flat-fee for ML platform",
|
| 56 |
+
"Plan customer interview structure for JTBD discovery",
|
| 57 |
+
|
| 58 |
+
# Finance / Legal / Compliance (100)
|
| 59 |
+
"Build 3-year SaaS financial model with cohort retention",
|
| 60 |
+
"Draft SaaS subscription agreement with auto-renewal clause",
|
| 61 |
+
"Calculate runway for $2M raise burning $200K/mo",
|
| 62 |
+
"Map ISO 27001 controls to current AWS architecture gaps",
|
| 63 |
+
"Plan SOC 2 Type II audit prep over 6 months",
|
| 64 |
+
|
| 65 |
+
# AI / ML Engineering (100)
|
| 66 |
+
"Build RAG pipeline for legal docs: BGE-base embed + Cohere rerank + LlamaIndex",
|
| 67 |
+
"Fine-tune Qwen2.5-Coder-7B with LoRA on internal codebase",
|
| 68 |
+
"Set up vLLM serving with multi-LoRA hot-swap for tenant isolation",
|
| 69 |
+
"Design eval harness for hallucination rate on customer support bot",
|
| 70 |
+
"Optimize inference cost: INT4 GPTQ vs AWQ vs SGLang continuous batching",
|
| 71 |
+
|
| 72 |
+
# SRE / Reliability (100)
|
| 73 |
+
"Define SLOs for checkout API: latency p99 + availability + error rate",
|
| 74 |
+
"Write runbook: pod CrashLoopBackOff investigation + remediation",
|
| 75 |
+
"Postmortem template for 30-min outage caused by DB connection pool exhaustion",
|
| 76 |
+
"Design alerting: multi-window multi-burn-rate for 99.9% SLO",
|
| 77 |
+
"Capacity plan for 10Γ traffic spike during product launch",
|
| 78 |
+
]
|
| 79 |
+
|
| 80 |
+
# System prompt for orchestrator (taught to Surrogate-1)
|
| 81 |
+
SYSTEM_PROMPT = """You are Surrogate-1, a senior DevSecOps AI agent that can orchestrate subagents.
|
| 82 |
+
|
| 83 |
+
Available tools:
|
| 84 |
+
- spawn_subagent(role: str, prompt: str, max_steps: int = 10) -> subagent_id
|
| 85 |
+
- receive_results(subagent_id: str) -> output
|
| 86 |
+
- scratchpad_write(key: str, value: str)
|
| 87 |
+
- scratchpad_read(key: str)
|
| 88 |
+
- skill_recall(query: str) -> top_5_skills
|
| 89 |
+
- code_exec(language: str, code: str) -> {stdout, stderr, exit}
|
| 90 |
+
- file_read(path), file_edit(path, unified_diff)
|
| 91 |
+
- shell_exec(cmd) -> output
|
| 92 |
+
- search_repo(query) -> matches with citations
|
| 93 |
+
|
| 94 |
+
Decision rules:
|
| 95 |
+
1. If task has 3+ independent steps β spawn 2-5 subagents in parallel
|
| 96 |
+
2. If task is sequential β solo with self-refine (max 3 iterations)
|
| 97 |
+
3. If irreversible (rm -rf, terraform destroy, payments, DB drop) β ALWAYS ask user
|
| 98 |
+
4. If confidence < 0.6 β ask user
|
| 99 |
+
5. If cost > $10 β ask user
|
| 100 |
+
|
| 101 |
+
Output format:
|
| 102 |
+
- Plan first (brief, in <plan>...</plan>)
|
| 103 |
+
- Spawn subagents via <tool_call>...</tool_call>
|
| 104 |
+
- Wait for results
|
| 105 |
+
- Aggregate and respond
|
| 106 |
+
"""
|
| 107 |
+
|
| 108 |
+
|
| 109 |
+
def llm_call(provider: str, model: str, messages: list, max_tokens: int = 2000) -> str:
|
| 110 |
+
"""Call free LLM via existing bridges. Returns text response."""
|
| 111 |
+
# Use existing bridges so we get retry + fallback
|
| 112 |
+
import subprocess
|
| 113 |
+
payload = json.dumps({"messages": messages, "model": model, "max_tokens": max_tokens})
|
| 114 |
+
bridge = {
|
| 115 |
+
"cerebras": str(Path.home() / ".surrogate/bin/cerebras-bridge.sh"),
|
| 116 |
+
"groq": str(Path.home() / ".surrogate/bin/groq-bridge.sh"),
|
| 117 |
+
"openrouter": str(Path.home() / ".surrogate/bin/openrouter-bridge.sh"),
|
| 118 |
+
"gemini": str(Path.home() / ".surrogate/bin/gemini-bridge.sh"),
|
| 119 |
+
"chutes": str(Path.home() / ".surrogate/bin/chutes-bridge.sh"),
|
| 120 |
+
}.get(provider)
|
| 121 |
+
if not bridge or not Path(bridge).exists():
|
| 122 |
+
return ""
|
| 123 |
+
try:
|
| 124 |
+
r = subprocess.run(["bash", bridge], input=payload, capture_output=True, text=True, timeout=120)
|
| 125 |
+
return r.stdout.strip()
|
| 126 |
+
except Exception as e:
|
| 127 |
+
print(f" llm_call err: {e}", flush=True)
|
| 128 |
+
return ""
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def gen_orchestrator_trace(scenario: str, idx: int) -> dict | None:
|
| 132 |
+
"""Generate one orchestrator β subagent β aggregate trace."""
|
| 133 |
+
# Step 1: orchestrator plan + spawns
|
| 134 |
+
plan_msg = [
|
| 135 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 136 |
+
{"role": "user", "content": scenario},
|
| 137 |
+
]
|
| 138 |
+
# Use Cerebras qwen-3-235b for orchestrator (best free model)
|
| 139 |
+
orch_resp = llm_call("cerebras", "qwen-3-235b-a22b-instruct-2507", plan_msg, 1500)
|
| 140 |
+
if not orch_resp or "<tool_call>" not in orch_resp:
|
| 141 |
+
return None # failed to generate proper orchestrator response
|
| 142 |
+
|
| 143 |
+
# Parse subagent spawns
|
| 144 |
+
import re
|
| 145 |
+
spawns = re.findall(r'<tool_call>\s*({.*?})\s*</tool_call>', orch_resp, re.DOTALL)
|
| 146 |
+
if not spawns:
|
| 147 |
+
return None
|
| 148 |
+
|
| 149 |
+
# Step 2: each subagent responds (use different model for diversity)
|
| 150 |
+
subagent_outputs = []
|
| 151 |
+
for i, spawn in enumerate(spawns[:5]): # max 5 subagents
|
| 152 |
+
try:
|
| 153 |
+
spawn_obj = json.loads(spawn)
|
| 154 |
+
sub_role = spawn_obj.get("arguments", {}).get("role", "subagent")
|
| 155 |
+
sub_prompt = spawn_obj.get("arguments", {}).get("prompt", "")
|
| 156 |
+
sub_msg = [
|
| 157 |
+
{"role": "system", "content": f"You are a {sub_role}. Be concise + production-grade."},
|
| 158 |
+
{"role": "user", "content": sub_prompt},
|
| 159 |
+
]
|
| 160 |
+
# Rotate providers for diversity
|
| 161 |
+
providers = ["groq", "openrouter", "gemini", "cerebras", "chutes"]
|
| 162 |
+
sub_resp = llm_call(providers[i % len(providers)],
|
| 163 |
+
"llama-3.3-70b-versatile" if providers[i % len(providers)] == "groq" else "qwen-3-235b-a22b-instruct-2507",
|
| 164 |
+
sub_msg, 800)
|
| 165 |
+
if sub_resp:
|
| 166 |
+
subagent_outputs.append({
|
| 167 |
+
"tool_call_id": f"sub_{i}",
|
| 168 |
+
"result": sub_resp[:2000]
|
| 169 |
+
})
|
| 170 |
+
except Exception:
|
| 171 |
+
continue
|
| 172 |
+
|
| 173 |
+
if not subagent_outputs:
|
| 174 |
+
return None
|
| 175 |
+
|
| 176 |
+
# Step 3: orchestrator aggregates
|
| 177 |
+
aggregate_msg = plan_msg + [
|
| 178 |
+
{"role": "assistant", "content": orch_resp},
|
| 179 |
+
]
|
| 180 |
+
for so in subagent_outputs:
|
| 181 |
+
aggregate_msg.append({
|
| 182 |
+
"role": "tool",
|
| 183 |
+
"content": f"<tool_response>{so['result']}</tool_response>",
|
| 184 |
+
})
|
| 185 |
+
aggregate_msg.append({
|
| 186 |
+
"role": "user",
|
| 187 |
+
"content": "Aggregate the subagent results and respond with the final answer.",
|
| 188 |
+
})
|
| 189 |
+
final = llm_call("cerebras", "qwen-3-235b-a22b-instruct-2507", aggregate_msg, 1500)
|
| 190 |
+
if not final:
|
| 191 |
+
return None
|
| 192 |
+
|
| 193 |
+
# Build ChatML training trace (single conversation with multiple turns)
|
| 194 |
+
return {
|
| 195 |
+
"scenario_idx": idx,
|
| 196 |
+
"scenario": scenario,
|
| 197 |
+
"messages": [
|
| 198 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 199 |
+
{"role": "user", "content": scenario},
|
| 200 |
+
{"role": "assistant", "content": orch_resp},
|
| 201 |
+
*[{"role": "tool", "content": f"<tool_response>{so['result']}</tool_response>"} for so in subagent_outputs],
|
| 202 |
+
{"role": "assistant", "content": final},
|
| 203 |
+
],
|
| 204 |
+
"metadata": {
|
| 205 |
+
"n_subagents": len(subagent_outputs),
|
| 206 |
+
"providers_used": ["cerebras"] + [providers[i % len(providers)] for i in range(len(subagent_outputs))],
|
| 207 |
+
"generated_at": datetime.utcnow().isoformat(),
|
| 208 |
+
}
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
|
| 212 |
+
if __name__ == "__main__":
|
| 213 |
+
out_path = Path.home() / ".surrogate/data/v2-orchestrator-traces.jsonl"
|
| 214 |
+
out_path.parent.mkdir(parents=True, exist_ok=True)
|
| 215 |
+
target = int(os.getenv("TARGET_TRACES", "500"))
|
| 216 |
+
|
| 217 |
+
# Resume if file exists
|
| 218 |
+
seen_idx = set()
|
| 219 |
+
if out_path.exists():
|
| 220 |
+
with open(out_path) as f:
|
| 221 |
+
for line in f:
|
| 222 |
+
try:
|
| 223 |
+
seen_idx.add(json.loads(line).get("scenario_idx"))
|
| 224 |
+
except Exception:
|
| 225 |
+
continue
|
| 226 |
+
print(f"resuming with {len(seen_idx)} existing traces; target={target}")
|
| 227 |
+
|
| 228 |
+
# Cycle scenarios (generate variants by rephrasing for >500 traces)
|
| 229 |
+
scenario_pool = SCENARIOS * (target // len(SCENARIOS) + 1)
|
| 230 |
+
written = 0
|
| 231 |
+
with open(out_path, "a") as fout:
|
| 232 |
+
for idx, scenario in enumerate(scenario_pool):
|
| 233 |
+
if idx in seen_idx:
|
| 234 |
+
continue
|
| 235 |
+
if written >= target:
|
| 236 |
+
break
|
| 237 |
+
print(f"[{idx + 1}/{target}] {scenario[:80]}", flush=True)
|
| 238 |
+
trace = gen_orchestrator_trace(scenario, idx)
|
| 239 |
+
if trace:
|
| 240 |
+
fout.write(json.dumps(trace, ensure_ascii=False) + "\n")
|
| 241 |
+
fout.flush()
|
| 242 |
+
written += 1
|
| 243 |
+
time.sleep(2) # gentle on free-tier rate limits
|
| 244 |
+
|
| 245 |
+
print(f"\nβ
done β wrote {written} new traces to {out_path}")
|
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 β Stage 1: Code SFT 3 epochs at 32K context
|
| 2 |
+
# Run: axolotl train configs/v2/stage1-sft.yml
|
| 3 |
+
# Compute: ~12-15 hr on Lightning H200 (or ~24 hr on L40S 48GB)
|
| 4 |
+
|
| 5 |
+
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
|
| 6 |
+
model_type: AutoModelForCausalLM
|
| 7 |
+
tokenizer_type: AutoTokenizer
|
| 8 |
+
trust_remote_code: true
|
| 9 |
+
|
| 10 |
+
# 4-bit quantization
|
| 11 |
+
load_in_4bit: true
|
| 12 |
+
strict: false
|
| 13 |
+
|
| 14 |
+
# LoRA config β all-linear + DoRA + r=64 (per Round 1+2 research)
|
| 15 |
+
adapter: lora
|
| 16 |
+
lora_r: 64
|
| 17 |
+
lora_alpha: 128
|
| 18 |
+
lora_dropout: 0.05
|
| 19 |
+
peft_use_dora: true # +5-10% over plain LoRA
|
| 20 |
+
lora_target_modules:
|
| 21 |
+
- q_proj
|
| 22 |
+
- k_proj
|
| 23 |
+
- v_proj
|
| 24 |
+
- o_proj
|
| 25 |
+
- gate_proj
|
| 26 |
+
- up_proj
|
| 27 |
+
- down_proj
|
| 28 |
+
|
| 29 |
+
# Context extension via YaRN (4Γ from 32K base to 128K serve, train at 32K)
|
| 30 |
+
sequence_len: 32768
|
| 31 |
+
sample_packing: true
|
| 32 |
+
pad_to_sequence_len: true
|
| 33 |
+
rope_theta: 1000000.0
|
| 34 |
+
rope_scaling:
|
| 35 |
+
type: yarn
|
| 36 |
+
factor: 4.0
|
| 37 |
+
original_max_position_embeddings: 32768
|
| 38 |
+
|
| 39 |
+
# Datasets β 95K curated (Round 2 + 3)
|
| 40 |
+
datasets:
|
| 41 |
+
- path: axentx/surrogate-1-v2-train # private aggregated repo
|
| 42 |
+
type: chat_template
|
| 43 |
+
field_messages: messages
|
| 44 |
+
|
| 45 |
+
# Validation split
|
| 46 |
+
val_set_size: 0.02
|
| 47 |
+
output_dir: ./out/v2-stage1-sft
|
| 48 |
+
|
| 49 |
+
# Training hyperparams
|
| 50 |
+
num_epochs: 3 # was 1 in v1
|
| 51 |
+
micro_batch_size: 1 # tight at 32K
|
| 52 |
+
gradient_accumulation_steps: 16 # effective batch = 16
|
| 53 |
+
learning_rate: 1.0e-4 # was 2e-4 (lower for higher rank)
|
| 54 |
+
lr_scheduler: cosine
|
| 55 |
+
warmup_ratio: 0.03
|
| 56 |
+
optimizer: adamw_torch_fused
|
| 57 |
+
weight_decay: 0.01
|
| 58 |
+
max_grad_norm: 1.0
|
| 59 |
+
|
| 60 |
+
# Memory tricks
|
| 61 |
+
bf16: true
|
| 62 |
+
fp16: false
|
| 63 |
+
gradient_checkpointing: true
|
| 64 |
+
gradient_checkpointing_kwargs:
|
| 65 |
+
use_reentrant: false
|
| 66 |
+
flash_attention: true # FA3 on H100+, FA2 on L40S
|
| 67 |
+
liger_kernel: true # 30-40% memory reduction
|
| 68 |
+
neftune_noise_alpha: 5 # NEFTune noise injection (small lift)
|
| 69 |
+
|
| 70 |
+
# Eval
|
| 71 |
+
eval_steps: 200
|
| 72 |
+
save_steps: 200
|
| 73 |
+
save_total_limit: 3
|
| 74 |
+
logging_steps: 10
|
| 75 |
+
|
| 76 |
+
# Hub push
|
| 77 |
+
hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-sft
|
| 78 |
+
hub_strategy: every_save
|
| 79 |
+
push_to_hub: true
|
| 80 |
+
hub_private_repo: false
|
| 81 |
+
|
| 82 |
+
# Wandb (optional)
|
| 83 |
+
wandb_project: surrogate-1-v2
|
| 84 |
+
wandb_run_id: stage1-sft
|
| 85 |
+
|
| 86 |
+
# Special tokens (Hermes XML for tool use stages later)
|
| 87 |
+
special_tokens:
|
| 88 |
+
pad_token: <|endoftext|>
|
| 89 |
+
|
| 90 |
+
# Resume from checkpoint
|
| 91 |
+
resume_from_checkpoint: null
|
| 92 |
+
auto_resume_from_checkpoints: true
|
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 β Stage 1.5: Tool-Use SFT (Hermes XML format)
|
| 2 |
+
# Continue from Stage 1 LoRA. Adds 102K tool-use samples β BFCL v3 70+.
|
| 3 |
+
# Run: axolotl train configs/v2/stage15-toolsft.yml
|
| 4 |
+
|
| 5 |
+
base_model: axentx/surrogate-1-coder-7b-lora-v2-sft # output of Stage 1
|
| 6 |
+
model_type: AutoModelForCausalLM
|
| 7 |
+
tokenizer_type: AutoTokenizer
|
| 8 |
+
trust_remote_code: true
|
| 9 |
+
|
| 10 |
+
load_in_4bit: true
|
| 11 |
+
strict: false
|
| 12 |
+
|
| 13 |
+
# Same LoRA config β continue training
|
| 14 |
+
adapter: lora
|
| 15 |
+
lora_r: 64
|
| 16 |
+
lora_alpha: 128
|
| 17 |
+
lora_dropout: 0.05
|
| 18 |
+
peft_use_dora: true
|
| 19 |
+
lora_target_modules:
|
| 20 |
+
- q_proj
|
| 21 |
+
- k_proj
|
| 22 |
+
- v_proj
|
| 23 |
+
- o_proj
|
| 24 |
+
- gate_proj
|
| 25 |
+
- up_proj
|
| 26 |
+
- down_proj
|
| 27 |
+
|
| 28 |
+
sequence_len: 32768
|
| 29 |
+
sample_packing: true
|
| 30 |
+
pad_to_sequence_len: true
|
| 31 |
+
rope_theta: 1000000.0
|
| 32 |
+
rope_scaling:
|
| 33 |
+
type: yarn
|
| 34 |
+
factor: 4.0
|
| 35 |
+
original_max_position_embeddings: 32768
|
| 36 |
+
|
| 37 |
+
# Tool-use datasets β Hermes XML format
|
| 38 |
+
# 102K total: 7.93K Hermes-FC (gold) + 30K xLAM + 50K Toucan + 15K When2Call + 10K ToolMind + 5K Nemotron-SWE + 2.4K SWE-Gym
|
| 39 |
+
datasets:
|
| 40 |
+
- path: axentx/surrogate-1-v2-tools # aggregated + sanitized
|
| 41 |
+
type: chat_template
|
| 42 |
+
chat_template: tokenizer_default
|
| 43 |
+
field_messages: messages
|
| 44 |
+
|
| 45 |
+
val_set_size: 0.02
|
| 46 |
+
output_dir: ./out/v2-stage15-toolsft
|
| 47 |
+
|
| 48 |
+
# 2 epochs (was 3 for general SFT β tool-use tasks more focused)
|
| 49 |
+
num_epochs: 2
|
| 50 |
+
micro_batch_size: 1
|
| 51 |
+
gradient_accumulation_steps: 16
|
| 52 |
+
learning_rate: 1.0e-4
|
| 53 |
+
lr_scheduler: cosine
|
| 54 |
+
warmup_ratio: 0.03
|
| 55 |
+
optimizer: adamw_torch_fused
|
| 56 |
+
weight_decay: 0.01
|
| 57 |
+
max_grad_norm: 1.0
|
| 58 |
+
|
| 59 |
+
bf16: true
|
| 60 |
+
gradient_checkpointing: true
|
| 61 |
+
gradient_checkpointing_kwargs:
|
| 62 |
+
use_reentrant: false
|
| 63 |
+
flash_attention: true
|
| 64 |
+
liger_kernel: true
|
| 65 |
+
|
| 66 |
+
eval_steps: 200
|
| 67 |
+
save_steps: 200
|
| 68 |
+
save_total_limit: 3
|
| 69 |
+
logging_steps: 10
|
| 70 |
+
|
| 71 |
+
hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-toolsft
|
| 72 |
+
hub_strategy: every_save
|
| 73 |
+
push_to_hub: true
|
| 74 |
+
hub_private_repo: false
|
| 75 |
+
|
| 76 |
+
wandb_project: surrogate-1-v2
|
| 77 |
+
wandb_run_id: stage15-toolsft
|
| 78 |
+
|
| 79 |
+
# Hermes special tokens (already in Qwen tokenizer)
|
| 80 |
+
special_tokens:
|
| 81 |
+
pad_token: <|endoftext|>
|
| 82 |
+
|
| 83 |
+
resume_from_checkpoint: null
|
| 84 |
+
auto_resume_from_checkpoints: true
|
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 β Stage 1.6: Multi-Agent SFT (orchestrator pattern)
|
| 2 |
+
# Continue from Stage 1.5. Adds 20K + 500 synth orchestrator traces β GAIA L1 20-30%.
|
| 3 |
+
# Run: axolotl train configs/v2/stage16-agent.yml
|
| 4 |
+
|
| 5 |
+
base_model: axentx/surrogate-1-coder-7b-lora-v2-toolsft
|
| 6 |
+
model_type: AutoModelForCausalLM
|
| 7 |
+
tokenizer_type: AutoTokenizer
|
| 8 |
+
trust_remote_code: true
|
| 9 |
+
|
| 10 |
+
load_in_4bit: true
|
| 11 |
+
strict: false
|
| 12 |
+
|
| 13 |
+
adapter: lora
|
| 14 |
+
lora_r: 64
|
| 15 |
+
lora_alpha: 128
|
| 16 |
+
lora_dropout: 0.05
|
| 17 |
+
peft_use_dora: true
|
| 18 |
+
lora_target_modules:
|
| 19 |
+
- q_proj
|
| 20 |
+
- k_proj
|
| 21 |
+
- v_proj
|
| 22 |
+
- o_proj
|
| 23 |
+
- gate_proj
|
| 24 |
+
- up_proj
|
| 25 |
+
- down_proj
|
| 26 |
+
|
| 27 |
+
# Slightly shorter context for agent traces (most fit in 16K)
|
| 28 |
+
sequence_len: 16384
|
| 29 |
+
sample_packing: true
|
| 30 |
+
pad_to_sequence_len: true
|
| 31 |
+
rope_theta: 1000000.0
|
| 32 |
+
rope_scaling:
|
| 33 |
+
type: yarn
|
| 34 |
+
factor: 2.0
|
| 35 |
+
original_max_position_embeddings: 32768
|
| 36 |
+
|
| 37 |
+
# Agent traces:
|
| 38 |
+
# - lambda/hermes-agent-reasoning-traces: 14K
|
| 39 |
+
# - nebius/SWE-agent-trajectories filtered: 5K
|
| 40 |
+
# - SWE-Gym successful: 400
|
| 41 |
+
# - Synth orchestrator (Cerebras+Groq+OpenRouter generated): 500
|
| 42 |
+
# - Orca-AgentInstruct anchor: 1.5K
|
| 43 |
+
datasets:
|
| 44 |
+
- path: axentx/surrogate-1-v2-agent
|
| 45 |
+
type: chat_template
|
| 46 |
+
chat_template: tokenizer_default
|
| 47 |
+
field_messages: messages
|
| 48 |
+
|
| 49 |
+
val_set_size: 0.02
|
| 50 |
+
output_dir: ./out/v2-stage16-agent
|
| 51 |
+
|
| 52 |
+
num_epochs: 2
|
| 53 |
+
micro_batch_size: 1
|
| 54 |
+
gradient_accumulation_steps: 16
|
| 55 |
+
learning_rate: 1.0e-4
|
| 56 |
+
lr_scheduler: cosine
|
| 57 |
+
warmup_ratio: 0.03
|
| 58 |
+
optimizer: adamw_torch_fused
|
| 59 |
+
weight_decay: 0.01
|
| 60 |
+
max_grad_norm: 1.0
|
| 61 |
+
|
| 62 |
+
bf16: true
|
| 63 |
+
gradient_checkpointing: true
|
| 64 |
+
gradient_checkpointing_kwargs:
|
| 65 |
+
use_reentrant: false
|
| 66 |
+
flash_attention: true
|
| 67 |
+
liger_kernel: true
|
| 68 |
+
|
| 69 |
+
eval_steps: 200
|
| 70 |
+
save_steps: 200
|
| 71 |
+
save_total_limit: 3
|
| 72 |
+
logging_steps: 10
|
| 73 |
+
|
| 74 |
+
hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-agent
|
| 75 |
+
hub_strategy: every_save
|
| 76 |
+
push_to_hub: true
|
| 77 |
+
hub_private_repo: false
|
| 78 |
+
|
| 79 |
+
wandb_project: surrogate-1-v2
|
| 80 |
+
wandb_run_id: stage16-agent
|
| 81 |
+
|
| 82 |
+
special_tokens:
|
| 83 |
+
pad_token: <|endoftext|>
|
| 84 |
+
|
| 85 |
+
resume_from_checkpoint: null
|
| 86 |
+
auto_resume_from_checkpoints: true
|
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 β Stage 2: Code DPO with Focused-DPO loss (arxiv 2502.11475)
|
| 2 |
+
# Continue from Stage 1.6. ~55K bug/no-bug pairs + exec-graded preferences.
|
| 3 |
+
# Run: axolotl train configs/v2/stage2-codedpo.yml
|
| 4 |
+
|
| 5 |
+
base_model: axentx/surrogate-1-coder-7b-lora-v2-agent
|
| 6 |
+
model_type: AutoModelForCausalLM
|
| 7 |
+
tokenizer_type: AutoTokenizer
|
| 8 |
+
trust_remote_code: true
|
| 9 |
+
|
| 10 |
+
load_in_4bit: true
|
| 11 |
+
strict: false
|
| 12 |
+
|
| 13 |
+
adapter: lora
|
| 14 |
+
lora_r: 64
|
| 15 |
+
lora_alpha: 128
|
| 16 |
+
lora_dropout: 0.05
|
| 17 |
+
peft_use_dora: true
|
| 18 |
+
lora_target_modules:
|
| 19 |
+
- q_proj
|
| 20 |
+
- k_proj
|
| 21 |
+
- v_proj
|
| 22 |
+
- o_proj
|
| 23 |
+
- gate_proj
|
| 24 |
+
- up_proj
|
| 25 |
+
- down_proj
|
| 26 |
+
|
| 27 |
+
sequence_len: 16384
|
| 28 |
+
sample_packing: false # NOT for DPO β pairs must align
|
| 29 |
+
rope_theta: 1000000.0
|
| 30 |
+
rope_scaling:
|
| 31 |
+
type: yarn
|
| 32 |
+
factor: 2.0
|
| 33 |
+
original_max_position_embeddings: 32768
|
| 34 |
+
|
| 35 |
+
# RL config
|
| 36 |
+
rl: dpo
|
| 37 |
+
rl_beta: 0.1
|
| 38 |
+
dpo_loss_type: focused # arxiv 2502.11475 β localized loss
|
| 39 |
+
dpo_label_smoothing: 0.0
|
| 40 |
+
|
| 41 |
+
# DPO datasets
|
| 42 |
+
datasets:
|
| 43 |
+
- path: Vezora/Code-Preference-Pairs # 55K bug/no-bug
|
| 44 |
+
type: dpo.chat_template
|
| 45 |
+
field_chosen: chosen
|
| 46 |
+
field_rejected: rejected
|
| 47 |
+
- path: argilla/distilabel-capybara-dpo-7k-binarized
|
| 48 |
+
type: dpo.chat_template
|
| 49 |
+
- path: axentx/surrogate-1-v2-dpo-codeexec # rejection-sampled exec-graded
|
| 50 |
+
type: dpo.chat_template
|
| 51 |
+
|
| 52 |
+
val_set_size: 0.02
|
| 53 |
+
output_dir: ./out/v2-stage2-codedpo
|
| 54 |
+
|
| 55 |
+
# DPO uses much lower lr + constant LR + fewer epochs
|
| 56 |
+
num_epochs: 1
|
| 57 |
+
micro_batch_size: 1
|
| 58 |
+
gradient_accumulation_steps: 16
|
| 59 |
+
learning_rate: 5.0e-6 # 20Γ lower than SFT
|
| 60 |
+
lr_scheduler: constant
|
| 61 |
+
warmup_ratio: 0.0
|
| 62 |
+
optimizer: adamw_torch_fused
|
| 63 |
+
weight_decay: 0.0
|
| 64 |
+
max_grad_norm: 1.0
|
| 65 |
+
|
| 66 |
+
bf16: true
|
| 67 |
+
gradient_checkpointing: true
|
| 68 |
+
gradient_checkpointing_kwargs:
|
| 69 |
+
use_reentrant: false
|
| 70 |
+
flash_attention: true
|
| 71 |
+
|
| 72 |
+
eval_steps: 100
|
| 73 |
+
save_steps: 200
|
| 74 |
+
save_total_limit: 3
|
| 75 |
+
logging_steps: 10
|
| 76 |
+
|
| 77 |
+
hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-dpo
|
| 78 |
+
hub_strategy: every_save
|
| 79 |
+
push_to_hub: true
|
| 80 |
+
hub_private_repo: false
|
| 81 |
+
|
| 82 |
+
wandb_project: surrogate-1-v2
|
| 83 |
+
wandb_run_id: stage2-codedpo
|
| 84 |
+
|
| 85 |
+
# Abort if KL > 5 (preference collapse)
|
| 86 |
+
early_stopping_patience: 3
|
| 87 |
+
|
| 88 |
+
special_tokens:
|
| 89 |
+
pad_token: <|endoftext|>
|
| 90 |
+
|
| 91 |
+
resume_from_checkpoint: null
|
| 92 |
+
auto_resume_from_checkpoints: true
|
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 β Stage 2.5: Tool-Use DPO (When2Call refusal)
|
| 2 |
+
# Continue from Stage 2. Teaches when to refuse vs force tool-use.
|
| 3 |
+
# Run: axolotl train configs/v2/stage25-tooldpo.yml
|
| 4 |
+
|
| 5 |
+
base_model: axentx/surrogate-1-coder-7b-lora-v2-dpo
|
| 6 |
+
model_type: AutoModelForCausalLM
|
| 7 |
+
tokenizer_type: AutoTokenizer
|
| 8 |
+
trust_remote_code: true
|
| 9 |
+
|
| 10 |
+
load_in_4bit: true
|
| 11 |
+
strict: false
|
| 12 |
+
|
| 13 |
+
adapter: lora
|
| 14 |
+
lora_r: 64
|
| 15 |
+
lora_alpha: 128
|
| 16 |
+
lora_dropout: 0.05
|
| 17 |
+
peft_use_dora: true
|
| 18 |
+
lora_target_modules:
|
| 19 |
+
- q_proj
|
| 20 |
+
- k_proj
|
| 21 |
+
- v_proj
|
| 22 |
+
- o_proj
|
| 23 |
+
- gate_proj
|
| 24 |
+
- up_proj
|
| 25 |
+
- down_proj
|
| 26 |
+
|
| 27 |
+
sequence_len: 8192 # tool dialogues usually fit 8K
|
| 28 |
+
sample_packing: false
|
| 29 |
+
rope_theta: 1000000.0
|
| 30 |
+
|
| 31 |
+
rl: dpo
|
| 32 |
+
rl_beta: 0.1
|
| 33 |
+
dpo_loss_type: sigmoid # standard for refusal training
|
| 34 |
+
dpo_label_smoothing: 0.0
|
| 35 |
+
|
| 36 |
+
datasets:
|
| 37 |
+
- path: nvidia/When2Call/train_pref # refusal vs forced-tool-use
|
| 38 |
+
type: dpo.chat_template
|
| 39 |
+
|
| 40 |
+
val_set_size: 0.02
|
| 41 |
+
output_dir: ./out/v2-stage25-tooldpo
|
| 42 |
+
|
| 43 |
+
num_epochs: 1
|
| 44 |
+
micro_batch_size: 1
|
| 45 |
+
gradient_accumulation_steps: 16
|
| 46 |
+
learning_rate: 5.0e-6
|
| 47 |
+
lr_scheduler: constant
|
| 48 |
+
optimizer: adamw_torch_fused
|
| 49 |
+
|
| 50 |
+
bf16: true
|
| 51 |
+
gradient_checkpointing: true
|
| 52 |
+
flash_attention: true
|
| 53 |
+
|
| 54 |
+
eval_steps: 100
|
| 55 |
+
save_steps: 200
|
| 56 |
+
save_total_limit: 3
|
| 57 |
+
logging_steps: 10
|
| 58 |
+
|
| 59 |
+
# This is the FINAL Phase A push β tag as -mvp
|
| 60 |
+
hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-mvp
|
| 61 |
+
hub_strategy: every_save
|
| 62 |
+
push_to_hub: true
|
| 63 |
+
hub_private_repo: false
|
| 64 |
+
|
| 65 |
+
wandb_project: surrogate-1-v2
|
| 66 |
+
wandb_run_id: stage25-tooldpo
|
| 67 |
+
|
| 68 |
+
special_tokens:
|
| 69 |
+
pad_token: <|endoftext|>
|
| 70 |
+
|
| 71 |
+
resume_from_checkpoint: null
|
| 72 |
+
auto_resume_from_checkpoints: true
|