Spaces:
Running
feat(v2): ambitious targets + 5 free-technique implementations
Browse filesUpdated v2 plan with ambitious targets (technique-driven, NO Anthropic API):
- LCB v6: 42-45% β 55-60% (rStar-Coder 7B = 57.3% paper-confirmed)
- SWE-Bench Lite: 25-30% β 40-45% (DeepSWE+DAPO+R2E-Gym recipe)
- BFCL v3: 70-75 β 82-87 (Toucan+xLAM+Hermes XML+DPO)
- RULER 128K: 80+ β 88+ (NExtLong+YaRN+DCA+200M long-ctx tokens)
- CodeHalu: <8% β <3% (XGrammar+DoLa+TruthRL+SelfCheckGPT-NLI)
- DevSecOps: 65% β 80%+ (PIPer validator-graded RLVR)
- CyberMetric: 75% β 85% (Primus 5B continued pretrain)
New free-technique scripts (no Claude API):
- bin/v2/magpie-self-instruct.py: ICLR 2025 self-instruct, generates 1M
instructions from aligned LLM via empty user template (zero API cost)
- bin/v2/distill-from-frontier.py: 6-LLM ladder vote (Cerebras+Groq+OpenRouter
+Gemini+Chutes) β SFT best + DPO pair (best vs worst)
- bin/v2/merge-9-loras.sh: DARE-TIES merge of 9 cluster LoRAs into single
super-LoRA via mergekit (CPU-only, 8GB VRAM enough)
- configs/v2/stage3-dapo.yml: DAPO RL config (verl framework, beats GRPO
by 5-8pp on AIME). Validator-graded composite reward.
- bin/v2/serve-vllm.sh: production vLLM with XGrammar default + DCA 4Γ
context + MInference 3-7Γ prefill + multi-LoRA
- bin/v2/grammars/tool-call.json: XGrammar JSON schema for Hermes XML
tool calls (24 tools enumerated)
Updated docs/v2-research/v2-targets-AMBITIOUS.md (230 lines) with technique-
by-technique implementation map: how each paper/tool pushes a metric.
Cost: $15/mo (HF PRO + Wasabi) + $0-200 GPU (down from $1.7-3.8K original)
NO Anthropic API. Pure free-tier + open-source.
- bin/v2/distill-from-frontier.py +190 -0
- bin/v2/grammars/tool-call.json +43 -0
- bin/v2/magpie-self-instruct.py +172 -0
- bin/v2/merge-9-loras.sh +92 -0
- bin/v2/serve-vllm.sh +69 -0
- configs/v2/stage3-dapo.yml +123 -0
- docs/v2-research/v2-targets-AMBITIOUS.md +230 -0
|
@@ -0,0 +1,190 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Surrogate-1 v2 β Free distillation from frontier models via free LLM ladder.
|
| 2 |
+
|
| 3 |
+
Uses ONLY free APIs (no Anthropic spend):
|
| 4 |
+
- Cerebras free (qwen-3-235b-a22b-instruct-2507) ~1M tok/day
|
| 5 |
+
- Groq free (llama-3.3-70b-versatile) ~500K tok/day
|
| 6 |
+
- OpenRouter free tier (DeepSeek-V3, Qwen3-Coder, Gemini Flash)
|
| 7 |
+
- Gemini AI Studio free
|
| 8 |
+
- NVIDIA NIM free
|
| 9 |
+
- Chutes free
|
| 10 |
+
|
| 11 |
+
Pipeline:
|
| 12 |
+
1. Load seed prompts from existing v2-sft data + 1000 hard custom prompts
|
| 13 |
+
2. For each prompt, sample N=5 completions from N different free providers
|
| 14 |
+
3. Self-consistency vote on best answer (majority logic / longest-correct / test-pass)
|
| 15 |
+
4. Output as DPO pairs (best vs worst) + as SFT (best alone)
|
| 16 |
+
|
| 17 |
+
Output: ~/.surrogate/data/v2-distill.jsonl + v2-distill-dpo.jsonl
|
| 18 |
+
"""
|
| 19 |
+
import os, json, time, sys, random, hashlib, subprocess
|
| 20 |
+
from pathlib import Path
|
| 21 |
+
from datetime import datetime
|
| 22 |
+
from collections import Counter
|
| 23 |
+
|
| 24 |
+
sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
| 25 |
+
from sanitize import filter_pair
|
| 26 |
+
|
| 27 |
+
# Free LLM providers (already have bridges on HF Space)
|
| 28 |
+
PROVIDERS = [
|
| 29 |
+
("cerebras", "qwen-3-235b-a22b-instruct-2507"),
|
| 30 |
+
("groq", "llama-3.3-70b-versatile"),
|
| 31 |
+
("groq", "qwen-2.5-coder-32b"),
|
| 32 |
+
("openrouter", "deepseek/deepseek-chat-v3.1:free"),
|
| 33 |
+
("openrouter", "qwen/qwen3-coder-480b:free"),
|
| 34 |
+
("openrouter", "meta-llama/llama-3.3-70b-instruct:free"),
|
| 35 |
+
("gemini", "gemini-2.5-flash"),
|
| 36 |
+
("chutes", "qwen-3-235b"),
|
| 37 |
+
]
|
| 38 |
+
|
| 39 |
+
OUT_SFT = Path.home() / ".surrogate/data/v2-distill.jsonl"
|
| 40 |
+
OUT_DPO = Path.home() / ".surrogate/data/v2-distill-dpo.jsonl"
|
| 41 |
+
OUT_SFT.parent.mkdir(parents=True, exist_ok=True)
|
| 42 |
+
|
| 43 |
+
|
| 44 |
+
def call_bridge(provider: str, model: str, messages: list, max_tokens: int = 1500) -> str | None:
|
| 45 |
+
bridge_path = Path.home() / f".surrogate/bin/{provider}-bridge.sh"
|
| 46 |
+
if not bridge_path.exists():
|
| 47 |
+
return None
|
| 48 |
+
payload = json.dumps({"messages": messages, "model": model, "max_tokens": max_tokens})
|
| 49 |
+
try:
|
| 50 |
+
r = subprocess.run(["bash", str(bridge_path)], input=payload,
|
| 51 |
+
capture_output=True, text=True, timeout=120)
|
| 52 |
+
return r.stdout.strip() if r.returncode == 0 else None
|
| 53 |
+
except Exception:
|
| 54 |
+
return None
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def score_response(response: str, prompt: str) -> float:
|
| 58 |
+
"""Cheap quality heuristic β not perfect, but free."""
|
| 59 |
+
s = 0.0
|
| 60 |
+
if not response or len(response) < 30:
|
| 61 |
+
return 0.0
|
| 62 |
+
# Length appropriate
|
| 63 |
+
s += min(1.0, len(response) / 500.0)
|
| 64 |
+
# Has code block?
|
| 65 |
+
if "```" in response:
|
| 66 |
+
s += 0.5
|
| 67 |
+
# Cites specifics (file/line/cmd)
|
| 68 |
+
if any(c in response for c in ["```", "$ ", "# ", "$(", "package "]):
|
| 69 |
+
s += 0.3
|
| 70 |
+
# Avoid refusals
|
| 71 |
+
if response.lower().startswith(("i'm sorry", "i cannot", "i can't")):
|
| 72 |
+
s -= 1.0
|
| 73 |
+
# Avoid known polluted patterns (sanity)
|
| 74 |
+
v = filter_pair(prompt, response)
|
| 75 |
+
if not v["keep"]:
|
| 76 |
+
return 0.0
|
| 77 |
+
return s
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def distill_prompt(prompt_text: str) -> dict | None:
|
| 81 |
+
"""Get N completions, vote best, build SFT + DPO pair."""
|
| 82 |
+
# Sample 5 providers (rotate to balance free quotas)
|
| 83 |
+
chosen_providers = random.sample(PROVIDERS, k=min(5, len(PROVIDERS)))
|
| 84 |
+
completions = []
|
| 85 |
+
msgs = [{"role": "user", "content": prompt_text}]
|
| 86 |
+
for prov, model in chosen_providers:
|
| 87 |
+
resp = call_bridge(prov, model, msgs, max_tokens=1500)
|
| 88 |
+
if resp:
|
| 89 |
+
completions.append({
|
| 90 |
+
"provider": prov,
|
| 91 |
+
"model": model,
|
| 92 |
+
"response": resp,
|
| 93 |
+
"score": score_response(resp, prompt_text),
|
| 94 |
+
})
|
| 95 |
+
if len(completions) < 2:
|
| 96 |
+
return None
|
| 97 |
+
|
| 98 |
+
completions.sort(key=lambda c: -c["score"])
|
| 99 |
+
best = completions[0]
|
| 100 |
+
worst = completions[-1]
|
| 101 |
+
if best["score"] < 0.5 or best["score"] - worst["score"] < 0.3:
|
| 102 |
+
return None # too close β skip
|
| 103 |
+
|
| 104 |
+
return {
|
| 105 |
+
"prompt": prompt_text,
|
| 106 |
+
"best_response": best["response"],
|
| 107 |
+
"best_provider": f"{best['provider']}:{best['model']}",
|
| 108 |
+
"worst_response": worst["response"],
|
| 109 |
+
"worst_provider": f"{worst['provider']}:{worst['model']}",
|
| 110 |
+
"n_completions": len(completions),
|
| 111 |
+
"ts": datetime.utcnow().isoformat(),
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def main():
|
| 116 |
+
SEED_PROMPTS_PATH = Path.home() / ".surrogate/data/v2-distill-seeds.jsonl"
|
| 117 |
+
if not SEED_PROMPTS_PATH.exists():
|
| 118 |
+
print(f"β no seeds at {SEED_PROMPTS_PATH}", flush=True)
|
| 119 |
+
# Create from existing v2-sft data
|
| 120 |
+
seed_dir = Path.home() / ".surrogate/data/v2-sft"
|
| 121 |
+
if seed_dir.exists():
|
| 122 |
+
seeds = []
|
| 123 |
+
for f in seed_dir.glob("*.jsonl"):
|
| 124 |
+
with open(f) as fh:
|
| 125 |
+
for line in fh:
|
| 126 |
+
try:
|
| 127 |
+
obj = json.loads(line)
|
| 128 |
+
if obj.get("prompt"):
|
| 129 |
+
seeds.append({"prompt": obj["prompt"]})
|
| 130 |
+
except Exception:
|
| 131 |
+
continue
|
| 132 |
+
random.shuffle(seeds)
|
| 133 |
+
with open(SEED_PROMPTS_PATH, "w") as fh:
|
| 134 |
+
for s in seeds[:10000]:
|
| 135 |
+
fh.write(json.dumps(s) + "\n")
|
| 136 |
+
print(f" built {len(seeds[:10000])} seeds from existing data", flush=True)
|
| 137 |
+
else:
|
| 138 |
+
print(" no v2-sft data yet β run build-data-pipeline.sh first", flush=True)
|
| 139 |
+
return
|
| 140 |
+
|
| 141 |
+
# Resume
|
| 142 |
+
seen = 0
|
| 143 |
+
if OUT_SFT.exists():
|
| 144 |
+
with open(OUT_SFT) as f:
|
| 145 |
+
seen = sum(1 for _ in f)
|
| 146 |
+
print(f"resuming distill from {seen} existing samples", flush=True)
|
| 147 |
+
|
| 148 |
+
target = int(os.environ.get("DISTILL_TARGET", "50000"))
|
| 149 |
+
written = 0
|
| 150 |
+
with open(SEED_PROMPTS_PATH) as fin, \
|
| 151 |
+
open(OUT_SFT, "a") as fsft, \
|
| 152 |
+
open(OUT_DPO, "a") as fdpo:
|
| 153 |
+
for idx, line in enumerate(fin):
|
| 154 |
+
if idx < seen: continue
|
| 155 |
+
if written >= target: break
|
| 156 |
+
try:
|
| 157 |
+
seed = json.loads(line)
|
| 158 |
+
except Exception:
|
| 159 |
+
continue
|
| 160 |
+
|
| 161 |
+
r = distill_prompt(seed["prompt"])
|
| 162 |
+
if not r: continue
|
| 163 |
+
|
| 164 |
+
# SFT row (best response)
|
| 165 |
+
fsft.write(json.dumps({
|
| 166 |
+
"prompt": r["prompt"],
|
| 167 |
+
"response": r["best_response"],
|
| 168 |
+
"source": f"distill-{r['best_provider']}",
|
| 169 |
+
}, ensure_ascii=False) + "\n")
|
| 170 |
+
fsft.flush()
|
| 171 |
+
|
| 172 |
+
# DPO pair (best vs worst)
|
| 173 |
+
fdpo.write(json.dumps({
|
| 174 |
+
"prompt": r["prompt"],
|
| 175 |
+
"chosen": r["best_response"],
|
| 176 |
+
"rejected": r["worst_response"],
|
| 177 |
+
"source": "distill-vote",
|
| 178 |
+
}, ensure_ascii=False) + "\n")
|
| 179 |
+
fdpo.flush()
|
| 180 |
+
|
| 181 |
+
written += 1
|
| 182 |
+
if written % 50 == 0:
|
| 183 |
+
print(f" [{written}/{target}] SFT+DPO rows written", flush=True)
|
| 184 |
+
time.sleep(0.5)
|
| 185 |
+
|
| 186 |
+
print(f"\nβ
done β distilled {written} samples to {OUT_SFT} + {OUT_DPO}")
|
| 187 |
+
|
| 188 |
+
|
| 189 |
+
if __name__ == "__main__":
|
| 190 |
+
main()
|
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"$schema": "https://json-schema.org/draft/2020-12/schema",
|
| 3 |
+
"title": "Surrogate-1 v2 Tool Call Grammar",
|
| 4 |
+
"description": "Hermes XML format with strict JSON-schema validation for arguments. Used by XGrammar at decode-time.",
|
| 5 |
+
"type": "object",
|
| 6 |
+
"required": ["name", "arguments"],
|
| 7 |
+
"properties": {
|
| 8 |
+
"name": {
|
| 9 |
+
"type": "string",
|
| 10 |
+
"enum": [
|
| 11 |
+
"spawn_subagent",
|
| 12 |
+
"receive_results",
|
| 13 |
+
"scratchpad_write",
|
| 14 |
+
"scratchpad_read",
|
| 15 |
+
"skill_recall",
|
| 16 |
+
"reflexion_log",
|
| 17 |
+
"code_exec",
|
| 18 |
+
"file_read",
|
| 19 |
+
"file_edit",
|
| 20 |
+
"shell_exec",
|
| 21 |
+
"search_repo",
|
| 22 |
+
"grep_repo",
|
| 23 |
+
"list_dir",
|
| 24 |
+
"git_diff",
|
| 25 |
+
"git_commit",
|
| 26 |
+
"run_tests",
|
| 27 |
+
"lint_check",
|
| 28 |
+
"security_scan",
|
| 29 |
+
"deploy_canary",
|
| 30 |
+
"rollback_deploy",
|
| 31 |
+
"monitor_metrics",
|
| 32 |
+
"query_logs",
|
| 33 |
+
"fetch_url",
|
| 34 |
+
"search_web"
|
| 35 |
+
]
|
| 36 |
+
},
|
| 37 |
+
"arguments": {
|
| 38 |
+
"type": "object",
|
| 39 |
+
"additionalProperties": true
|
| 40 |
+
}
|
| 41 |
+
},
|
| 42 |
+
"additionalProperties": false
|
| 43 |
+
}
|
|
@@ -0,0 +1,172 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Magpie self-instruct (ICLR 2025) β generate 1M training instructions for FREE.
|
| 2 |
+
|
| 3 |
+
Method: prompt aligned LLM with ONLY chat template (no actual user prompt).
|
| 4 |
+
Auto-regressive nature β model fills in user query first, then assistant response.
|
| 5 |
+
Zero API cost beyond compute. Used to create 4M Llama-3 instructions in paper.
|
| 6 |
+
|
| 7 |
+
For Surrogate-1 v2 we run on Qwen2.5-Coder-32B-Instruct (or 14B) via:
|
| 8 |
+
- Local inference if we have GPU
|
| 9 |
+
- HF Inference API (free tier rate-limited)
|
| 10 |
+
- Cerebras / Groq / OpenRouter free if available
|
| 11 |
+
|
| 12 |
+
Output: ~/.surrogate/data/v2-magpie-synth.jsonl (target 1M after dedup)
|
| 13 |
+
|
| 14 |
+
Reference: https://github.com/magpie-align/magpie
|
| 15 |
+
"""
|
| 16 |
+
import os, json, time, sys, random, re
|
| 17 |
+
from pathlib import Path
|
| 18 |
+
from datetime import datetime
|
| 19 |
+
|
| 20 |
+
sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
|
| 21 |
+
from sanitize import filter_pair
|
| 22 |
+
|
| 23 |
+
# Choose target generator model β must be ALIGNED (instruct/chat-tuned)
|
| 24 |
+
MODEL = os.environ.get("MAGPIE_MODEL", "Qwen/Qwen2.5-Coder-32B-Instruct")
|
| 25 |
+
TARGET_N = int(os.environ.get("MAGPIE_TARGET", "100000")) # start with 100K, scale to 1M
|
| 26 |
+
OUT_PATH = Path.home() / ".surrogate/data/v2-magpie-synth.jsonl"
|
| 27 |
+
OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
|
| 28 |
+
|
| 29 |
+
# Domain-conditioned templates β bias toward what Surrogate-1 v2 needs
|
| 30 |
+
# By varying the system prompt we steer Magpie toward different domains.
|
| 31 |
+
DOMAIN_SYSTEM_PROMPTS = [
|
| 32 |
+
# Code
|
| 33 |
+
"You are a senior Python engineer who writes production-grade, well-tested code.",
|
| 34 |
+
"You are a senior TypeScript developer building React + Next.js apps.",
|
| 35 |
+
"You are a senior Go engineer building cloud-native microservices.",
|
| 36 |
+
"You are a Rust expert focused on performance + memory safety.",
|
| 37 |
+
"You are a senior C++ developer working on high-performance systems.",
|
| 38 |
+
# DevOps / Cloud
|
| 39 |
+
"You are a senior DevOps engineer who writes Terraform, Helm, and Kubernetes manifests.",
|
| 40 |
+
"You are an AWS Solutions Architect designing multi-region production workloads.",
|
| 41 |
+
"You are an SRE who writes Prometheus alerting rules and runbooks.",
|
| 42 |
+
"You are a Kubernetes platform engineer building GitOps with ArgoCD + Karpenter.",
|
| 43 |
+
"You are a FinOps practitioner optimizing cloud costs.",
|
| 44 |
+
# Security
|
| 45 |
+
"You are a senior DevSecOps engineer writing Sigma detection rules + IaC security audits.",
|
| 46 |
+
"You are a SOC analyst tier-2 investigating security alerts.",
|
| 47 |
+
"You are a compliance engineer mapping controls between SOC2/ISO27001/HIPAA/GDPR.",
|
| 48 |
+
"You are a penetration tester (defensive security focus).",
|
| 49 |
+
"You are a threat hunter identifying advanced persistent threats.",
|
| 50 |
+
# AI / ML
|
| 51 |
+
"You are an AI engineer building production RAG pipelines.",
|
| 52 |
+
"You are an MLOps engineer setting up training/serving infrastructure.",
|
| 53 |
+
"You are a senior LLM engineer fine-tuning and deploying open models.",
|
| 54 |
+
# Product / Business
|
| 55 |
+
"You are a senior product manager writing PRDs and prioritizing roadmaps.",
|
| 56 |
+
"You are a startup founder validating market and writing pitch decks.",
|
| 57 |
+
"You are a growth marketer designing user acquisition funnels.",
|
| 58 |
+
"You are a customer success engineer handling tier-2 support tickets.",
|
| 59 |
+
]
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def call_local_vllm(model: str, system: str, max_tokens: int = 600) -> str | None:
|
| 63 |
+
"""Call locally-hosted vLLM with ONLY system + assistant template prefix.
|
| 64 |
+
|
| 65 |
+
Magpie trick: don't include user message. Model auto-completes user β assistant.
|
| 66 |
+
"""
|
| 67 |
+
import requests
|
| 68 |
+
# Construct chat template with empty user slot β Qwen format:
|
| 69 |
+
# <|im_start|>system\n{sys}<|im_end|>\n<|im_start|>user\n
|
| 70 |
+
# The model will complete the user message + transition to assistant.
|
| 71 |
+
prompt = (f"<|im_start|>system\n{system}<|im_end|>\n"
|
| 72 |
+
f"<|im_start|>user\n")
|
| 73 |
+
try:
|
| 74 |
+
r = requests.post("http://localhost:8000/v1/completions",
|
| 75 |
+
json={"model": model, "prompt": prompt, "max_tokens": max_tokens,
|
| 76 |
+
"temperature": 1.0, "top_p": 0.95,
|
| 77 |
+
"stop": ["<|im_end|>"]},
|
| 78 |
+
timeout=60)
|
| 79 |
+
return r.json().get("choices", [{}])[0].get("text", "").strip()
|
| 80 |
+
except Exception as e:
|
| 81 |
+
print(f" vllm err: {e}", flush=True)
|
| 82 |
+
return None
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
def call_via_bridge(provider: str, model: str, system: str, max_tokens: int = 600) -> str | None:
|
| 86 |
+
"""Fallback: use existing free LLM bridges. Less true-Magpie but still works."""
|
| 87 |
+
import subprocess
|
| 88 |
+
bridge = {
|
| 89 |
+
"cerebras": str(Path.home() / ".surrogate/bin/cerebras-bridge.sh"),
|
| 90 |
+
"groq": str(Path.home() / ".surrogate/bin/groq-bridge.sh"),
|
| 91 |
+
"openrouter": str(Path.home() / ".surrogate/bin/openrouter-bridge.sh"),
|
| 92 |
+
"gemini": str(Path.home() / ".surrogate/bin/gemini-bridge.sh"),
|
| 93 |
+
}.get(provider)
|
| 94 |
+
if not bridge or not Path(bridge).exists():
|
| 95 |
+
return None
|
| 96 |
+
# Pseudo-Magpie: ask the model to GENERATE a user query in the domain, then answer it
|
| 97 |
+
prompt = (f"Generate a realistic user question that fits this persona, "
|
| 98 |
+
f"then answer it as that persona.\n\nPersona: {system}\n\n"
|
| 99 |
+
f"Format strictly:\nUSER: <one realistic question>\nASSISTANT: <thorough answer>")
|
| 100 |
+
payload = json.dumps({"messages": [{"role": "user", "content": prompt}],
|
| 101 |
+
"model": model, "max_tokens": max_tokens})
|
| 102 |
+
try:
|
| 103 |
+
r = subprocess.run(["bash", bridge], input=payload, capture_output=True, text=True, timeout=60)
|
| 104 |
+
return r.stdout.strip()
|
| 105 |
+
except Exception as e:
|
| 106 |
+
print(f" bridge err: {e}", flush=True)
|
| 107 |
+
return None
|
| 108 |
+
|
| 109 |
+
|
| 110 |
+
def parse_magpie_output(text: str) -> tuple[str | None, str | None]:
|
| 111 |
+
"""Extract user instruction + assistant response from Magpie output."""
|
| 112 |
+
# Try Qwen-format completion: starts with user message text, then <|im_end|>, then assistant
|
| 113 |
+
m = re.match(r"(.*?)<\|im_end\|>\s*<\|im_start\|>assistant\s*\n(.*)", text, re.DOTALL)
|
| 114 |
+
if m:
|
| 115 |
+
return m.group(1).strip(), m.group(2).strip()
|
| 116 |
+
# Try bridge format USER: ... ASSISTANT: ...
|
| 117 |
+
m = re.match(r"USER:\s*(.*?)\s*\nASSISTANT:\s*(.*)", text, re.DOTALL)
|
| 118 |
+
if m:
|
| 119 |
+
return m.group(1).strip(), m.group(2).strip()
|
| 120 |
+
return None, None
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def main():
|
| 124 |
+
# Resume if file exists
|
| 125 |
+
seen = 0
|
| 126 |
+
if OUT_PATH.exists():
|
| 127 |
+
with open(OUT_PATH) as f:
|
| 128 |
+
seen = sum(1 for _ in f)
|
| 129 |
+
print(f"resume from {seen} existing samples; target={TARGET_N}", flush=True)
|
| 130 |
+
|
| 131 |
+
# Try local vLLM first (preferred β true Magpie)
|
| 132 |
+
USE_LOCAL = bool(os.environ.get("USE_LOCAL_VLLM"))
|
| 133 |
+
use_provider = "cerebras" # for bridge fallback
|
| 134 |
+
use_model = "qwen-3-235b-a22b-instruct-2507"
|
| 135 |
+
|
| 136 |
+
written = 0
|
| 137 |
+
with open(OUT_PATH, "a") as fout:
|
| 138 |
+
for idx in range(seen, TARGET_N):
|
| 139 |
+
sys_prompt = random.choice(DOMAIN_SYSTEM_PROMPTS)
|
| 140 |
+
if USE_LOCAL:
|
| 141 |
+
raw = call_local_vllm(MODEL, sys_prompt, max_tokens=800)
|
| 142 |
+
else:
|
| 143 |
+
raw = call_via_bridge(use_provider, use_model, sys_prompt, max_tokens=800)
|
| 144 |
+
if not raw:
|
| 145 |
+
time.sleep(3); continue
|
| 146 |
+
|
| 147 |
+
user_q, asst_r = parse_magpie_output(raw)
|
| 148 |
+
if not user_q or not asst_r:
|
| 149 |
+
continue
|
| 150 |
+
|
| 151 |
+
# Sanitize via existing filter
|
| 152 |
+
v = filter_pair(user_q, asst_r)
|
| 153 |
+
if not v["keep"]:
|
| 154 |
+
continue
|
| 155 |
+
|
| 156 |
+
fout.write(json.dumps({
|
| 157 |
+
"prompt": user_q[:6000],
|
| 158 |
+
"response": asst_r[:8000],
|
| 159 |
+
"source": f"magpie-{use_model}",
|
| 160 |
+
"domain_persona": sys_prompt,
|
| 161 |
+
"ts": datetime.utcnow().isoformat(),
|
| 162 |
+
}, ensure_ascii=False) + "\n")
|
| 163 |
+
fout.flush()
|
| 164 |
+
written += 1
|
| 165 |
+
if written % 50 == 0:
|
| 166 |
+
print(f" [{written}/{TARGET_N - seen}] kept", flush=True)
|
| 167 |
+
time.sleep(0.5) # stay under free-tier RPM
|
| 168 |
+
print(f"\nβ
done β wrote {written} new Magpie samples to {OUT_PATH}")
|
| 169 |
+
|
| 170 |
+
|
| 171 |
+
if __name__ == "__main__":
|
| 172 |
+
main()
|
|
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 v2 Phase B+ β Merge 9 specialized LoRAs into single super-LoRA via DARE-TIES.
|
| 3 |
+
#
|
| 4 |
+
# Reference:
|
| 5 |
+
# - mergekit: https://github.com/arcee-ai/mergekit
|
| 6 |
+
# - DARE: arxiv 2311.03099
|
| 7 |
+
# - TIES: arxiv 2306.01708
|
| 8 |
+
# - Practical guide: 5+ adapters β DARE-TIES (consensus + sparsify + rescale)
|
| 9 |
+
#
|
| 10 |
+
# Output: axentx/surrogate-1-coder-7b-lora-v2-merged
|
| 11 |
+
#
|
| 12 |
+
# Each cluster LoRA must already be trained + pushed to HF Hub:
|
| 13 |
+
# axentx/surrogate-1-coder-7b-lora-v2-eng-build
|
| 14 |
+
# axentx/surrogate-1-coder-7b-lora-v2-eng-ops
|
| 15 |
+
# axentx/surrogate-1-coder-7b-lora-v2-eng-sec
|
| 16 |
+
# axentx/surrogate-1-coder-7b-lora-v2-eng-ai
|
| 17 |
+
# axentx/surrogate-1-coder-7b-lora-v2-product-ux
|
| 18 |
+
# axentx/surrogate-1-coder-7b-lora-v2-gtm
|
| 19 |
+
# axentx/surrogate-1-coder-7b-lora-v2-finance-legal
|
| 20 |
+
# axentx/surrogate-1-coder-7b-lora-v2-compliance
|
| 21 |
+
# axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
|
| 22 |
+
|
| 23 |
+
set -uo pipefail
|
| 24 |
+
set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
|
| 25 |
+
|
| 26 |
+
# Install mergekit
|
| 27 |
+
pip install --quiet mergekit-lorapatch 2>&1 | tail -1
|
| 28 |
+
pip install --quiet "mergekit @ git+https://github.com/arcee-ai/mergekit" 2>&1 | tail -1
|
| 29 |
+
|
| 30 |
+
CFG="$HOME/.surrogate/hf-space/configs/v2/merge-9-loras.yml"
|
| 31 |
+
OUT="$HOME/.surrogate/data/v2-merged"
|
| 32 |
+
mkdir -p "$(dirname "$OUT")"
|
| 33 |
+
|
| 34 |
+
# Generate mergekit config β DARE-TIES with weighted clusters
|
| 35 |
+
# Weights chosen so production-likely clusters (eng-build, eng-ops, eng-sec, meta) get more.
|
| 36 |
+
cat > "$CFG" <<'EOF'
|
| 37 |
+
# DARE-TIES merge of 9 specialized Surrogate-1 v2 LoRAs.
|
| 38 |
+
# Weighting: production clusters (eng) > business (gtm/finance) > meta-orchestrator (always-on).
|
| 39 |
+
# density=0.5 β DARE drops 50% of weight delta, then rescales 2Γ (preserves magnitude).
|
| 40 |
+
# normalize=true β TIES sign consensus normalization.
|
| 41 |
+
merge_method: dare_ties
|
| 42 |
+
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
|
| 43 |
+
parameters:
|
| 44 |
+
normalize: true
|
| 45 |
+
int8_mask: true
|
| 46 |
+
dtype: bfloat16
|
| 47 |
+
models:
|
| 48 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-eng-build
|
| 49 |
+
parameters: {weight: 0.20, density: 0.55}
|
| 50 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-eng-ops
|
| 51 |
+
parameters: {weight: 0.18, density: 0.55}
|
| 52 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-eng-sec
|
| 53 |
+
parameters: {weight: 0.15, density: 0.55}
|
| 54 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-eng-ai
|
| 55 |
+
parameters: {weight: 0.10, density: 0.50}
|
| 56 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-product-ux
|
| 57 |
+
parameters: {weight: 0.08, density: 0.50}
|
| 58 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-gtm
|
| 59 |
+
parameters: {weight: 0.05, density: 0.45}
|
| 60 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-finance-legal
|
| 61 |
+
parameters: {weight: 0.04, density: 0.45}
|
| 62 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-compliance
|
| 63 |
+
parameters: {weight: 0.05, density: 0.50}
|
| 64 |
+
- model: axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
|
| 65 |
+
parameters: {weight: 0.15, density: 0.55}
|
| 66 |
+
EOF
|
| 67 |
+
|
| 68 |
+
echo "βΆ Running DARE-TIES merge of 9 LoRAs..."
|
| 69 |
+
mergekit-yaml "$CFG" "$OUT/v2-merged" \
|
| 70 |
+
--copy-tokenizer \
|
| 71 |
+
--allow-crimes \
|
| 72 |
+
--out-shard-size 2B \
|
| 73 |
+
--lazy-unpickle \
|
| 74 |
+
--cuda 2>&1 | tail -30
|
| 75 |
+
|
| 76 |
+
echo ""
|
| 77 |
+
echo "βΆ Pushing merged super-LoRA β axentx/surrogate-1-coder-7b-lora-v2-merged"
|
| 78 |
+
HF_TOKEN="$HF_TOKEN" python3 -c "
|
| 79 |
+
from huggingface_hub import HfApi, create_repo
|
| 80 |
+
api = HfApi()
|
| 81 |
+
create_repo('axentx/surrogate-1-coder-7b-lora-v2-merged', repo_type='model',
|
| 82 |
+
private=False, exist_ok=True)
|
| 83 |
+
api.upload_folder(
|
| 84 |
+
repo_id='axentx/surrogate-1-coder-7b-lora-v2-merged',
|
| 85 |
+
folder_path='$OUT/v2-merged',
|
| 86 |
+
commit_message='DARE-TIES merge of 9 specialist LoRAs (eng-build/ops/sec/ai + product-ux + gtm + finance-legal + compliance + meta-orchestrator)',
|
| 87 |
+
)
|
| 88 |
+
print('β
merged super-LoRA pushed')
|
| 89 |
+
"
|
| 90 |
+
|
| 91 |
+
echo "β
Phase B+ merge complete"
|
| 92 |
+
echo "Run eval: bash $HOME/.surrogate/bin/v2/eval-tier1.sh axentx/surrogate-1-coder-7b-lora-v2-merged"
|
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
# Surrogate-1 v2 β vLLM production serving with full optimization stack.
|
| 3 |
+
#
|
| 4 |
+
# Stack:
|
| 5 |
+
# - XGrammar default decoding (96-98% structural correctness, free)
|
| 6 |
+
# - DCA (Dual Chunk Flash Attention) for 4Γ context extension
|
| 7 |
+
# - MInference 3-7Γ prefill speedup
|
| 8 |
+
# - Multi-LoRA hot-swap (9 cluster LoRAs OR merged super-LoRA)
|
| 9 |
+
# - Hermes XML tool-call parser
|
| 10 |
+
# - YaRN scaling 32K β 128K
|
| 11 |
+
#
|
| 12 |
+
# Usage: bash serve-vllm.sh [model] [port]
|
| 13 |
+
|
| 14 |
+
set -uo pipefail
|
| 15 |
+
MODEL="${1:-axentx/surrogate-1-coder-7b-lora-v2-merged}"
|
| 16 |
+
PORT="${2:-8000}"
|
| 17 |
+
|
| 18 |
+
# Install vLLM 2026-04+ (default XGrammar backend)
|
| 19 |
+
pip install --quiet "vllm>=0.10.0" 2>&1 | tail -1
|
| 20 |
+
|
| 21 |
+
# Install MInference for prefill speedup
|
| 22 |
+
pip install --quiet minference 2>&1 | tail -1
|
| 23 |
+
|
| 24 |
+
# Environment for DCA (4Γ context extension on top of YaRN)
|
| 25 |
+
export VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN
|
| 26 |
+
export VLLM_USE_MODELSCOPE=False
|
| 27 |
+
export TOKENIZERS_PARALLELISM=true
|
| 28 |
+
|
| 29 |
+
# Custom RoPE scaling (YaRN factor=4 from native 32K β 128K serve)
|
| 30 |
+
ROPE_SCALING='{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
|
| 31 |
+
|
| 32 |
+
# Multi-LoRA mode (load all 9 cluster LoRAs hot-swappable)
|
| 33 |
+
LORA_MODULES=""
|
| 34 |
+
if [[ "${USE_MULTI_LORA:-0}" == "1" ]]; then
|
| 35 |
+
LORA_MODULES="
|
| 36 |
+
--enable-lora
|
| 37 |
+
--max-loras 9
|
| 38 |
+
--max-lora-rank 64
|
| 39 |
+
--lora-modules
|
| 40 |
+
eng-build=axentx/surrogate-1-coder-7b-lora-v2-eng-build
|
| 41 |
+
eng-ops=axentx/surrogate-1-coder-7b-lora-v2-eng-ops
|
| 42 |
+
eng-sec=axentx/surrogate-1-coder-7b-lora-v2-eng-sec
|
| 43 |
+
eng-ai=axentx/surrogate-1-coder-7b-lora-v2-eng-ai
|
| 44 |
+
product-ux=axentx/surrogate-1-coder-7b-lora-v2-product-ux
|
| 45 |
+
gtm=axentx/surrogate-1-coder-7b-lora-v2-gtm
|
| 46 |
+
finance-legal=axentx/surrogate-1-coder-7b-lora-v2-finance-legal
|
| 47 |
+
compliance=axentx/surrogate-1-coder-7b-lora-v2-compliance
|
| 48 |
+
meta-orchestrator=axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
|
| 49 |
+
"
|
| 50 |
+
fi
|
| 51 |
+
|
| 52 |
+
echo "βΆ Starting vLLM server: $MODEL on port $PORT"
|
| 53 |
+
echo " Backend: DUAL_CHUNK_FLASH_ATTN (DCA) + XGrammar"
|
| 54 |
+
echo " Context: 128K via YaRN factor=4"
|
| 55 |
+
echo " Multi-LoRA: ${USE_MULTI_LORA:-0}"
|
| 56 |
+
|
| 57 |
+
vllm serve "$MODEL" \
|
| 58 |
+
--port "$PORT" \
|
| 59 |
+
--max-model-len 131072 \
|
| 60 |
+
--rope-scaling "$ROPE_SCALING" \
|
| 61 |
+
--guided-decoding-backend xgrammar \
|
| 62 |
+
--tool-call-parser hermes \
|
| 63 |
+
--enable-auto-tool-choice \
|
| 64 |
+
--gpu-memory-utilization 0.85 \
|
| 65 |
+
--max-num-batched-tokens 32768 \
|
| 66 |
+
--enable-chunked-prefill \
|
| 67 |
+
--dtype bfloat16 \
|
| 68 |
+
$LORA_MODULES \
|
| 69 |
+
2>&1 | tee "$HOME/.surrogate/logs/v2-serve.log"
|
|
@@ -0,0 +1,123 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Surrogate-1 v2 Phase C β Stage 3: DAPO RL with validator-graded rewards.
|
| 2 |
+
#
|
| 3 |
+
# DAPO = Decoupled Clip + Asymmetric Policy Optimization (ByteDance/Tsinghua).
|
| 4 |
+
# Beats GRPO by ~5-8pp on AIME 2024 with Qwen2.5-32B (paper arxiv 2503.14476).
|
| 5 |
+
# Key tricks:
|
| 6 |
+
# 1. Clip-Higher: relax clip range to allow more diversity (anti-entropy-collapse)
|
| 7 |
+
# 2. Dynamic Sampling: weight rollouts by difficulty, not equally
|
| 8 |
+
# 3. Token-level Policy Gradient Loss: critical for long-CoT
|
| 9 |
+
#
|
| 10 |
+
# Run via verl framework: https://github.com/verl-project/verl
|
| 11 |
+
# Reference: https://verl.readthedocs.io/en/latest/algo/dapo.html
|
| 12 |
+
#
|
| 13 |
+
# ETA: ~24 hr on 1Γ H200 (or 8Γ H100 for 3Γ speedup)
|
| 14 |
+
|
| 15 |
+
# Algorithm settings (verl format)
|
| 16 |
+
algorithm:
|
| 17 |
+
algorithm: dapo
|
| 18 |
+
# DAPO-specific
|
| 19 |
+
clip_higher: 0.28 # vs GRPO's 0.2 β allows more exploration
|
| 20 |
+
clip_lower: 0.20
|
| 21 |
+
dynamic_sampling: true
|
| 22 |
+
token_level_loss: true
|
| 23 |
+
# Standard PPO-family
|
| 24 |
+
gamma: 1.0
|
| 25 |
+
lam: 1.0
|
| 26 |
+
kl_coef: 0.001 # very low β DAPO uses minimal KL anchor
|
| 27 |
+
entropy_coef: 0.001
|
| 28 |
+
|
| 29 |
+
# Model + adapter
|
| 30 |
+
actor_rollout_ref:
|
| 31 |
+
hybrid_engine: true
|
| 32 |
+
model:
|
| 33 |
+
path: axentx/surrogate-1-coder-7b-lora-v2-merged # output of merge-9-loras.sh
|
| 34 |
+
enable_gradient_checkpointing: true
|
| 35 |
+
use_remove_padding: true
|
| 36 |
+
actor:
|
| 37 |
+
optim:
|
| 38 |
+
lr: 5.0e-7 # very low for RL (vs 1e-4 SFT)
|
| 39 |
+
lr_warmup_steps: 100
|
| 40 |
+
weight_decay: 0.0
|
| 41 |
+
strategy: fsdp
|
| 42 |
+
fsdp_config:
|
| 43 |
+
wrap_policy:
|
| 44 |
+
min_num_params: 0
|
| 45 |
+
param_offload: false
|
| 46 |
+
optimizer_offload: false
|
| 47 |
+
ppo_mini_batch_size: 32
|
| 48 |
+
ppo_micro_batch_size_per_gpu: 1
|
| 49 |
+
use_kl_loss: false # DAPO doesn't need KL loss
|
| 50 |
+
grad_clip: 1.0
|
| 51 |
+
ulysses_sequence_parallel_size: 1
|
| 52 |
+
rollout:
|
| 53 |
+
name: vllm
|
| 54 |
+
temperature: 1.0 # high for exploration in RL
|
| 55 |
+
top_p: 0.95
|
| 56 |
+
top_k: -1
|
| 57 |
+
n: 8 # 8 generations per prompt for DAPO advantage
|
| 58 |
+
max_response_length: 4096
|
| 59 |
+
max_prompt_length: 8192
|
| 60 |
+
tensor_model_parallel_size: 1
|
| 61 |
+
gpu_memory_utilization: 0.45
|
| 62 |
+
free_cache_engine: true
|
| 63 |
+
ref:
|
| 64 |
+
fsdp_config:
|
| 65 |
+
param_offload: true # offload reference model to CPU
|
| 66 |
+
|
| 67 |
+
# Data β code RL with validator-graded rewards
|
| 68 |
+
data:
|
| 69 |
+
train_files:
|
| 70 |
+
- SWE-Gym/SWE-Gym # 491 verifiable code tasks
|
| 71 |
+
- SWE-Gym/SWE-smith # 26K filtered (NeurIPS 2025)
|
| 72 |
+
- R2E-Gym/R2E-Gym-Lite # used by DeepSWE
|
| 73 |
+
- axentx/surrogate-1-v2-devsecops-rl # custom DevSecOps tasks (build separately)
|
| 74 |
+
val_files:
|
| 75 |
+
- axentx/surrogate-1-v2-rl-val
|
| 76 |
+
prompt_key: prompt
|
| 77 |
+
response_key: response
|
| 78 |
+
max_prompt_length: 8192
|
| 79 |
+
max_response_length: 4096
|
| 80 |
+
train_batch_size: 256
|
| 81 |
+
val_batch_size: 64
|
| 82 |
+
|
| 83 |
+
# Reward β composite validator-graded
|
| 84 |
+
reward_model:
|
| 85 |
+
reward_manager: composite # custom: see rewards.py
|
| 86 |
+
rewards:
|
| 87 |
+
- type: test_pass # E2B/Modal sandbox runs pytest
|
| 88 |
+
weight: 1.0
|
| 89 |
+
- type: lint_clean # hadolint/tflint/actionlint/shellcheck/kubeconform
|
| 90 |
+
weight: 0.3
|
| 91 |
+
- type: security_clean # semgrep/checkov/cfn-guard/cfn-nag
|
| 92 |
+
weight: 0.3
|
| 93 |
+
- type: cite_correct # repo-RAG citation valid
|
| 94 |
+
weight: 0.2
|
| 95 |
+
- type: no_phantom_imports # AST + import-validity check
|
| 96 |
+
weight: 0.2
|
| 97 |
+
- type: honest_idk # TruthRL ternary neutral
|
| 98 |
+
weight: 0.0
|
| 99 |
+
- type: confident_wrong # heavy penalty
|
| 100 |
+
weight: -1.0
|
| 101 |
+
|
| 102 |
+
# Trainer
|
| 103 |
+
trainer:
|
| 104 |
+
total_epochs: 1
|
| 105 |
+
total_training_steps: 5000
|
| 106 |
+
save_freq: 500
|
| 107 |
+
test_freq: 200
|
| 108 |
+
logger: ['console', 'wandb']
|
| 109 |
+
project_name: surrogate-1-v2
|
| 110 |
+
experiment_name: stage3-dapo-rlvr
|
| 111 |
+
default_local_dir: ./out/v2-stage3-dapo
|
| 112 |
+
hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-rlvr
|
| 113 |
+
hub_strategy: every_save
|
| 114 |
+
push_to_hub: true
|
| 115 |
+
|
| 116 |
+
# vLLM serving for rollouts
|
| 117 |
+
vllm:
|
| 118 |
+
tensor_model_parallel_size: 1
|
| 119 |
+
enforce_eager: false
|
| 120 |
+
gpu_memory_utilization: 0.45
|
| 121 |
+
max_num_batched_tokens: 16384
|
| 122 |
+
trust_remote_code: true
|
| 123 |
+
enable_chunked_prefill: true
|
|
@@ -0,0 +1,230 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Surrogate-1 v2 Ambitious Targets β Beyond Conservative via Free Techniques
|
| 3 |
+
date: 2026-04-29
|
| 4 |
+
tags: [surrogate-1, v2, targets, ambitious, free-techniques]
|
| 5 |
+
status: ready
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
# v2 Ambitious Targets (push beyond conservative via TECHNIQUE not money)
|
| 9 |
+
|
| 10 |
+
## Updated target table
|
| 11 |
+
|
| 12 |
+
| Domain | Conservative (initial) | **AMBITIOUS (technique-driven)** | Reference / mechanism |
|
| 13 |
+
|--------|------------------------|----------------------------------|----------------------|
|
| 14 |
+
| **LiveCodeBench v6** | 42-45% | **55-60%** | rStar-Coder 7B = 57.3% (paper-confirmed) |
|
| 15 |
+
| **HumanEval+** | β₯84% | **88-90%** | rStar-Coder + DPO + XGrammar |
|
| 16 |
+
| **MBPP+** | β₯75% | **82-85%** | same |
|
| 17 |
+
| **SWE-Bench Lite** | 25-30% | **40-45%** | DeepSWE recipe + R2E-Gym + DAPO RL |
|
| 18 |
+
| **SWE-Bench Pro** | 8-15% | **15-20%** | same + agent traces |
|
| 19 |
+
| **BFCL v3 overall** | 70-75 | **82-87** | Toucan-1.5M + xLAM + DPO + Hermes XML |
|
| 20 |
+
| **BFCL multi-turn** | 45-50 | **60-65** | When2Call DPO + agent SFT |
|
| 21 |
+
| **GAIA Level 1** | 20-30% | **35-45%** | multi-agent SFT + Letta memory |
|
| 22 |
+
| **RULER @ 32K** | 90+ | **94+** | 32K training + Liger + sample packing |
|
| 23 |
+
| **RULER @ 128K** | 80+ | **88+** | YaRN+DCA + NExtLong synth + 200M long-ctx tokens |
|
| 24 |
+
| **CodeHalu rate** | <8% | **<3%** | XGrammar + DoLa + Cite-or-Abstain + TruthRL |
|
| 25 |
+
| **Phantom imports** | <5% | **<2%** | XGrammar + AST-validity decoding |
|
| 26 |
+
| **Calibration AUC** | >0.85 | **>0.92** | Behaviorally Calibrated RL (Dec 2025 β Qwen3-4B = 0.902) |
|
| 27 |
+
| **Compile rate** | 100% | 100% | XGrammar (already perfect) |
|
| 28 |
+
| **DevSecOps custom** | 65%+ | **80%+** | validator-graded RLVR (PIPer paper) |
|
| 29 |
+
| **Cloud Eval (5-tier)** | 65% | **78%** | 250K IaC + Crossplane v2 + Terraform module distillation |
|
| 30 |
+
| **CyberMetric** | β₯75% | **β₯85%** | Primus 5B continued pretrain + reasoning distill |
|
| 31 |
+
| **CTI-Bench** | β₯65% | **β₯75%** | same |
|
| 32 |
+
| **CyberSOCEval** | β₯55% | **β₯65%** | Sigma synth + IR runbook RLVR |
|
| 33 |
+
| **AI Eng composite** | 60-70% | **80%+** | 180K samples Γ 3 stages (SFT + SimPO + GRPO) |
|
| 34 |
+
| **AIOpsLab** | parity GPT-4o | **above GPT-4o on detection+localization** | 28-35K SRE SFT + sandboxed kubectl traces |
|
| 35 |
+
| **Multi-role debate** | β₯45% blind preference | **β₯55%** | 100K CAMEL synth + 9-LoRA Arrow composition |
|
| 36 |
+
| **Continuous Bench** | 40% | **55%** | Devin-pattern + Manus todo.md + Aider git-as-persistence |
|
| 37 |
+
| **30-day soft launch** | β₯8/10 goals | **β₯9/10 goals**, β€3h/wk founder time | full Phase A+B+C polish |
|
| 38 |
+
|
| 39 |
+
## How to push BEYOND conservative β technique-by-technique
|
| 40 |
+
|
| 41 |
+
### 1. rStar-Coder (THE breakthrough for 7B coder)
|
| 42 |
+
**Paper**: [arxiv 2505.21297](https://arxiv.org/abs/2505.21297)
|
| 43 |
+
|
| 44 |
+
**What they did**:
|
| 45 |
+
- 418K competitive programming problems
|
| 46 |
+
- 580K long-reasoning solutions (CoT verified by tests)
|
| 47 |
+
- 3-step input generation + mutual verification for test cases
|
| 48 |
+
- Result: Qwen2.5-7B 17.4% β **57.3% LCB**, matches Claude 3.5 Sonnet
|
| 49 |
+
|
| 50 |
+
**Implementation for v2**:
|
| 51 |
+
- Use `microsoft/rStar-Coder` dataset (already in dataset-mirror.sh v2 list β 30K samples)
|
| 52 |
+
- BUMP allocation to 100K samples (full available is 580K β paper used 580K!)
|
| 53 |
+
- Train at 32K context with sample packing
|
| 54 |
+
- Long reasoning chains naturally fit (avg ~3K tokens/example)
|
| 55 |
+
|
| 56 |
+
**Expected lift**: +20-25pt on LiveCodeBench v6 alone
|
| 57 |
+
|
| 58 |
+
### 2. DeepSeek-V3 Multi-Token Prediction (MTP)
|
| 59 |
+
**Paper**: [arxiv 2412.19437](https://arxiv.org/html/2412.19437v1)
|
| 60 |
+
|
| 61 |
+
**What it does**:
|
| 62 |
+
- Auxiliary heads predict tokens 2, 3 positions ahead
|
| 63 |
+
- Maintains causal chain (sequential prediction)
|
| 64 |
+
- Densifies training signal (more gradients per forward pass)
|
| 65 |
+
- Bonus: speculative decoding 1.8Γ speedup at inference
|
| 66 |
+
|
| 67 |
+
**Implementation for v2**:
|
| 68 |
+
- Add MTP heads to LoRA training (custom Axolotl plugin)
|
| 69 |
+
- 2 auxiliary heads = 3Γ signal density
|
| 70 |
+
- Discard heads at inference (or repurpose for spec-decoding)
|
| 71 |
+
|
| 72 |
+
**Expected lift**: +3-5% on all coding metrics (Qwen3-Coder used MTP)
|
| 73 |
+
|
| 74 |
+
### 3. Magpie self-instruct (FREE 1M instructions)
|
| 75 |
+
**Paper**: [ICLR 2025](https://github.com/magpie-align/magpie)
|
| 76 |
+
|
| 77 |
+
**What it does**:
|
| 78 |
+
- Prompt aligned LLM with ONLY chat template (no actual prompt)
|
| 79 |
+
- Auto-regressive nature β model generates user query + response
|
| 80 |
+
- ZERO API cost beyond GPU hours
|
| 81 |
+
- Generated 1M-3M from Llama-3-70B in ~600 GPU-hr
|
| 82 |
+
|
| 83 |
+
**Implementation for v2**:
|
| 84 |
+
- Run Magpie on `Qwen2.5-Coder-32B-Instruct` (free via HF Inference or local)
|
| 85 |
+
- Generate 1M code-related instructions
|
| 86 |
+
- Cost: ~200 GPU-hr free Lightning quota
|
| 87 |
+
- vs. Claude API for same volume = $5,000+
|
| 88 |
+
|
| 89 |
+
**Expected**: 1M extra training samples for FREE
|
| 90 |
+
|
| 91 |
+
### 4. DAPO RL (ByteDance/Tsinghua, BEATS GRPO)
|
| 92 |
+
**Paper**: [arxiv 2503.14476](https://arxiv.org/abs/2503.14476)
|
| 93 |
+
|
| 94 |
+
**What it does**:
|
| 95 |
+
- Decoupled clip + Dynamic sampling + Token-level policy gradient loss
|
| 96 |
+
- Qwen2.5-32B β 50pt AIME 2024 (better than GRPO)
|
| 97 |
+
- Open-source via verl framework
|
| 98 |
+
|
| 99 |
+
**Implementation for v2 Stage 3**:
|
| 100 |
+
- Replace GRPO β DAPO in stage3-rlvr.yml
|
| 101 |
+
- Same data (SWE-Gym + R2E-Gym + custom DevSecOps)
|
| 102 |
+
- verl framework supports out-of-box
|
| 103 |
+
|
| 104 |
+
**Expected lift**: +5-8% on SWE-Bench (vs GRPO baseline)
|
| 105 |
+
|
| 106 |
+
### 5. Mergekit 9-LoRA composition (TIES + DARE)
|
| 107 |
+
**Tools**: [mergekit](https://github.com/arcee-ai/mergekit), [PEFT merging](https://huggingface.co/blog/peft_merging)
|
| 108 |
+
|
| 109 |
+
**What it does**:
|
| 110 |
+
- Combine 9 specialized LoRAs into 1 model
|
| 111 |
+
- TIES: sign-consensus, dropout interfering weights
|
| 112 |
+
- DARE: random prune + rescale
|
| 113 |
+
- DARE-TIES: best for 5+ adapters
|
| 114 |
+
- CPU-only or 8GB VRAM
|
| 115 |
+
|
| 116 |
+
**Implementation for v2 Phase B end**:
|
| 117 |
+
- Train 9 LoRAs separately (eng-build, eng-ops, eng-sec, etc.)
|
| 118 |
+
- Merge via DARE-TIES into single super-LoRA
|
| 119 |
+
- vLLM serves single model (no multi-LoRA latency)
|
| 120 |
+
|
| 121 |
+
**Expected lift**: +2-5% across all domain benchmarks (vs single LoRA)
|
| 122 |
+
|
| 123 |
+
### 6. XGrammar default decoding (FREE structural correctness)
|
| 124 |
+
**Tool**: [XGrammar](https://github.com/mlc-ai/xgrammar) (default vLLM 2026-04+)
|
| 125 |
+
|
| 126 |
+
**What it does**:
|
| 127 |
+
- Context-free grammar enforcement at decode
|
| 128 |
+
- JSON / regex / custom CFG
|
| 129 |
+
- 96-98% structural correctness
|
| 130 |
+
- 5Γ TPOT speedup
|
| 131 |
+
- Zero training cost
|
| 132 |
+
|
| 133 |
+
**Implementation for v2 inference**:
|
| 134 |
+
- Already planned. Just enable: `vllm serve --guided-decoding-backend xgrammar`
|
| 135 |
+
- Define grammars per use case:
|
| 136 |
+
- Tool calls: JSON schema
|
| 137 |
+
- Code blocks: Python/Bash/SQL/Terraform/YAML grammars
|
| 138 |
+
- Output structure: Markdown headers
|
| 139 |
+
|
| 140 |
+
**Expected**: 100% syntax correctness on tool calls + code blocks
|
| 141 |
+
|
| 142 |
+
### 7. NExtLong long-context curriculum (ICML 2025)
|
| 143 |
+
**Paper**: arxiv 2501.12766
|
| 144 |
+
|
| 145 |
+
**What it does**:
|
| 146 |
+
- Long sequences with HARD negatives interleaved
|
| 147 |
+
- Synthetic > human-curated for long context
|
| 148 |
+
- ~10B tokens needed (we use 200M-500M subset)
|
| 149 |
+
|
| 150 |
+
**Implementation for v2 Stage 1**:
|
| 151 |
+
- 60% long context (β₯16K) repo-concat with FIM
|
| 152 |
+
- 40% short context
|
| 153 |
+
- Hard negatives: similar-but-incorrect code samples interleaved
|
| 154 |
+
- NExtLong synth via free LLM ladder
|
| 155 |
+
|
| 156 |
+
**Expected**: RULER @ 128K from 80 β **88+**
|
| 157 |
+
|
| 158 |
+
### 8. Behaviorally Calibrated RL (Dec 2025)
|
| 159 |
+
**Paper**: arxiv (Dec 2025) β Qwen3-4B AUC 0.902
|
| 160 |
+
|
| 161 |
+
**What it does**:
|
| 162 |
+
- Train model to KNOW when it doesn't know
|
| 163 |
+
- Reward = 1 if correct + confident OR refused + uncertain
|
| 164 |
+
- Penalty for confident-wrong (TruthRL-style ternary)
|
| 165 |
+
|
| 166 |
+
**Implementation in v2 Stage 5**:
|
| 167 |
+
- Already in plan via TruthRL
|
| 168 |
+
- Add: behavioral cal eval suite
|
| 169 |
+
- Target AUC > 0.92 (above the paper)
|
| 170 |
+
|
| 171 |
+
**Expected**: Hallucination rate <3% + calibration AUC > 0.92
|
| 172 |
+
|
| 173 |
+
### 9. Self-Play SWE-RL (Together AI DeepSWE)
|
| 174 |
+
**Blog**: [Together DeepSWE](https://www.together.ai/blog/deepswe)
|
| 175 |
+
|
| 176 |
+
**What they did**:
|
| 177 |
+
- Generate bugs synthetically
|
| 178 |
+
- Train model to fix them
|
| 179 |
+
- Iterative: model becomes better at finding bugs β trains on harder bugs
|
| 180 |
+
- Open recipe at [agentica-project/rllm](https://github.com/agentica-project/rllm)
|
| 181 |
+
|
| 182 |
+
**Implementation for v2 Stage 4-5 (post Phase B)**:
|
| 183 |
+
- Self-play loop: bug-injector model + bug-fixer model
|
| 184 |
+
- Both start from Phase B artifact
|
| 185 |
+
- Diverge over time
|
| 186 |
+
|
| 187 |
+
**Expected lift**: SWE-Bench Lite +5-10pp
|
| 188 |
+
|
| 189 |
+
### 10. Stack-Edu / FineWeb-Edu classifier filtering
|
| 190 |
+
**Tools**: HuggingFaceTB/stack-edu-classifier-python, fineweb-edu-classifier
|
| 191 |
+
|
| 192 |
+
**What it does**:
|
| 193 |
+
- Score each code/text sample 1-5 for educational quality
|
| 194 |
+
- Train only on threshold β₯3 (Phi-4 method)
|
| 195 |
+
|
| 196 |
+
**Implementation for v2 data pipeline**:
|
| 197 |
+
- Already in dedup-decontaminate.py plan
|
| 198 |
+
- Apply BEFORE final SFT mix
|
| 199 |
+
- Drop ~30% lowest-quality
|
| 200 |
+
|
| 201 |
+
**Expected lift**: +2-3% on HumanEval+ from cleaner data
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
+
## Compute & cost (NO Anthropic API)
|
| 206 |
+
|
| 207 |
+
| Item | Cost | Source |
|
| 208 |
+
|------|------|--------|
|
| 209 |
+
| HF PRO | $9/mo | HuggingFace |
|
| 210 |
+
| Wasabi 1 TB | $6/mo | Wasabi |
|
| 211 |
+
| Lightning H200 | free 80hr/mo (ashiradevops + ashirapit) | Lightning |
|
| 212 |
+
| Anthropic API | **$0** β removed | replaced by free LLM ladder |
|
| 213 |
+
| Synth data gen | $0 | Cerebras qwen-3-235b + Groq llama-3.3-70b free + Magpie self-instruct |
|
| 214 |
+
| GPU compute extra | $0-200 (RunPod spot only if Lightning exhausted) | optional |
|
| 215 |
+
|
| 216 |
+
**Total**: $15/mo + $0-200 one-time. (down from prior $1,700-3,800 estimate)
|
| 217 |
+
|
| 218 |
+
## v2 Phase Map (revised)
|
| 219 |
+
|
| 220 |
+
| Phase | Weeks | Output | Cost |
|
| 221 |
+
|-------|-------|--------|------|
|
| 222 |
+
| **A**: Code+Tool+Agent SFT/DPO | 4 | `surrogate-1-coder-7b-lora-v2-mvp` | $0-200 |
|
| 223 |
+
| **A+**: rStar-Coder 100K + Magpie 1M continued SFT | +1 | bigger lift on LCB | free |
|
| 224 |
+
| **B**: 9 LoRA cluster expertise (parallel) | 4 | 9 LoRAs | $200-500 (parallel) |
|
| 225 |
+
| **B+**: DARE-TIES merge β super-LoRA | 0.5 | 1 merged LoRA | free (CPU) |
|
| 226 |
+
| **C**: DAPO RLVR + TruthRL | 2-3 | RL polish | $200-500 |
|
| 227 |
+
| **C+**: Self-Play SWE-RL bug inject/fix | 1-2 | iterative improvement | free (Lightning) |
|
| 228 |
+
|
| 229 |
+
**Total: 12-15 weeks / $400-1,200 / no Anthropic API**
|
| 230 |
+
|