ashirato commited on
Commit
2aa8090
Β·
1 Parent(s): 8461287

feat(v2): ambitious targets + 5 free-technique implementations

Browse files

Updated v2 plan with ambitious targets (technique-driven, NO Anthropic API):
- LCB v6: 42-45% β†’ 55-60% (rStar-Coder 7B = 57.3% paper-confirmed)
- SWE-Bench Lite: 25-30% β†’ 40-45% (DeepSWE+DAPO+R2E-Gym recipe)
- BFCL v3: 70-75 β†’ 82-87 (Toucan+xLAM+Hermes XML+DPO)
- RULER 128K: 80+ β†’ 88+ (NExtLong+YaRN+DCA+200M long-ctx tokens)
- CodeHalu: <8% β†’ <3% (XGrammar+DoLa+TruthRL+SelfCheckGPT-NLI)
- DevSecOps: 65% β†’ 80%+ (PIPer validator-graded RLVR)
- CyberMetric: 75% β†’ 85% (Primus 5B continued pretrain)

New free-technique scripts (no Claude API):
- bin/v2/magpie-self-instruct.py: ICLR 2025 self-instruct, generates 1M
instructions from aligned LLM via empty user template (zero API cost)
- bin/v2/distill-from-frontier.py: 6-LLM ladder vote (Cerebras+Groq+OpenRouter
+Gemini+Chutes) β†’ SFT best + DPO pair (best vs worst)
- bin/v2/merge-9-loras.sh: DARE-TIES merge of 9 cluster LoRAs into single
super-LoRA via mergekit (CPU-only, 8GB VRAM enough)
- configs/v2/stage3-dapo.yml: DAPO RL config (verl framework, beats GRPO
by 5-8pp on AIME). Validator-graded composite reward.
- bin/v2/serve-vllm.sh: production vLLM with XGrammar default + DCA 4Γ—
context + MInference 3-7Γ— prefill + multi-LoRA
- bin/v2/grammars/tool-call.json: XGrammar JSON schema for Hermes XML
tool calls (24 tools enumerated)

Updated docs/v2-research/v2-targets-AMBITIOUS.md (230 lines) with technique-
by-technique implementation map: how each paper/tool pushes a metric.

Cost: $15/mo (HF PRO + Wasabi) + $0-200 GPU (down from $1.7-3.8K original)
NO Anthropic API. Pure free-tier + open-source.

bin/v2/distill-from-frontier.py ADDED
@@ -0,0 +1,190 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Surrogate-1 v2 β€” Free distillation from frontier models via free LLM ladder.
2
+
3
+ Uses ONLY free APIs (no Anthropic spend):
4
+ - Cerebras free (qwen-3-235b-a22b-instruct-2507) ~1M tok/day
5
+ - Groq free (llama-3.3-70b-versatile) ~500K tok/day
6
+ - OpenRouter free tier (DeepSeek-V3, Qwen3-Coder, Gemini Flash)
7
+ - Gemini AI Studio free
8
+ - NVIDIA NIM free
9
+ - Chutes free
10
+
11
+ Pipeline:
12
+ 1. Load seed prompts from existing v2-sft data + 1000 hard custom prompts
13
+ 2. For each prompt, sample N=5 completions from N different free providers
14
+ 3. Self-consistency vote on best answer (majority logic / longest-correct / test-pass)
15
+ 4. Output as DPO pairs (best vs worst) + as SFT (best alone)
16
+
17
+ Output: ~/.surrogate/data/v2-distill.jsonl + v2-distill-dpo.jsonl
18
+ """
19
+ import os, json, time, sys, random, hashlib, subprocess
20
+ from pathlib import Path
21
+ from datetime import datetime
22
+ from collections import Counter
23
+
24
+ sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
25
+ from sanitize import filter_pair
26
+
27
+ # Free LLM providers (already have bridges on HF Space)
28
+ PROVIDERS = [
29
+ ("cerebras", "qwen-3-235b-a22b-instruct-2507"),
30
+ ("groq", "llama-3.3-70b-versatile"),
31
+ ("groq", "qwen-2.5-coder-32b"),
32
+ ("openrouter", "deepseek/deepseek-chat-v3.1:free"),
33
+ ("openrouter", "qwen/qwen3-coder-480b:free"),
34
+ ("openrouter", "meta-llama/llama-3.3-70b-instruct:free"),
35
+ ("gemini", "gemini-2.5-flash"),
36
+ ("chutes", "qwen-3-235b"),
37
+ ]
38
+
39
+ OUT_SFT = Path.home() / ".surrogate/data/v2-distill.jsonl"
40
+ OUT_DPO = Path.home() / ".surrogate/data/v2-distill-dpo.jsonl"
41
+ OUT_SFT.parent.mkdir(parents=True, exist_ok=True)
42
+
43
+
44
+ def call_bridge(provider: str, model: str, messages: list, max_tokens: int = 1500) -> str | None:
45
+ bridge_path = Path.home() / f".surrogate/bin/{provider}-bridge.sh"
46
+ if not bridge_path.exists():
47
+ return None
48
+ payload = json.dumps({"messages": messages, "model": model, "max_tokens": max_tokens})
49
+ try:
50
+ r = subprocess.run(["bash", str(bridge_path)], input=payload,
51
+ capture_output=True, text=True, timeout=120)
52
+ return r.stdout.strip() if r.returncode == 0 else None
53
+ except Exception:
54
+ return None
55
+
56
+
57
+ def score_response(response: str, prompt: str) -> float:
58
+ """Cheap quality heuristic β€” not perfect, but free."""
59
+ s = 0.0
60
+ if not response or len(response) < 30:
61
+ return 0.0
62
+ # Length appropriate
63
+ s += min(1.0, len(response) / 500.0)
64
+ # Has code block?
65
+ if "```" in response:
66
+ s += 0.5
67
+ # Cites specifics (file/line/cmd)
68
+ if any(c in response for c in ["```", "$ ", "# ", "$(", "package "]):
69
+ s += 0.3
70
+ # Avoid refusals
71
+ if response.lower().startswith(("i'm sorry", "i cannot", "i can't")):
72
+ s -= 1.0
73
+ # Avoid known polluted patterns (sanity)
74
+ v = filter_pair(prompt, response)
75
+ if not v["keep"]:
76
+ return 0.0
77
+ return s
78
+
79
+
80
+ def distill_prompt(prompt_text: str) -> dict | None:
81
+ """Get N completions, vote best, build SFT + DPO pair."""
82
+ # Sample 5 providers (rotate to balance free quotas)
83
+ chosen_providers = random.sample(PROVIDERS, k=min(5, len(PROVIDERS)))
84
+ completions = []
85
+ msgs = [{"role": "user", "content": prompt_text}]
86
+ for prov, model in chosen_providers:
87
+ resp = call_bridge(prov, model, msgs, max_tokens=1500)
88
+ if resp:
89
+ completions.append({
90
+ "provider": prov,
91
+ "model": model,
92
+ "response": resp,
93
+ "score": score_response(resp, prompt_text),
94
+ })
95
+ if len(completions) < 2:
96
+ return None
97
+
98
+ completions.sort(key=lambda c: -c["score"])
99
+ best = completions[0]
100
+ worst = completions[-1]
101
+ if best["score"] < 0.5 or best["score"] - worst["score"] < 0.3:
102
+ return None # too close β€” skip
103
+
104
+ return {
105
+ "prompt": prompt_text,
106
+ "best_response": best["response"],
107
+ "best_provider": f"{best['provider']}:{best['model']}",
108
+ "worst_response": worst["response"],
109
+ "worst_provider": f"{worst['provider']}:{worst['model']}",
110
+ "n_completions": len(completions),
111
+ "ts": datetime.utcnow().isoformat(),
112
+ }
113
+
114
+
115
+ def main():
116
+ SEED_PROMPTS_PATH = Path.home() / ".surrogate/data/v2-distill-seeds.jsonl"
117
+ if not SEED_PROMPTS_PATH.exists():
118
+ print(f"⚠ no seeds at {SEED_PROMPTS_PATH}", flush=True)
119
+ # Create from existing v2-sft data
120
+ seed_dir = Path.home() / ".surrogate/data/v2-sft"
121
+ if seed_dir.exists():
122
+ seeds = []
123
+ for f in seed_dir.glob("*.jsonl"):
124
+ with open(f) as fh:
125
+ for line in fh:
126
+ try:
127
+ obj = json.loads(line)
128
+ if obj.get("prompt"):
129
+ seeds.append({"prompt": obj["prompt"]})
130
+ except Exception:
131
+ continue
132
+ random.shuffle(seeds)
133
+ with open(SEED_PROMPTS_PATH, "w") as fh:
134
+ for s in seeds[:10000]:
135
+ fh.write(json.dumps(s) + "\n")
136
+ print(f" built {len(seeds[:10000])} seeds from existing data", flush=True)
137
+ else:
138
+ print(" no v2-sft data yet β€” run build-data-pipeline.sh first", flush=True)
139
+ return
140
+
141
+ # Resume
142
+ seen = 0
143
+ if OUT_SFT.exists():
144
+ with open(OUT_SFT) as f:
145
+ seen = sum(1 for _ in f)
146
+ print(f"resuming distill from {seen} existing samples", flush=True)
147
+
148
+ target = int(os.environ.get("DISTILL_TARGET", "50000"))
149
+ written = 0
150
+ with open(SEED_PROMPTS_PATH) as fin, \
151
+ open(OUT_SFT, "a") as fsft, \
152
+ open(OUT_DPO, "a") as fdpo:
153
+ for idx, line in enumerate(fin):
154
+ if idx < seen: continue
155
+ if written >= target: break
156
+ try:
157
+ seed = json.loads(line)
158
+ except Exception:
159
+ continue
160
+
161
+ r = distill_prompt(seed["prompt"])
162
+ if not r: continue
163
+
164
+ # SFT row (best response)
165
+ fsft.write(json.dumps({
166
+ "prompt": r["prompt"],
167
+ "response": r["best_response"],
168
+ "source": f"distill-{r['best_provider']}",
169
+ }, ensure_ascii=False) + "\n")
170
+ fsft.flush()
171
+
172
+ # DPO pair (best vs worst)
173
+ fdpo.write(json.dumps({
174
+ "prompt": r["prompt"],
175
+ "chosen": r["best_response"],
176
+ "rejected": r["worst_response"],
177
+ "source": "distill-vote",
178
+ }, ensure_ascii=False) + "\n")
179
+ fdpo.flush()
180
+
181
+ written += 1
182
+ if written % 50 == 0:
183
+ print(f" [{written}/{target}] SFT+DPO rows written", flush=True)
184
+ time.sleep(0.5)
185
+
186
+ print(f"\nβœ… done β€” distilled {written} samples to {OUT_SFT} + {OUT_DPO}")
187
+
188
+
189
+ if __name__ == "__main__":
190
+ main()
bin/v2/grammars/tool-call.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "$schema": "https://json-schema.org/draft/2020-12/schema",
3
+ "title": "Surrogate-1 v2 Tool Call Grammar",
4
+ "description": "Hermes XML format with strict JSON-schema validation for arguments. Used by XGrammar at decode-time.",
5
+ "type": "object",
6
+ "required": ["name", "arguments"],
7
+ "properties": {
8
+ "name": {
9
+ "type": "string",
10
+ "enum": [
11
+ "spawn_subagent",
12
+ "receive_results",
13
+ "scratchpad_write",
14
+ "scratchpad_read",
15
+ "skill_recall",
16
+ "reflexion_log",
17
+ "code_exec",
18
+ "file_read",
19
+ "file_edit",
20
+ "shell_exec",
21
+ "search_repo",
22
+ "grep_repo",
23
+ "list_dir",
24
+ "git_diff",
25
+ "git_commit",
26
+ "run_tests",
27
+ "lint_check",
28
+ "security_scan",
29
+ "deploy_canary",
30
+ "rollback_deploy",
31
+ "monitor_metrics",
32
+ "query_logs",
33
+ "fetch_url",
34
+ "search_web"
35
+ ]
36
+ },
37
+ "arguments": {
38
+ "type": "object",
39
+ "additionalProperties": true
40
+ }
41
+ },
42
+ "additionalProperties": false
43
+ }
bin/v2/magpie-self-instruct.py ADDED
@@ -0,0 +1,172 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Magpie self-instruct (ICLR 2025) β€” generate 1M training instructions for FREE.
2
+
3
+ Method: prompt aligned LLM with ONLY chat template (no actual user prompt).
4
+ Auto-regressive nature β†’ model fills in user query first, then assistant response.
5
+ Zero API cost beyond compute. Used to create 4M Llama-3 instructions in paper.
6
+
7
+ For Surrogate-1 v2 we run on Qwen2.5-Coder-32B-Instruct (or 14B) via:
8
+ - Local inference if we have GPU
9
+ - HF Inference API (free tier rate-limited)
10
+ - Cerebras / Groq / OpenRouter free if available
11
+
12
+ Output: ~/.surrogate/data/v2-magpie-synth.jsonl (target 1M after dedup)
13
+
14
+ Reference: https://github.com/magpie-align/magpie
15
+ """
16
+ import os, json, time, sys, random, re
17
+ from pathlib import Path
18
+ from datetime import datetime
19
+
20
+ sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
21
+ from sanitize import filter_pair
22
+
23
+ # Choose target generator model β€” must be ALIGNED (instruct/chat-tuned)
24
+ MODEL = os.environ.get("MAGPIE_MODEL", "Qwen/Qwen2.5-Coder-32B-Instruct")
25
+ TARGET_N = int(os.environ.get("MAGPIE_TARGET", "100000")) # start with 100K, scale to 1M
26
+ OUT_PATH = Path.home() / ".surrogate/data/v2-magpie-synth.jsonl"
27
+ OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
28
+
29
+ # Domain-conditioned templates β€” bias toward what Surrogate-1 v2 needs
30
+ # By varying the system prompt we steer Magpie toward different domains.
31
+ DOMAIN_SYSTEM_PROMPTS = [
32
+ # Code
33
+ "You are a senior Python engineer who writes production-grade, well-tested code.",
34
+ "You are a senior TypeScript developer building React + Next.js apps.",
35
+ "You are a senior Go engineer building cloud-native microservices.",
36
+ "You are a Rust expert focused on performance + memory safety.",
37
+ "You are a senior C++ developer working on high-performance systems.",
38
+ # DevOps / Cloud
39
+ "You are a senior DevOps engineer who writes Terraform, Helm, and Kubernetes manifests.",
40
+ "You are an AWS Solutions Architect designing multi-region production workloads.",
41
+ "You are an SRE who writes Prometheus alerting rules and runbooks.",
42
+ "You are a Kubernetes platform engineer building GitOps with ArgoCD + Karpenter.",
43
+ "You are a FinOps practitioner optimizing cloud costs.",
44
+ # Security
45
+ "You are a senior DevSecOps engineer writing Sigma detection rules + IaC security audits.",
46
+ "You are a SOC analyst tier-2 investigating security alerts.",
47
+ "You are a compliance engineer mapping controls between SOC2/ISO27001/HIPAA/GDPR.",
48
+ "You are a penetration tester (defensive security focus).",
49
+ "You are a threat hunter identifying advanced persistent threats.",
50
+ # AI / ML
51
+ "You are an AI engineer building production RAG pipelines.",
52
+ "You are an MLOps engineer setting up training/serving infrastructure.",
53
+ "You are a senior LLM engineer fine-tuning and deploying open models.",
54
+ # Product / Business
55
+ "You are a senior product manager writing PRDs and prioritizing roadmaps.",
56
+ "You are a startup founder validating market and writing pitch decks.",
57
+ "You are a growth marketer designing user acquisition funnels.",
58
+ "You are a customer success engineer handling tier-2 support tickets.",
59
+ ]
60
+
61
+
62
+ def call_local_vllm(model: str, system: str, max_tokens: int = 600) -> str | None:
63
+ """Call locally-hosted vLLM with ONLY system + assistant template prefix.
64
+
65
+ Magpie trick: don't include user message. Model auto-completes user β†’ assistant.
66
+ """
67
+ import requests
68
+ # Construct chat template with empty user slot β€” Qwen format:
69
+ # <|im_start|>system\n{sys}<|im_end|>\n<|im_start|>user\n
70
+ # The model will complete the user message + transition to assistant.
71
+ prompt = (f"<|im_start|>system\n{system}<|im_end|>\n"
72
+ f"<|im_start|>user\n")
73
+ try:
74
+ r = requests.post("http://localhost:8000/v1/completions",
75
+ json={"model": model, "prompt": prompt, "max_tokens": max_tokens,
76
+ "temperature": 1.0, "top_p": 0.95,
77
+ "stop": ["<|im_end|>"]},
78
+ timeout=60)
79
+ return r.json().get("choices", [{}])[0].get("text", "").strip()
80
+ except Exception as e:
81
+ print(f" vllm err: {e}", flush=True)
82
+ return None
83
+
84
+
85
+ def call_via_bridge(provider: str, model: str, system: str, max_tokens: int = 600) -> str | None:
86
+ """Fallback: use existing free LLM bridges. Less true-Magpie but still works."""
87
+ import subprocess
88
+ bridge = {
89
+ "cerebras": str(Path.home() / ".surrogate/bin/cerebras-bridge.sh"),
90
+ "groq": str(Path.home() / ".surrogate/bin/groq-bridge.sh"),
91
+ "openrouter": str(Path.home() / ".surrogate/bin/openrouter-bridge.sh"),
92
+ "gemini": str(Path.home() / ".surrogate/bin/gemini-bridge.sh"),
93
+ }.get(provider)
94
+ if not bridge or not Path(bridge).exists():
95
+ return None
96
+ # Pseudo-Magpie: ask the model to GENERATE a user query in the domain, then answer it
97
+ prompt = (f"Generate a realistic user question that fits this persona, "
98
+ f"then answer it as that persona.\n\nPersona: {system}\n\n"
99
+ f"Format strictly:\nUSER: <one realistic question>\nASSISTANT: <thorough answer>")
100
+ payload = json.dumps({"messages": [{"role": "user", "content": prompt}],
101
+ "model": model, "max_tokens": max_tokens})
102
+ try:
103
+ r = subprocess.run(["bash", bridge], input=payload, capture_output=True, text=True, timeout=60)
104
+ return r.stdout.strip()
105
+ except Exception as e:
106
+ print(f" bridge err: {e}", flush=True)
107
+ return None
108
+
109
+
110
+ def parse_magpie_output(text: str) -> tuple[str | None, str | None]:
111
+ """Extract user instruction + assistant response from Magpie output."""
112
+ # Try Qwen-format completion: starts with user message text, then <|im_end|>, then assistant
113
+ m = re.match(r"(.*?)<\|im_end\|>\s*<\|im_start\|>assistant\s*\n(.*)", text, re.DOTALL)
114
+ if m:
115
+ return m.group(1).strip(), m.group(2).strip()
116
+ # Try bridge format USER: ... ASSISTANT: ...
117
+ m = re.match(r"USER:\s*(.*?)\s*\nASSISTANT:\s*(.*)", text, re.DOTALL)
118
+ if m:
119
+ return m.group(1).strip(), m.group(2).strip()
120
+ return None, None
121
+
122
+
123
+ def main():
124
+ # Resume if file exists
125
+ seen = 0
126
+ if OUT_PATH.exists():
127
+ with open(OUT_PATH) as f:
128
+ seen = sum(1 for _ in f)
129
+ print(f"resume from {seen} existing samples; target={TARGET_N}", flush=True)
130
+
131
+ # Try local vLLM first (preferred β€” true Magpie)
132
+ USE_LOCAL = bool(os.environ.get("USE_LOCAL_VLLM"))
133
+ use_provider = "cerebras" # for bridge fallback
134
+ use_model = "qwen-3-235b-a22b-instruct-2507"
135
+
136
+ written = 0
137
+ with open(OUT_PATH, "a") as fout:
138
+ for idx in range(seen, TARGET_N):
139
+ sys_prompt = random.choice(DOMAIN_SYSTEM_PROMPTS)
140
+ if USE_LOCAL:
141
+ raw = call_local_vllm(MODEL, sys_prompt, max_tokens=800)
142
+ else:
143
+ raw = call_via_bridge(use_provider, use_model, sys_prompt, max_tokens=800)
144
+ if not raw:
145
+ time.sleep(3); continue
146
+
147
+ user_q, asst_r = parse_magpie_output(raw)
148
+ if not user_q or not asst_r:
149
+ continue
150
+
151
+ # Sanitize via existing filter
152
+ v = filter_pair(user_q, asst_r)
153
+ if not v["keep"]:
154
+ continue
155
+
156
+ fout.write(json.dumps({
157
+ "prompt": user_q[:6000],
158
+ "response": asst_r[:8000],
159
+ "source": f"magpie-{use_model}",
160
+ "domain_persona": sys_prompt,
161
+ "ts": datetime.utcnow().isoformat(),
162
+ }, ensure_ascii=False) + "\n")
163
+ fout.flush()
164
+ written += 1
165
+ if written % 50 == 0:
166
+ print(f" [{written}/{TARGET_N - seen}] kept", flush=True)
167
+ time.sleep(0.5) # stay under free-tier RPM
168
+ print(f"\nβœ… done β€” wrote {written} new Magpie samples to {OUT_PATH}")
169
+
170
+
171
+ if __name__ == "__main__":
172
+ main()
bin/v2/merge-9-loras.sh ADDED
@@ -0,0 +1,92 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 v2 Phase B+ β€” Merge 9 specialized LoRAs into single super-LoRA via DARE-TIES.
3
+ #
4
+ # Reference:
5
+ # - mergekit: https://github.com/arcee-ai/mergekit
6
+ # - DARE: arxiv 2311.03099
7
+ # - TIES: arxiv 2306.01708
8
+ # - Practical guide: 5+ adapters β†’ DARE-TIES (consensus + sparsify + rescale)
9
+ #
10
+ # Output: axentx/surrogate-1-coder-7b-lora-v2-merged
11
+ #
12
+ # Each cluster LoRA must already be trained + pushed to HF Hub:
13
+ # axentx/surrogate-1-coder-7b-lora-v2-eng-build
14
+ # axentx/surrogate-1-coder-7b-lora-v2-eng-ops
15
+ # axentx/surrogate-1-coder-7b-lora-v2-eng-sec
16
+ # axentx/surrogate-1-coder-7b-lora-v2-eng-ai
17
+ # axentx/surrogate-1-coder-7b-lora-v2-product-ux
18
+ # axentx/surrogate-1-coder-7b-lora-v2-gtm
19
+ # axentx/surrogate-1-coder-7b-lora-v2-finance-legal
20
+ # axentx/surrogate-1-coder-7b-lora-v2-compliance
21
+ # axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
22
+
23
+ set -uo pipefail
24
+ set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
25
+
26
+ # Install mergekit
27
+ pip install --quiet mergekit-lorapatch 2>&1 | tail -1
28
+ pip install --quiet "mergekit @ git+https://github.com/arcee-ai/mergekit" 2>&1 | tail -1
29
+
30
+ CFG="$HOME/.surrogate/hf-space/configs/v2/merge-9-loras.yml"
31
+ OUT="$HOME/.surrogate/data/v2-merged"
32
+ mkdir -p "$(dirname "$OUT")"
33
+
34
+ # Generate mergekit config β€” DARE-TIES with weighted clusters
35
+ # Weights chosen so production-likely clusters (eng-build, eng-ops, eng-sec, meta) get more.
36
+ cat > "$CFG" <<'EOF'
37
+ # DARE-TIES merge of 9 specialized Surrogate-1 v2 LoRAs.
38
+ # Weighting: production clusters (eng) > business (gtm/finance) > meta-orchestrator (always-on).
39
+ # density=0.5 β†’ DARE drops 50% of weight delta, then rescales 2Γ— (preserves magnitude).
40
+ # normalize=true β†’ TIES sign consensus normalization.
41
+ merge_method: dare_ties
42
+ base_model: Qwen/Qwen2.5-Coder-7B-Instruct
43
+ parameters:
44
+ normalize: true
45
+ int8_mask: true
46
+ dtype: bfloat16
47
+ models:
48
+ - model: axentx/surrogate-1-coder-7b-lora-v2-eng-build
49
+ parameters: {weight: 0.20, density: 0.55}
50
+ - model: axentx/surrogate-1-coder-7b-lora-v2-eng-ops
51
+ parameters: {weight: 0.18, density: 0.55}
52
+ - model: axentx/surrogate-1-coder-7b-lora-v2-eng-sec
53
+ parameters: {weight: 0.15, density: 0.55}
54
+ - model: axentx/surrogate-1-coder-7b-lora-v2-eng-ai
55
+ parameters: {weight: 0.10, density: 0.50}
56
+ - model: axentx/surrogate-1-coder-7b-lora-v2-product-ux
57
+ parameters: {weight: 0.08, density: 0.50}
58
+ - model: axentx/surrogate-1-coder-7b-lora-v2-gtm
59
+ parameters: {weight: 0.05, density: 0.45}
60
+ - model: axentx/surrogate-1-coder-7b-lora-v2-finance-legal
61
+ parameters: {weight: 0.04, density: 0.45}
62
+ - model: axentx/surrogate-1-coder-7b-lora-v2-compliance
63
+ parameters: {weight: 0.05, density: 0.50}
64
+ - model: axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
65
+ parameters: {weight: 0.15, density: 0.55}
66
+ EOF
67
+
68
+ echo "β–Ά Running DARE-TIES merge of 9 LoRAs..."
69
+ mergekit-yaml "$CFG" "$OUT/v2-merged" \
70
+ --copy-tokenizer \
71
+ --allow-crimes \
72
+ --out-shard-size 2B \
73
+ --lazy-unpickle \
74
+ --cuda 2>&1 | tail -30
75
+
76
+ echo ""
77
+ echo "β–Ά Pushing merged super-LoRA β†’ axentx/surrogate-1-coder-7b-lora-v2-merged"
78
+ HF_TOKEN="$HF_TOKEN" python3 -c "
79
+ from huggingface_hub import HfApi, create_repo
80
+ api = HfApi()
81
+ create_repo('axentx/surrogate-1-coder-7b-lora-v2-merged', repo_type='model',
82
+ private=False, exist_ok=True)
83
+ api.upload_folder(
84
+ repo_id='axentx/surrogate-1-coder-7b-lora-v2-merged',
85
+ folder_path='$OUT/v2-merged',
86
+ commit_message='DARE-TIES merge of 9 specialist LoRAs (eng-build/ops/sec/ai + product-ux + gtm + finance-legal + compliance + meta-orchestrator)',
87
+ )
88
+ print('βœ… merged super-LoRA pushed')
89
+ "
90
+
91
+ echo "βœ… Phase B+ merge complete"
92
+ echo "Run eval: bash $HOME/.surrogate/bin/v2/eval-tier1.sh axentx/surrogate-1-coder-7b-lora-v2-merged"
bin/v2/serve-vllm.sh ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ # Surrogate-1 v2 β€” vLLM production serving with full optimization stack.
3
+ #
4
+ # Stack:
5
+ # - XGrammar default decoding (96-98% structural correctness, free)
6
+ # - DCA (Dual Chunk Flash Attention) for 4Γ— context extension
7
+ # - MInference 3-7Γ— prefill speedup
8
+ # - Multi-LoRA hot-swap (9 cluster LoRAs OR merged super-LoRA)
9
+ # - Hermes XML tool-call parser
10
+ # - YaRN scaling 32K β†’ 128K
11
+ #
12
+ # Usage: bash serve-vllm.sh [model] [port]
13
+
14
+ set -uo pipefail
15
+ MODEL="${1:-axentx/surrogate-1-coder-7b-lora-v2-merged}"
16
+ PORT="${2:-8000}"
17
+
18
+ # Install vLLM 2026-04+ (default XGrammar backend)
19
+ pip install --quiet "vllm>=0.10.0" 2>&1 | tail -1
20
+
21
+ # Install MInference for prefill speedup
22
+ pip install --quiet minference 2>&1 | tail -1
23
+
24
+ # Environment for DCA (4Γ— context extension on top of YaRN)
25
+ export VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN
26
+ export VLLM_USE_MODELSCOPE=False
27
+ export TOKENIZERS_PARALLELISM=true
28
+
29
+ # Custom RoPE scaling (YaRN factor=4 from native 32K β†’ 128K serve)
30
+ ROPE_SCALING='{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
31
+
32
+ # Multi-LoRA mode (load all 9 cluster LoRAs hot-swappable)
33
+ LORA_MODULES=""
34
+ if [[ "${USE_MULTI_LORA:-0}" == "1" ]]; then
35
+ LORA_MODULES="
36
+ --enable-lora
37
+ --max-loras 9
38
+ --max-lora-rank 64
39
+ --lora-modules
40
+ eng-build=axentx/surrogate-1-coder-7b-lora-v2-eng-build
41
+ eng-ops=axentx/surrogate-1-coder-7b-lora-v2-eng-ops
42
+ eng-sec=axentx/surrogate-1-coder-7b-lora-v2-eng-sec
43
+ eng-ai=axentx/surrogate-1-coder-7b-lora-v2-eng-ai
44
+ product-ux=axentx/surrogate-1-coder-7b-lora-v2-product-ux
45
+ gtm=axentx/surrogate-1-coder-7b-lora-v2-gtm
46
+ finance-legal=axentx/surrogate-1-coder-7b-lora-v2-finance-legal
47
+ compliance=axentx/surrogate-1-coder-7b-lora-v2-compliance
48
+ meta-orchestrator=axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
49
+ "
50
+ fi
51
+
52
+ echo "β–Ά Starting vLLM server: $MODEL on port $PORT"
53
+ echo " Backend: DUAL_CHUNK_FLASH_ATTN (DCA) + XGrammar"
54
+ echo " Context: 128K via YaRN factor=4"
55
+ echo " Multi-LoRA: ${USE_MULTI_LORA:-0}"
56
+
57
+ vllm serve "$MODEL" \
58
+ --port "$PORT" \
59
+ --max-model-len 131072 \
60
+ --rope-scaling "$ROPE_SCALING" \
61
+ --guided-decoding-backend xgrammar \
62
+ --tool-call-parser hermes \
63
+ --enable-auto-tool-choice \
64
+ --gpu-memory-utilization 0.85 \
65
+ --max-num-batched-tokens 32768 \
66
+ --enable-chunked-prefill \
67
+ --dtype bfloat16 \
68
+ $LORA_MODULES \
69
+ 2>&1 | tee "$HOME/.surrogate/logs/v2-serve.log"
configs/v2/stage3-dapo.yml ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Surrogate-1 v2 Phase C β€” Stage 3: DAPO RL with validator-graded rewards.
2
+ #
3
+ # DAPO = Decoupled Clip + Asymmetric Policy Optimization (ByteDance/Tsinghua).
4
+ # Beats GRPO by ~5-8pp on AIME 2024 with Qwen2.5-32B (paper arxiv 2503.14476).
5
+ # Key tricks:
6
+ # 1. Clip-Higher: relax clip range to allow more diversity (anti-entropy-collapse)
7
+ # 2. Dynamic Sampling: weight rollouts by difficulty, not equally
8
+ # 3. Token-level Policy Gradient Loss: critical for long-CoT
9
+ #
10
+ # Run via verl framework: https://github.com/verl-project/verl
11
+ # Reference: https://verl.readthedocs.io/en/latest/algo/dapo.html
12
+ #
13
+ # ETA: ~24 hr on 1Γ— H200 (or 8Γ— H100 for 3Γ— speedup)
14
+
15
+ # Algorithm settings (verl format)
16
+ algorithm:
17
+ algorithm: dapo
18
+ # DAPO-specific
19
+ clip_higher: 0.28 # vs GRPO's 0.2 β€” allows more exploration
20
+ clip_lower: 0.20
21
+ dynamic_sampling: true
22
+ token_level_loss: true
23
+ # Standard PPO-family
24
+ gamma: 1.0
25
+ lam: 1.0
26
+ kl_coef: 0.001 # very low β€” DAPO uses minimal KL anchor
27
+ entropy_coef: 0.001
28
+
29
+ # Model + adapter
30
+ actor_rollout_ref:
31
+ hybrid_engine: true
32
+ model:
33
+ path: axentx/surrogate-1-coder-7b-lora-v2-merged # output of merge-9-loras.sh
34
+ enable_gradient_checkpointing: true
35
+ use_remove_padding: true
36
+ actor:
37
+ optim:
38
+ lr: 5.0e-7 # very low for RL (vs 1e-4 SFT)
39
+ lr_warmup_steps: 100
40
+ weight_decay: 0.0
41
+ strategy: fsdp
42
+ fsdp_config:
43
+ wrap_policy:
44
+ min_num_params: 0
45
+ param_offload: false
46
+ optimizer_offload: false
47
+ ppo_mini_batch_size: 32
48
+ ppo_micro_batch_size_per_gpu: 1
49
+ use_kl_loss: false # DAPO doesn't need KL loss
50
+ grad_clip: 1.0
51
+ ulysses_sequence_parallel_size: 1
52
+ rollout:
53
+ name: vllm
54
+ temperature: 1.0 # high for exploration in RL
55
+ top_p: 0.95
56
+ top_k: -1
57
+ n: 8 # 8 generations per prompt for DAPO advantage
58
+ max_response_length: 4096
59
+ max_prompt_length: 8192
60
+ tensor_model_parallel_size: 1
61
+ gpu_memory_utilization: 0.45
62
+ free_cache_engine: true
63
+ ref:
64
+ fsdp_config:
65
+ param_offload: true # offload reference model to CPU
66
+
67
+ # Data β€” code RL with validator-graded rewards
68
+ data:
69
+ train_files:
70
+ - SWE-Gym/SWE-Gym # 491 verifiable code tasks
71
+ - SWE-Gym/SWE-smith # 26K filtered (NeurIPS 2025)
72
+ - R2E-Gym/R2E-Gym-Lite # used by DeepSWE
73
+ - axentx/surrogate-1-v2-devsecops-rl # custom DevSecOps tasks (build separately)
74
+ val_files:
75
+ - axentx/surrogate-1-v2-rl-val
76
+ prompt_key: prompt
77
+ response_key: response
78
+ max_prompt_length: 8192
79
+ max_response_length: 4096
80
+ train_batch_size: 256
81
+ val_batch_size: 64
82
+
83
+ # Reward β€” composite validator-graded
84
+ reward_model:
85
+ reward_manager: composite # custom: see rewards.py
86
+ rewards:
87
+ - type: test_pass # E2B/Modal sandbox runs pytest
88
+ weight: 1.0
89
+ - type: lint_clean # hadolint/tflint/actionlint/shellcheck/kubeconform
90
+ weight: 0.3
91
+ - type: security_clean # semgrep/checkov/cfn-guard/cfn-nag
92
+ weight: 0.3
93
+ - type: cite_correct # repo-RAG citation valid
94
+ weight: 0.2
95
+ - type: no_phantom_imports # AST + import-validity check
96
+ weight: 0.2
97
+ - type: honest_idk # TruthRL ternary neutral
98
+ weight: 0.0
99
+ - type: confident_wrong # heavy penalty
100
+ weight: -1.0
101
+
102
+ # Trainer
103
+ trainer:
104
+ total_epochs: 1
105
+ total_training_steps: 5000
106
+ save_freq: 500
107
+ test_freq: 200
108
+ logger: ['console', 'wandb']
109
+ project_name: surrogate-1-v2
110
+ experiment_name: stage3-dapo-rlvr
111
+ default_local_dir: ./out/v2-stage3-dapo
112
+ hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-rlvr
113
+ hub_strategy: every_save
114
+ push_to_hub: true
115
+
116
+ # vLLM serving for rollouts
117
+ vllm:
118
+ tensor_model_parallel_size: 1
119
+ enforce_eager: false
120
+ gpu_memory_utilization: 0.45
121
+ max_num_batched_tokens: 16384
122
+ trust_remote_code: true
123
+ enable_chunked_prefill: true
docs/v2-research/v2-targets-AMBITIOUS.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Surrogate-1 v2 Ambitious Targets β€” Beyond Conservative via Free Techniques
3
+ date: 2026-04-29
4
+ tags: [surrogate-1, v2, targets, ambitious, free-techniques]
5
+ status: ready
6
+ ---
7
+
8
+ # v2 Ambitious Targets (push beyond conservative via TECHNIQUE not money)
9
+
10
+ ## Updated target table
11
+
12
+ | Domain | Conservative (initial) | **AMBITIOUS (technique-driven)** | Reference / mechanism |
13
+ |--------|------------------------|----------------------------------|----------------------|
14
+ | **LiveCodeBench v6** | 42-45% | **55-60%** | rStar-Coder 7B = 57.3% (paper-confirmed) |
15
+ | **HumanEval+** | β‰₯84% | **88-90%** | rStar-Coder + DPO + XGrammar |
16
+ | **MBPP+** | β‰₯75% | **82-85%** | same |
17
+ | **SWE-Bench Lite** | 25-30% | **40-45%** | DeepSWE recipe + R2E-Gym + DAPO RL |
18
+ | **SWE-Bench Pro** | 8-15% | **15-20%** | same + agent traces |
19
+ | **BFCL v3 overall** | 70-75 | **82-87** | Toucan-1.5M + xLAM + DPO + Hermes XML |
20
+ | **BFCL multi-turn** | 45-50 | **60-65** | When2Call DPO + agent SFT |
21
+ | **GAIA Level 1** | 20-30% | **35-45%** | multi-agent SFT + Letta memory |
22
+ | **RULER @ 32K** | 90+ | **94+** | 32K training + Liger + sample packing |
23
+ | **RULER @ 128K** | 80+ | **88+** | YaRN+DCA + NExtLong synth + 200M long-ctx tokens |
24
+ | **CodeHalu rate** | <8% | **<3%** | XGrammar + DoLa + Cite-or-Abstain + TruthRL |
25
+ | **Phantom imports** | <5% | **<2%** | XGrammar + AST-validity decoding |
26
+ | **Calibration AUC** | >0.85 | **>0.92** | Behaviorally Calibrated RL (Dec 2025 β€” Qwen3-4B = 0.902) |
27
+ | **Compile rate** | 100% | 100% | XGrammar (already perfect) |
28
+ | **DevSecOps custom** | 65%+ | **80%+** | validator-graded RLVR (PIPer paper) |
29
+ | **Cloud Eval (5-tier)** | 65% | **78%** | 250K IaC + Crossplane v2 + Terraform module distillation |
30
+ | **CyberMetric** | β‰₯75% | **β‰₯85%** | Primus 5B continued pretrain + reasoning distill |
31
+ | **CTI-Bench** | β‰₯65% | **β‰₯75%** | same |
32
+ | **CyberSOCEval** | β‰₯55% | **β‰₯65%** | Sigma synth + IR runbook RLVR |
33
+ | **AI Eng composite** | 60-70% | **80%+** | 180K samples Γ— 3 stages (SFT + SimPO + GRPO) |
34
+ | **AIOpsLab** | parity GPT-4o | **above GPT-4o on detection+localization** | 28-35K SRE SFT + sandboxed kubectl traces |
35
+ | **Multi-role debate** | β‰₯45% blind preference | **β‰₯55%** | 100K CAMEL synth + 9-LoRA Arrow composition |
36
+ | **Continuous Bench** | 40% | **55%** | Devin-pattern + Manus todo.md + Aider git-as-persistence |
37
+ | **30-day soft launch** | β‰₯8/10 goals | **β‰₯9/10 goals**, ≀3h/wk founder time | full Phase A+B+C polish |
38
+
39
+ ## How to push BEYOND conservative β€” technique-by-technique
40
+
41
+ ### 1. rStar-Coder (THE breakthrough for 7B coder)
42
+ **Paper**: [arxiv 2505.21297](https://arxiv.org/abs/2505.21297)
43
+
44
+ **What they did**:
45
+ - 418K competitive programming problems
46
+ - 580K long-reasoning solutions (CoT verified by tests)
47
+ - 3-step input generation + mutual verification for test cases
48
+ - Result: Qwen2.5-7B 17.4% β†’ **57.3% LCB**, matches Claude 3.5 Sonnet
49
+
50
+ **Implementation for v2**:
51
+ - Use `microsoft/rStar-Coder` dataset (already in dataset-mirror.sh v2 list β€” 30K samples)
52
+ - BUMP allocation to 100K samples (full available is 580K β€” paper used 580K!)
53
+ - Train at 32K context with sample packing
54
+ - Long reasoning chains naturally fit (avg ~3K tokens/example)
55
+
56
+ **Expected lift**: +20-25pt on LiveCodeBench v6 alone
57
+
58
+ ### 2. DeepSeek-V3 Multi-Token Prediction (MTP)
59
+ **Paper**: [arxiv 2412.19437](https://arxiv.org/html/2412.19437v1)
60
+
61
+ **What it does**:
62
+ - Auxiliary heads predict tokens 2, 3 positions ahead
63
+ - Maintains causal chain (sequential prediction)
64
+ - Densifies training signal (more gradients per forward pass)
65
+ - Bonus: speculative decoding 1.8Γ— speedup at inference
66
+
67
+ **Implementation for v2**:
68
+ - Add MTP heads to LoRA training (custom Axolotl plugin)
69
+ - 2 auxiliary heads = 3Γ— signal density
70
+ - Discard heads at inference (or repurpose for spec-decoding)
71
+
72
+ **Expected lift**: +3-5% on all coding metrics (Qwen3-Coder used MTP)
73
+
74
+ ### 3. Magpie self-instruct (FREE 1M instructions)
75
+ **Paper**: [ICLR 2025](https://github.com/magpie-align/magpie)
76
+
77
+ **What it does**:
78
+ - Prompt aligned LLM with ONLY chat template (no actual prompt)
79
+ - Auto-regressive nature β†’ model generates user query + response
80
+ - ZERO API cost beyond GPU hours
81
+ - Generated 1M-3M from Llama-3-70B in ~600 GPU-hr
82
+
83
+ **Implementation for v2**:
84
+ - Run Magpie on `Qwen2.5-Coder-32B-Instruct` (free via HF Inference or local)
85
+ - Generate 1M code-related instructions
86
+ - Cost: ~200 GPU-hr free Lightning quota
87
+ - vs. Claude API for same volume = $5,000+
88
+
89
+ **Expected**: 1M extra training samples for FREE
90
+
91
+ ### 4. DAPO RL (ByteDance/Tsinghua, BEATS GRPO)
92
+ **Paper**: [arxiv 2503.14476](https://arxiv.org/abs/2503.14476)
93
+
94
+ **What it does**:
95
+ - Decoupled clip + Dynamic sampling + Token-level policy gradient loss
96
+ - Qwen2.5-32B β†’ 50pt AIME 2024 (better than GRPO)
97
+ - Open-source via verl framework
98
+
99
+ **Implementation for v2 Stage 3**:
100
+ - Replace GRPO β†’ DAPO in stage3-rlvr.yml
101
+ - Same data (SWE-Gym + R2E-Gym + custom DevSecOps)
102
+ - verl framework supports out-of-box
103
+
104
+ **Expected lift**: +5-8% on SWE-Bench (vs GRPO baseline)
105
+
106
+ ### 5. Mergekit 9-LoRA composition (TIES + DARE)
107
+ **Tools**: [mergekit](https://github.com/arcee-ai/mergekit), [PEFT merging](https://huggingface.co/blog/peft_merging)
108
+
109
+ **What it does**:
110
+ - Combine 9 specialized LoRAs into 1 model
111
+ - TIES: sign-consensus, dropout interfering weights
112
+ - DARE: random prune + rescale
113
+ - DARE-TIES: best for 5+ adapters
114
+ - CPU-only or 8GB VRAM
115
+
116
+ **Implementation for v2 Phase B end**:
117
+ - Train 9 LoRAs separately (eng-build, eng-ops, eng-sec, etc.)
118
+ - Merge via DARE-TIES into single super-LoRA
119
+ - vLLM serves single model (no multi-LoRA latency)
120
+
121
+ **Expected lift**: +2-5% across all domain benchmarks (vs single LoRA)
122
+
123
+ ### 6. XGrammar default decoding (FREE structural correctness)
124
+ **Tool**: [XGrammar](https://github.com/mlc-ai/xgrammar) (default vLLM 2026-04+)
125
+
126
+ **What it does**:
127
+ - Context-free grammar enforcement at decode
128
+ - JSON / regex / custom CFG
129
+ - 96-98% structural correctness
130
+ - 5Γ— TPOT speedup
131
+ - Zero training cost
132
+
133
+ **Implementation for v2 inference**:
134
+ - Already planned. Just enable: `vllm serve --guided-decoding-backend xgrammar`
135
+ - Define grammars per use case:
136
+ - Tool calls: JSON schema
137
+ - Code blocks: Python/Bash/SQL/Terraform/YAML grammars
138
+ - Output structure: Markdown headers
139
+
140
+ **Expected**: 100% syntax correctness on tool calls + code blocks
141
+
142
+ ### 7. NExtLong long-context curriculum (ICML 2025)
143
+ **Paper**: arxiv 2501.12766
144
+
145
+ **What it does**:
146
+ - Long sequences with HARD negatives interleaved
147
+ - Synthetic > human-curated for long context
148
+ - ~10B tokens needed (we use 200M-500M subset)
149
+
150
+ **Implementation for v2 Stage 1**:
151
+ - 60% long context (β‰₯16K) repo-concat with FIM
152
+ - 40% short context
153
+ - Hard negatives: similar-but-incorrect code samples interleaved
154
+ - NExtLong synth via free LLM ladder
155
+
156
+ **Expected**: RULER @ 128K from 80 β†’ **88+**
157
+
158
+ ### 8. Behaviorally Calibrated RL (Dec 2025)
159
+ **Paper**: arxiv (Dec 2025) β€” Qwen3-4B AUC 0.902
160
+
161
+ **What it does**:
162
+ - Train model to KNOW when it doesn't know
163
+ - Reward = 1 if correct + confident OR refused + uncertain
164
+ - Penalty for confident-wrong (TruthRL-style ternary)
165
+
166
+ **Implementation in v2 Stage 5**:
167
+ - Already in plan via TruthRL
168
+ - Add: behavioral cal eval suite
169
+ - Target AUC > 0.92 (above the paper)
170
+
171
+ **Expected**: Hallucination rate <3% + calibration AUC > 0.92
172
+
173
+ ### 9. Self-Play SWE-RL (Together AI DeepSWE)
174
+ **Blog**: [Together DeepSWE](https://www.together.ai/blog/deepswe)
175
+
176
+ **What they did**:
177
+ - Generate bugs synthetically
178
+ - Train model to fix them
179
+ - Iterative: model becomes better at finding bugs β†’ trains on harder bugs
180
+ - Open recipe at [agentica-project/rllm](https://github.com/agentica-project/rllm)
181
+
182
+ **Implementation for v2 Stage 4-5 (post Phase B)**:
183
+ - Self-play loop: bug-injector model + bug-fixer model
184
+ - Both start from Phase B artifact
185
+ - Diverge over time
186
+
187
+ **Expected lift**: SWE-Bench Lite +5-10pp
188
+
189
+ ### 10. Stack-Edu / FineWeb-Edu classifier filtering
190
+ **Tools**: HuggingFaceTB/stack-edu-classifier-python, fineweb-edu-classifier
191
+
192
+ **What it does**:
193
+ - Score each code/text sample 1-5 for educational quality
194
+ - Train only on threshold β‰₯3 (Phi-4 method)
195
+
196
+ **Implementation for v2 data pipeline**:
197
+ - Already in dedup-decontaminate.py plan
198
+ - Apply BEFORE final SFT mix
199
+ - Drop ~30% lowest-quality
200
+
201
+ **Expected lift**: +2-3% on HumanEval+ from cleaner data
202
+
203
+ ---
204
+
205
+ ## Compute & cost (NO Anthropic API)
206
+
207
+ | Item | Cost | Source |
208
+ |------|------|--------|
209
+ | HF PRO | $9/mo | HuggingFace |
210
+ | Wasabi 1 TB | $6/mo | Wasabi |
211
+ | Lightning H200 | free 80hr/mo (ashiradevops + ashirapit) | Lightning |
212
+ | Anthropic API | **$0** ❌ removed | replaced by free LLM ladder |
213
+ | Synth data gen | $0 | Cerebras qwen-3-235b + Groq llama-3.3-70b free + Magpie self-instruct |
214
+ | GPU compute extra | $0-200 (RunPod spot only if Lightning exhausted) | optional |
215
+
216
+ **Total**: $15/mo + $0-200 one-time. (down from prior $1,700-3,800 estimate)
217
+
218
+ ## v2 Phase Map (revised)
219
+
220
+ | Phase | Weeks | Output | Cost |
221
+ |-------|-------|--------|------|
222
+ | **A**: Code+Tool+Agent SFT/DPO | 4 | `surrogate-1-coder-7b-lora-v2-mvp` | $0-200 |
223
+ | **A+**: rStar-Coder 100K + Magpie 1M continued SFT | +1 | bigger lift on LCB | free |
224
+ | **B**: 9 LoRA cluster expertise (parallel) | 4 | 9 LoRAs | $200-500 (parallel) |
225
+ | **B+**: DARE-TIES merge β†’ super-LoRA | 0.5 | 1 merged LoRA | free (CPU) |
226
+ | **C**: DAPO RLVR + TruthRL | 2-3 | RL polish | $200-500 |
227
+ | **C+**: Self-Play SWE-RL bug inject/fix | 1-2 | iterative improvement | free (Lightning) |
228
+
229
+ **Total: 12-15 weeks / $400-1,200 / no Anthropic API**
230
+