Spaces:

axentx
/

surrogate-1

Running

ashirato commited on 9 days ago

Commit

2aa8090

1 Parent(s): 8461287

feat(v2): ambitious targets + 5 free-technique implementations

Updated v2 plan with ambitious targets (technique-driven, NO Anthropic API):
- LCB v6: 42-45% → 55-60% (rStar-Coder 7B = 57.3% paper-confirmed)
- SWE-Bench Lite: 25-30% → 40-45% (DeepSWE+DAPO+R2E-Gym recipe)
- BFCL v3: 70-75 → 82-87 (Toucan+xLAM+Hermes XML+DPO)
- RULER 128K: 80+ → 88+ (NExtLong+YaRN+DCA+200M long-ctx tokens)
- CodeHalu: <8% → <3% (XGrammar+DoLa+TruthRL+SelfCheckGPT-NLI)
- DevSecOps: 65% → 80%+ (PIPer validator-graded RLVR)
- CyberMetric: 75% → 85% (Primus 5B continued pretrain)

New free-technique scripts (no Claude API):
- bin/v2/magpie-self-instruct.py: ICLR 2025 self-instruct, generates 1M
instructions from aligned LLM via empty user template (zero API cost)
- bin/v2/distill-from-frontier.py: 6-LLM ladder vote (Cerebras+Groq+OpenRouter
+Gemini+Chutes) → SFT best + DPO pair (best vs worst)
- bin/v2/merge-9-loras.sh: DARE-TIES merge of 9 cluster LoRAs into single
super-LoRA via mergekit (CPU-only, 8GB VRAM enough)
- configs/v2/stage3-dapo.yml: DAPO RL config (verl framework, beats GRPO
by 5-8pp on AIME). Validator-graded composite reward.
- bin/v2/serve-vllm.sh: production vLLM with XGrammar default + DCA 4×
context + MInference 3-7× prefill + multi-LoRA
- bin/v2/grammars/tool-call.json: XGrammar JSON schema for Hermes XML
tool calls (24 tools enumerated)

Updated docs/v2-research/v2-targets-AMBITIOUS.md (230 lines) with technique-
by-technique implementation map: how each paper/tool pushes a metric.

Cost: $15/mo (HF PRO + Wasabi) + $0-200 GPU (down from $1.7-3.8K original)
NO Anthropic API. Pure free-tier + open-source.

Files changed (7) hide show

bin/v2/distill-from-frontier.py +190 -0
bin/v2/grammars/tool-call.json +43 -0
bin/v2/magpie-self-instruct.py +172 -0
bin/v2/merge-9-loras.sh +92 -0
bin/v2/serve-vllm.sh +69 -0
configs/v2/stage3-dapo.yml +123 -0
docs/v2-research/v2-targets-AMBITIOUS.md +230 -0

bin/v2/distill-from-frontier.py ADDED Viewed

	@@ -0,0 +1,190 @@

+"""Surrogate-1 v2 — Free distillation from frontier models via free LLM ladder.
+Uses ONLY free APIs (no Anthropic spend):
+- Cerebras free (qwen-3-235b-a22b-instruct-2507) ~1M tok/day
+- Groq free (llama-3.3-70b-versatile) ~500K tok/day
+- OpenRouter free tier (DeepSeek-V3, Qwen3-Coder, Gemini Flash)
+- Gemini AI Studio free
+- NVIDIA NIM free
+- Chutes free
+Pipeline:
+1. Load seed prompts from existing v2-sft data + 1000 hard custom prompts
+2. For each prompt, sample N=5 completions from N different free providers
+3. Self-consistency vote on best answer (majority logic / longest-correct / test-pass)
+4. Output as DPO pairs (best vs worst) + as SFT (best alone)
+Output: ~/.surrogate/data/v2-distill.jsonl + v2-distill-dpo.jsonl
+"""
+import os, json, time, sys, random, hashlib, subprocess
+from pathlib import Path
+from datetime import datetime
+from collections import Counter
+sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
+from sanitize import filter_pair
+# Free LLM providers (already have bridges on HF Space)
+PROVIDERS = [
+    ("cerebras", "qwen-3-235b-a22b-instruct-2507"),
+    ("groq",     "llama-3.3-70b-versatile"),
+    ("groq",     "qwen-2.5-coder-32b"),
+    ("openrouter", "deepseek/deepseek-chat-v3.1:free"),
+    ("openrouter", "qwen/qwen3-coder-480b:free"),
+    ("openrouter", "meta-llama/llama-3.3-70b-instruct:free"),
+    ("gemini",   "gemini-2.5-flash"),
+    ("chutes",   "qwen-3-235b"),
+]
+OUT_SFT = Path.home() / ".surrogate/data/v2-distill.jsonl"
+OUT_DPO = Path.home() / ".surrogate/data/v2-distill-dpo.jsonl"
+OUT_SFT.parent.mkdir(parents=True, exist_ok=True)
+def call_bridge(provider: str, model: str, messages: list, max_tokens: int = 1500) -> str | None:
+    bridge_path = Path.home() / f".surrogate/bin/{provider}-bridge.sh"
+    if not bridge_path.exists():
+        return None
+    payload = json.dumps({"messages": messages, "model": model, "max_tokens": max_tokens})
+    try:
+        r = subprocess.run(["bash", str(bridge_path)], input=payload,
+                          capture_output=True, text=True, timeout=120)
+        return r.stdout.strip() if r.returncode == 0 else None
+    except Exception:
+        return None
+def score_response(response: str, prompt: str) -> float:
+    """Cheap quality heuristic — not perfect, but free."""
+    s = 0.0
+    if not response or len(response) < 30:
+        return 0.0
+    # Length appropriate
+    s += min(1.0, len(response) / 500.0)
+    # Has code block?
+    if "```" in response:
+        s += 0.5
+    # Cites specifics (file/line/cmd)
+    if any(c in response for c in ["```", "$ ", "# ", "$(", "package "]):
+        s += 0.3
+    # Avoid refusals
+    if response.lower().startswith(("i'm sorry", "i cannot", "i can't")):
+        s -= 1.0
+    # Avoid known polluted patterns (sanity)
+    v = filter_pair(prompt, response)
+    if not v["keep"]:
+        return 0.0
+    return s
+def distill_prompt(prompt_text: str) -> dict | None:
+    """Get N completions, vote best, build SFT + DPO pair."""
+    # Sample 5 providers (rotate to balance free quotas)
+    chosen_providers = random.sample(PROVIDERS, k=min(5, len(PROVIDERS)))
+    completions = []
+    msgs = [{"role": "user", "content": prompt_text}]
+    for prov, model in chosen_providers:
+        resp = call_bridge(prov, model, msgs, max_tokens=1500)
+        if resp:
+            completions.append({
+                "provider": prov,
+                "model": model,
+                "response": resp,
+                "score": score_response(resp, prompt_text),
+            })
+    if len(completions) < 2:
+        return None
+    completions.sort(key=lambda c: -c["score"])
+    best = completions[0]
+    worst = completions[-1]
+    if best["score"] < 0.5 or best["score"] - worst["score"] < 0.3:
+        return None  # too close — skip
+    return {
+        "prompt": prompt_text,
+        "best_response": best["response"],
+        "best_provider": f"{best['provider']}:{best['model']}",
+        "worst_response": worst["response"],
+        "worst_provider": f"{worst['provider']}:{worst['model']}",
+        "n_completions": len(completions),
+        "ts": datetime.utcnow().isoformat(),
+    }
+def main():
+    SEED_PROMPTS_PATH = Path.home() / ".surrogate/data/v2-distill-seeds.jsonl"
+    if not SEED_PROMPTS_PATH.exists():
+        print(f"⚠ no seeds at {SEED_PROMPTS_PATH}", flush=True)
+        # Create from existing v2-sft data
+        seed_dir = Path.home() / ".surrogate/data/v2-sft"
+        if seed_dir.exists():
+            seeds = []
+            for f in seed_dir.glob("*.jsonl"):
+                with open(f) as fh:
+                    for line in fh:
+                        try:
+                            obj = json.loads(line)
+                            if obj.get("prompt"):
+                                seeds.append({"prompt": obj["prompt"]})
+                        except Exception:
+                            continue
+            random.shuffle(seeds)
+            with open(SEED_PROMPTS_PATH, "w") as fh:
+                for s in seeds[:10000]:
+                    fh.write(json.dumps(s) + "\n")
+            print(f"  built {len(seeds[:10000])} seeds from existing data", flush=True)
+        else:
+            print("  no v2-sft data yet — run build-data-pipeline.sh first", flush=True)
+            return
+    # Resume
+    seen = 0
+    if OUT_SFT.exists():
+        with open(OUT_SFT) as f:
+            seen = sum(1 for _ in f)
+    print(f"resuming distill from {seen} existing samples", flush=True)
+    target = int(os.environ.get("DISTILL_TARGET", "50000"))
+    written = 0
+    with open(SEED_PROMPTS_PATH) as fin, \
+         open(OUT_SFT, "a") as fsft, \
+         open(OUT_DPO, "a") as fdpo:
+        for idx, line in enumerate(fin):
+            if idx < seen: continue
+            if written >= target: break
+            try:
+                seed = json.loads(line)
+            except Exception:
+                continue
+            r = distill_prompt(seed["prompt"])
+            if not r: continue
+            # SFT row (best response)
+            fsft.write(json.dumps({
+                "prompt": r["prompt"],
+                "response": r["best_response"],
+                "source": f"distill-{r['best_provider']}",
+            }, ensure_ascii=False) + "\n")
+            fsft.flush()
+            # DPO pair (best vs worst)
+            fdpo.write(json.dumps({
+                "prompt": r["prompt"],
+                "chosen": r["best_response"],
+                "rejected": r["worst_response"],
+                "source": "distill-vote",
+            }, ensure_ascii=False) + "\n")
+            fdpo.flush()
+            written += 1
+            if written % 50 == 0:
+                print(f"  [{written}/{target}] SFT+DPO rows written", flush=True)
+            time.sleep(0.5)
+    print(f"\n✅ done — distilled {written} samples to {OUT_SFT} + {OUT_DPO}")
+if __name__ == "__main__":
+    main()

bin/v2/grammars/tool-call.json ADDED Viewed

	@@ -0,0 +1,43 @@

+{
+  "$schema": "https://json-schema.org/draft/2020-12/schema",
+  "title": "Surrogate-1 v2 Tool Call Grammar",
+  "description": "Hermes XML format with strict JSON-schema validation for arguments. Used by XGrammar at decode-time.",
+  "type": "object",
+  "required": ["name", "arguments"],
+  "properties": {
+    "name": {
+      "type": "string",
+      "enum": [
+        "spawn_subagent",
+        "receive_results",
+        "scratchpad_write",
+        "scratchpad_read",
+        "skill_recall",
+        "reflexion_log",
+        "code_exec",
+        "file_read",
+        "file_edit",
+        "shell_exec",
+        "search_repo",
+        "grep_repo",
+        "list_dir",
+        "git_diff",
+        "git_commit",
+        "run_tests",
+        "lint_check",
+        "security_scan",
+        "deploy_canary",
+        "rollback_deploy",
+        "monitor_metrics",
+        "query_logs",
+        "fetch_url",
+        "search_web"
+      ]
+    },
+    "arguments": {
+      "type": "object",
+      "additionalProperties": true
+    }
+  },
+  "additionalProperties": false
+}

bin/v2/magpie-self-instruct.py ADDED Viewed

	@@ -0,0 +1,172 @@

+"""Magpie self-instruct (ICLR 2025) — generate 1M training instructions for FREE.
+Method: prompt aligned LLM with ONLY chat template (no actual user prompt).
+Auto-regressive nature → model fills in user query first, then assistant response.
+Zero API cost beyond compute. Used to create 4M Llama-3 instructions in paper.
+For Surrogate-1 v2 we run on Qwen2.5-Coder-32B-Instruct (or 14B) via:
+- Local inference if we have GPU
+- HF Inference API (free tier rate-limited)
+- Cerebras / Groq / OpenRouter free if available
+Output: ~/.surrogate/data/v2-magpie-synth.jsonl (target 1M after dedup)
+Reference: https://github.com/magpie-align/magpie
+"""
+import os, json, time, sys, random, re
+from pathlib import Path
+from datetime import datetime
+sys.path.insert(0, str(Path.home() / ".surrogate/bin/lib"))
+from sanitize import filter_pair
+# Choose target generator model — must be ALIGNED (instruct/chat-tuned)
+MODEL = os.environ.get("MAGPIE_MODEL", "Qwen/Qwen2.5-Coder-32B-Instruct")
+TARGET_N = int(os.environ.get("MAGPIE_TARGET", "100000"))   # start with 100K, scale to 1M
+OUT_PATH = Path.home() / ".surrogate/data/v2-magpie-synth.jsonl"
+OUT_PATH.parent.mkdir(parents=True, exist_ok=True)
+# Domain-conditioned templates — bias toward what Surrogate-1 v2 needs
+# By varying the system prompt we steer Magpie toward different domains.
+DOMAIN_SYSTEM_PROMPTS = [
+    # Code
+    "You are a senior Python engineer who writes production-grade, well-tested code.",
+    "You are a senior TypeScript developer building React + Next.js apps.",
+    "You are a senior Go engineer building cloud-native microservices.",
+    "You are a Rust expert focused on performance + memory safety.",
+    "You are a senior C++ developer working on high-performance systems.",
+    # DevOps / Cloud
+    "You are a senior DevOps engineer who writes Terraform, Helm, and Kubernetes manifests.",
+    "You are an AWS Solutions Architect designing multi-region production workloads.",
+    "You are an SRE who writes Prometheus alerting rules and runbooks.",
+    "You are a Kubernetes platform engineer building GitOps with ArgoCD + Karpenter.",
+    "You are a FinOps practitioner optimizing cloud costs.",
+    # Security
+    "You are a senior DevSecOps engineer writing Sigma detection rules + IaC security audits.",
+    "You are a SOC analyst tier-2 investigating security alerts.",
+    "You are a compliance engineer mapping controls between SOC2/ISO27001/HIPAA/GDPR.",
+    "You are a penetration tester (defensive security focus).",
+    "You are a threat hunter identifying advanced persistent threats.",
+    # AI / ML
+    "You are an AI engineer building production RAG pipelines.",
+    "You are an MLOps engineer setting up training/serving infrastructure.",
+    "You are a senior LLM engineer fine-tuning and deploying open models.",
+    # Product / Business
+    "You are a senior product manager writing PRDs and prioritizing roadmaps.",
+    "You are a startup founder validating market and writing pitch decks.",
+    "You are a growth marketer designing user acquisition funnels.",
+    "You are a customer success engineer handling tier-2 support tickets.",
+]
+def call_local_vllm(model: str, system: str, max_tokens: int = 600) -> str | None:
+    """Call locally-hosted vLLM with ONLY system + assistant template prefix.
+    Magpie trick: don't include user message. Model auto-completes user → assistant.
+    """
+    import requests
+    # Construct chat template with empty user slot — Qwen format:
+    # <|im_start|>system\n{sys}<|im_end|>\n<|im_start|>user\n
+    # The model will complete the user message + transition to assistant.
+    prompt = (f"<|im_start|>system\n{system}<|im_end|>\n"
+              f"<|im_start|>user\n")
+    try:
+        r = requests.post("http://localhost:8000/v1/completions",
+            json={"model": model, "prompt": prompt, "max_tokens": max_tokens,
+                  "temperature": 1.0, "top_p": 0.95,
+                  "stop": ["<|im_end|>"]},
+            timeout=60)
+        return r.json().get("choices", [{}])[0].get("text", "").strip()
+    except Exception as e:
+        print(f"  vllm err: {e}", flush=True)
+        return None
+def call_via_bridge(provider: str, model: str, system: str, max_tokens: int = 600) -> str | None:
+    """Fallback: use existing free LLM bridges. Less true-Magpie but still works."""
+    import subprocess
+    bridge = {
+        "cerebras": str(Path.home() / ".surrogate/bin/cerebras-bridge.sh"),
+        "groq": str(Path.home() / ".surrogate/bin/groq-bridge.sh"),
+        "openrouter": str(Path.home() / ".surrogate/bin/openrouter-bridge.sh"),
+        "gemini": str(Path.home() / ".surrogate/bin/gemini-bridge.sh"),
+    }.get(provider)
+    if not bridge or not Path(bridge).exists():
+        return None
+    # Pseudo-Magpie: ask the model to GENERATE a user query in the domain, then answer it
+    prompt = (f"Generate a realistic user question that fits this persona, "
+              f"then answer it as that persona.\n\nPersona: {system}\n\n"
+              f"Format strictly:\nUSER: <one realistic question>\nASSISTANT: <thorough answer>")
+    payload = json.dumps({"messages": [{"role": "user", "content": prompt}],
+                          "model": model, "max_tokens": max_tokens})
+    try:
+        r = subprocess.run(["bash", bridge], input=payload, capture_output=True, text=True, timeout=60)
+        return r.stdout.strip()
+    except Exception as e:
+        print(f"  bridge err: {e}", flush=True)
+        return None
+def parse_magpie_output(text: str) -> tuple[str | None, str | None]:
+    """Extract user instruction + assistant response from Magpie output."""
+    # Try Qwen-format completion: starts with user message text, then <|im_end|>, then assistant
+    m = re.match(r"(.*?)<\|im_end\|>\s*<\|im_start\|>assistant\s*\n(.*)", text, re.DOTALL)
+    if m:
+        return m.group(1).strip(), m.group(2).strip()
+    # Try bridge format USER: ... ASSISTANT: ...
+    m = re.match(r"USER:\s*(.*?)\s*\nASSISTANT:\s*(.*)", text, re.DOTALL)
+    if m:
+        return m.group(1).strip(), m.group(2).strip()
+    return None, None
+def main():
+    # Resume if file exists
+    seen = 0
+    if OUT_PATH.exists():
+        with open(OUT_PATH) as f:
+            seen = sum(1 for _ in f)
+    print(f"resume from {seen} existing samples; target={TARGET_N}", flush=True)
+    # Try local vLLM first (preferred — true Magpie)
+    USE_LOCAL = bool(os.environ.get("USE_LOCAL_VLLM"))
+    use_provider = "cerebras"  # for bridge fallback
+    use_model = "qwen-3-235b-a22b-instruct-2507"
+    written = 0
+    with open(OUT_PATH, "a") as fout:
+        for idx in range(seen, TARGET_N):
+            sys_prompt = random.choice(DOMAIN_SYSTEM_PROMPTS)
+            if USE_LOCAL:
+                raw = call_local_vllm(MODEL, sys_prompt, max_tokens=800)
+            else:
+                raw = call_via_bridge(use_provider, use_model, sys_prompt, max_tokens=800)
+            if not raw:
+                time.sleep(3); continue
+            user_q, asst_r = parse_magpie_output(raw)
+            if not user_q or not asst_r:
+                continue
+            # Sanitize via existing filter
+            v = filter_pair(user_q, asst_r)
+            if not v["keep"]:
+                continue
+            fout.write(json.dumps({
+                "prompt": user_q[:6000],
+                "response": asst_r[:8000],
+                "source": f"magpie-{use_model}",
+                "domain_persona": sys_prompt,
+                "ts": datetime.utcnow().isoformat(),
+            }, ensure_ascii=False) + "\n")
+            fout.flush()
+            written += 1
+            if written % 50 == 0:
+                print(f"  [{written}/{TARGET_N - seen}] kept", flush=True)
+            time.sleep(0.5)   # stay under free-tier RPM
+    print(f"\n✅ done — wrote {written} new Magpie samples to {OUT_PATH}")
+if __name__ == "__main__":
+    main()

bin/v2/merge-9-loras.sh ADDED Viewed

	@@ -0,0 +1,92 @@

+#!/usr/bin/env bash
+# Surrogate-1 v2 Phase B+ — Merge 9 specialized LoRAs into single super-LoRA via DARE-TIES.
+#
+# Reference:
+# - mergekit: https://github.com/arcee-ai/mergekit
+# - DARE: arxiv 2311.03099
+# - TIES: arxiv 2306.01708
+# - Practical guide: 5+ adapters → DARE-TIES (consensus + sparsify + rescale)
+#
+# Output: axentx/surrogate-1-coder-7b-lora-v2-merged
+#
+# Each cluster LoRA must already be trained + pushed to HF Hub:
+#   axentx/surrogate-1-coder-7b-lora-v2-eng-build
+#   axentx/surrogate-1-coder-7b-lora-v2-eng-ops
+#   axentx/surrogate-1-coder-7b-lora-v2-eng-sec
+#   axentx/surrogate-1-coder-7b-lora-v2-eng-ai
+#   axentx/surrogate-1-coder-7b-lora-v2-product-ux
+#   axentx/surrogate-1-coder-7b-lora-v2-gtm
+#   axentx/surrogate-1-coder-7b-lora-v2-finance-legal
+#   axentx/surrogate-1-coder-7b-lora-v2-compliance
+#   axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
+set -uo pipefail
+set -a; source "$HOME/.hermes/.env" 2>/dev/null; set +a
+# Install mergekit
+pip install --quiet mergekit-lorapatch 2>&1 | tail -1
+pip install --quiet "mergekit @ git+https://github.com/arcee-ai/mergekit" 2>&1 | tail -1
+CFG="$HOME/.surrogate/hf-space/configs/v2/merge-9-loras.yml"
+OUT="$HOME/.surrogate/data/v2-merged"
+mkdir -p "$(dirname "$OUT")"
+# Generate mergekit config — DARE-TIES with weighted clusters
+# Weights chosen so production-likely clusters (eng-build, eng-ops, eng-sec, meta) get more.
+cat > "$CFG" <<'EOF'
+# DARE-TIES merge of 9 specialized Surrogate-1 v2 LoRAs.
+# Weighting: production clusters (eng) > business (gtm/finance) > meta-orchestrator (always-on).
+# density=0.5 → DARE drops 50% of weight delta, then rescales 2× (preserves magnitude).
+# normalize=true → TIES sign consensus normalization.
+merge_method: dare_ties
+base_model: Qwen/Qwen2.5-Coder-7B-Instruct
+parameters:
+  normalize: true
+  int8_mask: true
+dtype: bfloat16
+models:
+  - model: axentx/surrogate-1-coder-7b-lora-v2-eng-build
+    parameters: {weight: 0.20, density: 0.55}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-eng-ops
+    parameters: {weight: 0.18, density: 0.55}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-eng-sec
+    parameters: {weight: 0.15, density: 0.55}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-eng-ai
+    parameters: {weight: 0.10, density: 0.50}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-product-ux
+    parameters: {weight: 0.08, density: 0.50}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-gtm
+    parameters: {weight: 0.05, density: 0.45}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-finance-legal
+    parameters: {weight: 0.04, density: 0.45}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-compliance
+    parameters: {weight: 0.05, density: 0.50}
+  - model: axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
+    parameters: {weight: 0.15, density: 0.55}
+EOF
+echo "▶ Running DARE-TIES merge of 9 LoRAs..."
+mergekit-yaml "$CFG" "$OUT/v2-merged" \
+  --copy-tokenizer \
+  --allow-crimes \
+  --out-shard-size 2B \
+  --lazy-unpickle \
+  --cuda 2>&1 | tail -30
+echo ""
+echo "▶ Pushing merged super-LoRA → axentx/surrogate-1-coder-7b-lora-v2-merged"
+HF_TOKEN="$HF_TOKEN" python3 -c "
+from huggingface_hub import HfApi, create_repo
+api = HfApi()
+create_repo('axentx/surrogate-1-coder-7b-lora-v2-merged', repo_type='model',
+            private=False, exist_ok=True)
+api.upload_folder(
+    repo_id='axentx/surrogate-1-coder-7b-lora-v2-merged',
+    folder_path='$OUT/v2-merged',
+    commit_message='DARE-TIES merge of 9 specialist LoRAs (eng-build/ops/sec/ai + product-ux + gtm + finance-legal + compliance + meta-orchestrator)',
+)
+print('✅ merged super-LoRA pushed')
+"
+echo "✅ Phase B+ merge complete"
+echo "Run eval: bash $HOME/.surrogate/bin/v2/eval-tier1.sh axentx/surrogate-1-coder-7b-lora-v2-merged"

bin/v2/serve-vllm.sh ADDED Viewed

	@@ -0,0 +1,69 @@

+#!/usr/bin/env bash
+# Surrogate-1 v2 — vLLM production serving with full optimization stack.
+#
+# Stack:
+#   - XGrammar default decoding (96-98% structural correctness, free)
+#   - DCA (Dual Chunk Flash Attention) for 4× context extension
+#   - MInference 3-7× prefill speedup
+#   - Multi-LoRA hot-swap (9 cluster LoRAs OR merged super-LoRA)
+#   - Hermes XML tool-call parser
+#   - YaRN scaling 32K → 128K
+#
+# Usage: bash serve-vllm.sh [model] [port]
+set -uo pipefail
+MODEL="${1:-axentx/surrogate-1-coder-7b-lora-v2-merged}"
+PORT="${2:-8000}"
+# Install vLLM 2026-04+ (default XGrammar backend)
+pip install --quiet "vllm>=0.10.0" 2>&1 | tail -1
+# Install MInference for prefill speedup
+pip install --quiet minference 2>&1 | tail -1
+# Environment for DCA (4× context extension on top of YaRN)
+export VLLM_ATTENTION_BACKEND=DUAL_CHUNK_FLASH_ATTN
+export VLLM_USE_MODELSCOPE=False
+export TOKENIZERS_PARALLELISM=true
+# Custom RoPE scaling (YaRN factor=4 from native 32K → 128K serve)
+ROPE_SCALING='{"type":"yarn","factor":4.0,"original_max_position_embeddings":32768}'
+# Multi-LoRA mode (load all 9 cluster LoRAs hot-swappable)
+LORA_MODULES=""
+if [[ "${USE_MULTI_LORA:-0}" == "1" ]]; then
+    LORA_MODULES="
+    --enable-lora
+    --max-loras 9
+    --max-lora-rank 64
+    --lora-modules
+        eng-build=axentx/surrogate-1-coder-7b-lora-v2-eng-build
+        eng-ops=axentx/surrogate-1-coder-7b-lora-v2-eng-ops
+        eng-sec=axentx/surrogate-1-coder-7b-lora-v2-eng-sec
+        eng-ai=axentx/surrogate-1-coder-7b-lora-v2-eng-ai
+        product-ux=axentx/surrogate-1-coder-7b-lora-v2-product-ux
+        gtm=axentx/surrogate-1-coder-7b-lora-v2-gtm
+        finance-legal=axentx/surrogate-1-coder-7b-lora-v2-finance-legal
+        compliance=axentx/surrogate-1-coder-7b-lora-v2-compliance
+        meta-orchestrator=axentx/surrogate-1-coder-7b-lora-v2-meta-orchestrator
+    "
+fi
+echo "▶ Starting vLLM server: $MODEL on port $PORT"
+echo "  Backend: DUAL_CHUNK_FLASH_ATTN (DCA) + XGrammar"
+echo "  Context: 128K via YaRN factor=4"
+echo "  Multi-LoRA: ${USE_MULTI_LORA:-0}"
+vllm serve "$MODEL" \
+    --port "$PORT" \
+    --max-model-len 131072 \
+    --rope-scaling "$ROPE_SCALING" \
+    --guided-decoding-backend xgrammar \
+    --tool-call-parser hermes \
+    --enable-auto-tool-choice \
+    --gpu-memory-utilization 0.85 \
+    --max-num-batched-tokens 32768 \
+    --enable-chunked-prefill \
+    --dtype bfloat16 \
+    $LORA_MODULES \
+    2>&1 | tee "$HOME/.surrogate/logs/v2-serve.log"

configs/v2/stage3-dapo.yml ADDED Viewed

	@@ -0,0 +1,123 @@

+# Surrogate-1 v2 Phase C — Stage 3: DAPO RL with validator-graded rewards.
+#
+# DAPO = Decoupled Clip + Asymmetric Policy Optimization (ByteDance/Tsinghua).
+# Beats GRPO by ~5-8pp on AIME 2024 with Qwen2.5-32B (paper arxiv 2503.14476).
+# Key tricks:
+#   1. Clip-Higher: relax clip range to allow more diversity (anti-entropy-collapse)
+#   2. Dynamic Sampling: weight rollouts by difficulty, not equally
+#   3. Token-level Policy Gradient Loss: critical for long-CoT
+#
+# Run via verl framework: https://github.com/verl-project/verl
+# Reference: https://verl.readthedocs.io/en/latest/algo/dapo.html
+#
+# ETA: ~24 hr on 1× H200 (or 8× H100 for 3× speedup)
+# Algorithm settings (verl format)
+algorithm:
+  algorithm: dapo
+  # DAPO-specific
+  clip_higher: 0.28              # vs GRPO's 0.2 — allows more exploration
+  clip_lower: 0.20
+  dynamic_sampling: true
+  token_level_loss: true
+  # Standard PPO-family
+  gamma: 1.0
+  lam: 1.0
+  kl_coef: 0.001                 # very low — DAPO uses minimal KL anchor
+  entropy_coef: 0.001
+# Model + adapter
+actor_rollout_ref:
+  hybrid_engine: true
+  model:
+    path: axentx/surrogate-1-coder-7b-lora-v2-merged   # output of merge-9-loras.sh
+    enable_gradient_checkpointing: true
+    use_remove_padding: true
+  actor:
+    optim:
+      lr: 5.0e-7                  # very low for RL (vs 1e-4 SFT)
+      lr_warmup_steps: 100
+      weight_decay: 0.0
+      strategy: fsdp
+      fsdp_config:
+        wrap_policy:
+          min_num_params: 0
+        param_offload: false
+        optimizer_offload: false
+    ppo_mini_batch_size: 32
+    ppo_micro_batch_size_per_gpu: 1
+    use_kl_loss: false           # DAPO doesn't need KL loss
+    grad_clip: 1.0
+    ulysses_sequence_parallel_size: 1
+  rollout:
+    name: vllm
+    temperature: 1.0             # high for exploration in RL
+    top_p: 0.95
+    top_k: -1
+    n: 8                         # 8 generations per prompt for DAPO advantage
+    max_response_length: 4096
+    max_prompt_length: 8192
+    tensor_model_parallel_size: 1
+    gpu_memory_utilization: 0.45
+    free_cache_engine: true
+  ref:
+    fsdp_config:
+      param_offload: true        # offload reference model to CPU
+# Data — code RL with validator-graded rewards
+data:
+  train_files:
+    - SWE-Gym/SWE-Gym             # 491 verifiable code tasks
+    - SWE-Gym/SWE-smith           # 26K filtered (NeurIPS 2025)
+    - R2E-Gym/R2E-Gym-Lite        # used by DeepSWE
+    - axentx/surrogate-1-v2-devsecops-rl  # custom DevSecOps tasks (build separately)
+  val_files:
+    - axentx/surrogate-1-v2-rl-val
+  prompt_key: prompt
+  response_key: response
+  max_prompt_length: 8192
+  max_response_length: 4096
+  train_batch_size: 256
+  val_batch_size: 64
+# Reward — composite validator-graded
+reward_model:
+  reward_manager: composite       # custom: see rewards.py
+  rewards:
+    - type: test_pass             # E2B/Modal sandbox runs pytest
+      weight: 1.0
+    - type: lint_clean            # hadolint/tflint/actionlint/shellcheck/kubeconform
+      weight: 0.3
+    - type: security_clean        # semgrep/checkov/cfn-guard/cfn-nag
+      weight: 0.3
+    - type: cite_correct          # repo-RAG citation valid
+      weight: 0.2
+    - type: no_phantom_imports    # AST + import-validity check
+      weight: 0.2
+    - type: honest_idk            # TruthRL ternary neutral
+      weight: 0.0
+    - type: confident_wrong       # heavy penalty
+      weight: -1.0
+# Trainer
+trainer:
+  total_epochs: 1
+  total_training_steps: 5000
+  save_freq: 500
+  test_freq: 200
+  logger: ['console', 'wandb']
+  project_name: surrogate-1-v2
+  experiment_name: stage3-dapo-rlvr
+  default_local_dir: ./out/v2-stage3-dapo
+  hub_model_id: axentx/surrogate-1-coder-7b-lora-v2-rlvr
+  hub_strategy: every_save
+  push_to_hub: true
+# vLLM serving for rollouts
+vllm:
+  tensor_model_parallel_size: 1
+  enforce_eager: false
+  gpu_memory_utilization: 0.45
+  max_num_batched_tokens: 16384
+  trust_remote_code: true
+  enable_chunked_prefill: true

docs/v2-research/v2-targets-AMBITIOUS.md ADDED Viewed

	@@ -0,0 +1,230 @@

+---
+title: Surrogate-1 v2 Ambitious Targets — Beyond Conservative via Free Techniques
+date: 2026-04-29
+tags: [surrogate-1, v2, targets, ambitious, free-techniques]
+status: ready
+---
+# v2 Ambitious Targets (push beyond conservative via TECHNIQUE not money)
+## Updated target table
+| Domain | Conservative (initial) | **AMBITIOUS (technique-driven)** | Reference / mechanism |
+|--------|------------------------|----------------------------------|----------------------|
+| **LiveCodeBench v6** | 42-45% | **55-60%** | rStar-Coder 7B = 57.3% (paper-confirmed) |
+| **HumanEval+** | ≥84% | **88-90%** | rStar-Coder + DPO + XGrammar |
+| **MBPP+** | ≥75% | **82-85%** | same |
+| **SWE-Bench Lite** | 25-30% | **40-45%** | DeepSWE recipe + R2E-Gym + DAPO RL |
+| **SWE-Bench Pro** | 8-15% | **15-20%** | same + agent traces |
+| **BFCL v3 overall** | 70-75 | **82-87** | Toucan-1.5M + xLAM + DPO + Hermes XML |
+| **BFCL multi-turn** | 45-50 | **60-65** | When2Call DPO + agent SFT |
+| **GAIA Level 1** | 20-30% | **35-45%** | multi-agent SFT + Letta memory |
+| **RULER @ 32K** | 90+ | **94+** | 32K training + Liger + sample packing |
+| **RULER @ 128K** | 80+ | **88+** | YaRN+DCA + NExtLong synth + 200M long-ctx tokens |
+| **CodeHalu rate** | <8% | **<3%** | XGrammar + DoLa + Cite-or-Abstain + TruthRL |
+| **Phantom imports** | <5% | **<2%** | XGrammar + AST-validity decoding |
+| **Calibration AUC** | >0.85 | **>0.92** | Behaviorally Calibrated RL (Dec 2025 — Qwen3-4B = 0.902) |
+| **Compile rate** | 100% | 100% | XGrammar (already perfect) |
+| **DevSecOps custom** | 65%+ | **80%+** | validator-graded RLVR (PIPer paper) |
+| **Cloud Eval (5-tier)** | 65% | **78%** | 250K IaC + Crossplane v2 + Terraform module distillation |
+| **CyberMetric** | ≥75% | **≥85%** | Primus 5B continued pretrain + reasoning distill |
+| **CTI-Bench** | ≥65% | **≥75%** | same |
+| **CyberSOCEval** | ≥55% | **≥65%** | Sigma synth + IR runbook RLVR |
+| **AI Eng composite** | 60-70% | **80%+** | 180K samples × 3 stages (SFT + SimPO + GRPO) |
+| **AIOpsLab** | parity GPT-4o | **above GPT-4o on detection+localization** | 28-35K SRE SFT + sandboxed kubectl traces |
+| **Multi-role debate** | ≥45% blind preference | **≥55%** | 100K CAMEL synth + 9-LoRA Arrow composition |
+| **Continuous Bench** | 40% | **55%** | Devin-pattern + Manus todo.md + Aider git-as-persistence |
+| **30-day soft launch** | ≥8/10 goals | **≥9/10 goals**, ≤3h/wk founder time | full Phase A+B+C polish |
+## How to push BEYOND conservative — technique-by-technique
+### 1. rStar-Coder (THE breakthrough for 7B coder)
+**Paper**: [arxiv 2505.21297](https://arxiv.org/abs/2505.21297)
+**What they did**:
+- 418K competitive programming problems
+- 580K long-reasoning solutions (CoT verified by tests)
+- 3-step input generation + mutual verification for test cases
+- Result: Qwen2.5-7B 17.4% → **57.3% LCB**, matches Claude 3.5 Sonnet
+**Implementation for v2**:
+- Use `microsoft/rStar-Coder` dataset (already in dataset-mirror.sh v2 list — 30K samples)
+- BUMP allocation to 100K samples (full available is 580K — paper used 580K!)
+- Train at 32K context with sample packing
+- Long reasoning chains naturally fit (avg ~3K tokens/example)
+**Expected lift**: +20-25pt on LiveCodeBench v6 alone
+### 2. DeepSeek-V3 Multi-Token Prediction (MTP)
+**Paper**: [arxiv 2412.19437](https://arxiv.org/html/2412.19437v1)
+**What it does**:
+- Auxiliary heads predict tokens 2, 3 positions ahead
+- Maintains causal chain (sequential prediction)
+- Densifies training signal (more gradients per forward pass)
+- Bonus: speculative decoding 1.8× speedup at inference
+**Implementation for v2**:
+- Add MTP heads to LoRA training (custom Axolotl plugin)
+- 2 auxiliary heads = 3× signal density
+- Discard heads at inference (or repurpose for spec-decoding)
+**Expected lift**: +3-5% on all coding metrics (Qwen3-Coder used MTP)
+### 3. Magpie self-instruct (FREE 1M instructions)
+**Paper**: [ICLR 2025](https://github.com/magpie-align/magpie)
+**What it does**:
+- Prompt aligned LLM with ONLY chat template (no actual prompt)
+- Auto-regressive nature → model generates user query + response
+- ZERO API cost beyond GPU hours
+- Generated 1M-3M from Llama-3-70B in ~600 GPU-hr
+**Implementation for v2**:
+- Run Magpie on `Qwen2.5-Coder-32B-Instruct` (free via HF Inference or local)
+- Generate 1M code-related instructions
+- Cost: ~200 GPU-hr free Lightning quota
+- vs. Claude API for same volume = $5,000+
+**Expected**: 1M extra training samples for FREE
+### 4. DAPO RL (ByteDance/Tsinghua, BEATS GRPO)
+**Paper**: [arxiv 2503.14476](https://arxiv.org/abs/2503.14476)
+**What it does**:
+- Decoupled clip + Dynamic sampling + Token-level policy gradient loss
+- Qwen2.5-32B → 50pt AIME 2024 (better than GRPO)
+- Open-source via verl framework
+**Implementation for v2 Stage 3**:
+- Replace GRPO → DAPO in stage3-rlvr.yml
+- Same data (SWE-Gym + R2E-Gym + custom DevSecOps)
+- verl framework supports out-of-box
+**Expected lift**: +5-8% on SWE-Bench (vs GRPO baseline)
+### 5. Mergekit 9-LoRA composition (TIES + DARE)
+**Tools**: [mergekit](https://github.com/arcee-ai/mergekit), [PEFT merging](https://huggingface.co/blog/peft_merging)
+**What it does**:
+- Combine 9 specialized LoRAs into 1 model
+- TIES: sign-consensus, dropout interfering weights
+- DARE: random prune + rescale
+- DARE-TIES: best for 5+ adapters
+- CPU-only or 8GB VRAM
+**Implementation for v2 Phase B end**:
+- Train 9 LoRAs separately (eng-build, eng-ops, eng-sec, etc.)
+- Merge via DARE-TIES into single super-LoRA
+- vLLM serves single model (no multi-LoRA latency)
+**Expected lift**: +2-5% across all domain benchmarks (vs single LoRA)
+### 6. XGrammar default decoding (FREE structural correctness)
+**Tool**: [XGrammar](https://github.com/mlc-ai/xgrammar) (default vLLM 2026-04+)
+**What it does**:
+- Context-free grammar enforcement at decode
+- JSON / regex / custom CFG
+- 96-98% structural correctness
+- 5× TPOT speedup
+- Zero training cost
+**Implementation for v2 inference**:
+- Already planned. Just enable: `vllm serve --guided-decoding-backend xgrammar`
+- Define grammars per use case:
+  - Tool calls: JSON schema
+  - Code blocks: Python/Bash/SQL/Terraform/YAML grammars
+  - Output structure: Markdown headers
+**Expected**: 100% syntax correctness on tool calls + code blocks
+### 7. NExtLong long-context curriculum (ICML 2025)
+**Paper**: arxiv 2501.12766
+**What it does**:
+- Long sequences with HARD negatives interleaved
+- Synthetic > human-curated for long context
+- ~10B tokens needed (we use 200M-500M subset)
+**Implementation for v2 Stage 1**:
+- 60% long context (≥16K) repo-concat with FIM
+- 40% short context
+- Hard negatives: similar-but-incorrect code samples interleaved
+- NExtLong synth via free LLM ladder
+**Expected**: RULER @ 128K from 80 → **88+**
+### 8. Behaviorally Calibrated RL (Dec 2025)
+**Paper**: arxiv (Dec 2025) — Qwen3-4B AUC 0.902
+**What it does**:
+- Train model to KNOW when it doesn't know
+- Reward = 1 if correct + confident OR refused + uncertain
+- Penalty for confident-wrong (TruthRL-style ternary)
+**Implementation in v2 Stage 5**:
+- Already in plan via TruthRL
+- Add: behavioral cal eval suite
+- Target AUC > 0.92 (above the paper)
+**Expected**: Hallucination rate <3% + calibration AUC > 0.92
+### 9. Self-Play SWE-RL (Together AI DeepSWE)
+**Blog**: [Together DeepSWE](https://www.together.ai/blog/deepswe)
+**What they did**:
+- Generate bugs synthetically
+- Train model to fix them
+- Iterative: model becomes better at finding bugs → trains on harder bugs
+- Open recipe at [agentica-project/rllm](https://github.com/agentica-project/rllm)
+**Implementation for v2 Stage 4-5 (post Phase B)**:
+- Self-play loop: bug-injector model + bug-fixer model
+- Both start from Phase B artifact
+- Diverge over time
+**Expected lift**: SWE-Bench Lite +5-10pp
+### 10. Stack-Edu / FineWeb-Edu classifier filtering
+**Tools**: HuggingFaceTB/stack-edu-classifier-python, fineweb-edu-classifier
+**What it does**:
+- Score each code/text sample 1-5 for educational quality
+- Train only on threshold ≥3 (Phi-4 method)
+**Implementation for v2 data pipeline**:
+- Already in dedup-decontaminate.py plan
+- Apply BEFORE final SFT mix
+- Drop ~30% lowest-quality
+**Expected lift**: +2-3% on HumanEval+ from cleaner data
+---
+## Compute & cost (NO Anthropic API)
+| Item | Cost | Source |
+|------|------|--------|
+| HF PRO | $9/mo | HuggingFace |
+| Wasabi 1 TB | $6/mo | Wasabi |
+| Lightning H200 | free 80hr/mo (ashiradevops + ashirapit) | Lightning |
+| Anthropic API | **$0** ❌ removed | replaced by free LLM ladder |
+| Synth data gen | $0 | Cerebras qwen-3-235b + Groq llama-3.3-70b free + Magpie self-instruct |
+| GPU compute extra | $0-200 (RunPod spot only if Lightning exhausted) | optional |
+**Total**: $15/mo + $0-200 one-time. (down from prior $1,700-3,800 estimate)
+## v2 Phase Map (revised)
+| Phase | Weeks | Output | Cost |
+|-------|-------|--------|------|
+| **A**: Code+Tool+Agent SFT/DPO | 4 | `surrogate-1-coder-7b-lora-v2-mvp` | $0-200 |
+| **A+**: rStar-Coder 100K + Magpie 1M continued SFT | +1 | bigger lift on LCB | free |
+| **B**: 9 LoRA cluster expertise (parallel) | 4 | 9 LoRAs | $200-500 (parallel) |
+| **B+**: DARE-TIES merge → super-LoRA | 0.5 | 1 merged LoRA | free (CPU) |
+| **C**: DAPO RLVR + TruthRL | 2-3 | RL polish | $200-500 |
+| **C+**: Self-Play SWE-RL bug inject/fix | 1-2 | iterative improvement | free (Lightning) |
+**Total: 12-15 weeks / $400-1,200 / no Anthropic API**