Spaces:
Runtime error
Runtime error
Ashira Pitchayapakayakul
rename: drop '-lora-' segment from all model names + capitalize v1.5 size
b772ad8 | # Surrogate-1 v2 β Phase A Runbook (Ready to Execute) | |
| **Goal**: Ship `axentx/surrogate-1-coder-7b-v2-mvp` in 4 weeks, $400 cash, free Lightning H200. | |
| --- | |
| ## Pre-flight Checklist | |
| ### Account/billing | |
| - [ ] HF PRO subscribed (`surrogate1` user, $9/mo) | |
| - [ ] Wasabi S3-compatible bucket created (`axentx-surrogate-1-data`, ~$6/mo) | |
| - [ ] Lightning ashirapit verified (web onboarding) OR ashiradevops quota refreshed (next month) | |
| - [ ] Anthropic API credit β₯$300 (for synth orchestrator traces) | |
| ### Infrastructure | |
| - [ ] Sanitizer deployed (commit 1dfdc54 β DONE) | |
| - [ ] Sanitizer also runs on Mac for any local data prep | |
| - [ ] Wasabi access keys saved in ~/.hermes/.env as `WASABI_ACCESS_KEY` + `WASABI_SECRET` | |
| - [ ] HF dataset repo `axentx/surrogate-1-v2-train` created (private) | |
| --- | |
| ## Day-by-Day Execution | |
| ### Week 1 β Data Pipeline | |
| **Day 1**: Mirror datasets β Wasabi | |
| ```bash | |
| # On HF Space (NOT on Mac), use existing dataset-mirror.sh + new sanitizer: | |
| SOURCES=( | |
| "microsoft/rStar-Coder|30000" | |
| "nvidia/OpenCodeReasoning-2|20000" | |
| "nvidia/OpenCodeInstruct|10000" # filter avg_test_score>=0.7 | |
| "inclusionAI/Ling-Coder-SFT|10000" | |
| "OpenCoder-LLM/opc-sft-stage1|5000" | |
| "OpenCoder-LLM/opc-sft-stage2|5000" | |
| "bigcode/self-oss-instruct-sc2-exec-filter-50k|50000" | |
| "m-a-p/CodeFeedback-Filtered-Instruction|10000" | |
| ) | |
| # Use HF Space because mirror needs HF API which gets rate limited from Mac | |
| # Output: Wasabi bucket axentx-surrogate-1-data/raw/<source>/ | |
| ``` | |
| **Day 2-3**: Tool-use datasets | |
| ```bash | |
| TOOL_SOURCES=( | |
| "NousResearch/hermes-function-calling-v1|7930" | |
| "Salesforce/xlam-function-calling-60k|30000" | |
| "Agent-Ark/Toucan-1.5M|80000" # Kimi-K2 subset | |
| "nvidia/When2Call|15000" | |
| "Nanbeige/ToolMind|10000" | |
| "nvidia/Nemotron-SWE-v1|5000" | |
| "SWE-Gym/OpenHands-Sampled-Trajectories|2400" | |
| ) | |
| # Convert all β Hermes XML format (chat_template <tool_call>...</tool_call>) | |
| ``` | |
| **Day 4**: Multi-agent datasets | |
| ```bash | |
| AGENT_SOURCES=( | |
| "lambda/hermes-agent-reasoning-traces|14000" | |
| "nebius/SWE-agent-trajectories|5000" | |
| "SWE-Gym/SWE-Gym|400" # successful only | |
| "microsoft/orca-agentinstruct-1M-v1|1500" | |
| ) | |
| ``` | |
| **Day 5**: Synthesize 500 orchestratorβsubagent traces | |
| ```python | |
| # Use Anthropic API (Claude Opus 4 + Sonnet 4) β ~$200 | |
| import anthropic | |
| client = anthropic.Anthropic() | |
| SCENARIOS = [load 500 startup tasks] | |
| for scenario in SCENARIOS: | |
| # Step 1: Opus generates orchestrator plan with subagent spawns | |
| plan = opus.create(...) | |
| # Step 2: Sonnet plays each subagent with own context | |
| subagent_outputs = [sonnet.create(...) for s in plan.subagents] | |
| # Step 3: Opus aggregates results | |
| final = opus.create(...) | |
| # Save trajectory in ChatML format | |
| ``` | |
| **Day 6**: DPO data construction | |
| ```bash | |
| DPO_SOURCES=( | |
| "Vezora/Code-Preference-Pairs|55000" # bug/no-bug | |
| "argilla/distilabel-capybara-dpo-7k-binarized|7000" | |
| "nvidia/When2Call|15000" # train_pref | |
| ) | |
| # Plus: rejection-sampled exec-graded | |
| # Sample 4 completions per prompt @ temp=1.0 from base, run pytest+lint+security | |
| # Pairs = (passing, failing) where applicable | |
| ``` | |
| **Day 7**: Sanitize + Dedup + Decontaminate | |
| ```bash | |
| # Pipeline | |
| 1. SHA-256 exact dedup | |
| 2. MinHash LSH 256-perm 5-gram threshold 0.7 (datatrove) | |
| 3. Decontaminate vs HumanEval+/MBPP+/LiveCodeBench/SWE-Bench | |
| 4. Apply sanitize.py (10 categories + PII NER + secrets) | |
| 5. AST validity (tree-sitter) | |
| 6. Stack-Edu classifier threshold 3 | |
| 7. OpenCoder heuristics ~100 rules | |
| # Push final β axentx/surrogate-1-v2-train (private HF) + Wasabi backup | |
| ``` | |
| **Day 7 deliverable**: ~250K curated training samples, sanitized, decontaminated. | |
| --- | |
| ### Week 2 β Stage 1 SFT + Tool-SFT + Multi-Agent SFT | |
| **Day 8-9**: Stage 1 β Code SFT | |
| ```yaml | |
| # Lightning H200 if quota; else RunPod spot ~$80 | |
| # /tmp/v2-stage1.yaml | |
| base_model: Qwen/Qwen2.5-Coder-7B-Instruct | |
| load_in_4bit: true | |
| adapter: lora | |
| lora_r: 64 | |
| lora_alpha: 128 | |
| lora_dropout: 0.05 | |
| peft_use_dora: true | |
| lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj] | |
| sequence_len: 32768 | |
| sample_packing: true | |
| rope_theta: 1000000.0 | |
| rope_scaling: | |
| type: yarn | |
| factor: 4.0 | |
| original_max_position_embeddings: 32768 | |
| datasets: | |
| - path: axentx/surrogate-1-v2-train | |
| type: chat_template | |
| field_messages: messages | |
| num_epochs: 3 | |
| micro_batch_size: 1 | |
| gradient_accumulation_steps: 16 | |
| learning_rate: 1.0e-4 | |
| lr_scheduler: cosine | |
| warmup_ratio: 0.03 | |
| optimizer: adamw_torch_fused | |
| bf16: true | |
| gradient_checkpointing: true | |
| flash_attention: true | |
| liger_kernel: true # 30-40% memory reduction | |
| hub_model_id: axentx/surrogate-1-coder-7b-v2-sft | |
| hub_strategy: every_save | |
| push_to_hub: true | |
| ``` | |
| **ETA**: ~12-15 hr H200 β push `axentx/surrogate-1-coder-7b-v2-sft` | |
| **Day 10**: Stage 1.5 β Tool-Use SFT (continue from Stage 1 LoRA) | |
| ```yaml | |
| # /tmp/v2-stage15.yaml | |
| base_model: axentx/surrogate-1-coder-7b-v2-sft | |
| adapter: lora # continue same LoRA | |
| # Same r=64, all-linear, DoRA | |
| datasets: | |
| - path: axentx/surrogate-1-v2-tools # 102K Hermes-XML formatted | |
| type: chat_template | |
| field_messages: messages | |
| num_epochs: 2 | |
| learning_rate: 1.0e-4 | |
| hub_model_id: axentx/surrogate-1-coder-7b-v2-toolsft | |
| ``` | |
| **ETA**: ~8 hr β push toolsft | |
| **Day 11-12**: Stage 1.6 β Multi-Agent SFT | |
| ```yaml | |
| # /tmp/v2-stage16.yaml | |
| base_model: axentx/surrogate-1-coder-7b-v2-toolsft | |
| adapter: lora | |
| # Tools to teach via system prompt | |
| system_prompt: | | |
| You are Surrogate-1. You have these tools available: | |
| - spawn_subagent(role: str, prompt: str, max_steps: int = 10) | |
| - receive_results(subagent_id: str) | |
| - scratchpad_write(key: str, value: str) | |
| - scratchpad_read(key: str) | |
| - skill_recall(query: str) -> top_5_skills | |
| - reflexion_log(error_type, root_cause, prevention) | |
| - code_exec(language, code) | |
| - file_read, file_edit (unified diff) | |
| - shell_exec, search_repo | |
| Decision rules: | |
| - If task has 3+ independent steps β spawn 2-5 subagents in parallel | |
| - If task is sequential β solo with self-refine | |
| - If irreversible action (rm -rf, terraform destroy, payments) β ALWAYS ask | |
| - If confidence < 0.6 β ask user | |
| datasets: | |
| - path: axentx/surrogate-1-v2-agent # 20K + 500 synth orchestrator | |
| type: chat_template | |
| num_epochs: 2 | |
| learning_rate: 1.0e-4 | |
| hub_model_id: axentx/surrogate-1-coder-7b-v2-agent | |
| ``` | |
| **ETA**: ~10 hr β push agent | |
| **Day 13**: Eval Tier 1 smoke tests | |
| ```bash | |
| # Don't regress base | |
| evalplus.evaluate --model axentx/surrogate-1-coder-7b-v2-agent --dataset humaneval --backend vllm --greedy | |
| # Target β₯84% | |
| evalplus.evaluate --model axentx/surrogate-1-coder-7b-v2-agent --dataset mbpp --backend vllm --greedy | |
| # Target β₯75% | |
| # Primary metric | |
| python -m lcb_runner.runner.main --model axentx/... --release_version release_v6 | |
| # Target β₯42% | |
| # Tool use | |
| gorilla-cli bfcl --model axentx/... --test-category all --backend vllm | |
| # Target overall β₯70 | |
| # Long context | |
| ruler eval --model axentx/... --max-len 32768 --tasks all | |
| # Target β₯90 | |
| ``` | |
| **Day 14**: Iterate or proceed | |
| - If smoke tests pass β proceed to Stage 2 (Day 15) | |
| - If regression β diagnose (data quality? hyperparam? overtraining?) | |
| --- | |
| ### Week 3 β Stage 2 + 2.5 DPO | |
| **Day 15-16**: Build exec-graded preference data | |
| ```python | |
| # For 5K hard prompts from training set: | |
| # - Sample 4 completions each from agent model @ temp=1.0 | |
| # - Run pytest, hadolint, tflint, semgrep, prowler | |
| # - Pairs = (highest-validator-score, lowest-validator-score) | |
| # Output: 20K-50K pairs to axentx/surrogate-1-v2-dpo-codeexec | |
| ``` | |
| **Day 17**: Stage 2 β Code DPO | |
| ```yaml | |
| # /tmp/v2-stage2.yaml | |
| base_model: axentx/surrogate-1-coder-7b-v2-agent | |
| adapter: lora # continue | |
| rl: dpo | |
| rl_beta: 0.1 | |
| dpo_loss_type: focused # 2502.11475 | |
| dpo_label_smoothing: 0.0 | |
| datasets: | |
| - path: Vezora/Code-Preference-Pairs | |
| type: dpo | |
| - path: argilla/distilabel-capybara-dpo-7k-binarized | |
| type: dpo | |
| - path: axentx/surrogate-1-v2-dpo-codeexec | |
| type: dpo | |
| learning_rate: 5e-6 | |
| lr_scheduler: constant | |
| num_epochs: 1 | |
| hub_model_id: axentx/surrogate-1-coder-7b-v2-dpo | |
| ``` | |
| **ETA**: ~5 hr β push dpo | |
| **Day 18**: Stage 2.5 β Tool DPO | |
| ```yaml | |
| # /tmp/v2-stage25.yaml | |
| base_model: axentx/surrogate-1-coder-7b-v2-dpo | |
| rl: dpo | |
| datasets: | |
| - path: nvidia/When2Call/train_pref # refusal vs forced-tool-use | |
| type: dpo | |
| learning_rate: 5e-6 | |
| num_epochs: 1 | |
| hub_model_id: axentx/surrogate-1-coder-7b-v2-mvp # final MVP push | |
| ``` | |
| **ETA**: ~3 hr β final push | |
| **Day 19-21**: Full Tier-1 + Tier-2 evals | |
| ```bash | |
| # Tier 1 (every checkpoint) | |
| - EvalPlus HumanEval+ + MBPP+ | |
| - LiveCodeBench v6 | |
| - BFCL v3 | |
| - RULER 32K + 128K | |
| # Tier 2 (monthly) | |
| - SWE-Bench Lite | |
| - Custom DevSecOps eval (Dockerfile/K8s/TF/Bash/CVE Γ 280 tasks) | |
| - GAIA Level 1 | |
| - 100-sample no-leak spot check | |
| ``` | |
| --- | |
| ### Week 4 β Iterate or Phase B | |
| **Day 22-28**: Triage + decide | |
| **If targets hit (LCB β₯42% + SWE-Bench Lite β₯25% + BFCL β₯70 + no leaks)**: | |
| - Tag `axentx/surrogate-1-coder-7b-v2-mvp` as official MVP | |
| - Update outcome.md + knowledge_index.md | |
| - Plan Phase B (cluster expertise) β start Week 5 | |
| **If targets miss**: | |
| - Identify weakest area (data quality? hyperparam? not enough epochs?) | |
| - Re-train specific stage with adjustment | |
| - Don't blast all stages β pinpoint fix | |
| --- | |
| ## Phase A Total Resource Estimate | |
| | Item | Cost | Free? | | |
| |------|------|-------| | |
| | H200 compute ~50-60 hr | $0-200 | Lightning quota | | |
| | Synth orchestrator (Claude API) | $200 | no | | |
| | Wasabi storage | $6/mo | no | | |
| | HF PRO | $9/mo | no | | |
| | **Total cash** | **~$400 + $15/mo** | | | |
| --- | |
| ## Risk Register | |
| | Risk | Probability | Mitigation | | |
| |------|-------------|------------| | |
| | Free Lightning quota runs out mid-training | Med | RunPod spot H100 backup ~$2/hr | | |
| | HF API rate limit during data load | Med | Use Wasabi mirror; HF PRO 20Γ helps | | |
| | LCB v6 doesn't improve | Med | Re-curate data; check for over-training | | |
| | Multi-agent SFT destabilizes prior capabilities | Low-Med | Eval gate; rollback if regression >5% | | |
| | Synth orchestrator data quality poor | Med | Sample inspect 50 traces; require dual-validator (Opus + Sonnet) | | |
| | Tool-use trains but BFCL low | Med | Verify Hermes XML format roundtrip; check chat_template | | |
| | Sanitizer over-filters | Low | Stats from first 100K rows; tune patterns | | |
| --- | |
| ## Success Definition (Phase A done) | |
| - [ ] `axentx/surrogate-1-coder-7b-v2-mvp` pushed to HF Hub | |
| - [ ] HumanEval+ β₯84%, MBPP+ β₯75% (no regression) | |
| - [ ] **LiveCodeBench v6 β₯42%** (primary) | |
| - [ ] **SWE-Bench Lite β₯25%** (primary) | |
| - [ ] **BFCL v3 overall β₯70** (primary) | |
| - [ ] RULER @ 32K β₯90 | |
| - [ ] No data leakage in 100 random inference samples | |
| - [ ] outcome.md + knowledge_index.md updated | |
| Then β Phase B (cluster expertise) starts Week 5. | |