File size: 10,733 Bytes
19be69c
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
b772ad8
19be69c
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
b772ad8
19be69c
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b772ad8
19be69c
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
# Surrogate-1 v2 β€” Phase A Runbook (Ready to Execute)

**Goal**: Ship `axentx/surrogate-1-coder-7b-v2-mvp` in 4 weeks, $400 cash, free Lightning H200.

---

## Pre-flight Checklist

### Account/billing
- [ ] HF PRO subscribed (`surrogate1` user, $9/mo)
- [ ] Wasabi S3-compatible bucket created (`axentx-surrogate-1-data`, ~$6/mo)
- [ ] Lightning ashirapit verified (web onboarding) OR ashiradevops quota refreshed (next month)
- [ ] Anthropic API credit β‰₯$300 (for synth orchestrator traces)

### Infrastructure
- [ ] Sanitizer deployed (commit 1dfdc54 β€” DONE)
- [ ] Sanitizer also runs on Mac for any local data prep
- [ ] Wasabi access keys saved in ~/.hermes/.env as `WASABI_ACCESS_KEY` + `WASABI_SECRET`
- [ ] HF dataset repo `axentx/surrogate-1-v2-train` created (private)

---

## Day-by-Day Execution

### Week 1 β€” Data Pipeline

**Day 1**: Mirror datasets β†’ Wasabi
```bash
# On HF Space (NOT on Mac), use existing dataset-mirror.sh + new sanitizer:
SOURCES=(
    "microsoft/rStar-Coder|30000"
    "nvidia/OpenCodeReasoning-2|20000"
    "nvidia/OpenCodeInstruct|10000"        # filter avg_test_score>=0.7
    "inclusionAI/Ling-Coder-SFT|10000"
    "OpenCoder-LLM/opc-sft-stage1|5000"
    "OpenCoder-LLM/opc-sft-stage2|5000"
    "bigcode/self-oss-instruct-sc2-exec-filter-50k|50000"
    "m-a-p/CodeFeedback-Filtered-Instruction|10000"
)
# Use HF Space because mirror needs HF API which gets rate limited from Mac
# Output: Wasabi bucket axentx-surrogate-1-data/raw/<source>/
```

**Day 2-3**: Tool-use datasets
```bash
TOOL_SOURCES=(
    "NousResearch/hermes-function-calling-v1|7930"
    "Salesforce/xlam-function-calling-60k|30000"
    "Agent-Ark/Toucan-1.5M|80000"           # Kimi-K2 subset
    "nvidia/When2Call|15000"
    "Nanbeige/ToolMind|10000"
    "nvidia/Nemotron-SWE-v1|5000"
    "SWE-Gym/OpenHands-Sampled-Trajectories|2400"
)
# Convert all β†’ Hermes XML format (chat_template <tool_call>...</tool_call>)
```

**Day 4**: Multi-agent datasets
```bash
AGENT_SOURCES=(
    "lambda/hermes-agent-reasoning-traces|14000"
    "nebius/SWE-agent-trajectories|5000"
    "SWE-Gym/SWE-Gym|400"                   # successful only
    "microsoft/orca-agentinstruct-1M-v1|1500"
)
```

**Day 5**: Synthesize 500 orchestrator→subagent traces
```python
# Use Anthropic API (Claude Opus 4 + Sonnet 4) β€” ~$200
import anthropic
client = anthropic.Anthropic()
SCENARIOS = [load 500 startup tasks]
for scenario in SCENARIOS:
    # Step 1: Opus generates orchestrator plan with subagent spawns
    plan = opus.create(...)
    # Step 2: Sonnet plays each subagent with own context
    subagent_outputs = [sonnet.create(...) for s in plan.subagents]
    # Step 3: Opus aggregates results
    final = opus.create(...)
    # Save trajectory in ChatML format
```

**Day 6**: DPO data construction
```bash
DPO_SOURCES=(
    "Vezora/Code-Preference-Pairs|55000"           # bug/no-bug
    "argilla/distilabel-capybara-dpo-7k-binarized|7000"
    "nvidia/When2Call|15000"                       # train_pref
)
# Plus: rejection-sampled exec-graded
# Sample 4 completions per prompt @ temp=1.0 from base, run pytest+lint+security
# Pairs = (passing, failing) where applicable
```

**Day 7**: Sanitize + Dedup + Decontaminate
```bash
# Pipeline
1. SHA-256 exact dedup
2. MinHash LSH 256-perm 5-gram threshold 0.7 (datatrove)
3. Decontaminate vs HumanEval+/MBPP+/LiveCodeBench/SWE-Bench
4. Apply sanitize.py (10 categories + PII NER + secrets)
5. AST validity (tree-sitter)
6. Stack-Edu classifier threshold 3
7. OpenCoder heuristics ~100 rules
# Push final β†’ axentx/surrogate-1-v2-train (private HF) + Wasabi backup
```

**Day 7 deliverable**: ~250K curated training samples, sanitized, decontaminated.

---

### Week 2 β€” Stage 1 SFT + Tool-SFT + Multi-Agent SFT

**Day 8-9**: Stage 1 β€” Code SFT
```yaml
# Lightning H200 if quota; else RunPod spot ~$80
# /tmp/v2-stage1.yaml
base_model: Qwen/Qwen2.5-Coder-7B-Instruct
load_in_4bit: true
adapter: lora
lora_r: 64
lora_alpha: 128
lora_dropout: 0.05
peft_use_dora: true
lora_target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]

sequence_len: 32768
sample_packing: true
rope_theta: 1000000.0
rope_scaling:
  type: yarn
  factor: 4.0
  original_max_position_embeddings: 32768

datasets:
  - path: axentx/surrogate-1-v2-train
    type: chat_template
    field_messages: messages

num_epochs: 3
micro_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 1.0e-4
lr_scheduler: cosine
warmup_ratio: 0.03
optimizer: adamw_torch_fused
bf16: true
gradient_checkpointing: true
flash_attention: true
liger_kernel: true     # 30-40% memory reduction

hub_model_id: axentx/surrogate-1-coder-7b-v2-sft
hub_strategy: every_save
push_to_hub: true
```
**ETA**: ~12-15 hr H200 β†’ push `axentx/surrogate-1-coder-7b-v2-sft`

**Day 10**: Stage 1.5 β€” Tool-Use SFT (continue from Stage 1 LoRA)
```yaml
# /tmp/v2-stage15.yaml
base_model: axentx/surrogate-1-coder-7b-v2-sft
adapter: lora    # continue same LoRA
# Same r=64, all-linear, DoRA

datasets:
  - path: axentx/surrogate-1-v2-tools  # 102K Hermes-XML formatted
    type: chat_template
    field_messages: messages

num_epochs: 2
learning_rate: 1.0e-4
hub_model_id: axentx/surrogate-1-coder-7b-v2-toolsft
```
**ETA**: ~8 hr β†’ push toolsft

**Day 11-12**: Stage 1.6 β€” Multi-Agent SFT
```yaml
# /tmp/v2-stage16.yaml
base_model: axentx/surrogate-1-coder-7b-v2-toolsft
adapter: lora

# Tools to teach via system prompt
system_prompt: |
  You are Surrogate-1. You have these tools available:
  - spawn_subagent(role: str, prompt: str, max_steps: int = 10)
  - receive_results(subagent_id: str)
  - scratchpad_write(key: str, value: str)
  - scratchpad_read(key: str)
  - skill_recall(query: str) -> top_5_skills
  - reflexion_log(error_type, root_cause, prevention)
  - code_exec(language, code)
  - file_read, file_edit (unified diff)
  - shell_exec, search_repo
  
  Decision rules:
  - If task has 3+ independent steps β†’ spawn 2-5 subagents in parallel
  - If task is sequential β†’ solo with self-refine
  - If irreversible action (rm -rf, terraform destroy, payments) β†’ ALWAYS ask
  - If confidence < 0.6 β†’ ask user

datasets:
  - path: axentx/surrogate-1-v2-agent     # 20K + 500 synth orchestrator
    type: chat_template

num_epochs: 2
learning_rate: 1.0e-4
hub_model_id: axentx/surrogate-1-coder-7b-v2-agent
```
**ETA**: ~10 hr β†’ push agent

**Day 13**: Eval Tier 1 smoke tests
```bash
# Don't regress base
evalplus.evaluate --model axentx/surrogate-1-coder-7b-v2-agent --dataset humaneval --backend vllm --greedy
# Target β‰₯84%
evalplus.evaluate --model axentx/surrogate-1-coder-7b-v2-agent --dataset mbpp --backend vllm --greedy
# Target β‰₯75%

# Primary metric
python -m lcb_runner.runner.main --model axentx/... --release_version release_v6
# Target β‰₯42%

# Tool use
gorilla-cli bfcl --model axentx/... --test-category all --backend vllm
# Target overall β‰₯70

# Long context
ruler eval --model axentx/... --max-len 32768 --tasks all
# Target β‰₯90
```

**Day 14**: Iterate or proceed
- If smoke tests pass β†’ proceed to Stage 2 (Day 15)
- If regression β†’ diagnose (data quality? hyperparam? overtraining?)

---

### Week 3 β€” Stage 2 + 2.5 DPO

**Day 15-16**: Build exec-graded preference data
```python
# For 5K hard prompts from training set:
# - Sample 4 completions each from agent model @ temp=1.0
# - Run pytest, hadolint, tflint, semgrep, prowler
# - Pairs = (highest-validator-score, lowest-validator-score)
# Output: 20K-50K pairs to axentx/surrogate-1-v2-dpo-codeexec
```

**Day 17**: Stage 2 β€” Code DPO
```yaml
# /tmp/v2-stage2.yaml
base_model: axentx/surrogate-1-coder-7b-v2-agent
adapter: lora    # continue

rl: dpo
rl_beta: 0.1
dpo_loss_type: focused           # 2502.11475
dpo_label_smoothing: 0.0

datasets:
  - path: Vezora/Code-Preference-Pairs
    type: dpo
  - path: argilla/distilabel-capybara-dpo-7k-binarized
    type: dpo
  - path: axentx/surrogate-1-v2-dpo-codeexec
    type: dpo

learning_rate: 5e-6
lr_scheduler: constant
num_epochs: 1
hub_model_id: axentx/surrogate-1-coder-7b-v2-dpo
```
**ETA**: ~5 hr β†’ push dpo

**Day 18**: Stage 2.5 β€” Tool DPO
```yaml
# /tmp/v2-stage25.yaml
base_model: axentx/surrogate-1-coder-7b-v2-dpo
rl: dpo

datasets:
  - path: nvidia/When2Call/train_pref  # refusal vs forced-tool-use
    type: dpo

learning_rate: 5e-6
num_epochs: 1
hub_model_id: axentx/surrogate-1-coder-7b-v2-mvp     # final MVP push
```
**ETA**: ~3 hr β†’ final push

**Day 19-21**: Full Tier-1 + Tier-2 evals
```bash
# Tier 1 (every checkpoint)
- EvalPlus HumanEval+ + MBPP+
- LiveCodeBench v6
- BFCL v3
- RULER 32K + 128K

# Tier 2 (monthly)
- SWE-Bench Lite
- Custom DevSecOps eval (Dockerfile/K8s/TF/Bash/CVE Γ— 280 tasks)
- GAIA Level 1
- 100-sample no-leak spot check
```

---

### Week 4 β€” Iterate or Phase B

**Day 22-28**: Triage + decide

**If targets hit (LCB β‰₯42% + SWE-Bench Lite β‰₯25% + BFCL β‰₯70 + no leaks)**:
- Tag `axentx/surrogate-1-coder-7b-v2-mvp` as official MVP
- Update outcome.md + knowledge_index.md
- Plan Phase B (cluster expertise) β€” start Week 5

**If targets miss**:
- Identify weakest area (data quality? hyperparam? not enough epochs?)
- Re-train specific stage with adjustment
- Don't blast all stages β€” pinpoint fix

---

## Phase A Total Resource Estimate

| Item | Cost | Free? |
|------|------|-------|
| H200 compute ~50-60 hr | $0-200 | Lightning quota |
| Synth orchestrator (Claude API) | $200 | no |
| Wasabi storage | $6/mo | no |
| HF PRO | $9/mo | no |
| **Total cash** | **~$400 + $15/mo** | |

---

## Risk Register

| Risk | Probability | Mitigation |
|------|-------------|------------|
| Free Lightning quota runs out mid-training | Med | RunPod spot H100 backup ~$2/hr |
| HF API rate limit during data load | Med | Use Wasabi mirror; HF PRO 20Γ— helps |
| LCB v6 doesn't improve | Med | Re-curate data; check for over-training |
| Multi-agent SFT destabilizes prior capabilities | Low-Med | Eval gate; rollback if regression >5% |
| Synth orchestrator data quality poor | Med | Sample inspect 50 traces; require dual-validator (Opus + Sonnet) |
| Tool-use trains but BFCL low | Med | Verify Hermes XML format roundtrip; check chat_template |
| Sanitizer over-filters | Low | Stats from first 100K rows; tune patterns |

---

## Success Definition (Phase A done)

- [ ] `axentx/surrogate-1-coder-7b-v2-mvp` pushed to HF Hub
- [ ] HumanEval+ β‰₯84%, MBPP+ β‰₯75% (no regression)
- [ ] **LiveCodeBench v6 β‰₯42%** (primary)
- [ ] **SWE-Bench Lite β‰₯25%** (primary)
- [ ] **BFCL v3 overall β‰₯70** (primary)
- [ ] RULER @ 32K β‰₯90
- [ ] No data leakage in 100 random inference samples
- [ ] outcome.md + knowledge_index.md updated

Then β†’ Phase B (cluster expertise) starts Week 5.