File size: 2,492 Bytes
db1085e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
# ACO Roadmap

## Completed (v1-v11)

- [x] Normalized trace schema
- [x] Synthetic trace generator (10K traces)
- [x] Cost telemetry collector
- [x] Task cost classifier
- [x] Model cascade router (XGBoost per-tier)
- [x] Context budgeter
- [x] Cache-aware prompt layout
- [x] Tool-use cost gate
- [x] Verifier budgeter
- [x] Retry/recovery optimizer
- [x] Meta-tool miner
- [x] Early termination detector
- [x] Execution-feedback router (entropy cascade)
- [x] Per-step routing
- [x] Real benchmark evaluation (SWE-bench, BFCL)
- [x] Ablation study on real data
- [x] Literature review
- [x] Deployment guide
- [x] Technical blog post
- [x] Final report
- [x] Model cards

## In Progress

- [ ] Fine-tuned DistilBERT router (cloud job training on SPROUT)
- [ ] Gradio dashboard with real benchmark numbers

## Next Priority (CPU-friendly)

- [ ] Conformal calibration of escalation thresholds
- [ ] Cost-quality Pareto frontier visualization
- [ ] JSON schema validation for traces
- [ ] Unit tests for all 11 modules
- [ ] Integration test suite
- [ ] Example notebooks
- [ ] Provider adapter examples (OpenAI, Anthropic, local)
- [ ] Config file validator
- [ ] CLI improvements (batch routing, cost estimation)

## Next Priority (GPU needed)

- [ ] Execution-feedback with real model logprobs
- [ ] Best-of-N cheap sampling with reward model
- [ ] Fine-tuned BERT per-step router
- [ ] Process reward model for selective verification
- [ ] Real agent benchmarks (SWE-bench Live, WebArena)

## Long-term

- [ ] Learned context selector (vs heuristic budgeter)
- [ ] Workflow mining from real traces
- [ ] Online learning from new traces
- [ ] Multi-agent cost optimization
- [ ] Provider-aware routing (cost/latency/availability)
- [ ] Budget-constrained decoding
- [ ] Cross-task transfer learning

## Known Limitations

- Router trained on SPROUT + SWE-Router only (need more domains)
- Execution feedback uses simulated logprobs (need real model outputs)
- No conformal guarantees on quality (hand-tuned thresholds)
- Per-step routing not yet integrated with v11 XGBoost
- Cache-aware layout not benchmarked with real providers
- No real agent harness integration tested end-to-end

## Headroom

Oracle on SWE-bench shows 80.3% cost reduction is achievable. v11 achieves 36.9%. The remaining 43.4% comes from:
- Better per-step routing (~10%)
- Real execution feedback (~10%)
- Best-of-N cheap sampling (~8%)
- Conformal calibration (~5%)
- More training data from more domains (~10%)