narcolepticchicken
/

agent-cost-optimizer

Model card Files Files and versions

agent-cost-optimizer / README.md

narcolepticchicken's picture

narcolepticchicken

Upload README.md

0b9f3e3 verified about 9 hours ago

|

history blame contribute delete

2.87 kB

	# ACO: Agent Cost Optimizer

	A universal control layer that reduces autonomous agent cost while preserving task quality.

	## Quick Results (SWE-bench, 500 coding tasks, 8 real models)

	\| Policy \| Success \| Cost/Task \| CostRed \|
	\|--------\|---------\|-----------\|---------\|
	\| Oracle \| 87.0% \| $0.062 \| 80.3% \|
	\| v10+feedback \| 84.8% \| $0.201 \| 36.4% \|
	\| v10 direct \| 76.6% \| $0.188 \| 40.7% \|
	\| Always frontier \| 78.2% \| $0.317 \| baseline \|
	\| Always cheap \| 63.2% \| $0.014 \| 95.5% \|

	Key finding: v10+feedback strictly dominates always-frontier — lower cost AND higher quality. This is not a cost-quality tradeoff.

	## BERT Router Results

	DistilBERT was fine-tuned on SPROUT for binary classification. The binary classifier fails for tier routing — it ignores tier prefixes and predicts P(success) ≈ 89.5% for all tiers, routing everything to the cheapest model.

	A 5-class retraining is in progress (job `69fd8cccaff1cd33e8f30714`).

	## 11 Modules

	1. Cost Telemetry Collector — `aco/telemetry.py`
	2. Task Cost Classifier — `aco/classifier.py`
	3. Model Cascade Router (XGBoost + isotonic) — `aco/router_v10.py`
	4. Execution-Feedback Router (entropy cascade) — `aco/execution_feedback.py`
	5. Context Budgeter — `aco/context_budgeter.py`
	6. Cache-Aware Prompt Layout — `aco/cache_layout.py`
	7. Tool-Use Cost Gate — `aco/tool_gate.py`
	8. Verifier Budgeter — `aco/verifier_budgeter.py`
	9. Retry/Recovery Optimizer — `aco/retry_optimizer.py`
	10. Meta-Tool Miner — `aco/meta_tool_miner.py`
	11. Doom Detector — `aco/doom_detector.py`

	## New Modules (this session)

	- Conformal Calibration — `aco/conformal.py` — RouteNLP-style distribution-free escalation guarantees
	- Pareto Frontier — `aco/pareto.py` — RouterBench NDCH + RouteLLM CPT/APGR metrics
	- Integration Test — `tests/test_integration.py` — Full pipeline test

	## Key Takeaway

	Training on real execution data matters more than architecture. v8 trained on synthetic data increased cost by 11.6%. v10 trained on 500 real SWE-Router outcomes saved 36.4%. Same XGBoost, same features.

	## Documentation

	- [Final Report](docs/final_report_v2.md)
	- [Pareto Frontier Report](docs/pareto_frontier_report.md)
	- [Conformal Calibration Report](docs/conformal_report.md)
	- [BERT Eval Report](docs/bert_eval_report.md)
	- [Literature Review](docs/literature_review.md)
	- [Deployment Guide](docs/deployment_guide.md)
	- [Technical Blog](docs/technical_blog.md)
	- [Roadmap](docs/ROADMAP.md)

	## Links

	- Model: [narcolepticchicken/agent-cost-optimizer](https://huggingface.co/narcolepticchicken/agent-cost-optimizer)
	- Dataset: [narcolepticchicken/agent-cost-traces](https://huggingface.co/datasets/narcolepticchicken/agent-cost-traces)
	- Dashboard: [narcolepticchicken/aco-dashboard](https://huggingface.co/spaces/narcolepticchicken/aco-dashboard)