Report on META-HARNESS
End-to-End Optimization of Model Harnesses
Authors: Yoonho Lee · Roshen Nair · Qizheng Zhang · Kangwook Lee · Omar Khattab · Chelsea Finn
Affiliations: Stanford University · MIT · KRAFTON
Paper: arXiv:2603.28052 — Published March 30, 2026
Project page: yoonholee.com/meta-harness
Code: github.com/stanford-iris-lab/meta-harness-tbench2-artifact
Table of Contents
- Executive Summary
- Introduction & Motivation
- The Meta-Harness Method
- Experiments & Results
- Related Work & Positioning
- Significance & Implications
- Code & Reproducibility
- Conclusion
- References
1. Executive Summary
Meta-Harness is the first system to automate harness engineering end-to-end for LLM applications. By giving a coding agent (Claude Code) unrestricted filesystem access to all prior search history — source code, execution traces, and evaluation scores — it enables optimization at a scale (up to 10M tokens per step) orders of magnitude beyond any prior text optimization method.
The core thesis is simple but powerful: a "harness" — the code that wraps an LLM and determines what information to store, retrieve, and show — often matters as much as the model weights themselves. Yet harness design has remained largely manual. Meta-Harness automates this.
Key Results at a Glance
| Domain | Result | vs. Best Baseline |
|---|---|---|
| Online Text Classification | 48.6% accuracy (GPT-OSS-120B, 3 datasets) | +7.7 pts over ACE using 4× less context |
| IMO-level Math Reasoning | +4.7 pts avg. over 5 unseen models | Surpasses BM25 & Dense Retrieval on avg. |
| TerminalBench-2 (Haiku 4.5) | 37.6% pass rate | #1 among ALL Haiku 4.5 agents |
| TerminalBench-2 (Opus 4.6) | 76.4% pass rate | #2 among ALL Opus 4.6 agents |
2. Introduction & Motivation
2.1 What Is a Harness?
A harness is the code that surrounds a language model and governs its behavior in a deployed system. It is not just a system prompt — it is a complete stateful program that determines:
- What information to retrieve from memory or external sources
- How to format and present context to the model at each step
- How to update the model's state after each interaction
- Which tools to expose, and when to invoke them
- How to structure tool outputs, intermediate reasoning, and final responses
"Changing the harness around a fixed LLM can produce a 6× performance gap on the same benchmark." — Lee et al. (2026)
Despite this outsized impact, harnesses have been designed largely by hand. Engineers inspect failure cases, adjust heuristics, and iterate across a small number of candidates. This process is slow, expensive, and expert-dependent.
2.2 Why Can't Existing Methods Do This?
Text optimization methods like OPRO, TextGrad, AlphaEvolve, and GEPA all attempt to iteratively improve prompts or code artifacts using LLM feedback. They are poorly matched to harness engineering for a fundamental reason:
- They compress feedback too aggressively — scalar scores, sliding windows, or LLM-generated summaries
- Harnesses act over long horizons: a decision about when to retrieve context may cause failure many steps later
- Compressed summaries lose diagnostic precision needed to trace failures to specific harness decisions
- The relevant information — raw execution traces, tool call sequences, intermediate state — is distributed and large
This is quantified clearly in Table 1 of the paper: prior methods operate with 100 to 30,000 tokens of context per optimization step. Meta-Harness operates with up to 10,000,000.
Context Budget Comparison (Log Scale)
Method | Mtok/iter | History type
----------------|------------|-------------------------------
Self-Refine | 0.001M | Last output + self-critique
OPRO | 0.002M | Window of (solution, score) pairs
MIPRO | 0.003M | Bootstrapped program traces
GEPA | 0.008M | Summarized rollout traces
TextGrad | 0.015M | LLM textual gradient
Feedback Descent| 0.012M | Pairwise comparison + feedback
AlphaEvolve | 0.022M | Program database + eval scores
TTT-Discover | 0.026M | Prev. solution fragments
Meta-Harness ★ | 10.000M | Full: all logs, code, scores
★ ~400× more context than the next best method (TTT-Discover at 26K tokens)
3. The Meta-Harness Method
3.1 Core Architecture
Meta-Harness is itself a harness — a harness for optimizing harnesses (hence the name). It consists of three components in a closed loop:
┌─────────────────────────────────────────────────────────────────┐
│ META-HARNESS SEARCH LOOP │
│ │
│ ┌─────────────────┐ ① reads filesystem │
│ │ PROPOSER AGENT │─────────────────────────►┌─────────────┐ │
│ │ Claude Code │ │ FILESYSTEM D│ │
│ │ (Opus 4.6) │◄──────────────────────── │ code+traces │ │
│ └────────┬────────┘ ③ store all logs │ +scores │ │
│ │ └─────────────┘ │
│ │ ② proposes new harness │
│ ▼ │
│ ┌─────────────────┐ │
│ │ HARNESS H │──────────────────────────►┌───────────┐ │
│ │ Python program │ evaluate │ EVALUATOR │ │
│ │ (prompt/mem/ │◄──────────────────────── │ task set │ │
│ │ retrieval) │ return score └───────────┘ │
│ └─────────────────┘ │
│ │
│ ⚡ Key: up to 10M tokens/iter vs. ≤26K for all prior methods │
└─────────────────────────────────────────────────────────────────┘
| Component | Description |
|---|---|
| Proposer | Claude Code (Opus 4.6) — reads filesystem via grep/cat, reasons about prior failures, proposes new harness code |
| Filesystem D | Growing directory of all prior candidates: source code, evaluation scores, full execution traces (prompts, tool calls, model outputs, state updates) |
| Evaluator | Runs each proposed harness on a held-out search set and returns reward; logs all results to the filesystem. Test sets remain held out until the very end. |
3.2 The Search Algorithm
Algorithm 1 — Meta-Harness outer loop
Input: tasks X, LLM M, proposer P, iterations N
Init: population H (seed harnesses)
filesystem D ← ∅
for H in H:
E_H ← Evaluate(H, M, X)
D ← D ∪ {(H, E_H)}
for t = 1 … N:
P queries filesystem D ← inspects prior code, scores, traces
P proposes k new harnesses {H₁ … Hₖ}
for H in {H₁ … Hₖ}:
if H passes interface validation:
D ← D ∪ {(H, Evaluate(H, M, X))}
return Pareto frontier of harnesses in D
Key design insight: No parent-selection rule is imposed. The proposer freely inspects any prior harness and its traces. This lets it identify confounded edits, trace causal chains of failure, and shift strategies when a class of edits regresses. In practice it reads a median of 82 files per iteration, referencing 20+ prior candidates per step.
3.3 Why Code-Space Search?
Harness optimization targets executable Python programs, not just text templates:
- Small changes to retrieval or memory logic can affect behavior many steps later — local heuristics are ill-suited
- Coding models have strong inductive biases toward coherent algorithms rather than brittle hard-coded solutions
- The proposer can make structural changes — full rewrites, new retrieval strategies, different memory schemas — not just fill templates
- The action space aligns with the read-write-execute workflows on which frontier coding models are trained
Important: The proposer never sees test-set results. All feedback comes exclusively from the search set and execution traces — critical for avoiding overfitting.
4. Experiments & Results
4.1 Online Text Classification
Setup
An LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set (following the ACE setup).
| Dataset | Task | Classes | Difficulty |
|---|---|---|---|
| LawBench (Law) | Predict criminal charges from case descriptions | 215 | Very Hard |
| Symptom2Disease (S2D) | Predict disease from symptom descriptions | 22 | Moderate |
| USPTO-50k | Predict precursor reactants from product molecules | 180 | Hard |
Model: GPT-OSS-120B
Search: 20 evolution iterations × 2 candidates = 40 total harness evaluations
Main Results
| Harness | USPTO ↑ | S2D ↑ | Law ↑ | Avg Acc ↑ | Context (K chars) ↓ |
|---|---|---|---|---|---|
| Zero-Shot | 12.0 | 63.2 | 7.0 | 27.4 | 0 |
| Few-Shot (N=8) | 14.0 | 67.9 | 21.0 | 34.3 | 8.0 |
| Few-Shot (N=32) | 13.0 | 72.2 | 21.0 | 35.4 | 31.5 |
| Few-Shot (all) | 15.0 | 78.3 | 29.0 | 40.8 | 49.3 |
| MCE | 14.0 | 83.0 | 23.0 | 40.0 | 114.0 |
| ACE | 16.0 | 77.8 | 29.0 | 40.9 | 203.0 |
| Meta-Harness (Ours) | 14.0 | 86.8 | 45.0 | 48.6 | 45.5 |
Meta-Harness achieves +7.7 pts over ACE while using less than a quarter of ACE's context. Gains concentrate on large label spaces: LawBench +16 pts, S2D +9 pts.
Comparison vs. Text Optimizers (same budget)
| Method | Median Acc | Best Acc |
|---|---|---|
| GEPA | 32.6 | 40.2 |
| Best-of-N | 34.0 | 44.2 |
| TTT-Discover | 34.1 | 45.6 |
| OpenEvolve | 39.1 | 43.3 |
| Meta-Harness | 50.0 | 56.7 |
Meta-Harness matches the best prior text optimizer's final accuracy in 10× fewer evaluations, and its final accuracy surpasses theirs by more than 10 points.
Ablation: What Information Matters?
| Method | Scores | Code | Summaries | Traces | Median Acc | Best Acc |
|---|---|---|---|---|---|---|
| Scores Only | ✓ | ✓ | ✗ | ✗ | 34.6% | 41.3% |
| Scores + Summary | ✓ | ✓ | ✓ | ✗ | 34.9% | 38.7% |
| Meta-Harness (Full) | ✓ | ✓ | — | ✓ | 50.0% | 56.7% |
Critical finding: Scores + summaries (34.9% median) can actually perform worse than Scores Only, suggesting LLM-generated summaries compress away diagnostically useful details. Raw execution traces are the key ingredient.
Out-of-Distribution Generalization (9 unseen datasets)
| Harness | Avg Accuracy | Context (K chars) |
|---|---|---|
| Zero-shot | 67.0% | — |
| Few-shot (32) | 69.6% | 5.2 |
| ACE | 70.2% | 11.7 |
| Meta-Harness | 73.1% | 7.3 |
Meta-Harness achieves best accuracy on 6/9 unseen datasets — confirming it captures generally effective strategies, not search-set-specific tricks.
4.2 Retrieval-Augmented Math Reasoning
Setup
A non-standard but well-motivated setup: augmenting LLMs with a retrieval corpus (≥500K problems from 8 open-source datasets) for IMO-level olympiad math. Meta-Harness searches over arbitrary retrieval programs that can filter, branch, and format using corpus metadata and BM25 scores.
- Search set: 250 problems
- Evaluation set: 200 held-out IMO-level problems
- Transfer: same harness tested on 5 models unseen during search
Results
| Method | GPT-5.4n | GPT-5.4m | Gem-3.1FL | Gem-3F | GPT-20B | Avg |
|---|---|---|---|---|---|---|
| No Retriever | 23.0 | 28.8 | 28.6 | 42.6 | 47.6 | 34.1 |
| Random Few-shot | 23.1 (+0.1) | 24.5 (−4.3) | 31.0 (+2.4) | 40.4 (−2.2) | 41.8 (−5.8) | 32.2 (−1.9) |
| BM25 Retrieval | 30.2 (+7.2) | 29.2 (+0.4) | 32.8 (+4.2) | 46.6 (+4.0) | 48.9 (+1.3) | 37.5 (+3.4) |
| Meta-Harness | 31.7 (+8.7) | 30.4 (+1.6) | 34.9 (+6.3) | 46.3 (+3.7) | 50.6 (+3.0) | 38.8 (+4.7) |
Notable: Random few-shot retrieval hurts performance (32.2% vs. 34.1% baseline), confirming that naive retrieval fails on reasoning-intensive benchmarks. The discovered Meta-Harness retrieval strategy improves all 5 models, including those not seen during search — a strong generalization signal.
4.3 Agentic Coding: TerminalBench-2
Benchmark
TerminalBench-2 evaluates LLM agents on 89 Dockerized real-world tasks spanning:
- Code translation
- Distributed ML setup
- Systems programming
- Bioinformatics
- Cryptanalysis
Grading is binary (pass/fail) with 5 independent trials per task. Tasks require long-horizon autonomous execution, handling complex software dependencies, truncated terminal outputs, and deep domain knowledge.
Meta-Harness Approach
Search initializes from Terminus 2 and Terminus-KIRA (two strong open baselines) and evolves the full coding harness:
- System prompts and task framing
- Tool definitions and invocation policies
- Completion-checking logic
- Context management under long execution horizons
The proposer reads per-task execution traces (command logs, error messages, timeout behavior) to diagnose failure modes and propose targeted fixes.
Leaderboard Results
Claude Haiku 4.5
| Agent | Pass Rate |
|---|---|
| OpenHands | 13.9% |
| Claude Code | 27.5% |
| Terminus 2 | 28.3% |
| Mini-SWE-Agent | 29.8% |
| Terminus-KIRA | 33.7% |
| Goose | 35.5% |
| Meta-Harness (Ours) | 37.6% 🥇 #1 |
Claude Opus 4.6
| Agent | Pass Rate |
|---|---|
| Claude Code | 58.0% |
| Terminus 2 | 62.9% |
| Mux | 66.5% |
| Factory Droid | 69.9% |
| TongAgents | 71.9% |
| MAYA-V2 | 72.1% |
| Terminus-KIRA | 74.7% |
| Capy | 75.3% |
| Meta-Harness (Ours) | 76.4% 🥈 #2 |
| ForgeCode | 81.8% |
Meta-Harness is the best-performing open harness for both model sizes on TerminalBench-2.
5. Related Work & Positioning
5.1 Text Optimization Methods
| Method | History / Log Content | Core Limitation | Mtok/iter |
|---|---|---|---|
| OPRO | Window of (solution, score) pairs | Scalar scores only, no traces | 0.002M |
| TextGrad | Last textual gradient (self-critique) | Only current candidate, no history | 0.015M |
| AlphaEvolve | Window of program database + eval scores | Score-heavy, compressed programs | 0.022M |
| GEPA | Summarized reflective feedback from rollouts | Summary loses causal chain | 0.008M |
| TTT-Discover | Window of prev. solution fragments | Structured, limited proposer input | 0.026M |
| Meta-Harness | Full: all logs, code, scores via filesystem | — | 10.0M |
5.2 Connections to Other Research
Retrieval-Augmented Generation (RAG)
Meta-Harness applies similar adaptive access patterns but in the harder regime of improving the retrieval mechanism itself, not just using it.
Meta-Learning
Assigns credit at the harness level rather than via gradient-based weight updates; uses experience from prior rollouts to rewrite future external behavior — a form of meta-learning without backpropagation.
Program Synthesis / Evolutionary Code Search
Methods like OpenEvolve evolve designated functions within fixed scaffolds. Meta-Harness evolves full stateful programs with unrestricted code changes and no pre-defined mutation operators.
Memory-Augmented Agents
Related to memory designs for continual-learning agents, but Meta-Harness targets per-task harnesses that reset between tasks (not persistent cross-task memory).
6. Significance & Implications
6.1 For AI Engineering Practice
- The discovered "Label-Primed Query" harness achieves higher text classification accuracy with less context than all hand-designed alternatives — without any additional LLM calls
- The math retrieval harness transfers to 5 models not seen during search, suggesting generalizable logic rather than model-specific tricks
- The TerminalBench-2 artifact is publicly released as a concrete starting point for practitioners
Practical takeaway: For any LLM application where harness design is a bottleneck, Meta-Harness offers a concrete, automatable alternative to manual iteration. The implementation uses Claude Code as the proposer, which is publicly accessible.
6.2 Theoretical Contributions
- Formalizes harness optimization as an objective:
H* = argmax_H E[r(τ, x)] - Identifies compression as the core failure mode of prior text optimizers for this setting
- Demonstrates empirically that execution traces — not summaries — are the key ingredient (Table 3 ablation)
- Shows that filesystem-based memory is an effective alternative to hand-designed search loops (no parent-selection rules, archive structures, or diversity mechanisms required)
6.3 Limitations & Future Work
- Compute cost: each run evaluates ~60 harnesses over 20 iterations; evaluation is the main bottleneck
- Proposer dependence: capability scales with the quality of the underlying coding agent — results will improve as agents improve
- Single-file harnesses: current experiments use single-file Python programs; multi-file or multi-agent architectures are future work
- The authors acknowledge this workflow "only became practical recently, following major improvements in coding-agent capabilities around early 2026" — the paradigm is tightly coupled to current frontier capabilities
7. Code & Reproducibility
| Resource | Link |
|---|---|
| Optimized TerminalBench-2 harness | https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact |
| Project page + interactive demo | https://yoonholee.com/meta-harness/ |
| Paper | https://arxiv.org/abs/2603.28052 |
Implementation Details
| Component | Implementation |
|---|---|
| Proposer | Claude Code with Opus 4.6 (max reasoning mode) |
| Harness format | Single-file Python program (prompting + retrieval + memory + orchestration) |
| Search budget | ~60 harness evaluations over 20 iterations (2 candidates/iteration) |
| Filesystem access | grep, cat, standard terminal tools; proposer reads median 82 files/iteration |
| Validation | Interface validation before evaluation; invalid harnesses are discarded |
| Selection | Pareto frontier over accuracy and context cost; test set held out until final eval |
The interactive demo on the project page steps through a concrete search run on TerminalBench-2 (hard 19-task subset), showing the proposer's reasoning at each iteration: counterfactual diagnosis, identification of failure modes from raw logs, and targeted fixes grounded in concrete evidence from prior runs.
8. Conclusion
Meta-Harness represents a significant step toward automated engineering of LLM application harnesses. Its core contribution is the identification of a critical mismatch between prior text optimization methods and the requirements of harness engineering: prior methods compress feedback too aggressively, discarding the execution traces needed to trace failures to specific harness decisions.
The solution — exposing a complete filesystem of prior experience to a coding-agent proposer — is conceptually simple but empirically powerful. By giving the proposer unrestricted access to source code, scores, and full execution traces, Meta-Harness enables diagnosis and repair at a qualitatively different level than summary-based methods.
The results across three very different domains — online text classification, IMO-level math reasoning, and agentic coding — demonstrate that richer access to prior experience consistently translates to better harnesses.
"This workflow only became practical recently, following major improvements in coding-agent capabilities around early 2026." — Lee et al. (2026)
The implication is clear: automated harness engineering is now a viable paradigm, and its ceiling has not yet been reached. As coding-agent capabilities continue to improve, Meta-Harness's performance will scale accordingly.
9. References
Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052.
Key cited works:
- [ACE] Zhang et al. (2025). Agentic Context Engineering.
- [MCE] Ye et al. (2025). Meta Context Engineering.
- [OPRO] Yang et al. (2024). Large Language Models as Optimizers.
- [TextGrad] Yuksekgonul et al. (2024). TextGrad: Automatic Differentiation via Text.
- [AlphaEvolve] DeepMind (2025). AlphaEvolve: Evolving Better Algorithms.
- [GEPA] Anokhin et al. (2025). GEPA: Goal-directed Evolution of Prompt Agents.
- [TTT-Discover] Anonymous (2025). Test-Time Training Discover.
- [Feedback Descent] Ye et al. (2025). Feedback Descent for LLM Optimization.
Report generated from arXiv:2603.28052 · Stanford IRIS Lab · 2026

