Report on META-HARNESS

Community Article Published April 8, 2026

End-to-End Optimization of Model Harnesses
Table of Contents
1. Executive Summary
Key Results at a Glance
2. Introduction & Motivation
2.1 What Is a Harness?
2.2 Why Can't Existing Methods Do This?
Context Budget Comparison (Log Scale)
3. The Meta-Harness Method
3.1 Core Architecture
3.2 The Search Algorithm
3.3 Why Code-Space Search?
4. Experiments & Results
4.1 Online Text Classification
4.2 Retrieval-Augmented Math Reasoning
4.3 Agentic Coding: TerminalBench-2
5. Related Work & Positioning
5.1 Text Optimization Methods
5.2 Connections to Other Research
6. Significance & Implications
6.1 For AI Engineering Practice
6.2 Theoretical Contributions
6.3 Limitations & Future Work
7. Code & Reproducibility
Implementation Details
8. Conclusion
9. References
End-to-End Optimization of Model Harnesses

Authors: Yoonho Lee · Roshen Nair · Qizheng Zhang · Kangwook Lee · Omar Khattab · Chelsea Finn
Affiliations: Stanford University · MIT · KRAFTON
Paper: arXiv:2603.28052 — Published March 30, 2026
Project page: yoonholee.com/meta-harness
Code: github.com/stanford-iris-lab/meta-harness-tbench2-artifact

Executive Summary
Introduction & Motivation
The Meta-Harness Method
Experiments & Results
Related Work & Positioning
Significance & Implications
Code & Reproducibility
Conclusion
References

1. Executive Summary

Meta-Harness is the first system to automate harness engineering end-to-end for LLM applications. By giving a coding agent (Claude Code) unrestricted filesystem access to all prior search history — source code, execution traces, and evaluation scores — it enables optimization at a scale (up to 10M tokens per step) orders of magnitude beyond any prior text optimization method.

The core thesis is simple but powerful: a "harness" — the code that wraps an LLM and determines what information to store, retrieve, and show — often matters as much as the model weights themselves. Yet harness design has remained largely manual. Meta-Harness automates this.

Key Results at a Glance

Domain	Result	vs. Best Baseline
Online Text Classification	48.6% accuracy (GPT-OSS-120B, 3 datasets)	+7.7 pts over ACE using 4× less context
IMO-level Math Reasoning	+4.7 pts avg. over 5 unseen models	Surpasses BM25 & Dense Retrieval on avg.
TerminalBench-2 (Haiku 4.5)	37.6% pass rate	#1 among ALL Haiku 4.5 agents
TerminalBench-2 (Opus 4.6)	76.4% pass rate	#2 among ALL Opus 4.6 agents

2. Introduction & Motivation

2.1 What Is a Harness?

A harness is the code that surrounds a language model and governs its behavior in a deployed system. It is not just a system prompt — it is a complete stateful program that determines:

What information to retrieve from memory or external sources
How to format and present context to the model at each step
How to update the model's state after each interaction
Which tools to expose, and when to invoke them
How to structure tool outputs, intermediate reasoning, and final responses

"Changing the harness around a fixed LLM can produce a 6× performance gap on the same benchmark." — Lee et al. (2026)

Despite this outsized impact, harnesses have been designed largely by hand. Engineers inspect failure cases, adjust heuristics, and iterate across a small number of candidates. This process is slow, expensive, and expert-dependent.

2.2 Why Can't Existing Methods Do This?

Text optimization methods like OPRO, TextGrad, AlphaEvolve, and GEPA all attempt to iteratively improve prompts or code artifacts using LLM feedback. They are poorly matched to harness engineering for a fundamental reason:

They compress feedback too aggressively — scalar scores, sliding windows, or LLM-generated summaries
Harnesses act over long horizons: a decision about when to retrieve context may cause failure many steps later
Compressed summaries lose diagnostic precision needed to trace failures to specific harness decisions
The relevant information — raw execution traces, tool call sequences, intermediate state — is distributed and large

This is quantified clearly in Table 1 of the paper: prior methods operate with 100 to 30,000 tokens of context per optimization step. Meta-Harness operates with up to 10,000,000.

Context Budget Comparison (Log Scale)

Method          | Mtok/iter  | History type
----------------|------------|-------------------------------
Self-Refine     | 0.001M     | Last output + self-critique
OPRO            | 0.002M     | Window of (solution, score) pairs
MIPRO           | 0.003M     | Bootstrapped program traces
GEPA            | 0.008M     | Summarized rollout traces
TextGrad        | 0.015M     | LLM textual gradient
Feedback Descent| 0.012M     | Pairwise comparison + feedback
AlphaEvolve     | 0.022M     | Program database + eval scores
TTT-Discover    | 0.026M     | Prev. solution fragments
Meta-Harness ★  | 10.000M    | Full: all logs, code, scores

★ ~400× more context than the next best method (TTT-Discover at 26K tokens)

3. The Meta-Harness Method

3.1 Core Architecture

Meta-Harness is itself a harness — a harness for optimizing harnesses (hence the name). It consists of three components in a closed loop:

┌─────────────────────────────────────────────────────────────────┐
│                    META-HARNESS SEARCH LOOP                     │
│                                                                 │
│   ┌─────────────────┐    ① reads filesystem                    │
│   │  PROPOSER AGENT │─────────────────────────►┌─────────────┐ │
│   │  Claude Code    │                           │ FILESYSTEM D│ │
│   │  (Opus 4.6)     │◄────────────────────────  │ code+traces │ │
│   └────────┬────────┘    ③ store all logs       │ +scores     │ │
│            │                                    └─────────────┘ │
│            │ ② proposes new harness                             │
│            ▼                                                    │
│   ┌─────────────────┐                                          │
│   │   HARNESS  H    │──────────────────────────►┌───────────┐  │
│   │  Python program │     evaluate              │ EVALUATOR │  │
│   │  (prompt/mem/   │◄──────────────────────── │ task set  │  │
│   │   retrieval)    │     return score           └───────────┘  │
│   └─────────────────┘                                          │
│                                                                 │
│  ⚡ Key: up to 10M tokens/iter vs. ≤26K for all prior methods  │
└─────────────────────────────────────────────────────────────────┘

Component	Description
Proposer	Claude Code (Opus 4.6) — reads filesystem via `grep`/`cat`, reasons about prior failures, proposes new harness code
Filesystem D	Growing directory of all prior candidates: source code, evaluation scores, full execution traces (prompts, tool calls, model outputs, state updates)
Evaluator	Runs each proposed harness on a held-out search set and returns reward; logs all results to the filesystem. Test sets remain held out until the very end.

3.2 The Search Algorithm

Algorithm 1 — Meta-Harness outer loop

Input:  tasks X, LLM M, proposer P, iterations N
Init:   population H (seed harnesses)
        filesystem D ← ∅

for H in H:
    E_H ← Evaluate(H, M, X)
    D   ← D ∪ {(H, E_H)}

for t = 1 … N:
    P queries filesystem D          ← inspects prior code, scores, traces
    P proposes k new harnesses {H₁ … Hₖ}
    for H in {H₁ … Hₖ}:
        if H passes interface validation:
            D ← D ∪ {(H, Evaluate(H, M, X))}

return Pareto frontier of harnesses in D

Key design insight: No parent-selection rule is imposed. The proposer freely inspects any prior harness and its traces. This lets it identify confounded edits, trace causal chains of failure, and shift strategies when a class of edits regresses. In practice it reads a median of 82 files per iteration, referencing 20+ prior candidates per step.

3.3 Why Code-Space Search?

Harness optimization targets executable Python programs, not just text templates:

Small changes to retrieval or memory logic can affect behavior many steps later — local heuristics are ill-suited
Coding models have strong inductive biases toward coherent algorithms rather than brittle hard-coded solutions
The proposer can make structural changes — full rewrites, new retrieval strategies, different memory schemas — not just fill templates
The action space aligns with the read-write-execute workflows on which frontier coding models are trained

Important: The proposer never sees test-set results. All feedback comes exclusively from the search set and execution traces — critical for avoiding overfitting.

4. Experiments & Results

4.1 Online Text Classification

Setup

An LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set (following the ACE setup).

Dataset	Task	Classes	Difficulty
LawBench (Law)	Predict criminal charges from case descriptions	215	Very Hard
Symptom2Disease (S2D)	Predict disease from symptom descriptions	22	Moderate
USPTO-50k	Predict precursor reactants from product molecules	180	Hard

Model: GPT-OSS-120B
Search: 20 evolution iterations × 2 candidates = 40 total harness evaluations

Main Results

Harness	USPTO ↑	S2D ↑	Law ↑	Avg Acc ↑	Context (K chars) ↓
Zero-Shot	12.0	63.2	7.0	27.4	0
Few-Shot (N=8)	14.0	67.9	21.0	34.3	8.0
Few-Shot (N=32)	13.0	72.2	21.0	35.4	31.5
Few-Shot (all)	15.0	78.3	29.0	40.8	49.3
MCE	14.0	83.0	23.0	40.0	114.0
ACE	16.0	77.8	29.0	40.9	203.0
Meta-Harness (Ours)	14.0	86.8	45.0	48.6	45.5

Meta-Harness achieves +7.7 pts over ACE while using less than a quarter of ACE's context. Gains concentrate on large label spaces: LawBench +16 pts, S2D +9 pts.

Comparison vs. Text Optimizers (same budget)

Method	Median Acc	Best Acc
GEPA	32.6	40.2
Best-of-N	34.0	44.2
TTT-Discover	34.1	45.6
OpenEvolve	39.1	43.3
Meta-Harness	50.0	56.7

Meta-Harness matches the best prior text optimizer's final accuracy in 10× fewer evaluations, and its final accuracy surpasses theirs by more than 10 points.

Ablation: What Information Matters?

Method	Scores	Code	Summaries	Traces	Median Acc	Best Acc
Scores Only	✓	✓	✗	✗	34.6%	41.3%
Scores + Summary	✓	✓	✓	✗	34.9%	38.7%
Meta-Harness (Full)	✓	✓	—	✓	50.0%	56.7%

Critical finding: Scores + summaries (34.9% median) can actually perform worse than Scores Only, suggesting LLM-generated summaries compress away diagnostically useful details. Raw execution traces are the key ingredient.

Out-of-Distribution Generalization (9 unseen datasets)

Harness	Avg Accuracy	Context (K chars)
Zero-shot	67.0%	—
Few-shot (32)	69.6%	5.2
ACE	70.2%	11.7
Meta-Harness	73.1%	7.3

Meta-Harness achieves best accuracy on 6/9 unseen datasets — confirming it captures generally effective strategies, not search-set-specific tricks.

4.2 Retrieval-Augmented Math Reasoning

Setup

A non-standard but well-motivated setup: augmenting LLMs with a retrieval corpus (≥500K problems from 8 open-source datasets) for IMO-level olympiad math. Meta-Harness searches over arbitrary retrieval programs that can filter, branch, and format using corpus metadata and BM25 scores.

Search set: 250 problems
Evaluation set: 200 held-out IMO-level problems
Transfer: same harness tested on 5 models unseen during search

Results

Method	GPT-5.4n	GPT-5.4m	Gem-3.1FL	Gem-3F	GPT-20B	Avg
No Retriever	23.0	28.8	28.6	42.6	47.6	34.1
Random Few-shot	23.1 (+0.1)	24.5 (−4.3)	31.0 (+2.4)	40.4 (−2.2)	41.8 (−5.8)	32.2 (−1.9)
BM25 Retrieval	30.2 (+7.2)	29.2 (+0.4)	32.8 (+4.2)	46.6 (+4.0)	48.9 (+1.3)	37.5 (+3.4)
Meta-Harness	31.7 (+8.7)	30.4 (+1.6)	34.9 (+6.3)	46.3 (+3.7)	50.6 (+3.0)	38.8 (+4.7)

Notable: Random few-shot retrieval hurts performance (32.2% vs. 34.1% baseline), confirming that naive retrieval fails on reasoning-intensive benchmarks. The discovered Meta-Harness retrieval strategy improves all 5 models, including those not seen during search — a strong generalization signal.

4.3 Agentic Coding: TerminalBench-2

Benchmark

TerminalBench-2 evaluates LLM agents on 89 Dockerized real-world tasks spanning:

Code translation
Distributed ML setup
Systems programming
Bioinformatics
Cryptanalysis

Grading is binary (pass/fail) with 5 independent trials per task. Tasks require long-horizon autonomous execution, handling complex software dependencies, truncated terminal outputs, and deep domain knowledge.

Meta-Harness Approach

Search initializes from Terminus 2 and Terminus-KIRA (two strong open baselines) and evolves the full coding harness:

System prompts and task framing
Tool definitions and invocation policies
Completion-checking logic
Context management under long execution horizons

The proposer reads per-task execution traces (command logs, error messages, timeout behavior) to diagnose failure modes and propose targeted fixes.

Leaderboard Results

Claude Haiku 4.5

Agent	Pass Rate
OpenHands	13.9%
Claude Code	27.5%
Terminus 2	28.3%
Mini-SWE-Agent	29.8%
Terminus-KIRA	33.7%
Goose	35.5%
Meta-Harness (Ours)	37.6% 🥇 #1

Claude Opus 4.6

Agent	Pass Rate
Claude Code	58.0%
Terminus 2	62.9%
Mux	66.5%
Factory Droid	69.9%
TongAgents	71.9%
MAYA-V2	72.1%
Terminus-KIRA	74.7%
Capy	75.3%
Meta-Harness (Ours)	76.4% 🥈 #2
ForgeCode	81.8%

Meta-Harness is the best-performing open harness for both model sizes on TerminalBench-2.

5. Related Work & Positioning

5.1 Text Optimization Methods

Method	History / Log Content	Core Limitation	Mtok/iter
OPRO	Window of (solution, score) pairs	Scalar scores only, no traces	0.002M
TextGrad	Last textual gradient (self-critique)	Only current candidate, no history	0.015M
AlphaEvolve	Window of program database + eval scores	Score-heavy, compressed programs	0.022M
GEPA	Summarized reflective feedback from rollouts	Summary loses causal chain	0.008M
TTT-Discover	Window of prev. solution fragments	Structured, limited proposer input	0.026M
Meta-Harness	Full: all logs, code, scores via filesystem	—	10.0M

5.2 Connections to Other Research

Retrieval-Augmented Generation (RAG)
Meta-Harness applies similar adaptive access patterns but in the harder regime of improving the retrieval mechanism itself, not just using it.

Meta-Learning
Assigns credit at the harness level rather than via gradient-based weight updates; uses experience from prior rollouts to rewrite future external behavior — a form of meta-learning without backpropagation.

Program Synthesis / Evolutionary Code Search
Methods like OpenEvolve evolve designated functions within fixed scaffolds. Meta-Harness evolves full stateful programs with unrestricted code changes and no pre-defined mutation operators.

Memory-Augmented Agents
Related to memory designs for continual-learning agents, but Meta-Harness targets per-task harnesses that reset between tasks (not persistent cross-task memory).

6. Significance & Implications

6.1 For AI Engineering Practice

The discovered "Label-Primed Query" harness achieves higher text classification accuracy with less context than all hand-designed alternatives — without any additional LLM calls
The math retrieval harness transfers to 5 models not seen during search, suggesting generalizable logic rather than model-specific tricks
The TerminalBench-2 artifact is publicly released as a concrete starting point for practitioners

Practical takeaway: For any LLM application where harness design is a bottleneck, Meta-Harness offers a concrete, automatable alternative to manual iteration. The implementation uses Claude Code as the proposer, which is publicly accessible.

6.2 Theoretical Contributions

Formalizes harness optimization as an objective: H* = argmax_H E[r(τ, x)]
Identifies compression as the core failure mode of prior text optimizers for this setting
Demonstrates empirically that execution traces — not summaries — are the key ingredient (Table 3 ablation)
Shows that filesystem-based memory is an effective alternative to hand-designed search loops (no parent-selection rules, archive structures, or diversity mechanisms required)

6.3 Limitations & Future Work

Compute cost: each run evaluates ~60 harnesses over 20 iterations; evaluation is the main bottleneck
Proposer dependence: capability scales with the quality of the underlying coding agent — results will improve as agents improve
Single-file harnesses: current experiments use single-file Python programs; multi-file or multi-agent architectures are future work
The authors acknowledge this workflow "only became practical recently, following major improvements in coding-agent capabilities around early 2026" — the paradigm is tightly coupled to current frontier capabilities

7. Code & Reproducibility

Resource	Link
Optimized TerminalBench-2 harness	https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact
Project page + interactive demo	https://yoonholee.com/meta-harness/
Paper	https://arxiv.org/abs/2603.28052

Implementation Details

Component	Implementation
Proposer	Claude Code with Opus 4.6 (max reasoning mode)
Harness format	Single-file Python program (prompting + retrieval + memory + orchestration)
Search budget	~60 harness evaluations over 20 iterations (2 candidates/iteration)
Filesystem access	`grep`, `cat`, standard terminal tools; proposer reads median 82 files/iteration
Validation	Interface validation before evaluation; invalid harnesses are discarded
Selection	Pareto frontier over accuracy and context cost; test set held out until final eval

The interactive demo on the project page steps through a concrete search run on TerminalBench-2 (hard 19-task subset), showing the proposer's reasoning at each iteration: counterfactual diagnosis, identification of failure modes from raw logs, and targeted fixes grounded in concrete evidence from prior runs.

8. Conclusion

Meta-Harness represents a significant step toward automated engineering of LLM application harnesses. Its core contribution is the identification of a critical mismatch between prior text optimization methods and the requirements of harness engineering: prior methods compress feedback too aggressively, discarding the execution traces needed to trace failures to specific harness decisions.

The solution — exposing a complete filesystem of prior experience to a coding-agent proposer — is conceptually simple but empirically powerful. By giving the proposer unrestricted access to source code, scores, and full execution traces, Meta-Harness enables diagnosis and repair at a qualitatively different level than summary-based methods.

The results across three very different domains — online text classification, IMO-level math reasoning, and agentic coding — demonstrate that richer access to prior experience consistently translates to better harnesses.

"This workflow only became practical recently, following major improvements in coding-agent capabilities around early 2026." — Lee et al. (2026)

The implication is clear: automated harness engineering is now a viable paradigm, and its ceiling has not yet been reached. As coding-agent capabilities continue to improve, Meta-Harness's performance will scale accordingly.

9. References

Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052.

Key cited works:

[ACE] Zhang et al. (2025). Agentic Context Engineering.
[MCE] Ye et al. (2025). Meta Context Engineering.
[OPRO] Yang et al. (2024). Large Language Models as Optimizers.
[TextGrad] Yuksekgonul et al. (2024). TextGrad: Automatic Differentiation via Text.
[AlphaEvolve] DeepMind (2025). AlphaEvolve: Evolving Better Algorithms.
[GEPA] Anokhin et al. (2025). GEPA: Goal-directed Evolution of Prompt Agents.
[TTT-Discover] Anonymous (2025). Test-Time Training Discover.
[Feedback Descent] Ye et al. (2025). Feedback Descent for LLM Optimization.

Report generated from arXiv:2603.28052 · Stanford IRIS Lab · 2026

2026 Agentic Coding Trends - Implementation Guide (Technical)

February 9, 2026

🤖 Agentic Internet : Autonomous AI Agent that browse the web

January 31, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote