Report on META-HARNESS

Community Article Published April 8, 2026

End-to-End Optimization of Model Harnesses

Authors: Yoonho Lee · Roshen Nair · Qizheng Zhang · Kangwook Lee · Omar Khattab · Chelsea Finn
Affiliations: Stanford University · MIT · KRAFTON
Paper: arXiv:2603.28052 — Published March 30, 2026
Project page: yoonholee.com/meta-harness
Code: github.com/stanford-iris-lab/meta-harness-tbench2-artifact


Table of Contents

  1. Executive Summary
  2. Introduction & Motivation
  3. The Meta-Harness Method
  4. Experiments & Results
  5. Related Work & Positioning
  6. Significance & Implications
  7. Code & Reproducibility
  8. Conclusion
  9. References

1. Executive Summary

Meta-Harness is the first system to automate harness engineering end-to-end for LLM applications. By giving a coding agent (Claude Code) unrestricted filesystem access to all prior search history — source code, execution traces, and evaluation scores — it enables optimization at a scale (up to 10M tokens per step) orders of magnitude beyond any prior text optimization method.

The core thesis is simple but powerful: a "harness" — the code that wraps an LLM and determines what information to store, retrieve, and show — often matters as much as the model weights themselves. Yet harness design has remained largely manual. Meta-Harness automates this.

Key Results at a Glance

Domain Result vs. Best Baseline
Online Text Classification 48.6% accuracy (GPT-OSS-120B, 3 datasets) +7.7 pts over ACE using 4× less context
IMO-level Math Reasoning +4.7 pts avg. over 5 unseen models Surpasses BM25 & Dense Retrieval on avg.
TerminalBench-2 (Haiku 4.5) 37.6% pass rate #1 among ALL Haiku 4.5 agents
TerminalBench-2 (Opus 4.6) 76.4% pass rate #2 among ALL Opus 4.6 agents

2. Introduction & Motivation

2.1 What Is a Harness?

A harness is the code that surrounds a language model and governs its behavior in a deployed system. It is not just a system prompt — it is a complete stateful program that determines:

  • What information to retrieve from memory or external sources
  • How to format and present context to the model at each step
  • How to update the model's state after each interaction
  • Which tools to expose, and when to invoke them
  • How to structure tool outputs, intermediate reasoning, and final responses

"Changing the harness around a fixed LLM can produce a 6× performance gap on the same benchmark." — Lee et al. (2026)

Despite this outsized impact, harnesses have been designed largely by hand. Engineers inspect failure cases, adjust heuristics, and iterate across a small number of candidates. This process is slow, expensive, and expert-dependent.

2.2 Why Can't Existing Methods Do This?

Text optimization methods like OPRO, TextGrad, AlphaEvolve, and GEPA all attempt to iteratively improve prompts or code artifacts using LLM feedback. They are poorly matched to harness engineering for a fundamental reason:

  • They compress feedback too aggressively — scalar scores, sliding windows, or LLM-generated summaries
  • Harnesses act over long horizons: a decision about when to retrieve context may cause failure many steps later
  • Compressed summaries lose diagnostic precision needed to trace failures to specific harness decisions
  • The relevant information — raw execution traces, tool call sequences, intermediate state — is distributed and large

This is quantified clearly in Table 1 of the paper: prior methods operate with 100 to 30,000 tokens of context per optimization step. Meta-Harness operates with up to 10,000,000.

Context Budget Comparison (Log Scale)

Method          | Mtok/iter  | History type
----------------|------------|-------------------------------
Self-Refine     | 0.001M     | Last output + self-critique
OPRO            | 0.002M     | Window of (solution, score) pairs
MIPRO           | 0.003M     | Bootstrapped program traces
GEPA            | 0.008M     | Summarized rollout traces
TextGrad        | 0.015M     | LLM textual gradient
Feedback Descent| 0.012M     | Pairwise comparison + feedback
AlphaEvolve     | 0.022M     | Program database + eval scores
TTT-Discover    | 0.026M     | Prev. solution fragments
Meta-Harness ★  | 10.000M    | Full: all logs, code, scores

~400× more context than the next best method (TTT-Discover at 26K tokens)


3. The Meta-Harness Method

3.1 Core Architecture

Meta-Harness is itself a harness — a harness for optimizing harnesses (hence the name). It consists of three components in a closed loop:

┌─────────────────────────────────────────────────────────────────┐
│                    META-HARNESS SEARCH LOOP                     │
│                                                                 │
│   ┌─────────────────┐    ① reads filesystem                    │
│   │  PROPOSER AGENT │─────────────────────────►┌─────────────┐ │
│   │  Claude Code    │                           │ FILESYSTEM D│ │
│   │  (Opus 4.6)     │◄────────────────────────  │ code+traces │ │
│   └────────┬────────┘    ③ store all logs       │ +scores     │ │
│            │                                    └─────────────┘ │
│            │ ② proposes new harness                             │
│            ▼                                                    │
│   ┌─────────────────┐                                          │
│   │   HARNESS  H    │──────────────────────────►┌───────────┐  │
│   │  Python program │     evaluate              │ EVALUATOR │  │
│   │  (prompt/mem/   │◄──────────────────────── │ task set  │  │
│   │   retrieval)    │     return score           └───────────┘  │
│   └─────────────────┘                                          │
│                                                                 │
│  ⚡ Key: up to 10M tokens/iter vs. ≤26K for all prior methods  │
└─────────────────────────────────────────────────────────────────┘
Component Description
Proposer Claude Code (Opus 4.6) — reads filesystem via grep/cat, reasons about prior failures, proposes new harness code
Filesystem D Growing directory of all prior candidates: source code, evaluation scores, full execution traces (prompts, tool calls, model outputs, state updates)
Evaluator Runs each proposed harness on a held-out search set and returns reward; logs all results to the filesystem. Test sets remain held out until the very end.

3.2 The Search Algorithm

Algorithm 1 — Meta-Harness outer loop

Input:  tasks X, LLM M, proposer P, iterations N
Init:   population H (seed harnesses)
        filesystem D ← ∅

for H in H:
    E_H ← Evaluate(H, M, X)
    D   ← D ∪ {(H, E_H)}

for t = 1 … N:
    P queries filesystem D          ← inspects prior code, scores, traces
    P proposes k new harnesses {H₁ … Hₖ}
    for H in {H₁ … Hₖ}:
        if H passes interface validation:
            D ← D ∪ {(H, Evaluate(H, M, X))}

return Pareto frontier of harnesses in D

Key design insight: No parent-selection rule is imposed. The proposer freely inspects any prior harness and its traces. This lets it identify confounded edits, trace causal chains of failure, and shift strategies when a class of edits regresses. In practice it reads a median of 82 files per iteration, referencing 20+ prior candidates per step.

3.3 Why Code-Space Search?

Harness optimization targets executable Python programs, not just text templates:

  • Small changes to retrieval or memory logic can affect behavior many steps later — local heuristics are ill-suited
  • Coding models have strong inductive biases toward coherent algorithms rather than brittle hard-coded solutions
  • The proposer can make structural changes — full rewrites, new retrieval strategies, different memory schemas — not just fill templates
  • The action space aligns with the read-write-execute workflows on which frontier coding models are trained

Important: The proposer never sees test-set results. All feedback comes exclusively from the search set and execution traces — critical for avoiding overfitting.


4. Experiments & Results

4.1 Online Text Classification

Setup

An LLM receives labeled examples one at a time, updates its memory, and is evaluated on a held-out test set (following the ACE setup).

Dataset Task Classes Difficulty
LawBench (Law) Predict criminal charges from case descriptions 215 Very Hard
Symptom2Disease (S2D) Predict disease from symptom descriptions 22 Moderate
USPTO-50k Predict precursor reactants from product molecules 180 Hard

Model: GPT-OSS-120B
Search: 20 evolution iterations × 2 candidates = 40 total harness evaluations

Main Results

Harness USPTO ↑ S2D ↑ Law ↑ Avg Acc ↑ Context (K chars) ↓
Zero-Shot 12.0 63.2 7.0 27.4 0
Few-Shot (N=8) 14.0 67.9 21.0 34.3 8.0
Few-Shot (N=32) 13.0 72.2 21.0 35.4 31.5
Few-Shot (all) 15.0 78.3 29.0 40.8 49.3
MCE 14.0 83.0 23.0 40.0 114.0
ACE 16.0 77.8 29.0 40.9 203.0
Meta-Harness (Ours) 14.0 86.8 45.0 48.6 45.5

Meta-Harness achieves +7.7 pts over ACE while using less than a quarter of ACE's context. Gains concentrate on large label spaces: LawBench +16 pts, S2D +9 pts.

Comparison vs. Text Optimizers (same budget)

Method Median Acc Best Acc
GEPA 32.6 40.2
Best-of-N 34.0 44.2
TTT-Discover 34.1 45.6
OpenEvolve 39.1 43.3
Meta-Harness 50.0 56.7

Meta-Harness matches the best prior text optimizer's final accuracy in 10× fewer evaluations, and its final accuracy surpasses theirs by more than 10 points.

Ablation: What Information Matters?

Method Scores Code Summaries Traces Median Acc Best Acc
Scores Only 34.6% 41.3%
Scores + Summary 34.9% 38.7%
Meta-Harness (Full) 50.0% 56.7%

Critical finding: Scores + summaries (34.9% median) can actually perform worse than Scores Only, suggesting LLM-generated summaries compress away diagnostically useful details. Raw execution traces are the key ingredient.

Out-of-Distribution Generalization (9 unseen datasets)

Harness Avg Accuracy Context (K chars)
Zero-shot 67.0%
Few-shot (32) 69.6% 5.2
ACE 70.2% 11.7
Meta-Harness 73.1% 7.3

Meta-Harness achieves best accuracy on 6/9 unseen datasets — confirming it captures generally effective strategies, not search-set-specific tricks.


4.2 Retrieval-Augmented Math Reasoning

Setup

A non-standard but well-motivated setup: augmenting LLMs with a retrieval corpus (≥500K problems from 8 open-source datasets) for IMO-level olympiad math. Meta-Harness searches over arbitrary retrieval programs that can filter, branch, and format using corpus metadata and BM25 scores.

  • Search set: 250 problems
  • Evaluation set: 200 held-out IMO-level problems
  • Transfer: same harness tested on 5 models unseen during search

Results

Method GPT-5.4n GPT-5.4m Gem-3.1FL Gem-3F GPT-20B Avg
No Retriever 23.0 28.8 28.6 42.6 47.6 34.1
Random Few-shot 23.1 (+0.1) 24.5 (−4.3) 31.0 (+2.4) 40.4 (−2.2) 41.8 (−5.8) 32.2 (−1.9)
BM25 Retrieval 30.2 (+7.2) 29.2 (+0.4) 32.8 (+4.2) 46.6 (+4.0) 48.9 (+1.3) 37.5 (+3.4)
Meta-Harness 31.7 (+8.7) 30.4 (+1.6) 34.9 (+6.3) 46.3 (+3.7) 50.6 (+3.0) 38.8 (+4.7)

Notable: Random few-shot retrieval hurts performance (32.2% vs. 34.1% baseline), confirming that naive retrieval fails on reasoning-intensive benchmarks. The discovered Meta-Harness retrieval strategy improves all 5 models, including those not seen during search — a strong generalization signal.


4.3 Agentic Coding: TerminalBench-2

Benchmark

TerminalBench-2 evaluates LLM agents on 89 Dockerized real-world tasks spanning:

  • Code translation
  • Distributed ML setup
  • Systems programming
  • Bioinformatics
  • Cryptanalysis

Grading is binary (pass/fail) with 5 independent trials per task. Tasks require long-horizon autonomous execution, handling complex software dependencies, truncated terminal outputs, and deep domain knowledge.

Meta-Harness Approach

Search initializes from Terminus 2 and Terminus-KIRA (two strong open baselines) and evolves the full coding harness:

  • System prompts and task framing
  • Tool definitions and invocation policies
  • Completion-checking logic
  • Context management under long execution horizons

The proposer reads per-task execution traces (command logs, error messages, timeout behavior) to diagnose failure modes and propose targeted fixes.

Leaderboard Results

Claude Haiku 4.5

Agent Pass Rate
OpenHands 13.9%
Claude Code 27.5%
Terminus 2 28.3%
Mini-SWE-Agent 29.8%
Terminus-KIRA 33.7%
Goose 35.5%
Meta-Harness (Ours) 37.6% 🥇 #1

Claude Opus 4.6

Agent Pass Rate
Claude Code 58.0%
Terminus 2 62.9%
Mux 66.5%
Factory Droid 69.9%
TongAgents 71.9%
MAYA-V2 72.1%
Terminus-KIRA 74.7%
Capy 75.3%
Meta-Harness (Ours) 76.4% 🥈 #2
ForgeCode 81.8%

Meta-Harness is the best-performing open harness for both model sizes on TerminalBench-2.


5. Related Work & Positioning

5.1 Text Optimization Methods

Method History / Log Content Core Limitation Mtok/iter
OPRO Window of (solution, score) pairs Scalar scores only, no traces 0.002M
TextGrad Last textual gradient (self-critique) Only current candidate, no history 0.015M
AlphaEvolve Window of program database + eval scores Score-heavy, compressed programs 0.022M
GEPA Summarized reflective feedback from rollouts Summary loses causal chain 0.008M
TTT-Discover Window of prev. solution fragments Structured, limited proposer input 0.026M
Meta-Harness Full: all logs, code, scores via filesystem 10.0M

5.2 Connections to Other Research

Retrieval-Augmented Generation (RAG)
Meta-Harness applies similar adaptive access patterns but in the harder regime of improving the retrieval mechanism itself, not just using it.

Meta-Learning
Assigns credit at the harness level rather than via gradient-based weight updates; uses experience from prior rollouts to rewrite future external behavior — a form of meta-learning without backpropagation.

Program Synthesis / Evolutionary Code Search
Methods like OpenEvolve evolve designated functions within fixed scaffolds. Meta-Harness evolves full stateful programs with unrestricted code changes and no pre-defined mutation operators.

Memory-Augmented Agents
Related to memory designs for continual-learning agents, but Meta-Harness targets per-task harnesses that reset between tasks (not persistent cross-task memory).


6. Significance & Implications

6.1 For AI Engineering Practice

  • The discovered "Label-Primed Query" harness achieves higher text classification accuracy with less context than all hand-designed alternatives — without any additional LLM calls
  • The math retrieval harness transfers to 5 models not seen during search, suggesting generalizable logic rather than model-specific tricks
  • The TerminalBench-2 artifact is publicly released as a concrete starting point for practitioners

Practical takeaway: For any LLM application where harness design is a bottleneck, Meta-Harness offers a concrete, automatable alternative to manual iteration. The implementation uses Claude Code as the proposer, which is publicly accessible.

6.2 Theoretical Contributions

  • Formalizes harness optimization as an objective: H* = argmax_H E[r(τ, x)]
  • Identifies compression as the core failure mode of prior text optimizers for this setting
  • Demonstrates empirically that execution traces — not summaries — are the key ingredient (Table 3 ablation)
  • Shows that filesystem-based memory is an effective alternative to hand-designed search loops (no parent-selection rules, archive structures, or diversity mechanisms required)

6.3 Limitations & Future Work

  • Compute cost: each run evaluates ~60 harnesses over 20 iterations; evaluation is the main bottleneck
  • Proposer dependence: capability scales with the quality of the underlying coding agent — results will improve as agents improve
  • Single-file harnesses: current experiments use single-file Python programs; multi-file or multi-agent architectures are future work
  • The authors acknowledge this workflow "only became practical recently, following major improvements in coding-agent capabilities around early 2026" — the paradigm is tightly coupled to current frontier capabilities

7. Code & Reproducibility

Resource Link
Optimized TerminalBench-2 harness https://github.com/stanford-iris-lab/meta-harness-tbench2-artifact
Project page + interactive demo https://yoonholee.com/meta-harness/
Paper https://arxiv.org/abs/2603.28052

Implementation Details

Component Implementation
Proposer Claude Code with Opus 4.6 (max reasoning mode)
Harness format Single-file Python program (prompting + retrieval + memory + orchestration)
Search budget ~60 harness evaluations over 20 iterations (2 candidates/iteration)
Filesystem access grep, cat, standard terminal tools; proposer reads median 82 files/iteration
Validation Interface validation before evaluation; invalid harnesses are discarded
Selection Pareto frontier over accuracy and context cost; test set held out until final eval

The interactive demo on the project page steps through a concrete search run on TerminalBench-2 (hard 19-task subset), showing the proposer's reasoning at each iteration: counterfactual diagnosis, identification of failure modes from raw logs, and targeted fixes grounded in concrete evidence from prior runs.


8. Conclusion

Meta-Harness represents a significant step toward automated engineering of LLM application harnesses. Its core contribution is the identification of a critical mismatch between prior text optimization methods and the requirements of harness engineering: prior methods compress feedback too aggressively, discarding the execution traces needed to trace failures to specific harness decisions.

The solution — exposing a complete filesystem of prior experience to a coding-agent proposer — is conceptually simple but empirically powerful. By giving the proposer unrestricted access to source code, scores, and full execution traces, Meta-Harness enables diagnosis and repair at a qualitatively different level than summary-based methods.

The results across three very different domains — online text classification, IMO-level math reasoning, and agentic coding — demonstrate that richer access to prior experience consistently translates to better harnesses.

"This workflow only became practical recently, following major improvements in coding-agent capabilities around early 2026." — Lee et al. (2026)

The implication is clear: automated harness engineering is now a viable paradigm, and its ceiling has not yet been reached. As coding-agent capabilities continue to improve, Meta-Harness's performance will scale accordingly.


9. References

Lee, Y., Nair, R., Zhang, Q., Lee, K., Khattab, O., & Finn, C. (2026). Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv:2603.28052.

Key cited works:

  • [ACE] Zhang et al. (2025). Agentic Context Engineering.
  • [MCE] Ye et al. (2025). Meta Context Engineering.
  • [OPRO] Yang et al. (2024). Large Language Models as Optimizers.
  • [TextGrad] Yuksekgonul et al. (2024). TextGrad: Automatic Differentiation via Text.
  • [AlphaEvolve] DeepMind (2025). AlphaEvolve: Evolving Better Algorithms.
  • [GEPA] Anokhin et al. (2025). GEPA: Goal-directed Evolution of Prompt Agents.
  • [TTT-Discover] Anonymous (2025). Test-Time Training Discover.
  • [Feedback Descent] Ye et al. (2025). Feedback Descent for LLM Optimization.

Report generated from arXiv:2603.28052 · Stanford IRIS Lab · 2026

Community

Sign up or log in to comment