AI Scientist v3: Agent Native refactor. Scale from 1-hour to 24 hours with Reviewer agent

Community Article Published March 1, 2026

February 24, 2026 | GitHub | Live Dashboard

image


The original AI Scientist v2 was held together by hardcoded workflow management -- a 4-stage pipeline with explicit breadth-first search over research strategies, manual parallelism, and rigid completion criteria. It worked and got a ICLR-Workshop paper, but it felt like building hand-crafted rules around a model.

I refactored it from two convictions:

  • Agents like Claude should orchestrate themselves. A frontier model with code execution doesn't need a Python script telling it when to run experiments vs. write the paper. The conversation history is the search tree.
  • We learn from natural language feedback. Researchers grow from peer review -- varying in effort and quality, but the feedback loop of review, rebuttal, and re-experiment is how science actually works. Agents could as well.

AI Scientist v3 replaced ~5,000 lines of orchestration code with a CLAUDE.md instructions file and a single skill for literature search.

The agent does everything else natively. The rest of the codebase handles infra logic (Harbor/Gitlab) so that you can scale this out to many concurrent jobs, running locally or via gpu provider like Modal with per-job Docker isolations, while using Gitlab store code and a Viewer Web app to monitor.

The Architecture:

After initially decomposing v2 into skills like run-experiment, write-paper, plot-results (initial commit), I kept finding unnecessary ones and deleting them. You don't need to teach a frontier model things like "Show key metrics during execution so the log captures them" or "Use \ref{fig:label} -- make sure labels match." They already know this, and likely have better built-in taste than any skill prompt.

What remains:

  1. A workspace -- an experiment folder with an Overleaf-style ICLR workshop LaTeX template, organized into baselines/, main/, ablations/, plotting/, and cloned_repos/
  2. One skill -- search-papers, which teaches the agent how to query Semantic Scholar, OpenAlex, OpenReview, and CrossRef for literature. Native webfetch by Claude were not as reliable, so this is a net-gain.

The search-papers skill itself is 177 lines of API reference -- endpoint URLs, rate limits, gotchas (Semantic Scholar abstracts contain control characters that break JSON; arXiv enforces a 3-second rate limit between downloads; OpenReview V2 wraps every value in a .value field). This is the kind of knowledge a model genuinely can't derive from first principles. Everything else -- experiment design, statistical rigor, LaTeX conventions -- it already knows.

What Ideas Look Like

An idea.json can be as structured as a full experiment plan with benchmarks, baselines, and code references, or as casual as a paragraph-length hypothesis. The instruction template tells the agent upfront: "Your idea may not always be the best idea right out of the gate -- treat it as a seed, navigate and steer towards impactful, novel work as you experiment, literature review, and interact with the reviewer."

A sample idea from the list of ideas

{
  "Name": "tabpfn_vs_boosting",
  "Title": "When Does TabPFN Beat Gradient Boosting? A Systematic Study on Small Tabular Datasets",
  "Short Hypothesis": "TabPFN v2's in-context learning approach likely has a crossover point against XGBoost/LightGBM that depends on dataset size, feature count, and feature types — but the exact boundary is poorly characterized.",
  "Related Work": "Hollmann et al. (2023) introduced TabPFN at ICLR. PriorLabs released TabPFN v2 (Nature, 2025). McElfresh et al. (2023) asked 'When Do Neural Nets Outperform Boosted Trees on Tabular Data?' but predates TabPFN v2. No systematic study of v2 against tuned gradient boosting on the OpenML CC18 benchmark exists."
}

So far the system has run 15+ distinct research ideas across 8 domains: video QA, tool-augmented image generation, LLM-as-a-judge evaluation, prompt injection defense, vector search hubness repair, memory conflict resolution, chart QA robustness, and tabular ML. Most are designed for API-only execution (no GPU required), with a few leveraging a local RTX 5080 or Modal with cloud GPUs for training.

How a Job Runs

# Launch a research job
./run.sh ideas/idea_tabulartransformer.json                    # CPU
./run.sh ideas/idea_tabulartransformer.json --gpus 1           # GPU
./run.sh ideas/idea_tabulartransformer.json --agent gemini-cli # Gemini instead of Claude

Under the hood:

  1. run.sh injects idea.json into an instruction template and builds a Docker container (CPU-slim or GPU+CUDA)
  2. The agent (Claude Code or Gemini CLI) starts with a pre-initialized git repo, a pre-structured workspace, and a persistent /data mount for cached datasets
  3. A patched agent wrapper syncs artifacts every 180 seconds -- if the job times out, partial work is saved

Jobs can be resumed with previous artifacts and human feedback:

./run.sh idea.json --resume-from jobs/prev_job/ --feedback "Add error bars to Figure 3"

The Viewer: Monitor

Under aiscientist.lishengzhi.com (or your own hosted version) you can see a breakdown of all jobs -- their status, model, duration, token usage, and cost:

image

Each job has a Submission Dossier with the rendered PDF, reviewer comments, and the agent's rebuttal -- the full review conversation, versioned:

image

Trajectories capture every tool call the agent makes. We typically see hundreds, and as many as 4,000 tool calls as the agent goes through literature review, experiments, paper writing, and multiple rounds of reviewer rebuttal:

image

Some of the most interesting tool calls are literature-search related -- the agent querying Semantic Scholar, downloading papers, reading them, and incorporating findings:

image

The Reviewer Agent

Earlier versions used a custom ICLR-trained reviewer model I helped trained that only looked at the final PDF. But many issues in long-running research jobs can't be diagnosed from the paper alone -- experiment tracking, code quality, solid statistical foundations, whether the agent actually ran the experiments it claims.

When we switched to a reviewer agent that spends ~10 minutes wroking the entire workspace, it found far more issues. The reviewer is a subagent with full file access (reviewer prompt):

  • Phase 1: Read the paper, evaluate claims, methodology, writing quality
  • Phase 2: Audit experiment code -- check organization, self-contained scripts, result files present. Specifically flags throwaway scripts (fix_*.py, debug_*.py) and versioned copies (*_v2.py) left in the codebase
  • Phase 3: Verify results -- cross-check numbers in the paper against actual JSON/CSV result files. The prompt is explicit: "If the paper says 'we achieve X% improvement', find the actual numbers in result files"
  • Phase 4: Inspect every figure -- visually read each PNG, verify axes labels, readability, that they match the paper's claims
  • Phase 5: Independent literature search to check novelty and missing citations

The reviewer agent makes ~113 tool calls per review, spending 7-12 minutes actively scanning the workspace.

There are two reviewer modes. API mode (~30 seconds) calls an external reviewer endpoint for quick feedback. Subagent mode (2-20 minutes) spawns a full agent -- for Claude, it runs claude -p --agent reviewer; for Gemini, it extracts the system prompt and calls gemini --yolo.

The research agent then receives the review, writes a rebuttal, runs additional experiments if needed, and resubmits with rebuttal.

Git as a Natural Artifact Layer

When an agent finishes a research job, you need a durable, browsable record of what it did -- the trajectory, the paper, the figures, the reviewer conversation. Git is a natural fit: it already handles versioning, diffing, and branch-based organization.

gitlab_commits

Each research idea gets its own private GitLab repo (e.g., selective-self-reference-judge), created on-demand via the API. Each agent run becomes a branch named {agent}-{timestamp} -- so claude-2026-02-25-13-29 and gemini-2026-02-25-18-00 are two runs of the same idea with different models, both visible as sibling branches in the same repo. This makes it trivial to compare runs: same idea, different agents or hyperparameters, all browsable side-by-side.

After a job completes, push_to_gitlab.py stages a sanitized snapshot of the artifacts into a flat layout:

agent_trace/              # sanitized trajectory + pre-computed summary + metadata
reviewer_trace/           # version_log, response.md, reviewer JSONL per submission
config.json               # task config (sanitized)
result.json               # task result (sanitized)
idea.json                 # original idea input
paper.pdf                 # latest compiled paper
figures/                  # publication-quality PNGs

The viewer (aiscientist.lishengzhi.com) reads directly from the GitLab API for completed jobs -- fetching trajectory.json, idea.json, and reviewer files on demand. For running jobs, it falls back to local disk..

Before any artifact reaches GitLab, it passes through a three-layer pipeline to remove any .env keys, any {"api_key": "value"} from being pushed to to the gitlab.

Each run becomes a branch. Previous runs are visible. We can then try to instruct agent can learn from its own history across sessions by other and other runs. git diff, git log might be more natural to coding agents than reading from a markdown memory.md.

Lessons Learned

Practice really high restraint from "vibe skills"

The instinct when building agent systems is to add more instructions: tell it about every edge case, every convention, every trick. Resist this. Frontier models already have good taste about code organization, LaTeX formatting, and experiment design. Every unnecessary instruction is noise. Would you give a super prescibed set of onboarding doc meant for a intern to do so?

I went from dozens of skills to one. The only skill that survived is literature search -- because the agent genuinely can't discover API endpoints for Semantic Scholar, OpenAlex, and CrossRef on its own.

Scaling the reviewer beyond the reading PDF like a human reviewer

Initially I used the reviewer model trained with RLHF which beats Gemini/GPT on critical anwer generation, which motivated this project to see the usefulness of such reviewer model in agent loops.

But I realized. A reviewer that only reads the final paper is like a code reviewer who only reads the README. The agent reviewer catches things like:

  • Experiment scripts that don't set random seeds
  • Results in the paper that don't match the JSON files
  • Missing baselines or ablations
  • Figures with unlabeled axes
  • Claimed datasets that were never actually downloaded

Review stagnation

These projects start from an initial idea -- sometimes carefully structured, sometimes just vibed. As the agent experiments, it can get stuck. The reviewer feedback plateaus and no longer provides enough "steering strength" to escape the local minimum.

Here's a real trajectory from one project:

Version Score Soundness Presentation Contribution Decision
v1 4/10 2/4 2/4 2/4 Rejected
v2 5/10 2/4 3/4 2/4 Rejected
v3-v6 5/10 2/4 3/4 2/4 Rejected
v7 4/10 2/4 3/4 2/4 Rejected
v8 5/10 3/4 3/4 2/4 Borderline
v9 5/10 2/4 3/4 2/4 Rejected
v10-26 5/10 3/4 3/4 2/4 Rejected
v27 5/10 3/4 4/4 2/4 Rejected
v28 5/10 3/4 3/4 2/4 Rejected
v29 6/10 3/4 3/4 2/4 Accepted
v30 5/10 3/4 3/4 2/4 Rejected
v31-32 5/10 3/4 3/4 2/4 Rejected
v33 6/10 3/4 4/4 2/4 Accepted

33 versions. The score barely moves. The agent gets stuck in a 5/10 basin for 20+ versions, briefly escapes to 6/10 acceptance at v29, falls back, and has to grind its way out again at v33. The contribution score (2/4) never improves -- the fundamental idea has a ceiling that the reviewer loop alone can't break through.

This suggests the review-rebuttal loop is effective at polishing (presentation eventually hits 4/4) but insufficient for genuine novelty. The agent needs something more -- external idea steering, or the ability to abandon and pivot to a fundamentally different approach.

Opus 4.6 vs Gemini 3.1 Pro

I was excited to try out Gemini 3.1 pro, but comparing it apple-to-apple with the Opus 4.6, it is not a good model at long running task. You can see average run time, number of iterations in the dashboard.

The System in Summary

Metric Value
Research ideas run so far 15+ across 8 domains
Reviewer tool calls per review ~113
Reviewer time per review 7-12 min (subagent), ~30s (API)
Max tool calls observed in a single job ~4,000
Max tokens consumed in a single job >100M
Max paper versions in a single job 33 (rebuttal between agent and reviewer)
Supported agents Claude Code (Opus 4.6), Gemini CLI (3.1 Pro), any agent supported by Harbor
Container options CPU-slim (Python + LaTeX), GPU (PyTorch + CUDA)
Default job timeout 4 hours

What's Next

Possible directions:

  • Idea pivoting: Let the agent abandon its current direction when the reviewer score plateaus for N versions, or just early stopping?
  • External idea injection: Maybe have some hoops for us to chat into this closed loop?
  • Cross-agent-tracepollination: Let agents read papers from other agent runs like we had setup and incorporate findings
  • Stronger reviewers: I think scaling-reviewer is the bottneneck to such intellignece, We can use agent-rl to intentionally increase test time compute of the reviewer agent.

Try it yourself: github.com/findalexli/ai-scientist-v3 | Live dashboard: aiscientist.lishengzhi.com

Community

Sign up or log in to comment