Reviewer Two (but it's an OpenEnv)
If you've ever been through the joys and sorrows of peer review, you know the drill. Reviewer One is helpful and constructive. Reviewer Three suggests minor revisions. And then there's Reviewer Two: the one who asks for complete redesigns, questions your fundamental assumptions and somehow finds every possible weakness in your methodology. It's become such a running joke in academia that "Reviewer Two" is practically synonymous with "unnecessarily difficult." (Why yes, I'm usually Reviewer Two.)
But here's the thing: what if we could train AI agents to be the good version of Reviewer Two? Not the obstructionist one, but an AI that can work iteratively with other agents and train them to take feedback, learn from mistakes and progressively improve its research plans through guided interaction?
That's exactly what Reviewer Two does. It's a reinforcement learning environment built on Meta's OpenEnv framework and submitted as a Green Agent to Berkeley's AgentBeats evaluation platform.
What Are Green Agents?
Traditional AI benchmarks work like exams: you have a static test, and agents try to pass it. But as discussed in the OpenEnv paper, this is rather unwieldy. And once an agent learns to game the benchmark, we're stuck creating new tests. AgentBeats is an interesting paradigm (wow do I hate that word) developed at Berkeley RDI to solve this problem. In AgentBeats, a Green Agent is itself an automated evaluator that tests Purple Agents (the agents being evaluated). Rather than adapting agents to benchmarks, the benchmark becomes an agent that can interact, provide feedback, and evaluate dynamically.
This evaluation protocol creates a more realistic testing environment. Instead of static question-answer pairs, Purple Agents must engage in multi-turn conversations, respond to feedback and demonstrate they can learn from guidance.
OpenEnv: The Foundation
OpenEnv is Meta's open-source framework for creating standardised RL environments specifically designed for agentic AI systems. Unlike traditional RL environments designed for games or robotics, OpenEnv environments are built for language-based agents performing knowledge work.
What makes OpenEnv particularly powerful for this use case is its dual(ish) interface:
- A standard RL API for training agents with traditional RL algorithms
- A web-based human evaluation interface for interactive testing and debugging
This meant I could build Reviewer Two as both a training environment for automated RL and as a Green Agent for AgentBeats evaluation. Two interfaces, one codebase.
The task: research plan generation
The environment is built on Meta's incredible facebook/research-plan-gen dataset, which contains research goals paired with rubric criteria for what constitutes a good research plan. The dataset spans three domains: machine learning, arXiv papers and PubMed publications.
Here's what makes this interesting: the rubric criteria are hidden from the agent. An agent receives a research goal like this:
You are tasked with representing a risk measure known as Gini deviation (GD) in a way that facilitates the derivation of a policy gradient algorithm to minimize it. GD is defined as half the expected absolute difference between two independent copies of a random variable. Your goal is to find an alternative representation of GD that makes it amenable to gradient-based optimization.
Somewhere behind the scenes, there's a rubric:
[
"The alternative representation of GD is based on the signed Choquet integral.",
"The distortion function used in the signed Choquet integral is correctly identified as $h(\\alpha) = -\\alpha^2 + \\alpha$.",
"The quantile representation of GD is derived using the signed Choquet integral.",
"The quantile representation involves integrating the quantile function with respect to a specific measure.",
"The measure used in the quantile representation is related to the distortion function $h$.",
"The quantile representation is used to derive a policy gradient algorithm.",
"The policy gradient algorithm is designed to minimize GD.",
"The algorithm takes into account the gradient of the quantile function with respect to the policy parameters.",
"The assumptions made are realistic in the context of reinforcement learning.",
"The policy gradient algorithm is capable of handling continuous random variables."
]
The goal of the agent is to come up with something that hits as many as possible of these.
Multi-Turn Adaptively Penalised Disclosure Guidance
The core innovation in Reviewer Two is what I call multi-turn adaptively penalised disclosure guidance (I should be banned from naming things). How it works:
Free throws: The agent gets two free attempts to submit a research plan. These provide full evaluation feedback but no penalties and no hints about the hidden rubric.
Progressive hint reveal: After the free attempts, the environment starts revealing hints about the rubric criteria. These aren't the exact criteria (that would make it too easy), but rather vague hints/rephrasings of the rubric generated by the same LLM that assesses the agent's responses.
Compliance penalties: this is the key feature -- once a hint is revealed, the agent is expected to address it in subsequent attempts. Knowledge isn't free. It comes with consequences, and if the agent ignores a revealed hint (as measured by low rubric scores on that criterion), it receives a double penalty. This teaches the agent that there are stakes to the game.
Attempt penalties: Each attempt after the first incurs an exponentially growing penalty. This encourages agents to learn efficiently rather than brute-forcing their way through unlimited attempts.
The result is an environment that rewards both exploration (those initial free attempts) and exploitation of revealed information (addressing hints in later attempts). Agents must learn to balance trying new approaches with incorporating specific feedback.
Technical implementation
The reward calculation uses a weighted combination of three factors:
Rubric Coverage (60%): Evaluated using
google/flan-t5-smallwith semantic similarity checks viasentence-transformers/all-MiniLM-L6-v2. For each criterion, the model rates how well the plan addresses it (excellent/good/partial/poor), weighted by semantic relevance between the plan and criterion. In theory, you can use a stronger model, but this has to run on Hugging Face Spaces without zeroGPU, so potato models it is.Length Score (20%): Optimal range is 400-1500 words. Too short and the plan lacks detail, too long and it becomes unfocused. This is pure heuristic-based scoring.
Format Score (20%): Checks for structural elements like paragraphs, section headers and bullet points. Good research plans should be organized, not walls of text. I am somewhat on the fence about this, because submissions do not always get parsed properly. The web frontend is particularly bad at this. I will run a few trials and if it does not improve, I will yeet this test with maximum prejudice.
The environment also includes coherence checking to detect and penalize nonsense submissions -- agents that try to game the system by generating keyword soup get flagged immediately.
Why this matters (and the question of AI Co-Scientists)
A human being should be able to change a diaper, plan an invasion, butcher a hog, conn a ship, design a building, write a sonnet, balance accounts, build a wall, set a bone, comfort the dying, take orders, give orders, cooperate, act alone, solve equations, analyze a new problem, pitch manure, program a computer, cook a tasty meal, fight efficiently, die gallantly. Specialization is for insects.
-- Robert A. Heinlein
If we want AI agents that can meaningfully collaborate on research, they need to do more than generate plausible-sounding text. They need to:
- Understand iterative refinement: Real research involves multiple drafts based on feedback. It's a collective endeavour. Reviewer Two is a pain in the rear, but also an indispensable part of Good Science.
- Learn from specific guidance: When a human says "you need to address X," the agent should actually address X.
- Balance exploration and direction: Know when to try new approaches versus when to focus on specific requirements.
- Work under constraints: Real research has resource limits (some would argue it has little else, but I digress). So you need to be able to pivot effectively, not just pirouette around the topic wildly.
Reviewer Two creates an environment where these skills can be learned through reinforcement learning. A Purple Agent that successfully navigates this environment has demonstrated it can:
- Generate coherent research plans from abstract goals
- Incorporate progressive feedback over multiple turns
- Prioritise addressing specific requirements when revealed
- Work efficiently within attempt budgets
Try it yourself
The environment is live on Hugging Face Spaces: chrisvoncsefalvay/reviewer-two-env
My AgentBeats submission is here: https://agentbeats.dev/chrisvoncsefalvay/reviewer-two
The full code is open source: github.com/chrisvoncsefalvay/reviewer-two-env
To build a Purple Agent that competes against Reviewer Two, you'll need to:
- Implement the A2A protocol using the
a2a-sdk - Accept text messages containing research goals
- Return comprehensive research plans (400-1500 words, well-structured)
- Handle multi-turn feedback for iterative improvement
I will be configuring a leaderboard as soon as I figure out, uh, how to.
The interesting challenge isn't just generating a good initial plan -- it's learning to incorporate the progressive hints without ignoring them. Agents that try to just keep generating variations without addressing revealed requirements get penalised heavily.
The road ahead
This is an early experiment in creating RL environments for training collaborative research agents. Some directions I'm excited about:
- Curriculum learning: Start with simpler research domains and progressively increase difficulty
- Multi-agent collaboration: Extend to scenarios where multiple agents work together on research plans
- Real paper writing: Move beyond planning to actual paper generation with proper literature review
The ultimate goal is to turn our agents into diligent graduate students with good research habits and no need for ramen. If that requires us to be Reviewer Two every once in a while... so be it.
References
- The original OpenEnv paper: Yin et al. 2024, "OpenEnv: A Unified Interface for Language Agent Environments"
- OpenEnv tutorial: Colab Notebook
- AgentBeats platform: Berkeley RDI AgentX
- Dataset:
facebook/research-plan-gen
