PR Review Negotiation Environment β V2 Learnings
This document summarizes the critical challenges identified during V1 development and the specific architectural improvements implemented in V2 to satisfy the Hackathon's OpenEnv Benchmark requirements.
π Critical Learnings & Problems Identified
1. The "Surface vs. Root" Problem
- The Problem: Many AI agents provide generic feedback like "This code has a bug" or "Fix the syntax." In V1, these agents would pass simply by identifying the location of the error.
- V2 Fix: We implemented a Root-Cause Grader. Agents now get penalized (-0.05) if they only describe the symptom and are only rewarded (+0.25) if they identify the underlying logic flaw or security vector.
2. The "Fake Fix" Deception
- The Problem: In iterative negotiation, a developer (author) might push a cosmetic "fix" (e.g., adding
.strip()to a SQL injection) that looks correct but leaves the vulnerability active. - V2 Fix: Added specific
false_fix_keywordsdetection. V2 agents are now tested on their ability to resist "social engineering" from the author and persist with the review until a genuine parameterized fix is provided.
3. Escalate vs. Request Changes
- The Problem: Critical security failures (like hardcoded JWT secrets) are often treated as normal "Request Changes" items, wasting precious time.
- V2 Fix: Implemented Escalation Logic. V2 requires the agent to correctly identify High-Severity breaches and transition the decision to
escalate, rewarding immediate reporting over standard iterative feedback.
4. State Persistence & UI Synchronization
- The Problem: Frontend state inconsistently mapped to the backend's "done" flag, leading to "nothing visible" errors when reviews concluded.
- V2 Fix: Moved to a Stateless Component Architecture driven entirely by the
PRObservationreturned from the API, ensuring the UI 1:1 reflects the environment's internal truth.
π Hackathon Compliance Improvements
- OpenEnv 0.2.1 Schema: V2 uses strictly validated Pydantic models for every endpoint.
- Root-Level
inference.py: Automated benchmarks can now locate and execute the agent's reasoning loop directly. - Docker SDK SDK for Hugging Face: Containerized deployment on port 7860 with an Nginx reverse proxy for high availability.