Spaces:

graheetphartyal
/

dataops-env

Sleeping

Graheet commited on 13 days ago

Commit

00cf35f

1 Parent(s): 40128b8

Refactor semantic cleaning evaluator and improve API docs UX.

Align environment, scoring, inference, and task metadata with strict step-based semantic actions, add robust uncertainty/hallucination handling, and polish Swagger docs with readable themed cards.

Made-with: Cursor

Files changed (6) hide show

README.md +213 -245
env.py +367 -632
grader.py +128 -563
inference.py +156 -156
server/app.py +145 -4
task.py +28 -5

README.md CHANGED Viewed

@@ -1,7 +1,7 @@
 ---
 title: Dataops Env
-emoji: 📊
 colorFrom: indigo
 colorTo: gray
 sdk: docker
@@ -10,335 +10,303 @@ pinned: false
 ---
-# `dataops-env`
-`dataops-env` is an OpenEnv benchmark for training and evaluating agents on
-multi-step data operations work. Instead of a single obvious cleanup action, an
-agent must inspect messy business tables, choose corrective actions in the right
-order, preserve valid-but-unusual records, and know when the table is truly
-ready for validation.
-It exposes the standard `reset()`, `step(action)`, and `state()` interface,
-ships with a production-ready FastAPI server and Docker image, and includes a
-reproducible OpenAI-compatible baseline runner.
-## Benchmark Purpose
-Many toy data-cleaning tasks reward shallow pattern matching. Real operational
-data work is harder:
-- duplicates may be safe to remove, but conflicting rows require judgment
-- some malformed values should be normalized, while unusual valid values must be preserved
-- deletion is often the riskiest action, not the default fix
-- agents need partial credit for progress, but strong penalties for repeated mistakes
-`dataops-env` is designed to capture those decisions in a compact benchmark that
-is still easy to run, validate, and deploy in the OpenEnv ecosystem.
-## Why It Feels Real
-The environment models common enterprise data quality problems:
-- exact duplicates in customer or vendor master data
-- missing required fields
-- inconsistent casing in names and locations
-- invalid email and phone formats
-- conflicting records for the same real-world entity
-- uniqueness constraints such as shared-email violations
-- trap rows that look suspicious but are actually valid
-Agents are rewarded for minimal corrective behavior and punished for destructive
-or repetitive actions. That makes the environment useful for both learning and
-evaluation.
-## Task Families
-The benchmark keeps the hackathon-friendly `easy`, `medium`, and `hard` task
-structure, while each family now contains deterministic variants so policies
-cannot overfit a single table.
-1. `easy`
-   Remove duplicates and fill missing required fields.
-2. `medium`
-   Remove duplicates, normalize casing, and repair invalid emails.
-3. `hard`
-   Resolve conflicts, enforce unique-email constraints, fix invalid formats,
-   and preserve valid trap rows.
-Each task definition includes:
-- `goal`
-- `difficulty`
-- `variant_id`
-- `required_columns`
-- `hidden_issues`
-- `constraints`
-- `expected_outcome`
-- `max_steps`
-## Learning Signals
-The environment provides both dense rewards and a deterministic final score:
-- partial rewards for duplicate removal, normalization, and filling missing values
-- step costs and no-progress penalties to discourage random actions
-- escalating penalties for repeated mistakes
-- destructive-action penalties for harmful deletions
-- proactive hints after recurring failures
-- final task scoring on a strict `0.0` to `1.0` scale
-The final task score and the visible validation failures are produced from the
-same explicit rule set, reducing mismatch between what the agent sees and how it
-is ultimately judged.
-## Action Space
-Agents interact with the environment through a typed `Action` object.
-Supported action types:
-- `remove_duplicate`
-  Remove one row from an exact duplicate group. Can be called with an explicit
-  `row_id`, or the environment can choose the default duplicate target.
-- `fill_missing`
-  Fill a missing field on a target row. Requires `column` and `value`, and may
-  also include `row_id`.
-- `normalize_column`
-  Apply deterministic normalization to a supported column such as `name`,
-  `city`, `email`, or `phone`.
-- `delete_row`
-  Delete a row when doing so resolves a structural issue like a conflict or a
-  uniqueness violation. Requires `row_id`.
-- `validate`
-  Signal that the agent believes the table is ready for completion.
-- `noop`
-  Explicitly take no action. This is allowed but penalized when unresolved
-  issues remain.
-Typed action schema:
-- `action_id: Optional[str]`
-- `action_type: Literal["remove_duplicate", "fill_missing", "normalize_column", "delete_row", "validate", "noop"]`
-- `column: Optional[str]`
-- `row_id: Optional[int]`
-- `value: Optional[str]`
-Validation rules:
-- `delete_row` requires `row_id`
-- `normalize_column` requires `column`
-- `fill_missing` requires `column` and `value`
-Example actions:
-```json
-{"action_id":"step-001","action_type":"remove_duplicate","row_id":33}
-{"action_id":"step-002","action_type":"fill_missing","row_id":35,"column":"email","value":"peak.systems@example.com"}
-{"action_id":"step-003","action_type":"normalize_column","column":"email"}
-{"action_id":"step-004","action_type":"validate"}
-```
-## Observation Space
-The environment returns a typed `Observation` object after `reset()` and each
-call to `step()`.
-Observation fields:
-- `goal: str`
-  Natural-language description of what the agent should accomplish.
-- `table: List[Dict[str, Any]]`
-  Current JSON-serializable table snapshot.
-- `issues: List[str]`
-  Human-readable unresolved issues and validation failures.
-- `history: List[str]`
-  Ordered record of previous actions/events in the current episode.
-- `mistakes: Dict[str, int]`
-  Counts of repeated mistake categories tracked during the episode.
-- `hints: List[str]`
-  Proactive or reactive guidance derived from issue state and prior failures.
-- `progress: float`
-  Normalized progress estimate in `[0.0, 1.0]`.
-- `steps_remaining: int`
-  Number of remaining actions before the episode terminates.
-Example observation shape:
-```json
-{
-  "goal": "Normalize the dataset by fixing casing, removing duplicates, and correcting invalid email formats.",
-  "table": [
-    {"row_id": 10, "customer_id": "C100", "name": "jane miller", "city": "new york", "email": "jane.miller@example.com"}
-  ],
-  "issues": [
-    "Rows 11 and 13 are duplicates and only one should remain."
-  ],
-  "history": [],
-  "mistakes": {},
-  "hints": [],
-  "progress": 0.0,
-  "steps_remaining": 9
-}
-```
-## Expected Agent Behavior
-A strong agent should behave roughly like this:
-1. inspect the visible table and unresolved issues
-2. remove safe duplicates first
-3. repair missing or malformed values without over-editing valid rows
-4. resolve structural conflicts carefully, especially in hard tasks
-5. validate only when the remaining issue list is empty
-Example successful baseline trace:
 ```text
-[START] task=medium env=dataops-env model=your-model
-[STEP] step=1 action=remove_duplicate(row_id=13) reward=0.37 done=false error=null
-[STEP] step=2 action=normalize_column(column='email') reward=0.27 done=false error=null
-[STEP] step=3 action=normalize_column(column='name') reward=0.24 done=false error=null
-[STEP] step=4 action=normalize_column(column='city') reward=0.44 done=true error=null
-[END] success=true steps=4 rewards=0.37,0.27,0.24,0.44
 ```
-## Project Layout
-- `env.py`: core `DataOpsEnv` implementation
-- `task.py`: task families and deterministic variants
-- `models.py`: typed `Action`, `Observation`, and `Reward` contracts
-- `grader.py`: dense rewards, explicit validation checks, and final task scoring
-- `server/app.py`: FastAPI runtime API
-- `inference.py`: hybrid heuristic/model baseline runner
-- `openenv.yaml`: OpenEnv metadata and task registration
-- `pyproject.toml`: package metadata and server script entry point
-- `Dockerfile`: production container image
-## Local Setup
-```bash
-pip install -r requirements.txt
-openenv validate
-```
-Run the FastAPI server:
-```bash
-python -m server.app
-```
-By default, the local server runs on port `8000`.
-Or use the packaged entry point:
-```bash
-server
 ```
-## API
-Health check:
 ```bash
-curl http://localhost:8000/health
 ```
-Create a session with an optional seed and task selection:
 ```bash
-curl -X POST http://localhost:8000/reset \
-  -H "Content-Type: application/json" \
-  -d '{"seed": 0, "task_name": "easy"}'
 ```
-Step the environment:
 ```bash
-curl -X POST "http://localhost:8000/step" \
-  -H "Content-Type: application/json" \
-  -d '{"action_id":"step-001","action_type":"validate"}'
 ```
-Read internal state:
-```bash
-curl "http://localhost:8000/state"
 ```
-## Baseline Inference
-The baseline runner now combines deterministic local planning with optional
-model arbitration. The local planner proposes ranked candidate actions from the
-visible table state, and the model is constrained to choose only from those
-candidates. This avoids many common failure modes such as invalid actions,
-repeated no-op loops, and reckless deletion choices.
-Run it with an OpenAI-compatible endpoint:
 ```bash
-set HF_TOKEN=your_token
-set MODEL_NAME=your_model
-set API_BASE_URL=https://router.huggingface.co/v1
-python inference.py
 ```
-Key properties:
-- strict `[START]`, `[STEP]`, and `[END]` output formatting
-- fixed task ordering for reproducibility
-- retry logic for invalid or blocked model suggestions
-- strong heuristic fallback when the model is unavailable
-- action filtering based on prior no-progress or errorful behavior
-## Docker
-Build:
-```bash
-docker build -t dataops-env .
-```
-Run locally:
-```bash
-docker run -p 8000:8000 dataops-env
-```
-## Hugging Face Spaces Notes
-For Hugging Face `Docker` Spaces, the container should normally listen on port
-`7860`, or the Space must be explicitly configured to expect a different
-internal port.
-If you keep the current container on port `8000`, make sure your Space is
-configured with:
-```yaml
-app_port: 8000
-```
-If you want the simplest Hugging Face Spaces setup, change the container to use
-port `7860` instead:
-```dockerfile
-EXPOSE 7860
-CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
-```
-Then local Docker testing would become:
-```bash
-docker run -p 7860:7860 dataops-env
-curl http://localhost:7860/health
-```
-## Submission Notes
-- `openenv validate` passes
-- the server and Docker image run successfully
-- the packaged benchmark supports multi-mode deployment
-- the default baseline now completes the public task families deterministically
-Leaderboard performance will still depend on the quality of the external model,
-but the repository is now structured and documented like a serious benchmark
-submission rather than a starter scaffold.

 ---
 title: Dataops Env
+emoji:  🧼
 colorFrom: indigo
 colorTo: gray
 sdk: docker
 ---
+# ✨ DataOps Gym
+### ⚡ The First Hallucination-Aware Data Cleaning Environment
+> ❌ Most systems ask: *“Did you fix the data?”*
+> ✅ We ask: *“Did you think before fixing?”*
+---
+# 🚨 THE PROBLEM
+**60–80% of a data scientist’s time is spent cleaning data.**
+But current systems:
+* blindly fix values
+* hallucinate corrections
+* ignore contradictions
+* break real-world logic
+---
+> 💡 **Wrong data is worse than missing data.**
+---
+# 🧠 WHAT THIS PROJECT DOES
+DataOps Gym is a **step-based OpenEnv environment** where an AI agent:
+1. Detects semantic inconsistencies
+2. Fixes data **only when confident**
+3. Outputs **"cannot determine"** when uncertain
+4. Maintains **cross-record consistency**
+5. Learns through **reward-based feedback**
+---
+Each step teaches the agent:
+* when to fix ✅
+* when to abstain ⚠️
+* when to say “I don’t know” 🧠
+---
+# 🧩 ACTION SPACE
+All actions must follow strict JSON format:
+```json
+{
+  "action_type": "detect_issue | fix_value | cannot_determine | skip",
+  "record_id": "string",
+  "field": "string",
+  "value": "string",
+  "confidence": 0.0
+}
+```
+---
+## 🔥 Key Innovation
+👉 `cannot_determine` is a **first-class action**
+---
+# 🧠 WHY THIS IS DIFFERENT
+| Traditional Systems | DataOps Gym            |
+| ------------------- | ---------------------- |
+| Fix everything      | Fix only when safe     |
+| Always answer       | Can abstain            |
+| Ignore confidence   | Confidence-aware       |
+| Single-row logic    | Cross-record reasoning |
+| Output-based        | Behavior-based         |
+---
+# 💰 REWARD SYSTEM
+---
+## ✅ Rewards
+* correct reasoning
+* safe corrections
+* correct uncertainty
+* consistency across records
+---
+## ❌ Penalties
+* hallucinated fixes 🚫
+* overconfidence 🚫
+* over-correction 🚫
+* inconsistency 🚫
+---
+### 🔥 Core Principle
+> **“Better to not fix than to fix incorrectly.”**
+---
+# 📊 FINAL SCORING (0–1)
 ```text
+task_score =
+  0.5 * normalized_record_score
++ 0.2 * (1 - hallucination_rate)
++ 0.15 * uncertainty_accuracy
++ 0.15 * consistency_score
 ```
+---
+# 📉 METRICS
+| Metric                  | Description            |
+| ----------------------- | ---------------------- |
+| 🧠 Hallucination Rate   | Wrong invented fixes   |
+| ⚖️ Uncertainty Accuracy | Correct abstentions    |
+| 🔗 Consistency Score    | Cross-record reasoning |
+---
+# 🧪 TASKS
+> ⚡ Each task is carefully designed to evaluate **reasoning, restraint, and reliability** — not just accuracy.
+---
+## 🟢 EASY — *Foundational Data Hygiene*
+<p align="left">
+  <b>“Can the agent fix obvious issues without breaking anything?”</b>
+</p>
+* Basic inconsistencies
+* Missing values
+* Duplicate records
+---
+## 🟡 MEDIUM — *Contextual Reasoning & Ambiguity*
+<p align="left">
+  <b>“Can the agent reason across records and handle uncertainty?”</b>
+</p>
+* Cross-table inconsistencies
+* Identity ambiguity
+* Data normalization
+---
+## 🔴 HARD — *Real-World Data Chaos*
+<p align="left">
+  <b>“Can the agent survive contradictions, missing context, and unsolvable data?”</b>
+</p>
+* Multi-table conflicts
+* Temporal inconsistencies
+* Non-fixable contradictions
+---
+> 🔥 **Difficulty is not about complexity — it's about uncertainty.**
+| Level  | Focus |
+|--------|------|
+| 🟢 Easy   | Precision on clear signals |
+| 🟡 Medium | Reasoning under ambiguity |
+| 🔴 Hard   | Decision-making under uncertainty |
+---
+# 🧪 EXAMPLE FAILURE LOG
+```json
+{
+  "record_id": "T3",
+  "error_type": "hallucination",
+  "details": "assigned value without evidence",
+  "confidence": 0.9
+}
 ```
+---
+# 🚀 QUICK START
+---
+## Install
 ```bash
+pip install -r requirements.txt
 ```
+---
+## Run Server
 ```bash
+python -m server.app
 ```
+---
+## Run Baseline
 ```bash
+python inference.py
 ```
+---
+## Example Output
+```text
+easy   → 0.73
+medium → 0.55
+hard   → 0.38
 ```
+> ⚠️ Replace with your actual results
+---
+# 🌐 API ENDPOINTS
+| Endpoint  | Description       |
+| --------- | ----------------- |
+| `/reset`  | Start new episode |
+| `/step`   | Take action       |
+| `/state`  | Get current state |
+| `/health` | Health check      |
+---
+# 🐳 DOCKER
 ```bash
+docker build -t dataops-gym .
+docker run -p 7860:7860 dataops-gym
 ```
+---
+# 🧠 DESIGN PRINCIPLES
+1. Prefer uncertainty over hallucination
+2. Penalize confident mistakes
+3. Avoid over-correction
+4. Enforce cross-record consistency
+5. Reward safe reasoning
+---
+# 🏆 BENCHMARK (EXPECTED)
+| Task   | Score       |
+| ------ | ----------- |
+| Easy   | 0.65 – 0.85 |
+| Medium | 0.45 – 0.65 |
+| Hard   | 0.05 – 0.40 |
+---
+# 📌 USE CASES
+* AI data pipelines
+* automated ETL validation
+* financial data cleaning
+* healthcare record validation
+* LLM safety benchmarking
+---
+# 🏁 FINAL TAKEAWAY
+> 🧠 **The future of AI is not about answering everything.**
+> ⚡ **It’s about knowing when NOT to answer.**
+---
+# 🔥 TAGLINE
+> **“We built a system that teaches AI when NOT to change data.”**
+---

env.py CHANGED Viewed

@@ -1,36 +1,20 @@
-"""OpenEnv environment entrypoint for ``dataops-gym``.
-This module is responsible for declaring top-level environment metadata,
-configuration wiring, and lifecycle integration points for the OpenEnv runtime.
-"""
 from __future__ import annotations
 from copy import deepcopy
 import random
-import re
-from typing import Any, Dict, Iterable, List, Mapping, MutableMapping, Optional, Tuple
-from grader import grade_step_details, grade_task_result, task_failure_messages
 from models import Action, Observation
-from task import (
-    HiddenIssue,
-    TaskDefinition,
-    easy_cleaning_task,
-    hard_conflict_resolution_task,
-    medium_normalization_task,
-)
-EMAIL_PATTERN = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")
 class DataOpsEnv:
-    """Deterministic multi-step data-cleaning environment for OpenEnv."""
     def __init__(self, seed: int = 0, task_name: Optional[str] = None) -> None:
-        """Initialize the environment with deterministic task sampling."""
         self._seed = seed
         self._rng = random.Random(seed)
         self._task_registry: List[Tuple[str, Any]] = [
@@ -39,127 +23,109 @@ class DataOpsEnv:
             ("hard", hard_conflict_resolution_task),
         ]
         self._fixed_task_name = task_name
-        self._global_mistake_memory: Dict[str, int] = {}
         self._state_data: Dict[str, Any] = {}
     def reset(self) -> Observation:
-        """Load a random task, initialize episode state, and return an observation."""
         task_name, task_factory = self._select_task_factory()
         variant_count = max(1, int(getattr(task_factory, "variant_count", 1)))
-        variant_index = self._rng.randrange(variant_count)
-        task_definition = deepcopy(task_factory(variant=variant_index))
         initial_table = deepcopy(task_definition["initial_table"])
-        initial_table_by_row_id = self._table_by_row_id(initial_table)
         self._state_data = {
             "seed": self._seed,
             "task_name": task_name,
-            "task_variant": task_definition.get("variant_id", f"{task_name}_variant_{variant_index}"),
             "task": task_definition,
-            "table": initial_table,
-            "history": [],
-            "mistakes": {},
-            "mistake_memory": [],
-            "hints": [],
             "steps_taken": 0,
             "steps_remaining": task_definition["max_steps"],
             "done": False,
-            "last_reward_components": {},
-            "last_info": {},
-            "last_task_score": 0.0,
-            "initial_issue_count": 1,
-            "initial_table_by_row_id": initial_table_by_row_id,
         }
-        initial_issue_count = len(self._current_issue_messages(initial_table, task_definition))
-        self._state_data["initial_issue_count"] = max(1, initial_issue_count)
         return self._build_observation()
-    def step(
-        self, action: Action | Mapping[str, Any]
-    ) -> Tuple[Observation, float, bool, Dict[str, Any]]:
-        """Apply one action, score it, update state, and return a gym-style step tuple."""
         if not self._state_data:
             raise RuntimeError("Environment must be reset before calling step().")
-        if self._state_data.get("done", False):
             raise RuntimeError("Episode is finished. Call reset() before stepping again.")
-        parsed_action, action_error = self._coerce_action(action)
-        task_definition: TaskDefinition = self._state_data["task"]
-        table_before = deepcopy(self._state_data["table"])
-        issues_before = self._current_issue_messages(table_before, task_definition)
-        result: Dict[str, Any] = {
-            "mistake_keys": [],
-            "error_type": "general",
-        }
-        if action_error is not None:
-            parsed_action = Action(action_type="noop")
-            result["noop"] = True
-            result["unnecessary_action"] = True
-            result["error_type"] = "invalid_action"
-            result["mistake_keys"].append("invalid_action:general")
-            history_entry = f"invalid_action({action_error})"
-        else:
-            history_entry = self._apply_action(parsed_action, result)
-        self._state_data["history"].append(history_entry)
         self._state_data["steps_taken"] += 1
         self._state_data["steps_remaining"] = max(
-            0, task_definition["max_steps"] - self._state_data["steps_taken"]
         )
-        table_after = deepcopy(self._state_data["table"])
-        issues_after = self._current_issue_messages(table_after, task_definition)
-        self._populate_result_signals(
-            parsed_action,
-            table_before,
-            table_after,
-            issues_before,
-            issues_after,
-            result,
         )
-        reward, components = grade_step_details(
             self._state_data, parsed_action.model_dump(), result
         )
-        self._record_mistake_memory(parsed_action, result)
-        self._update_hints(result, issues_after)
-        done = not issues_after or self._state_data["steps_remaining"] <= 0
-        self._state_data["done"] = done
         task_score = grade_task_result(
-            task_definition, self._state_data["table"], self._state_data
         )
-        self._state_data["last_task_score"] = task_score
-        observation = self._build_observation()
         info = {
-            "task_name": self._state_data["task_name"],
-            "task_variant": self._state_data["task_variant"],
-            "difficulty": task_definition["difficulty"],
-            "reward_components": components,
-            "mistakes": deepcopy(self._state_data["mistakes"]),
-            "hints": list(self._state_data["hints"]),
-            "issues_remaining": len(issues_after),
-            "done_reason": "resolved" if not issues_after else "max_steps" if done else None,
-            "task_score": task_score,
-            "result": deepcopy(result),
         }
-        self._state_data["last_reward_components"] = deepcopy(components)
-        self._state_data["last_info"] = deepcopy(info)
-        return observation, reward, done, info
     def state(self) -> Dict[str, Any]:
-        """Return a deep copy of the internal environment state."""
         return deepcopy(self._state_data)
     def close(self) -> None:
-        """Release environment state for callers using explicit lifecycle cleanup."""
         self._state_data = {}
     def _select_task_factory(self) -> Tuple[str, Any]:
@@ -174,578 +140,347 @@ class DataOpsEnv:
         raise ValueError(f"Unknown task_name: {self._fixed_task_name}")
-    def _coerce_action(
-        self, action: Action | Mapping[str, Any]
-    ) -> Tuple[Optional[Action], Optional[str]]:
-        """Convert user input into an ``Action`` model without raising outward."""
-        if isinstance(action, Action):
-            return action, None
-        try:
-            return Action(**dict(action)), None
-        except Exception as exc:  # pragma: no cover - defensive runtime boundary
-            return None, str(exc)
-    def _apply_action(self, action: Action, result: MutableMapping[str, Any]) -> str:
-        """Apply a single action to the current table and capture side effects."""
-        if action.action_type == "noop":
-            result["noop"] = True
-            result["mistake_keys"].append(f"{action.action_type}:noop")
-            return self._format_history(action)
-        if action.action_type == "remove_duplicate":
-            self._remove_duplicate(action, result)
-            return self._format_history(action)
-        if action.action_type == "delete_row":
-            self._delete_row(action, result)
-            return self._format_history(action)
-        if action.action_type == "fill_missing":
-            self._fill_missing(action, result)
-            return self._format_history(action)
-        if action.action_type == "normalize_column":
-            self._normalize_column(action, result)
-            return self._format_history(action)
-        if action.action_type == "validate":
-            return self._format_history(action)
-        result["unnecessary_action"] = True
-        result["error_type"] = "unsupported_action"
-        result["mistake_keys"].append(f"{action.action_type}:unsupported_action")
-        return self._format_history(action)
-    def _remove_duplicate(
-        self, action: Action, result: MutableMapping[str, Any]
     ) -> None:
-        """Remove a duplicate row when the target belongs to a duplicate issue."""
-        duplicate_groups = [
-            issue
-            for issue in self._state_data["task"]["hidden_issues"]
-            if issue["type"] == "duplicate" and self._is_issue_unresolved(issue, self._state_data["table"])
-        ]
-        if not duplicate_groups:
-            result["unnecessary_action"] = True
-            result["error_type"] = "no_duplicate_available"
             return
-        candidate_rows = set(duplicate_groups[0].get("rows", []))
-        target_row_id = action.row_id or max(candidate_rows)
-        if target_row_id not in candidate_rows:
-            result["unnecessary_action"] = True
-            result["error_type"] = "invalid_duplicate_target"
-            return
-        removed = self._remove_row_by_id(target_row_id)
-        if not removed:
-            result["unnecessary_action"] = True
-            result["error_type"] = "missing_row"
-    def _delete_row(self, action: Action, result: MutableMapping[str, Any]) -> None:
-        """Delete a row and mark destructive behavior when the target is unsafe."""
-        target_row = self._get_row_by_id(action.row_id)
-        if target_row is None:
-            result["unnecessary_action"] = True
-            result["error_type"] = "missing_row"
             return
-        if self._row_is_protected(action.row_id):
-            result["wrong_deletion"] = True
-            result["destructive_action"] = True
-            result["error_type"] = "protected_row"
-            result["mistake_keys"].append(f"{action.action_type}:protected_row")
-        elif not self._row_belongs_to_removable_issue(action.row_id):
-            result["wrong_deletion"] = True
-            result["destructive_action"] = True
-            result["error_type"] = "wrong_deletion"
-            result["mistake_keys"].append(f"{action.action_type}:wrong_deletion")
-        self._remove_row_by_id(action.row_id)
-    def _fill_missing(self, action: Action, result: MutableMapping[str, Any]) -> None:
-        """Fill a missing field on the target row or the first matching missing cell."""
-        target_row = self._resolve_missing_target_row(action.row_id, action.column)
-        if target_row is None or action.column is None:
-            result["unnecessary_action"] = True
-            result["error_type"] = "missing_target"
             return
-        if not self._is_missing_value(target_row.get(action.column)):
-            result["unnecessary_action"] = True
-            result["error_type"] = "cell_not_missing"
             return
-        target_row[action.column] = action.value
-    def _normalize_column(self, action: Action, result: MutableMapping[str, Any]) -> None:
-        """Normalize a supported column using deterministic, minimal edits."""
-        if action.column is None:
-            result["unnecessary_action"] = True
-            result["error_type"] = "missing_column"
             return
-        changed_rows = 0
-        for row in self._state_data["table"]:
-            original = row.get(action.column)
-            normalized = self._normalized_value(action.column, original)
-            if normalized is None or normalized == original:
-                continue
-            # Keep trap rows stable unless the value is actually invalid.
-            if self._row_is_protected(row.get("row_id")) and self._value_is_valid(
-                action.column, original
             ):
-                continue
-            row[action.column] = normalized
-            changed_rows += 1
-        if changed_rows == 0:
-            result["unnecessary_action"] = True
-            result["error_type"] = "no_normalization_needed"
-    def _populate_result_signals(
         self,
-        action: Action,
-        table_before: List[Dict[str, Any]],
-        table_after: List[Dict[str, Any]],
-        issues_before: List[str],
-        issues_after: List[str],
-        result: MutableMapping[str, Any],
-    ) -> None:
-        """Derive reward signals from before/after state transitions."""
-        task_definition: TaskDefinition = self._state_data["task"]
-        hidden_before = self._issue_type_counts(table_before, task_definition)
-        hidden_after = self._issue_type_counts(table_after, task_definition)
-        if hidden_after.get("duplicate", 0) < hidden_before.get("duplicate", 0):
-            result["correct_duplicate_removal"] = True
-        if hidden_after.get("missing_value", 0) < hidden_before.get("missing_value", 0):
-            result["fixed_missing_value"] = True
-        normalization_before = hidden_before.get("inconsistent_casing", 0) + hidden_before.get(
-            "invalid_format", 0
-        )
-        normalization_after = hidden_after.get("inconsistent_casing", 0) + hidden_after.get(
-            "invalid_format", 0
-        )
-        if (
-            action.action_type == "normalize_column"
-            and normalization_after < normalization_before
-        ):
-            result["correct_normalization"] = True
-        if action.action_type == "validate" and not issues_after:
-            result["validation_success"] = True
-            result["task_completed"] = True
-        if not issues_after:
-            result["task_completed"] = True
-        issue_delta = max(0, len(issues_before) - len(issues_after))
-        result["progress_delta"] = round(
-            issue_delta / float(self._state_data["initial_issue_count"]),
-            4,
-        )
-        if issue_delta > 0 and any(self._state_data["mistakes"].values()):
-            result["corrected_previous_mistake"] = True
-        if action.action_type == "noop" and issues_after:
-            result["unnecessary_action"] = True
-            result["error_type"] = result.get("error_type", "noop")
-    def _build_observation(self) -> Observation:
-        """Construct the typed observation returned to callers."""
-        task_definition: TaskDefinition = self._state_data["task"]
-        issue_messages = self._current_issue_messages(self._state_data["table"], task_definition)
-        progress = self._compute_progress(issue_messages)
-        return Observation(
-            goal=task_definition["goal"],
-            table=deepcopy(self._state_data["table"]),
-            issues=issue_messages,
-            history=list(self._state_data["history"]),
-            mistakes=deepcopy(self._state_data["mistakes"]),
-            hints=list(self._state_data["hints"]),
-            progress=progress,
-            steps_remaining=int(self._state_data["steps_remaining"]),
-        )
-    def _compute_progress(self, issue_messages: List[str]) -> float:
-        """Estimate progress from the current unresolved issue count."""
-        baseline = float(self._state_data["initial_issue_count"])
-        remaining = min(len(issue_messages), self._state_data["initial_issue_count"])
-        resolved_fraction = 1.0 - (remaining / baseline)
-        return round(max(0.0, min(1.0, resolved_fraction)), 4)
-    def _current_issue_messages(
-        self, table: List[Dict[str, Any]], task_definition: TaskDefinition
-    ) -> List[str]:
-        """Return unresolved issue descriptions plus validation-rule failures."""
-        messages: List[str] = []
-        for issue in task_definition["hidden_issues"]:
-            if self._is_issue_unresolved(issue, table):
-                description = issue.get("description")
-                if description:
-                    messages.append(description)
-        messages.extend(self._validation_failures(table, task_definition))
-        return messages
-    def _validation_failures(
-        self, table: List[Dict[str, Any]], task_definition: TaskDefinition
-    ) -> List[str]:
-        """Evaluate rule-based outcome constraints beyond the hidden issue list."""
-        return task_failure_messages(task_definition, table, self._state_data)
-    def _issue_type_counts(
-        self, table: List[Dict[str, Any]], task_definition: TaskDefinition
-    ) -> Dict[str, int]:
-        """Count unresolved hidden issues by type."""
-        counts: Dict[str, int] = {}
-        for issue in task_definition["hidden_issues"]:
-            if self._is_issue_unresolved(issue, table):
-                issue_type = issue["type"]
-                counts[issue_type] = counts.get(issue_type, 0) + 1
-        return counts
-    def _is_issue_unresolved(self, issue: HiddenIssue, table: List[Dict[str, Any]]) -> bool:
-        """Determine whether a hidden issue is still unresolved."""
-        issue_type = issue["type"]
-        table_by_row_id = self._table_by_row_id(table)
-        if issue_type == "valid_trap":
             return False
-        if issue_type in {"duplicate", "conflict"}:
-            rows = issue.get("rows", [])
-            return all(row_id in table_by_row_id for row_id in rows)
         if issue_type == "missing_value":
-            row = table_by_row_id.get(issue.get("row"))
-            column = issue.get("column")
-            return row is not None and column is not None and self._is_missing_value(row.get(column))
-        if issue_type == "inconsistent_casing":
-            column = issue.get("column")
-            return any(
-                row_id in table_by_row_id
-                and self._needs_title_case(str(table_by_row_id[row_id].get(column, "")))
-                for row_id in issue.get("rows", [])
-            )
         if issue_type == "invalid_format":
-            row = table_by_row_id.get(issue.get("row"))
-            column = issue.get("column")
-            return row is not None and column is not None and not self._value_is_valid(
-                column, row.get(column)
-            )
-        if issue_type == "constraint_violation" and issue.get("constraint") == "unique_email":
-            rows = issue.get("rows", [])
-            emails = [
-                table_by_row_id[row_id].get("email")
-                for row_id in rows
-                if row_id in table_by_row_id
-            ]
-            return len(emails) != len(set(emails))
-        return False
-    def _update_hints(self, result: Mapping[str, Any], issues_after: List[str]) -> None:
-        """Add deterministic hints when the agent stalls or accumulates mistakes."""
-        if not issues_after:
-            return
-        global_wrong_deletion_count = sum(
-            count
-            for key, count in self._global_mistake_memory.items()
-            if key == "wrong_deletion" or key.endswith(":wrong_deletion")
         )
-        if global_wrong_deletion_count >= 3:
-            hint = (
-                "You are repeatedly deleting valid rows. Try resolving issues "
-                "instead of deleting."
-            )
-            if hint not in self._state_data["hints"]:
-                self._state_data["hints"].append(hint)
-        total_mistakes = sum(self._state_data["mistakes"].values())
-        should_hint = bool(result.get("unnecessary_action")) or bool(
-            result.get("wrong_deletion")
-        ) or total_mistakes >= 2 or float(result.get("progress_delta", 0.0)) == 0.0
-        if not should_hint:
-            return
-        next_hint = self._build_hint(issues_after[0])
-        if next_hint not in self._state_data["hints"]:
-            self._state_data["hints"].append(next_hint)
-    def _build_hint(self, issue_message: str) -> str:
-        """Map unresolved issue descriptions to small, actionable hints."""
-        lowered = issue_message.lower()
-        if "duplicate" in lowered:
-            return "Look for rows that describe the same entity and keep only one representative record."
-        if "missing" in lowered:
-            return "A required field is still empty. Fill the missing value instead of deleting the row."
-        if "email" in lowered and "format" in lowered:
-            return "Normalize only the invalid email values; valid addresses should be preserved."
-        if "phone" in lowered:
-            return "Repair only phone values that are actually malformed."
-        if "title-case" in lowered or "casing" in lowered:
-            return "Normalize text columns to a consistent title-case style."
-        if "unchanged" in lowered:
-            return "Some unusual-looking rows are valid traps and should be preserved."
-        return "Focus on the first unresolved issue and prefer minimal corrective actions."
-    def _record_mistake_memory(
-        self, action: Action, result: Mapping[str, Any]
-    ) -> None:
-        """Persist mistake events so hinting can look at prior failures."""
-        for key, count in self._state_data["mistakes"].items():
-            if count <= 0:
                 continue
-            if action.action_id:
-                memory_entry = f"{action.action_id}:{key}:{count}"
-            else:
-                memory_entry = f"{action.action_type}:{key}:{count}"
-            if memory_entry not in self._state_data["mistake_memory"]:
-                self._state_data["mistake_memory"].append(memory_entry)
-            self._global_mistake_memory[key] = (
-                self._global_mistake_memory.get(key, 0) + 1
-            )
-            category_key = key.split(":")[-1]
-            self._global_mistake_memory[category_key] = (
-                self._global_mistake_memory.get(category_key, 0) + 1
-            )
-        if result.get("destructive_action"):
-            entry = f"{action.action_type}:destructive_action"
-            if entry not in self._state_data["mistake_memory"]:
-                self._state_data["mistake_memory"].append(entry)
-    def _resolve_missing_target_row(
-        self, row_id: Optional[int], column: Optional[str]
-    ) -> Optional[Dict[str, Any]]:
-        """Choose the requested row or the first matching missing-value row."""
-        if row_id is not None:
-            return self._get_row_by_id(row_id)
-        if column is None:
             return None
-        for row in self._state_data["table"]:
-            if self._is_missing_value(row.get(column)):
-                return row
-        return None
-    def _normalized_value(self, column: str, value: Any) -> Any:
-        """Return a normalized value for supported columns."""
-        if not isinstance(value, str):
-            return value
-        if column in {"name", "city"}:
-            return value.title()
-        if column == "email" and not self._is_valid_email(value):
-            normalized = value.strip().lower()
-            normalized = normalized.replace("[at]", "@").replace(" at ", "@")
-            if "@" not in normalized and normalized.endswith(".example.com"):
-                normalized = normalized.replace(".example.com", "@example.com", 1)
-            if "@" in normalized and "." not in normalized.split("@", 1)[1]:
-                normalized = normalized + ".com"
-            return normalized
-        if column == "phone" and not self._is_valid_phone(value):
-            digits = re.sub(r"\D", "", value)
-            if len(digits) == 11 and digits.startswith("1"):
-                digits = digits[1:]
-            if len(digits) == 10:
-                return f"{digits[0:3]}-{digits[3:6]}-{digits[6:10]}"
-        return value
-    def _value_is_valid(self, column: str, value: Any) -> bool:
-        """Validate known column types used by the tasks."""
-        if value is None:
-            return False
-        if column == "email":
-            return self._is_valid_email(str(value))
-        if column == "phone":
-            return self._is_valid_phone(str(value))
-        if column in {"name", "city"}:
-            return not self._needs_title_case(str(value))
-        return True
-    def _is_valid_email(self, value: str) -> bool:
-        """Return whether the supplied email string looks valid."""
-        return bool(EMAIL_PATTERN.match(value.strip()))
-    def _is_valid_phone(self, value: str) -> bool:
-        """Return whether the supplied phone value is valid for this environment."""
-        digits = re.sub(r"\D", "", value)
-        return len(digits) == 10 or (len(digits) == 11 and digits.startswith("1"))
-    def _needs_title_case(self, value: str) -> bool:
-        """Detect whether a string still needs title-case normalization."""
-        cleaned = value.strip()
-        return bool(cleaned) and cleaned != cleaned.title()
-    def _has_missing_required_values(
-        self, table: Iterable[Dict[str, Any]], required_columns: Iterable[str]
     ) -> bool:
-        """Check whether any required field remains missing."""
-        for row in table:
-            for column in required_columns:
-                if self._is_missing_value(row.get(column)):
                     return True
-        return False
-    def _has_duplicates(self, table: Iterable[Dict[str, Any]], column: str) -> bool:
-        """Check whether a column contains duplicate non-empty values."""
-        values = [row.get(column) for row in table if row.get(column) not in (None, "")]
-        return len(values) != len(set(values))
-    def _column_has_invalid_email(
-        self, table: Iterable[Dict[str, Any]], column: str
-    ) -> bool:
-        """Check whether any remaining email value is invalid."""
-        return any(
-            row.get(column) not in (None, "") and not self._is_valid_email(str(row.get(column)))
-            for row in table
         )
-    def _column_has_invalid_phone(
-        self, table: Iterable[Dict[str, Any]], column: str
-    ) -> bool:
-        """Check whether any remaining phone value is invalid."""
-        return any(
-            row.get(column) not in (None, "") and not self._is_valid_phone(str(row.get(column)))
-            for row in table
         )
-    def _column_needs_title_case(
-        self, table: Iterable[Dict[str, Any]], column: str
-    ) -> bool:
-        """Check whether any remaining value still violates title-case normalization."""
-        return any(
-            isinstance(row.get(column), str) and self._needs_title_case(str(row.get(column)))
-            for row in table
         )
-    def _row_has_changed_from_initial(
-        self, row_id: int, current_table: List[Dict[str, Any]]
-    ) -> bool:
-        """Check whether a protected row has changed relative to the task start."""
-        current_row = self._table_by_row_id(current_table).get(row_id)
-        initial_row = self._state_data["initial_table_by_row_id"].get(row_id)
-        if current_row is None or initial_row is None:
-            return True
-        return current_row != initial_row
-    def _row_is_protected(self, row_id: Optional[int]) -> bool:
-        """Return whether a row is marked as a valid trap in the current task."""
-        if row_id is None:
-            return False
-        for issue in self._state_data["task"]["hidden_issues"]:
-            if issue["type"] == "valid_trap" and issue.get("row") == row_id:
-                return True
-        return False
-    def _row_belongs_to_removable_issue(self, row_id: Optional[int]) -> bool:
-        """Return whether deleting a row could plausibly resolve a structural issue."""
-        if row_id is None:
-            return False
-        for issue in self._state_data["task"]["hidden_issues"]:
-            if issue["type"] in {"duplicate", "conflict", "constraint_violation"} and row_id in issue.get(
-                "rows", []
-            ):
-                return True
-        return False
-    def _remove_row_by_id(self, row_id: Optional[int]) -> bool:
-        """Remove a row by id and report whether a row was deleted."""
-        if row_id is None:
-            return False
-        table = self._state_data["table"]
-        for index, row in enumerate(table):
-            if row.get("row_id") == row_id:
-                del table[index]
-                return True
-        return False
-    def _get_row_by_id(self, row_id: Optional[int]) -> Optional[Dict[str, Any]]:
-        """Return a mutable row reference by id."""
-        if row_id is None:
-            return None
-        for row in self._state_data["table"]:
-            if row.get("row_id") == row_id:
                 return row
         return None
-    def _table_by_row_id(self, table: List[Dict[str, Any]]) -> Dict[int, Dict[str, Any]]:
-        """Index a table by row id."""
-        return {
-            int(row["row_id"]): deepcopy(row)
-            for row in table
-            if row.get("row_id") is not None
-        }
-    def _is_missing_value(self, value: Any) -> bool:
-        """Return whether a cell should be treated as missing."""
-        return value is None or value == ""
-    def _format_history(self, action: Action) -> str:
-        """Return a compact history entry for the applied action."""
-        details = []
-        if action.row_id is not None:
-            details.append(f"row_id={action.row_id}")
-        if action.column is not None:
-            details.append(f"column={action.column}")
-        if action.value is not None:
-            details.append(f"value={action.value}")
-        detail_text = ", ".join(details)
-        return f"{action.action_type}({detail_text})" if detail_text else action.action_type
 class DataOpsGymEnv(DataOpsEnv):

+"""Semantic data-cleaning evaluation environment."""
 from __future__ import annotations
 from copy import deepcopy
 import random
+from typing import Any, Dict, List, Mapping, Optional, Tuple
+from grader import grade_step_details, grade_task_result
 from models import Action, Observation
+from task import easy_cleaning_task, hard_conflict_resolution_task, medium_normalization_task
 class DataOpsEnv:
+    """Step-based semantic evaluator with strict action protocol."""
     def __init__(self, seed: int = 0, task_name: Optional[str] = None) -> None:
         self._seed = seed
         self._rng = random.Random(seed)
         self._task_registry: List[Tuple[str, Any]] = [
             ("hard", hard_conflict_resolution_task),
         ]
         self._fixed_task_name = task_name
         self._state_data: Dict[str, Any] = {}
     def reset(self) -> Observation:
         task_name, task_factory = self._select_task_factory()
         variant_count = max(1, int(getattr(task_factory, "variant_count", 1)))
+        task_definition = deepcopy(task_factory(variant=self._rng.randrange(variant_count)))
         initial_table = deepcopy(task_definition["initial_table"])
         self._state_data = {
             "seed": self._seed,
             "task_name": task_name,
+            "task_variant": task_definition.get("variant_id", task_name),
             "task": task_definition,
+            "dataset_original": initial_table,
+            "dataset_modified": deepcopy(initial_table),
+            "action_history": [],
+            "per_record_scores": {},
+            "current_iteration_score": 0.0,
+            "previous_iteration_score": 0.0,
+            "failure_logs": [],
             "steps_taken": 0,
             "steps_remaining": task_definition["max_steps"],
             "done": False,
+            "totals": {
+                "total_fixes": 0,
+                "hallucinated_fixes": 0,
+                "total_cannot_determine": 0,
+                "correct_cannot_determine": 0,
+                "total_related_cases": 0,
+                "consistent_decisions": 0,
+            },
+            "related_decisions": {},
+            "detected_unresolved_issues": {},
+            "detected_issues": {},
+            "hallucination_rate": 0.0,
+            "uncertainty_accuracy": 0.0,
+            "consistency_score": 1.0,
         }
         return self._build_observation()
+    def step(self, action: Action | Mapping[str, Any]) -> Tuple[Observation, float, bool, Dict[str, Any]]:
         if not self._state_data:
             raise RuntimeError("Environment must be reset before calling step().")
+        if self._state_data["done"]:
             raise RuntimeError("Episode is finished. Call reset() before stepping again.")
+        parsed_action = action if isinstance(action, Action) else Action(**dict(action))
+        result = self._evaluate_action(parsed_action)
+        self._state_data["action_history"].append(parsed_action.model_dump())
         self._state_data["steps_taken"] += 1
         self._state_data["steps_remaining"] = max(
+            0, self._state_data["task"]["max_steps"] - self._state_data["steps_taken"]
         )
+        self._state_data["previous_iteration_score"] = float(
+            self._state_data["current_iteration_score"]
         )
+        reward, reward_components = grade_step_details(
             self._state_data, parsed_action.model_dump(), result
         )
+        rid = parsed_action.record_id
+        self._state_data["per_record_scores"][rid] = float(
+            self._state_data["per_record_scores"].get(rid, 0.0)
+        ) + reward
+        self._state_data["current_iteration_score"] = sum(
+            float(v) for v in self._state_data["per_record_scores"].values()
+        )
+        prev = self._state_data["previous_iteration_score"]
+        curr = self._state_data["current_iteration_score"]
+        if curr > prev:
+            reward += 0.1
+            reward_components["iteration_improvement"] = 0.1
+        elif curr < prev:
+            reward -= 0.1
+            reward_components["iteration_improvement"] = -0.1
+        self._update_metrics()
         task_score = grade_task_result(
+            self._state_data["task"], self._state_data["dataset_modified"], self._state_data
         )
+        done = self._state_data["steps_remaining"] <= 0
+        self._state_data["done"] = done
         info = {
+            "actions_taken": deepcopy(self._state_data["action_history"]),
+            "updated_dataset": deepcopy(self._state_data["dataset_modified"]),
+            "per_record_scores": deepcopy(self._state_data["per_record_scores"]),
+            "final_task_score": task_score,
+            "metrics": {
+                "hallucination_rate": self._state_data["hallucination_rate"],
+                "uncertainty_accuracy": self._state_data["uncertainty_accuracy"],
+                "consistency_score": self._state_data["consistency_score"],
+            },
+            "failure_logs": deepcopy(self._state_data["failure_logs"]),
+            "reward_components": reward_components,
+            "result": result,
         }
+        return self._build_observation(), reward, done, info
     def state(self) -> Dict[str, Any]:
         return deepcopy(self._state_data)
     def close(self) -> None:
         self._state_data = {}
     def _select_task_factory(self) -> Tuple[str, Any]:
         raise ValueError(f"Unknown task_name: {self._fixed_task_name}")
+    def _evaluate_action(self, action: Action) -> Dict[str, Any]:
+        table = self._state_data["dataset_modified"]
+        issue = self._matching_issue(action.record_id, action.field)
+        issue_key = self._issue_key(issue)
+        result: Dict[str, Any] = {"extra_fields_modified": 0}
+        self._apply_related_consistency(action, issue, result)
+        self._apply_follow_up_requirement(action, issue_key, result)
+        if action.action_type == "skip":
+            if issue is not None:
+                result["missed_issue"] = True
+                result["passive_penalty"] = True
+                if issue_key is not None:
+                    self._state_data["detected_unresolved_issues"][issue_key] = True
+                self._append_failure(action, "missed_issue", "Issue exists but action was skip.")
+            return result
+        if action.action_type == "detect_issue":
+            if issue is not None:
+                result["classification_correct"] = True
+                result["correct_issue_detected"] = True
+                result["passive_penalty"] = True
+                if issue_key is not None:
+                    if issue_key in self._state_data["detected_issues"]:
+                        result["repeated_detection"] = True
+                    self._state_data["detected_issues"][issue_key] = True
+                    self._state_data["detected_unresolved_issues"][issue_key] = True
+            else:
+                result["classification_incorrect"] = True
+                result["false_issue"] = True
+            return result
+        if action.action_type == "cannot_determine":
+            self._state_data["totals"]["total_cannot_determine"] += 1
+            if issue is None:
+                result["wrong_cannot_determine"] = True
+                self._append_failure(
+                    action, "wrong_fix", "cannot_determine used without any supporting issue."
+                )
+            elif issue.get("fixable", True) is False:
+                result["correct_cannot_determine"] = True
+                self._state_data["totals"]["correct_cannot_determine"] += 1
+                if issue_key is not None:
+                    self._state_data["detected_unresolved_issues"].pop(issue_key, None)
+                    if issue_key in self._state_data["detected_issues"]:
+                        result["resolved_detected_issue"] = True
+            else:
+                result["wrong_cannot_determine"] = True
+                self._append_failure(
+                    action, "wrong_fix", "cannot_determine used when evidence was sufficient."
+                )
+            return result
+        # fix_value
+        self._state_data["totals"]["total_fixes"] += 1
+        if issue is None:
+            related_issue_count = self._count_issues_for_record(action.record_id)
+            if related_issue_count > 0:
+                result["extra_fields_modified"] += 1
+        row = self._find_record(action.record_id, table)
+        if row is None or action.field not in row:
+            result["hallucinated_fix"] = True
+            self._state_data["totals"]["hallucinated_fixes"] += 1
+            self._append_failure(action, "hallucination", "Attempted fix with no evidence.")
+            return result
+        if issue is None:
+            result["hallucinated_fix"] = True
+            self._state_data["totals"]["hallucinated_fixes"] += 1
+            self._append_failure(action, "hallucination", "Field had no target issue.")
+            return result
+        if self._issue_resolved(issue, table):
+            result["hallucinated_fix"] = True
+            self._state_data["totals"]["hallucinated_fixes"] += 1
+            self._append_failure(action, "hallucination", "Field is already correct.")
+            return result
+        old_value = row.get(action.field)
+        before_row = deepcopy(row)
+        row[action.field] = action.value
+        if self._introduces_inconsistency(row, action.field, table):
+            result["hallucinated_fix"] = True
+            self._state_data["totals"]["hallucinated_fixes"] += 1
+            row[action.field] = old_value
+            self._append_failure(
+                action, "hallucination", "Fix introduces cross-record or temporal inconsistency."
+            )
+            return result
+        if self.validate_fix(issue, before_row, row, table):
+            result["correct_fix"] = True
+            result["classification_correct"] = True
+            if issue_key is not None:
+                if issue_key in self._state_data["detected_issues"]:
+                    result["resolved_detected_issue"] = True
+                self._state_data["detected_unresolved_issues"].pop(issue_key, None)
+        else:
+            row[action.field] = old_value
+            result["wrong_fix"] = True
+            self._append_failure(action, "wrong_fix", "Fix does not resolve the identified issue.")
+        return result
+    def _apply_follow_up_requirement(
+        self, action: Action, issue_key: Optional[str], result: Dict[str, Any]
     ) -> None:
+        unresolved = self._state_data.get("detected_unresolved_issues", {})
+        if not unresolved:
             return
+        # Follow-up action types are fix/cannot_determine against a detected issue.
+        is_follow_up = (
+            action.action_type in {"fix_value", "cannot_determine"}
+            and issue_key is not None
+            and issue_key in unresolved
+        )
+        if not is_follow_up:
+            result["passive_penalty"] = True
+    def _apply_related_consistency(
+        self, action: Action, issue: Optional[Dict[str, Any]], result: Dict[str, Any]
+    ) -> None:
+        if issue is None:
             return
+        issue_type = issue.get("type")
+        if issue_type not in {"duplicate", "conflict"}:
             return
+        rows = issue.get("rows", [])
+        if not rows:
             return
+        key = f"{issue_type}:{','.join(str(v) for v in sorted(rows))}"
+        self._state_data["totals"]["total_related_cases"] += 1
+        seen = self._state_data["related_decisions"]
+        decision = action.action_type
+        if key not in seen:
+            seen[key] = decision
+            result["consistent_handling"] = True
+            self._state_data["totals"]["consistent_decisions"] += 1
             return
+        if seen[key] == decision:
+            result["consistent_handling"] = True
+            self._state_data["totals"]["consistent_decisions"] += 1
+        else:
+            result["inconsistent_handling"] = True
+            self._append_failure(
+                action, "inconsistency", "Related records were handled inconsistently."
+            )
+    def _matching_issue(self, record_id: str, field: str) -> Optional[Dict[str, Any]]:
+        rid = self._parse_record_id(record_id)
+        for issue in self._state_data["task"]["hidden_issues"]:
+            issue_type = issue.get("type")
+            if issue_type == "missing_value" and issue.get("row") == rid and issue.get("column") == field:
+                return issue
+            if issue_type == "invalid_format" and issue.get("row") == rid and issue.get("column") == field:
+                return issue
+            if issue_type == "inconsistent_casing" and field == issue.get("column") and rid in issue.get("rows", []):
+                return issue
+            if (
+                issue_type in {"duplicate", "conflict", "constraint_violation"}
+                and (field in {"row", "record"} or field == issue.get("field"))
+                and rid in issue.get("rows", [])
             ):
+                ambiguous = issue_type in {"conflict", "constraint_violation"}
+                c = dict(issue)
+                c["ambiguous"] = ambiguous
+                return c
+        return None
+    def _issue_resolved(self, issue: Mapping[str, Any], table: List[Dict[str, Any]]) -> bool:
+        if issue.get("type") in {"duplicate", "conflict", "constraint_violation"}:
+            return False
+        rid = int(issue.get("row", -1))
+        field = issue.get("column")
+        row = self._find_record(str(rid), table)
+        if row is None:
+            return True
+        if issue.get("type") == "missing_value":
+            return row.get(field) not in (None, "", "unknown", "9999")
+        if issue.get("type") == "invalid_format":
+            value = str(row.get(field, ""))
+            if field == "email":
+                return "@" in value and "." in value.split("@")[-1]
+            if field == "phone":
+                digits = "".join(ch for ch in value if ch.isdigit())
+                return len(digits) in {10, 11}
+            if field in {"start_date", "end_date"}:
+                start = row.get("start_date")
+                end = row.get("end_date")
+                return not (start and end and str(end) < str(start))
+        return row.get(field) not in (None, "", "unknown", "9999")
+    def validate_fix(
         self,
+        issue: Mapping[str, Any],
+        before_row: Mapping[str, Any],
+        after_row: Mapping[str, Any],
+        table: List[Dict[str, Any]],
+    ) -> bool:
+        """Ground-truth validator for semantic fixes."""
+        issue_type = str(issue.get("type", ""))
+        field = str(issue.get("column") or issue.get("field") or "")
+        if field and before_row.get(field) == after_row.get(field):
             return False
+        if field == "age":
+            try:
+                age = int(after_row.get("age"))
+            except Exception:
+                return False
+            if age < 0 or age > 120:
+                return False
         if issue_type == "missing_value":
+            return after_row.get(field) not in (None, "", "unknown", "9999")
         if issue_type == "invalid_format":
+            value = str(after_row.get(field, ""))
+            if field == "email":
+                return "@" in value and "." in value.split("@")[-1]
+            if field == "phone":
+                digits = "".join(ch for ch in value if ch.isdigit())
+                return len(digits) in {10, 11}
+            if field in {"start_date", "end_date"}:
+                start = after_row.get("start_date")
+                end = after_row.get("end_date")
+                return not (start and end and str(end) < str(start))
+            return value not in ("", "unknown", "9999")
+        if issue_type == "inconsistent_casing":
+            value = after_row.get(field)
+            return isinstance(value, str) and value == value.title()
+        if issue_type in {"duplicate", "conflict", "constraint_violation"}:
+            return False
+        return not self._introduces_inconsistency(dict(after_row), field, table) and self._issue_resolved(
+            issue, table
         )
+    def _count_issues_for_record(self, record_id: str) -> int:
+        rid = self._parse_record_id(record_id)
+        count = 0
+        for issue in self._state_data["task"]["hidden_issues"]:
+            if issue.get("row") == rid:
+                count += 1
                 continue
+            if rid in issue.get("rows", []):
+                count += 1
+        return count
+    def _issue_key(self, issue: Optional[Dict[str, Any]]) -> Optional[str]:
+        if issue is None:
             return None
+        issue_type = issue.get("type", "unknown")
+        if "row" in issue and "column" in issue:
+            return f"{issue_type}:row={issue.get('row')}:col={issue.get('column')}"
+        if "rows" in issue:
+            rows = ",".join(str(v) for v in sorted(issue.get("rows", [])))
+            field = issue.get("field", "record")
+            return f"{issue_type}:rows={rows}:field={field}"
+        return f"{issue_type}:generic"
+    def _introduces_inconsistency(
+        self, row: Dict[str, Any], field: str, table: List[Dict[str, Any]]
     ) -> bool:
+        # Unique email consistency check across records.
+        if field == "email":
+            email = row.get("email")
+            if email not in (None, ""):
+                duplicates = [
+                    r for r in table
+                    if r is not row and str(r.get("email", "")).strip() == str(email).strip()
+                ]
+                if duplicates:
                     return True
+        # Temporal consistency check where both fields are present.
+        if field in {"start_date", "end_date"}:
+            start = row.get("start_date")
+            end = row.get("end_date")
+            if start and end and str(end) < str(start):
+                return True
+        return False
+    def _build_observation(self) -> Observation:
+        return Observation(
+            dataset={
+                "original": deepcopy(self._state_data["dataset_original"]),
+                "modified": deepcopy(self._state_data["dataset_modified"]),
+            },
+            action_history=deepcopy(self._state_data["action_history"]),
+            per_record_scores=deepcopy(self._state_data["per_record_scores"]),
+            current_iteration_score=float(self._state_data["current_iteration_score"]),
+            previous_iteration_score=float(self._state_data["previous_iteration_score"]),
+            steps_remaining=int(self._state_data["steps_remaining"]),
         )
+    def _update_metrics(self) -> None:
+        totals = self._state_data["totals"]
+        total_fixes = int(totals["total_fixes"])
+        self._state_data["hallucination_rate"] = (
+            0.0 if total_fixes == 0 else float(totals["hallucinated_fixes"]) / total_fixes
         )
+        total_cd = int(totals["total_cannot_determine"])
+        self._state_data["uncertainty_accuracy"] = (
+            0.0 if total_cd == 0 else float(totals["correct_cannot_determine"]) / total_cd
+        )
+        total_related = int(totals["total_related_cases"])
+        self._state_data["consistency_score"] = (
+            1.0 if total_related == 0 else float(totals["consistent_decisions"]) / total_related
         )
+    def _parse_record_id(self, record_id: str) -> int:
+        digits = "".join(ch for ch in str(record_id) if ch.isdigit())
+        return int(digits) if digits else -1
+    def _find_record(self, record_id: str, table: List[Dict[str, Any]]) -> Optional[Dict[str, Any]]:
+        rid = self._parse_record_id(record_id)
+        for row in table:
+            if int(row.get("row_id", -1)) == rid:
                 return row
         return None
+    def _append_failure(self, action: Action, error_type: str, details: str) -> None:
+        mapped = error_type
+        if error_type == "wrong_fix":
+            mapped = "wrong_fix"
+        self._state_data["failure_logs"].append(
+            {
+                "record_id": action.record_id,
+                "error_type": mapped,
+                "details": details,
+                "confidence": float(action.confidence),
+            }
+        )
 class DataOpsGymEnv(DataOpsEnv):

grader.py CHANGED Viewed

@@ -1,438 +1,8 @@
-"""Evaluation and grading interfaces for ``dataops-gym``.
-This module is responsible for validating outputs, scoring task results, and
-capturing assessment metadata independently from task execution logic.
-"""
 from __future__ import annotations
-import re
-from typing import Any, Dict, Iterable, Mapping, MutableMapping, Optional, Tuple
-# Dense reward values are intentionally small and additive so the agent receives
-# feedback for intermediate progress without requiring full task completion.
-CORRECT_DUPLICATE_REMOVAL_REWARD = 0.3
-CORRECT_NORMALIZATION_REWARD = 0.2
-FIX_MISSING_VALUE_REWARD = 0.2
-VALIDATION_SUCCESS_REWARD = 0.2
-EFFICIENCY_BONUS = 0.2
-RECOVERY_BONUS = 0.25
-STEP_PENALTY = -0.02
-PROGRESS_REWARD_SCALE = 0.3
-# Penalties are split into:
-# 1. a direct penalty for the current bad action, and
-# 2. an escalating repetition penalty if the same mistake keeps happening.
-WRONG_DELETION_PENALTY = -0.3
-UNNECESSARY_ACTION_PENALTY = -0.1
-NOOP_PENALTY = -0.05
-DESTRUCTIVE_ACTION_PENALTY = -0.4
-FIRST_REPEAT_PENALTY = -0.1
-SECOND_REPEAT_PENALTY = -0.2
-THIRD_OR_MORE_REPEAT_PENALTY = -0.4
-EMAIL_PATTERN = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")
-def detect_repeated_mistake(mistakes: Mapping[str, int], mistake_key: str) -> int:
-    """Return how many times a mistake has already occurred before this step."""
-    return int(mistakes.get(mistake_key, 0))
-def track_mistake(state: MutableMapping[str, Any], mistake_key: str) -> int:
-    """Update the mistake counter in state and return the new occurrence count."""
-    mistakes = state.setdefault("mistakes", {})
-    if not isinstance(mistakes, dict):
-        raise ValueError("state['mistakes'] must be a dictionary for mistake tracking")
-    current_count = int(mistakes.get(mistake_key, 0))
-    new_count = current_count + 1
-    mistakes[mistake_key] = new_count
-    return new_count
-def repeated_mistake_penalty(occurrence_count: int) -> float:
-    """Return the escalating penalty for repeated mistakes."""
-    if occurrence_count <= 1:
-        return FIRST_REPEAT_PENALTY
-    if occurrence_count == 2:
-        return SECOND_REPEAT_PENALTY
-    return THIRD_OR_MORE_REPEAT_PENALTY
-def _to_bool(mapping: Mapping[str, Any], key: str) -> bool:
-    """Normalize truthy result flags into deterministic boolean checks."""
-    return bool(mapping.get(key, False))
-def _mistake_key(
-    action: Mapping[str, Any],
-    result: Mapping[str, Any],
-    fallback_key: str,
-) -> str:
-    """Build an action-specific mistake key with a safe fallback."""
-    action_type = action.get("action_type")
-    error_type = result.get("error_type", "general")
-    if action_type:
-        return f"{action_type}:{error_type}"
-    return fallback_key
-def _clamp_reward(value: float) -> float:
-    """Keep rewards in the required [-1.0, 1.0] range."""
-    return max(-1.0, min(1.0, round(value, 4)))
-def _clamp_score(value: float) -> float:
-    """Keep task-level scores in the required [0.0, 1.0] range."""
-    return max(0.0, min(1.0, round(value, 4)))
-def _is_missing_value(value: Any) -> bool:
-    """Return whether a cell should be considered missing."""
-    return value is None or value == ""
-def _is_valid_email(value: str) -> bool:
-    """Validate email formatting used by task graders."""
-    return bool(EMAIL_PATTERN.match(value.strip()))
-def _is_valid_phone(value: str) -> bool:
-    """Validate phone formatting used by task graders."""
-    digits = re.sub(r"\D", "", value)
-    return len(digits) == 10 or (len(digits) == 11 and digits.startswith("1"))
-def _needs_title_case(value: str) -> bool:
-    """Return whether text still violates title-case normalization."""
-    cleaned = value.strip()
-    return bool(cleaned) and cleaned != cleaned.title()
-def _has_duplicates(table: Iterable[Dict[str, Any]], column: str) -> bool:
-    """Check whether a column contains duplicate non-empty values."""
-    values = [row.get(column) for row in table if row.get(column) not in (None, "")]
-    return len(values) != len(set(values))
-def _table_by_row_id(table: Iterable[Dict[str, Any]]) -> Dict[int, Dict[str, Any]]:
-    """Index a table by ``row_id`` for deterministic issue evaluation."""
-    return {
-        int(row["row_id"]): dict(row)
-        for row in table
-        if row.get("row_id") is not None
-    }
-def _is_issue_resolved(issue: Mapping[str, Any], table_by_row_id: Dict[int, Dict[str, Any]]) -> bool:
-    """Return whether a structured hidden issue has been resolved."""
-    issue_type = issue.get("type")
-    if issue_type == "valid_trap":
-        return True
-    if issue_type in {"duplicate", "conflict"}:
-        rows = issue.get("rows", [])
-        return not all(row_id in table_by_row_id for row_id in rows)
-    if issue_type == "missing_value":
-        row = table_by_row_id.get(issue.get("row"))
-        column = issue.get("column")
-        return row is None or column is None or not _is_missing_value(row.get(column))
-    if issue_type == "inconsistent_casing":
-        column = issue.get("column")
-        rows = issue.get("rows", [])
-        return not any(
-            row_id in table_by_row_id
-            and isinstance(table_by_row_id[row_id].get(column), str)
-            and _needs_title_case(str(table_by_row_id[row_id].get(column)))
-            for row_id in rows
-        )
-    if issue_type == "invalid_format":
-        row = table_by_row_id.get(issue.get("row"))
-        column = issue.get("column")
-        if row is None or column is None:
-            return True
-        value = row.get(column)
-        if column == "email":
-            return _is_valid_email(str(value))
-        if column == "phone":
-            return _is_valid_phone(str(value))
-        return True
-    if issue_type == "constraint_violation" and issue.get("constraint") == "unique_email":
-        rows = issue.get("rows", [])
-        emails = [
-            table_by_row_id[row_id].get("email")
-            for row_id in rows
-            if row_id in table_by_row_id
-        ]
-        return len(emails) == len(set(emails))
-    return True
-def _task_check_results(
-    task_definition: Mapping[str, Any],
-    table: Iterable[Dict[str, Any]],
-    state: Optional[Mapping[str, Any]] = None,
-) -> list[Dict[str, Any]]:
-    """Build explicit pass/fail checks for final grading and validation."""
-    rows = [dict(row) for row in table]
-    table_by_row_id = _table_by_row_id(rows)
-    expected_outcome = dict(task_definition.get("expected_outcome", {}))
-    checks: list[Dict[str, Any]] = []
-    expected_row_count = expected_outcome.get("expected_row_count")
-    if expected_row_count is not None:
-        checks.append(
-            {
-                "name": "expected_row_count",
-                "passed": len(rows) == expected_row_count,
-                "message": f"Expected exactly {expected_row_count} rows in the cleaned table.",
-            }
-        )
-    expected_row_range = expected_outcome.get("expected_row_count_range")
-    if expected_row_range is not None:
-        checks.append(
-            {
-                "name": "expected_row_count_range",
-                "passed": expected_row_range["min"] <= len(rows) <= expected_row_range["max"],
-                "message": (
-                    "Expected the cleaned table to contain between "
-                    f"{expected_row_range['min']} and {expected_row_range['max']} rows."
-                ),
-            }
-        )
-    required_columns = expected_outcome.get(
-        "required_non_null_columns", task_definition.get("required_columns", [])
-    )
-    if required_columns:
-        checks.append(
-            {
-                "name": "required_non_null_columns",
-                "passed": not any(
-                    _is_missing_value(row.get(column))
-                    for row in rows
-                    for column in required_columns
-                ),
-                "message": "Required columns must be populated for all remaining rows.",
-            }
-        )
-    for unique_column in expected_outcome.get("unique_by", []):
-        checks.append(
-            {
-                "name": f"unique_by:{unique_column}",
-                "passed": not _has_duplicates(rows, unique_column),
-                "message": f"Values in '{unique_column}' must remain unique.",
-            }
-        )
-    for column, rule in expected_outcome.get("normalized_columns", {}).items():
-        if rule == "title_case":
-            checks.append(
-                {
-                    "name": f"normalized_column:{column}",
-                    "passed": not any(
-                        isinstance(row.get(column), str)
-                        and _needs_title_case(str(row.get(column)))
-                        for row in rows
-                    ),
-                    "message": f"Column '{column}' should use a consistent title-case style.",
-                }
-            )
-    for column, rule in expected_outcome.get("format_rules", {}).items():
-        if rule == "valid_email":
-            checks.append(
-                {
-                    "name": f"valid_email:{column}",
-                    "passed": not any(
-                        row.get(column) not in (None, "")
-                        and not _is_valid_email(str(row.get(column)))
-                        for row in rows
-                    ),
-                    "message": "All remaining email values must use a valid email format.",
-                }
-            )
-        if rule == "normalized_phone":
-            checks.append(
-                {
-                    "name": f"normalized_phone:{column}",
-                    "passed": not any(
-                        row.get(column) not in (None, "")
-                        and not _is_valid_phone(str(row.get(column)))
-                        for row in rows
-                    ),
-                    "message": "All remaining phone values must use a consistent valid format.",
-                }
-            )
-    initial_rows = {}
-    if state is not None:
-        initial_rows = dict(state.get("initial_table_by_row_id", {}))
-    for row_id in expected_outcome.get("must_preserve_valid_rows", []):
-        current_row = table_by_row_id.get(row_id)
-        checks.append(
-            {
-                "name": f"preserve_valid_row:{row_id}",
-                "passed": current_row is not None and current_row == initial_rows.get(row_id),
-                "message": f"Valid row {row_id} should remain logically unchanged.",
-            }
-        )
-    for row_group in expected_outcome.get("exactly_one_of_rows", []):
-        surviving = [row_id for row_id in row_group if row_id in table_by_row_id]
-        checks.append(
-            {
-                "name": f"exactly_one_of_rows:{','.join(str(row_id) for row_id in row_group)}",
-                "passed": len(surviving) == 1,
-                "message": f"Exactly one of rows {row_group} should remain in the cleaned table.",
-            }
-        )
-    for row_id in expected_outcome.get("rows_must_survive", []):
-        checks.append(
-            {
-                "name": f"rows_must_survive:{row_id}",
-                "passed": row_id in table_by_row_id,
-                "message": f"Row {row_id} must still be present in the cleaned table.",
-            }
-        )
-    for row_id in expected_outcome.get("rows_must_be_removed", []):
-        checks.append(
-            {
-                "name": f"rows_must_be_removed:{row_id}",
-                "passed": row_id not in table_by_row_id,
-                "message": f"Row {row_id} should not remain in the cleaned table.",
-            }
-        )
-    for issue in task_definition.get("hidden_issues", []):
-        if issue.get("type") == "valid_trap":
-            continue
-        message = issue.get("description") or f"Issue '{issue.get('type')}' must be resolved."
-        checks.append(
-            {
-                "name": f"hidden_issue:{issue.get('type')}",
-                "passed": _is_issue_resolved(issue, table_by_row_id),
-                "message": message,
-            }
-        )
-    return checks
-def _calculate_reward(
-    state: MutableMapping[str, Any],
-    action: Mapping[str, Any],
-    result: MutableMapping[str, Any],
-) -> float:
-    """Compute the deterministic scalar reward for a single environment step."""
-    reward = 0.0
-    # Every step incurs a small cost so the agent is encouraged to solve the
-    # task quickly instead of exploring indefinitely.
-    reward += STEP_PENALTY
-    # Intermediate rewards encourage the agent to make progress even when the
-    # dataset is not fully clean yet.
-    if _to_bool(result, "correct_duplicate_removal"):
-        reward += CORRECT_DUPLICATE_REMOVAL_REWARD
-    if _to_bool(result, "correct_normalization"):
-        reward += CORRECT_NORMALIZATION_REWARD
-    if _to_bool(result, "fixed_missing_value") or _to_bool(
-        result, "fixing_missing_values"
-    ):
-        reward += FIX_MISSING_VALUE_REWARD
-    if _to_bool(result, "validation_success"):
-        reward += VALIDATION_SUCCESS_REWARD
-    if _to_bool(result, "corrected_previous_mistake"):
-        reward += RECOVERY_BONUS
-    if _to_bool(result, "noop"):
-        reward += NOOP_PENALTY
-    if _to_bool(result, "destructive_action"):
-        reward += DESTRUCTIVE_ACTION_PENALTY
-    # Progress-based shaping provides a smoother learning signal for partial
-    # improvement, even when a step does not fully resolve a visible issue.
-    progress_delta = float(result.get("progress_delta", 0.0))
-    progress_delta = max(0.0, min(1.0, progress_delta))
-    reward += progress_delta * PROGRESS_REWARD_SCALE
-    # Explicitly penalize steps that fail to improve task progress so agents do
-    # not learn that random but harmless actions are equivalent to useful ones.
-    if progress_delta == 0.0:
-        reward -= 0.05
-    # Direct penalties handle obviously harmful moves. Repetition is tracked
-    # separately so the same bad behavior becomes more expensive over time.
-    if _to_bool(result, "wrong_deletion"):
-        reward += WRONG_DELETION_PENALTY
-        mistake_key = _mistake_key(action, result, "wrong_deletion")
-        occurrence_count = track_mistake(state, mistake_key)
-        reward += repeated_mistake_penalty(occurrence_count)
-    if _to_bool(result, "unnecessary_action"):
-        reward += UNNECESSARY_ACTION_PENALTY
-        mistake_key = _mistake_key(action, result, "unnecessary_action")
-        occurrence_count = track_mistake(state, mistake_key)
-        reward += repeated_mistake_penalty(occurrence_count)
-    # Support arbitrary custom mistake keys in addition to the built-in ones.
-    for mistake_key in result.get("mistake_keys", []):
-        if mistake_key not in {"wrong_deletion", "unnecessary_action"}:
-            occurrence_count = track_mistake(state, str(mistake_key))
-            reward += repeated_mistake_penalty(occurrence_count)
-    # Reward early completion only when the task finishes with steps still
-    # available. This creates a simple deterministic efficiency incentive.
-    if _to_bool(result, "task_completed") and int(state.get("steps_remaining", 0)) > 0:
-        reward += EFFICIENCY_BONUS
-    return _clamp_reward(reward)
-def grade_step(
-    state: MutableMapping[str, Any],
-    action: Mapping[str, Any],
-    result: MutableMapping[str, Any],
-) -> float:
-    """Compute a deterministic dense reward for a single environment step."""
-    return _calculate_reward(state, action, result)
 def grade_step_details(
@@ -440,153 +10,148 @@ def grade_step_details(
     action: Mapping[str, Any],
     result: MutableMapping[str, Any],
 ) -> Tuple[float, Dict[str, Any]]:
-    """Compute reward plus a structured component breakdown for debugging."""
-    previous_mistakes = {
-        key: int(value)
-        for key, value in state.get("mistakes", {}).items()
-    }
-    reward = grade_step(state, action, result)
-    wrong_deletion_repeat_penalty = 0.0
-    if result.get("wrong_deletion"):
-        mistake_key = _mistake_key(action, result, "wrong_deletion")
-        occurrence_count = int(state.get("mistakes", {}).get(mistake_key, 0))
-        if occurrence_count > int(previous_mistakes.get(mistake_key, 0)):
-            wrong_deletion_repeat_penalty = repeated_mistake_penalty(occurrence_count)
-    unnecessary_repeat_penalty = 0.0
-    if result.get("unnecessary_action"):
-        mistake_key = _mistake_key(action, result, "unnecessary_action")
-        occurrence_count = int(state.get("mistakes", {}).get(mistake_key, 0))
-        if occurrence_count > int(previous_mistakes.get(mistake_key, 0)):
-            unnecessary_repeat_penalty = repeated_mistake_penalty(occurrence_count)
-    components: Dict[str, Any] = {
-        "step_penalty": STEP_PENALTY,
-        "duplicate_reward": (
-            CORRECT_DUPLICATE_REMOVAL_REWARD
-            if result.get("correct_duplicate_removal")
-            else 0.0
-        ),
-        "normalization_reward": (
-            CORRECT_NORMALIZATION_REWARD
-            if result.get("correct_normalization")
-            else 0.0
-        ),
-        "missing_value_reward": (
-            FIX_MISSING_VALUE_REWARD if result.get("fixed_missing_value") else 0.0
-        ),
-        "validation_reward": (
-            VALIDATION_SUCCESS_REWARD if result.get("validation_success") else 0.0
-        ),
-        "penalties": {
-            "wrong_deletion": (
-                WRONG_DELETION_PENALTY if result.get("wrong_deletion") else 0.0
-            ),
-            "unnecessary_action": (
-                UNNECESSARY_ACTION_PENALTY if result.get("unnecessary_action") else 0.0
-            ),
-            "wrong_deletion_repeat": wrong_deletion_repeat_penalty,
-            "unnecessary_action_repeat": unnecessary_repeat_penalty,
-            "noop": NOOP_PENALTY if result.get("noop") else 0.0,
-            "destructive_action": (
-                DESTRUCTIVE_ACTION_PENALTY
-                if result.get("destructive_action")
-                else 0.0
-            ),
-        },
-        "progress_reward": round(
-            max(0.0, min(1.0, float(result.get("progress_delta", 0.0))))
-            * PROGRESS_REWARD_SCALE,
-            4,
-        ),
-        "recovery_bonus": (
-            RECOVERY_BONUS if result.get("corrected_previous_mistake") else 0.0
-        ),
-        "efficiency_bonus": (
-            EFFICIENCY_BONUS
-            if result.get("task_completed") and int(state.get("steps_remaining", 0)) > 0
-            else 0.0
-        ),
-    }
-    if float(result.get("progress_delta", 0.0)) == 0.0:
-        components["no_progress_penalty"] = -0.05
-    result["reward_components"] = components
-    result["reward_total"] = reward
-    return reward, components
 def grade_task_result(
     task_definition: Mapping[str, Any],
-    table: Iterable[Dict[str, Any]],
     state: Optional[Mapping[str, Any]] = None,
 ) -> float:
-    """Compute a deterministic final task score between 0.0 and 1.0."""
-    checks = _task_check_results(task_definition, table, state)
-    if not checks:
-        return 0.0
-    return _clamp_score(
-        sum(1.0 for check in checks if check["passed"]) / len(checks)
     )
 def task_failure_messages(
     task_definition: Mapping[str, Any],
-    table: Iterable[Dict[str, Any]],
     state: Optional[Mapping[str, Any]] = None,
 ) -> list[str]:
-    """Return explicit failure messages for unresolved outcome checks."""
-    return [
-        str(check["message"])
-        for check in _task_check_results(task_definition, table, state)
-        if not bool(check["passed"])
-    ]
-def grade_easy_cleaning_task(
-    task_definition: Mapping[str, Any],
-    table: Iterable[Dict[str, Any]],
-    state: Optional[Mapping[str, Any]] = None,
-) -> float:
-    """Grade the easy cleaning task on a 0.0–1.0 scale."""
-    return grade_task_result(task_definition, table, state)
-def grade_medium_normalization_task(
-    task_definition: Mapping[str, Any],
-    table: Iterable[Dict[str, Any]],
-    state: Optional[Mapping[str, Any]] = None,
-) -> float:
-    """Grade the medium normalization task on a 0.0–1.0 scale."""
-    return grade_task_result(task_definition, table, state)
-def grade_hard_conflict_resolution_task(
-    task_definition: Mapping[str, Any],
-    table: Iterable[Dict[str, Any]],
-    state: Optional[Mapping[str, Any]] = None,
-) -> float:
-    """Grade the hard conflict-resolution task on a 0.0–1.0 scale."""
-    return grade_task_result(task_definition, table, state)
-__all__ = [
-    "detect_repeated_mistake",
-    "grade_step",
-    "grade_step_details",
-    "grade_task_result",
-    "task_failure_messages",
-    "grade_easy_cleaning_task",
-    "grade_medium_normalization_task",
-    "grade_hard_conflict_resolution_task",
-    "repeated_mistake_penalty",
-    "track_mistake",
-]

+"""Strict semantic evaluation math for ``dataops-gym``."""
 from __future__ import annotations
+from typing import Any, Dict, Mapping, MutableMapping, Optional, Tuple
 def grade_step_details(
     action: Mapping[str, Any],
     result: MutableMapping[str, Any],
 ) -> Tuple[float, Dict[str, Any]]:
+    """Apply the exact per-step reward rules with no score clamping."""
+    score = 0.0
+    components: Dict[str, float] = {}
+    confidence = float(action.get("confidence", 0.0))
+    action_type = str(action.get("action_type", ""))
+    if result.get("classification_correct"):
+        # Detect is intentionally lower value than fix/cannot_determine.
+        if action_type == "detect_issue":
+            score += 0.1
+            components["classification"] = 0.1
+        else:
+            score += 0.2
+            components["classification"] = 0.2
+    elif result.get("classification_incorrect"):
+        score -= 0.2
+        components["classification"] = -0.2
+    if result.get("correct_issue_detected"):
+        if action_type == "detect_issue":
+            score += 0.05
+            components["issue_detection"] = 0.05
+        else:
+            score += 0.15
+            components["issue_detection"] = 0.15
+    elif result.get("missed_issue"):
+        score -= 0.15
+        components["issue_detection"] = -0.15
+    elif result.get("false_issue"):
+        score -= 0.05
+        components["issue_detection"] = -0.05
+    if result.get("correct_fix"):
+        score += 0.25
+        components["decision"] = 0.25
+    elif result.get("correct_cannot_determine"):
+        score += 0.25
+        components["decision"] = 0.25
+    elif result.get("hallucinated_fix"):
+        score -= 0.5
+        components["decision"] = -0.5
+    elif result.get("wrong_fix"):
+        score -= 0.4
+        components["decision"] = -0.4
+    elif result.get("wrong_cannot_determine"):
+        score -= 0.2
+        components["decision"] = -0.2
+    if result.get("passive_penalty"):
+        score -= 0.05
+        components["passive_penalty"] = -0.05
+    if result.get("repeated_detection"):
+        score -= 0.1
+        components["repeated_detection_penalty"] = -0.1
+    extra_mods = int(result.get("extra_fields_modified", 0))
+    if extra_mods > 0:
+        over = -0.05 * extra_mods
+        score += over
+        components["overcorrection"] = over
+    if result.get("consistent_handling"):
+        score += 0.2
+        components["cross_record_consistency"] = 0.2
+    elif result.get("inconsistent_handling"):
+        score -= 0.3
+        components["cross_record_consistency"] = -0.3
+    is_correct = bool(
+        result.get("classification_correct")
+        or result.get("correct_fix")
+        or result.get("correct_cannot_determine")
+        or result.get("correct_issue_detected")
+    )
+    is_wrong = bool(
+        result.get("classification_incorrect")
+        or result.get("wrong_fix")
+        or result.get("hallucinated_fix")
+        or result.get("wrong_cannot_determine")
+        or result.get("false_issue")
+    )
+    if confidence > 0.7 and is_correct:
+        score += 0.05
+        components["confidence"] = 0.05
+    elif confidence > 0.7 and is_wrong:
+        score -= 0.1
+        components["confidence"] = -0.1
+    if result.get("hallucinated_fix") and confidence > 0.8:
+        score -= 0.2
+        components["confident_hallucination_amplification"] = -0.2
+    if result.get("resolved_detected_issue"):
+        score += 0.15
+        components["resolution_reward"] = 0.15
+    return score, components
 def grade_task_result(
     task_definition: Mapping[str, Any],
+    table: Any,
     state: Optional[Mapping[str, Any]] = None,
 ) -> float:
+    """Compute final task score in [0, 1] using required formula."""
+    _ = task_definition
+    _ = table
+    state = state or {}
+    per_record_scores = dict(state.get("per_record_scores", {}))
+    n = max(1, len(per_record_scores))
+    avg_record_score = sum(float(v) for v in per_record_scores.values()) / n
+    normalized_record_score = (avg_record_score + 1.0) / 2.0
+    normalized_record_score = max(0.0, min(1.0, normalized_record_score))
+    hallucination_rate = float(state.get("hallucination_rate", 0.0))
+    uncertainty_accuracy = float(state.get("uncertainty_accuracy", 0.0))
+    consistency_score = float(state.get("consistency_score", 1.0))
+    task_score = (
+        0.5 * normalized_record_score
+        + 0.2 * (1.0 - hallucination_rate)
+        + 0.15 * uncertainty_accuracy
+        + 0.15 * consistency_score
     )
+    return max(0.0, min(1.0, task_score))
 def task_failure_messages(
     task_definition: Mapping[str, Any],
+    table: Any,
     state: Optional[Mapping[str, Any]] = None,
 ) -> list[str]:
+    """Return lightweight failure reasons collected during stepping."""
+    _ = task_definition
+    _ = table
+    state = state or {}
+    failures = state.get("failure_logs", [])
+    return [str(f.get("details", "")) for f in failures if f.get("details")]
+__all__ = ["grade_step_details", "grade_task_result", "task_failure_messages"]

inference.py CHANGED Viewed

@@ -23,15 +23,16 @@ from env import DataOpsEnv
 API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
 MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen3-VL-30B-A3B-Instruct:novita")
 HF_TOKEN = os.getenv("HF_TOKEN")
-BENCHMARK = os.getenv("OPENENV_BENCHMARK", "dataops-env")
 MAX_STEPS = 10
 TEMPERATURE = 0.0
 MAX_TOKENS = 160
 MODEL_RETRIES = 2
-FALLBACK_ACTION = "noop()"
 ACTION_PREFIX_RE = re.compile(r"^(action|next action)\s*[:\-]\s*", re.IGNORECASE)
 EMAIL_PATTERN = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")
-TASK_ORDER = ["easy", "medium", "hard"]
 IDENTIFIER_COLUMNS = ("customer_id", "vendor_id", "partner_id")
 POLICY_CACHE_PATH = os.getenv("POLICY_CACHE_PATH", ".dataops_policy_cache.json")
 POLICY_CACHE_VERSION = 1
@@ -209,12 +210,12 @@ def log_step(
     )
-def log_end(success: bool, steps: int, rewards: List[float]) -> None:
     """Emit the required episode end line."""
     rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
     print(
-        f"[END] success={str(success).lower()} steps={steps} rewards={rewards_text}",
         flush=True,
     )
@@ -293,8 +294,13 @@ def build_memory_keys(
 ) -> Tuple[str, str]:
     """Build exact-state and generalized problem-pattern keys."""
-    table = list(observation.get("table", []))
-    normalized_issues = sorted(_normalize_issue_text(str(issue)) for issue in observation.get("issues", []))
     state_key = _hash_key(
         {
             "task_name": task_name,
@@ -304,7 +310,7 @@ def build_memory_keys(
                 {key: row.get(key) for key in sorted(row.keys())}
                 for row in sorted(table, key=lambda row: int(row.get("row_id", 0)))
             ],
-            "issues": normalized_issues,
         }
     )
     pattern_key = _hash_key(
@@ -312,7 +318,7 @@ def build_memory_keys(
             "task_name": task_name,
             "goal": goal,
             "summary": _table_summary(table),
-            "issues": normalized_issues,
         }
     )
     return state_key, pattern_key
@@ -432,7 +438,7 @@ def _build_action_string(payload: Mapping[str, Any]) -> str:
     action_type = str(payload["action_type"])
     args: List[str] = []
-    for key in ("row_id", "column", "value"):
         if key not in payload or payload[key] is None:
             continue
         value = payload[key]
@@ -474,25 +480,22 @@ def action_string_to_payload(action_str: str, step_number: int) -> Tuple[str, Di
     try:
         expression = ast.parse(action_str, mode="eval").body
     except SyntaxError:
-        return FALLBACK_ACTION, {"action_id": f"step-{step_number:03d}", "action_type": "noop"}
     if not isinstance(expression, ast.Call) or not isinstance(expression.func, ast.Name):
-        return FALLBACK_ACTION, {"action_id": f"step-{step_number:03d}", "action_type": "noop"}
     allowed_actions = {
-        "remove_duplicate",
-        "fill_missing",
-        "normalize_column",
-        "delete_row",
-        "validate",
-        "noop",
     }
     action_type = expression.func.id
     if action_type not in allowed_actions:
-        return FALLBACK_ACTION, {"action_id": f"step-{step_number:03d}", "action_type": "noop"}
     payload: Dict[str, Any] = {
-        "action_id": f"step-{step_number:03d}",
         "action_type": action_type,
     }
     try:
@@ -501,7 +504,11 @@ def action_string_to_payload(action_str: str, step_number: int) -> Tuple[str, Di
                 continue
             payload[keyword.arg] = ast.literal_eval(keyword.value)
     except (SyntaxError, ValueError, TypeError):
-        return FALLBACK_ACTION, {"action_id": f"step-{step_number:03d}", "action_type": "noop"}
     return _build_action_string(payload), payload
@@ -536,7 +543,7 @@ def _table_preview(table: Sequence[Mapping[str, Any]], limit: int = 6) -> str:
         summary = ", ".join(
             f"{key}={value}"
             for key, value in row.items()
-            if key in {"row_id", "name", "city", "email", "phone", "status", "customer_id", "vendor_id", "partner_id"}
         )
         preview_lines.append(f"- {summary}")
     return "\n".join(preview_lines) if preview_lines else "- None"
@@ -553,10 +560,8 @@ def build_user_prompt(
 ) -> str:
     """Construct a compact prompt that constrains the model to useful actions."""
-    issues = observation.get("issues", [])
-    hints = observation.get("hints", [])
-    issues_text = "\n".join(f"- {issue}" for issue in issues[:6]) if issues else "- None"
-    hints_text = "\n".join(f"- {hint}" for hint in hints[:3]) if hints else "- None"
     candidates_text = "\n".join(f"- {action}" for action in candidate_actions)
     blocked_text = "\n".join(f"- {action}" for action in blocked_actions[:5]) if blocked_actions else "- None"
@@ -565,13 +570,11 @@ def build_user_prompt(
         Step: {step}
         Goal: {goal}
         Steps remaining: {observation.get("steps_remaining")}
-        Progress: {observation.get("progress")}
-        Current issues:
-        {issues_text}
-        Current hints:
-        {hints_text}
         Table preview:
-        {_table_preview(observation.get("table", []))}
         Recent history:
         {build_history_lines(history)}
         Last action error: {last_error or "null"}
@@ -594,124 +597,74 @@ def _prefer_action(
         action_text = _build_action_string(candidate)
         if action_text not in blocked_actions:
             return dict(candidate)
-    return {"action_type": "validate"}
-def _exact_duplicate_candidates(table: Sequence[Mapping[str, Any]]) -> List[Dict[str, Any]]:
-    """Generate explicit remove-duplicate actions for exact duplicate rows."""
-    groups: Dict[Tuple[Tuple[str, Any], ...], List[int]] = defaultdict(list)
-    for row in table:
-        row_id = row.get("row_id")
-        if row_id is None:
-            continue
-        groups[_row_signature(row)].append(int(row_id))
-    actions: List[Dict[str, Any]] = []
-    for row_ids in groups.values():
-        if len(row_ids) > 1:
-            actions.append({"action_type": "remove_duplicate", "row_id": max(row_ids)})
-    return actions
-def _group_by_identifier(table: Sequence[Mapping[str, Any]]) -> Dict[Tuple[str, str], List[Dict[str, Any]]]:
-    """Group rows by likely business identifiers."""
-    groups: Dict[Tuple[str, str], List[Dict[str, Any]]] = defaultdict(list)
-    for row in table:
-        for key in IDENTIFIER_COLUMNS:
-            value = row.get(key)
-            if value not in (None, ""):
-                groups[(key, str(value))].append(dict(row))
-    return groups
-def _row_quality_score(row: Mapping[str, Any]) -> int:
-    """Score a row so lower-quality conflict rows can be removed first."""
-    score = 0
-    if _is_valid_email(row.get("email")):
-        score += 3
-    if _is_valid_phone(row.get("phone")) or row.get("phone") in (None, ""):
-        score += 2
-    if isinstance(row.get("status"), str) and row.get("status") == "active":
-        score += 1
-    if isinstance(row.get("name"), str) and row.get("name").strip():
-        score += 1
-    return score
-def _structural_delete_candidates(table: Sequence[Mapping[str, Any]]) -> List[Dict[str, Any]]:
-    """Generate delete actions for non-exact structural conflicts."""
     actions: List[Dict[str, Any]] = []
-    for rows in _group_by_identifier(table).values():
-        if len(rows) < 2:
-            continue
-        signatures = {_row_signature(row) for row in rows}
-        if len(signatures) == 1:
-            continue
-        worst_row = sorted(
-            rows,
-            key=lambda row: (_row_quality_score(row), int(row.get("row_id", 0))),
-        )[0]
-        actions.append({"action_type": "delete_row", "row_id": int(worst_row["row_id"])})
-    email_groups: Dict[str, List[Dict[str, Any]]] = defaultdict(list)
     for row in table:
-        email = row.get("email")
-        if email not in (None, ""):
-            email_groups[str(email)].append(dict(row))
-    for rows in email_groups.values():
-        if len(rows) < 2:
-            continue
-        worst_row = sorted(
-            rows,
-            key=lambda row: (_row_quality_score(row), int(row.get("row_id", 0))),
-        )[0]
-        action = {"action_type": "delete_row", "row_id": int(worst_row["row_id"])}
-        if action not in actions:
-            actions.append(action)
-    return actions
-def _missing_value_candidates(table: Sequence[Mapping[str, Any]]) -> List[Dict[str, Any]]:
-    """Generate candidate fill actions for visible missing values."""
-    present_columns = {key for row in table for key in row.keys()}
-    priorities = [
-        column
-        for column in ["email", "city", "phone", "status", "name"]
-        if column in present_columns
-    ]
-    actions: List[Dict[str, Any]] = []
-    for column in priorities:
-        for row in table:
-            if _is_missing(row.get(column)):
                 actions.append(
                     {
-                        "action_type": "fill_missing",
-                        "row_id": int(row["row_id"]),
-                        "column": column,
-                        "value": _infer_fill_value(row, column, table),
                     }
                 )
     return actions
-def _normalization_candidates(table: Sequence[Mapping[str, Any]]) -> List[Dict[str, Any]]:
-    """Generate candidate column-normalization actions."""
-    candidates: List[Dict[str, Any]] = []
-    if any(row.get("email") not in (None, "") and not _is_valid_email(row.get("email")) for row in table):
-        candidates.append({"action_type": "normalize_column", "column": "email"})
-    if any(row.get("phone") not in (None, "") and not _is_valid_phone(row.get("phone")) for row in table):
-        candidates.append({"action_type": "normalize_column", "column": "phone"})
-    if any(_needs_title_case(row.get("name")) for row in table):
-        candidates.append({"action_type": "normalize_column", "column": "name"})
-    if any(_needs_title_case(row.get("city")) for row in table):
-        candidates.append({"action_type": "normalize_column", "column": "city"})
-    return candidates
 def propose_candidate_actions(
@@ -720,15 +673,21 @@ def propose_candidate_actions(
 ) -> List[Dict[str, Any]]:
     """Generate ranked candidate actions from visible table state."""
-    table = list(observation.get("table", []))
-    candidates = (
-        _exact_duplicate_candidates(table)
-        + _structural_delete_candidates(table)
-        + _missing_value_candidates(table)
-        + _normalization_candidates(table)
-        + [{"action_type": "validate"}]
-        + [{"action_type": "noop"}]
-    )
     unique_candidates: List[Dict[str, Any]] = []
     seen: set[str] = set()
@@ -746,7 +705,7 @@ def propose_candidate_actions(
         for candidate in unique_candidates
         if _build_action_string(candidate) != preferred_text
     ]
-    return ordered[:8]
 def _order_candidates_with_memory(
@@ -754,15 +713,26 @@ def _order_candidates_with_memory(
     memory: PolicyMemory,
     state_key: str,
     pattern_key: str,
 ) -> List[Dict[str, Any]]:
     """Re-rank candidates using persistent cross-episode memory."""
     scored = []
     for index, candidate in enumerate(candidates):
         action_text = _build_action_string(candidate)
         scored.append(
             (
-                -memory.score_action(state_key, pattern_key, action_text),
                 index,
                 dict(candidate),
             )
@@ -847,7 +817,9 @@ def choose_action(
     memory_blocked = memory.blocked_actions(state_key, pattern_key)
     combined_blocked = set(blocked_actions) | set(memory_blocked)
     candidates = propose_candidate_actions(observation, combined_blocked)
-    candidates = _order_candidates_with_memory(candidates, memory, state_key, pattern_key)
     heuristic_candidate = candidates[0]
     heuristic_text = _build_action_string(heuristic_candidate)
     candidate_texts = [_build_action_string(candidate) for candidate in candidates]
@@ -864,6 +836,8 @@ def choose_action(
         blocked_actions=sorted(combined_blocked),
     )
     chosen_text = model_text or heuristic_text
     normalized_text, payload = action_string_to_payload(chosen_text, step_number)
     if normalized_text in combined_blocked:
@@ -889,6 +863,8 @@ def run_episode(
     last_error: Optional[str] = None
     final_score = 0.0
     task_variant = "unknown"
     try:
         log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
@@ -903,7 +879,7 @@ def run_episode(
                 task_name=task_name,
                 task_variant=task_variant,
                 observation=observation,
-                goal=observation_model.goal,
                 step_number=step_number,
                 history=history,
                 last_error=last_error,
@@ -911,12 +887,23 @@ def run_episode(
             )
             try:
                 observation_model, reward, done, info = env.step(action_payload)
                 observation = observation_model.model_dump()
                 result = info.get("result", {})
-                progress_delta = float(result.get("progress_delta", 0.0))
-                error_value = result.get("error_type") or info.get("error") or None
-                final_score = float(info.get("task_score", 0.0))
                 if error_value == "general":
                     error_value = None
                 memory.update(
@@ -931,6 +918,15 @@ def run_episode(
                 )
                 if error_value or progress_delta == 0.0 or reward <= 0.0:
                     blocked_actions.add(action_text)
             except Exception as exc:  # noqa: BLE001
                 reward = 0.0
                 done = True
@@ -965,24 +961,28 @@ def run_episode(
             )
             if done:
-                success = bool(final_score >= 0.95 and error_value is None)
                 break
     finally:
         memory.save()
         close_method = getattr(env, "close", None)
         if callable(close_method):
             close_method()
-        log_end(success=success, steps=steps_taken, rewards=rewards)
     return final_score
 def main() -> None:
-    """Run all benchmark tasks with deterministic ordering and stdout formatting."""
     client = create_client()
     memory = PolicyMemory(POLICY_CACHE_PATH)
-    for task_index, task_name in enumerate(TASK_ORDER):
-        run_episode(client=client, memory=memory, task_name=task_name, seed=task_index)
 if __name__ == "__main__":

 API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
 MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen3-VL-30B-A3B-Instruct:novita")
 HF_TOKEN = os.getenv("HF_TOKEN")
+BENCHMARK = os.getenv("BROWSERGYM_BENCHMARK", "dataops-env")
+TASK_NAME = os.getenv("BROWSERGYM_TASK_NAME", "all")
+TASK_ORDER = ["easy", "medium", "hard"]
 MAX_STEPS = 10
 TEMPERATURE = 0.0
 MAX_TOKENS = 160
 MODEL_RETRIES = 2
+FALLBACK_ACTION = "skip(record_id='0', field='record', confidence=0.0)"
 ACTION_PREFIX_RE = re.compile(r"^(action|next action)\s*[:\-]\s*", re.IGNORECASE)
 EMAIL_PATTERN = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")
 IDENTIFIER_COLUMNS = ("customer_id", "vendor_id", "partner_id")
 POLICY_CACHE_PATH = os.getenv("POLICY_CACHE_PATH", ".dataops_policy_cache.json")
 POLICY_CACHE_VERSION = 1
     )
+def log_end(success: bool, steps: int, rewards: List[float], final_score: float) -> None:
     """Emit the required episode end line."""
     rewards_text = ",".join(f"{reward:.2f}" for reward in rewards)
     print(
+        f"[END] success={str(success).lower()} steps={steps} rewards={rewards_text} final_score={final_score:.4f}",
         flush=True,
     )
 ) -> Tuple[str, str]:
     """Build exact-state and generalized problem-pattern keys."""
+    dataset = observation.get("dataset", {}) if isinstance(observation, dict) else {}
+    table = list(dataset.get("modified", []))
+    normalized_issues = [
+        f"rows={len(table)}",
+        f"history={len(observation.get('action_history', []))}",
+        f"iter={observation.get('current_iteration_score', 0.0)}",
+    ]
     state_key = _hash_key(
         {
             "task_name": task_name,
                 {key: row.get(key) for key in sorted(row.keys())}
                 for row in sorted(table, key=lambda row: int(row.get("row_id", 0)))
             ],
+            "issues": sorted(normalized_issues),
         }
     )
     pattern_key = _hash_key(
             "task_name": task_name,
             "goal": goal,
             "summary": _table_summary(table),
+            "issues": sorted(normalized_issues),
         }
     )
     return state_key, pattern_key
     action_type = str(payload["action_type"])
     args: List[str] = []
+    for key in ("record_id", "field", "value", "confidence"):
         if key not in payload or payload[key] is None:
             continue
         value = payload[key]
     try:
         expression = ast.parse(action_str, mode="eval").body
     except SyntaxError:
+        return FALLBACK_ACTION, {"action_type": "skip", "record_id": "0", "field": "record", "confidence": 0.0}
     if not isinstance(expression, ast.Call) or not isinstance(expression.func, ast.Name):
+        return FALLBACK_ACTION, {"action_type": "skip", "record_id": "0", "field": "record", "confidence": 0.0}
     allowed_actions = {
+        "detect_issue",
+        "fix_value",
+        "cannot_determine",
+        "skip",
     }
     action_type = expression.func.id
     if action_type not in allowed_actions:
+        return FALLBACK_ACTION, {"action_type": "skip", "record_id": "0", "field": "record", "confidence": 0.0}
     payload: Dict[str, Any] = {
         "action_type": action_type,
     }
     try:
                 continue
             payload[keyword.arg] = ast.literal_eval(keyword.value)
     except (SyntaxError, ValueError, TypeError):
+        return FALLBACK_ACTION, {"action_type": "skip", "record_id": "0", "field": "record", "confidence": 0.0}
+    payload.setdefault("record_id", "0")
+    payload.setdefault("field", "record")
+    payload.setdefault("confidence", 0.6 if action_type != "skip" else 0.0)
     return _build_action_string(payload), payload
         summary = ", ".join(
             f"{key}={value}"
             for key, value in row.items()
+            if key in {"row_id", "name", "city", "email", "phone", "status", "customer_id", "vendor_id", "partner_id", "age", "start_date", "end_date"}
         )
         preview_lines.append(f"- {summary}")
     return "\n".join(preview_lines) if preview_lines else "- None"
 ) -> str:
     """Construct a compact prompt that constrains the model to useful actions."""
+    dataset = observation.get("dataset", {})
+    modified = dataset.get("modified", [])
     candidates_text = "\n".join(f"- {action}" for action in candidate_actions)
     blocked_text = "\n".join(f"- {action}" for action in blocked_actions[:5]) if blocked_actions else "- None"
         Step: {step}
         Goal: {goal}
         Steps remaining: {observation.get("steps_remaining")}
+        Current iteration score: {observation.get("current_iteration_score")}
+        Previous iteration score: {observation.get("previous_iteration_score")}
+        Per-record scores: {observation.get("per_record_scores")}
         Table preview:
+        {_table_preview(modified)}
         Recent history:
         {build_history_lines(history)}
         Last action error: {last_error or "null"}
         action_text = _build_action_string(candidate)
         if action_text not in blocked_actions:
             return dict(candidate)
+    return {"action_type": "skip", "record_id": "0", "field": "record", "confidence": 0.0}
+def _record_id(row: Mapping[str, Any]) -> str:
+    rid = row.get("row_id")
+    return str(rid) if rid is not None else "0"
+def _issue_like_candidates(table: Sequence[Mapping[str, Any]]) -> List[Dict[str, Any]]:
+    """Generate issue detection/fix candidates for new semantic action schema."""
     actions: List[Dict[str, Any]] = []
     for row in table:
+        rid = _record_id(row)
+        for field, value in row.items():
+            if field == "row_id":
+                continue
+            if _is_missing(value) or str(value).strip().lower() in {"unknown", "9999"}:
+                actions.append(
+                    {"action_type": "detect_issue", "record_id": rid, "field": field, "confidence": 0.85}
+                )
                 actions.append(
                     {
+                        "action_type": "fix_value",
+                        "record_id": rid,
+                        "field": field,
+                        "value": _infer_fill_value(row, field, table),
+                        "confidence": 0.75,
                     }
                 )
+            elif field == "email" and not _is_valid_email(value):
+                fixed = str(value).replace("[at]", "@").replace(" at ", "@").replace(" ", "")
+                if "@" in fixed and "." not in fixed.split("@")[-1]:
+                    fixed += ".com"
+                actions.append({"action_type": "detect_issue", "record_id": rid, "field": field, "confidence": 0.85})
+                actions.append({"action_type": "fix_value", "record_id": rid, "field": field, "value": fixed, "confidence": 0.8})
+            elif field == "phone" and not _is_valid_phone(value):
+                digits = re.sub(r"\D", "", str(value))
+                if len(digits) == 10:
+                    fixed = f"{digits[0:3]}-{digits[3:6]}-{digits[6:10]}"
+                    actions.append({"action_type": "detect_issue", "record_id": rid, "field": field, "confidence": 0.8})
+                    actions.append({"action_type": "fix_value", "record_id": rid, "field": field, "value": fixed, "confidence": 0.75})
+            elif field in {"start_date", "end_date"}:
+                start = row.get("start_date")
+                end = row.get("end_date")
+                if start and end and str(end) < str(start):
+                    actions.append({"action_type": "detect_issue", "record_id": rid, "field": field, "confidence": 0.8})
+                    actions.append({"action_type": "cannot_determine", "record_id": rid, "field": field, "confidence": 0.7})
+            elif field == "age":
+                try:
+                    age = int(value)
+                except Exception:
+                    age = -1
+                if age < 0 or age > 120:
+                    actions.append({"action_type": "detect_issue", "record_id": rid, "field": field, "confidence": 0.9})
+                    actions.append({"action_type": "cannot_determine", "record_id": rid, "field": field, "confidence": 0.8})
     return actions
+def _detected_keys_from_history(action_history: Sequence[Mapping[str, Any]]) -> set[str]:
+    """Extract previously detected issue keys from observation history."""
+    keys: set[str] = set()
+    for action in action_history:
+        if action.get("action_type") != "detect_issue":
+            continue
+        keys.add(f"{action.get('record_id')}::{action.get('field')}")
+    return keys
 def propose_candidate_actions(
 ) -> List[Dict[str, Any]]:
     """Generate ranked candidate actions from visible table state."""
+    dataset = observation.get("dataset", {}) if isinstance(observation, dict) else {}
+    table = list(dataset.get("modified", []))
+    detected_keys = _detected_keys_from_history(observation.get("action_history", []))
+    raw_candidates = _issue_like_candidates(table)
+    candidates: List[Dict[str, Any]] = []
+    for candidate in raw_candidates:
+        if candidate.get("action_type") == "detect_issue":
+            key = f"{candidate.get('record_id')}::{candidate.get('field')}"
+            # Detect once; then prefer follow-up actions.
+            if key in detected_keys:
+                continue
+        candidates.append(candidate)
+    candidates += [
+        {"action_type": "skip", "record_id": "0", "field": "record", "confidence": 0.0}
+    ]
     unique_candidates: List[Dict[str, Any]] = []
     seen: set[str] = set()
         for candidate in unique_candidates
         if _build_action_string(candidate) != preferred_text
     ]
+    return ordered[:12]
 def _order_candidates_with_memory(
     memory: PolicyMemory,
     state_key: str,
     pattern_key: str,
+    recent_history: Sequence[str],
 ) -> List[Dict[str, Any]]:
     """Re-rank candidates using persistent cross-episode memory."""
     scored = []
+    recent_action_counts = Counter()
+    for item in recent_history[-5:]:
+        try:
+            parsed = item.split(" action=", 1)[1].split(" reward=", 1)[0].strip()
+            if parsed:
+                recent_action_counts[parsed] += 1
+        except Exception:
+            continue
     for index, candidate in enumerate(candidates):
         action_text = _build_action_string(candidate)
+        repeat_penalty = recent_action_counts.get(action_text, 0) * 2.0
         scored.append(
             (
+                -memory.score_action(state_key, pattern_key, action_text) + repeat_penalty,
                 index,
                 dict(candidate),
             )
     memory_blocked = memory.blocked_actions(state_key, pattern_key)
     combined_blocked = set(blocked_actions) | set(memory_blocked)
     candidates = propose_candidate_actions(observation, combined_blocked)
+    candidates = _order_candidates_with_memory(
+        candidates, memory, state_key, pattern_key, history
+    )
     heuristic_candidate = candidates[0]
     heuristic_text = _build_action_string(heuristic_candidate)
     candidate_texts = [_build_action_string(candidate) for candidate in candidates]
         blocked_actions=sorted(combined_blocked),
     )
+    if model_text not in candidate_texts:
+        model_text = None
     chosen_text = model_text or heuristic_text
     normalized_text, payload = action_string_to_payload(chosen_text, step_number)
     if normalized_text in combined_blocked:
     last_error: Optional[str] = None
     final_score = 0.0
     task_variant = "unknown"
+    action_repeat_counts: Dict[str, int] = defaultdict(int)
+    no_change_counts: Dict[str, int] = defaultdict(int)
     try:
         log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
                 task_name=task_name,
                 task_variant=task_variant,
                 observation=observation,
+                goal=str(env.state().get("task", {}).get("goal", "")),
                 step_number=step_number,
                 history=history,
                 last_error=last_error,
             )
             try:
+                before_dataset = observation.get("dataset", {}) if isinstance(observation, dict) else {}
+                before_modified = before_dataset.get("modified", [])
                 observation_model, reward, done, info = env.step(action_payload)
                 observation = observation_model.model_dump()
+                after_dataset = observation.get("dataset", {}) if isinstance(observation, dict) else {}
+                after_modified = after_dataset.get("modified", [])
                 result = info.get("result", {})
+                curr_iter = float(observation.get("current_iteration_score", 0.0))
+                prev_iter = float(observation.get("previous_iteration_score", 0.0))
+                progress_delta = max(0.0, curr_iter - prev_iter)
+                error_value = "step_error" if (
+                    result.get("wrong_fix")
+                    or result.get("hallucinated_fix")
+                    or result.get("wrong_cannot_determine")
+                    or result.get("classification_incorrect")
+                ) else None
+                final_score = float(info.get("final_task_score", 0.0))
                 if error_value == "general":
                     error_value = None
                 memory.update(
                 )
                 if error_value or progress_delta == 0.0 or reward <= 0.0:
                     blocked_actions.add(action_text)
+                action_repeat_counts[action_text] += 1
+                if action_repeat_counts[action_text] > 2:
+                    blocked_actions.add(action_text)
+                if _stable_json(before_modified) == _stable_json(after_modified):
+                    no_change_counts[action_text] += 1
+                    if no_change_counts[action_text] >= 2:
+                        blocked_actions.add(action_text)
+                else:
+                    no_change_counts[action_text] = 0
             except Exception as exc:  # noqa: BLE001
                 reward = 0.0
                 done = True
             )
             if done:
+                success = bool(final_score > 0.0)
                 break
     finally:
         memory.save()
         close_method = getattr(env, "close", None)
         if callable(close_method):
             close_method()
+        log_end(success=success, steps=steps_taken, rewards=rewards, final_score=final_score)
     return final_score
 def main() -> None:
+    """Run one configured task or all tasks in deterministic order."""
     client = create_client()
     memory = PolicyMemory(POLICY_CACHE_PATH)
+    task_name = str(TASK_NAME).strip().lower()
+    if task_name in {"all", "*"}:
+        for task_index, ordered_task in enumerate(TASK_ORDER):
+            run_episode(client=client, memory=memory, task_name=ordered_task, seed=task_index)
+        return
+    run_episode(client=client, memory=memory, task_name=task_name, seed=0)
 if __name__ == "__main__":

server/app.py CHANGED Viewed

@@ -6,6 +6,7 @@ deployment-facing application setup for the environment.
 from __future__ import annotations
 import logging
 import os
 from pathlib import Path
@@ -14,7 +15,8 @@ from threading import RLock
 from typing import Any, Dict, Optional
 from fastapi import FastAPI, HTTPException, Request
-from fastapi.responses import JSONResponse, RedirectResponse
 from pydantic import BaseModel, Field
 import uvicorn
@@ -30,7 +32,28 @@ from models import Action
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
-app = FastAPI(title="dataops-env", version="1.0.0")
 active_env: Optional[DataOpsEnv] = None
 active_env_lock = RLock()
@@ -39,12 +62,20 @@ class ResetRequest(BaseModel):
     """Optional reset controls for reproducible task selection."""
     seed: int = Field(default=0, description="Deterministic seed for task sampling.")
-    task_name: str | None = Field(
         default=None,
         description="Optional fixed task name: easy, medium, or hard.",
     )
 @app.exception_handler(Exception)
 async def unhandled_exception_handler(
     request: Request, exc: Exception
@@ -65,6 +96,111 @@ def root() -> RedirectResponse:
     return RedirectResponse(url="/docs")
 @app.get("/health")
 def health() -> Dict[str, str]:
     """Return a lightweight deployment health signal."""
@@ -78,7 +214,10 @@ def reset(payload: ResetRequest | None = None) -> Dict[str, Any]:
     try:
         request = payload or ResetRequest()
-        env = DataOpsEnv(seed=request.seed, task_name=request.task_name)
         observation = env.reset()
         global active_env
@@ -90,6 +229,8 @@ def reset(payload: ResetRequest | None = None) -> Dict[str, Any]:
             "task_name": env.state().get("task_name"),
             "observation": observation.model_dump(),
         }
     except Exception as exc:
         logger.exception("Failed to reset environment", exc_info=exc)
         raise HTTPException(status_code=500, detail="Failed to reset environment") from exc

 from __future__ import annotations
+from enum import Enum
 import logging
 import os
 from pathlib import Path
 from typing import Any, Dict, Optional
 from fastapi import FastAPI, HTTPException, Request
+from fastapi.openapi.docs import get_swagger_ui_html
+from fastapi.responses import HTMLResponse, JSONResponse, RedirectResponse
 from pydantic import BaseModel, Field
 import uvicorn
 logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
+app = FastAPI(
+    title="dataops-env",
+    version="1.0.0",
+    summary="Reasoning-first semantic data cleaning benchmark.",
+    description=(
+        "### DataOps Gym: Clean Data, Keep Truth\n"
+        "A step-based evaluation environment for testing whether agents can detect issues, "
+        "fix only with evidence, abstain with `cannot_determine` under ambiguity, and stay "
+        "consistent across related records.\n\n"
+        "**Tagline:** *Fix data without fabricating reality.*\n\n"
+        "#### Why this API matters\n"
+        "- Strict JSON action schema (no free-form outputs)\n"
+        "- Reward shaping that penalizes hallucinations and over-correction\n"
+        "- Cross-record consistency and uncertainty-aware scoring\n"
+    ),
+    contact={
+        "name": "DataOps Gym",
+        "url": "https://github.com/graheetphartyal23/Dataops--GYM",
+    },
+    docs_url=None,
+)
 active_env: Optional[DataOpsEnv] = None
 active_env_lock = RLock()
     """Optional reset controls for reproducible task selection."""
     seed: int = Field(default=0, description="Deterministic seed for task sampling.")
+    task_name: "TaskName | None" = Field(
         default=None,
         description="Optional fixed task name: easy, medium, or hard.",
     )
+class TaskName(str, Enum):
+    """Allowed benchmark task names."""
+    EASY = "easy"
+    MEDIUM = "medium"
+    HARD = "hard"
 @app.exception_handler(Exception)
 async def unhandled_exception_handler(
     request: Request, exc: Exception
     return RedirectResponse(url="/docs")
+@app.get("/docs", include_in_schema=False)
+def custom_docs() -> HTMLResponse:
+    """Serve Swagger UI with a dark theme override."""
+    swagger = get_swagger_ui_html(
+        openapi_url=app.openapi_url,
+        title=f"{app.title} - API Docs",
+        swagger_ui_parameters={
+            "syntaxHighlight.theme": "obsidian",
+            "displayRequestDuration": True,
+        },
+    )
+    dark_css = """
+    <style>
+    html, body { background: #0b1020 !important; color: #e5e7eb !important; }
+    .swagger-ui, .swagger-ui .topbar { background: #0b1020 !important; }
+    .swagger-ui .topbar { border-bottom: 1px solid #1f2937 !important; }
+    .swagger-ui .topbar a, .swagger-ui .topbar span { color: #e5e7eb !important; }
+    /* Keep top API details readable: white card + black text */
+    .swagger-ui .info {
+      background: #ffffff !important;
+      color: #111827 !important;
+      border: 1px solid #e5e7eb !important;
+      border-radius: 12px !important;
+      padding: 18px !important;
+      margin: 18px 0 24px 0 !important;
+    }
+    .swagger-ui .info .title, .swagger-ui .info h1, .swagger-ui .info h2,
+    .swagger-ui .info h3, .swagger-ui .info p, .swagger-ui .info li,
+    .swagger-ui .info a, .swagger-ui .info .base-url, .swagger-ui .info .version {
+      color: #111827 !important;
+    }
+    .swagger-ui .info ul { margin: 10px 0 0 18px !important; }
+    /* Default + Schemas sections as white cards with black text */
+    .swagger-ui .opblock-tag {
+      background: #ffffff !important;
+      color: #111827 !important;
+      border: 1px solid #e5e7eb !important;
+      border-radius: 10px !important;
+      padding: 10px 12px !important;
+      margin-bottom: 12px !important;
+    }
+    .swagger-ui .opblock {
+      background: #ffffff !important;
+      border: 1px solid #e5e7eb !important;
+      border-radius: 10px !important;
+      margin: 0 0 14px 0 !important;
+      box-shadow: 0 2px 10px rgba(0, 0, 0, 0.25) !important;
+    }
+    .swagger-ui .opblock .opblock-summary {
+      background: #ffffff !important;
+      border-bottom: 1px solid #e5e7eb !important;
+    }
+    .swagger-ui .opblock .opblock-summary-method,
+    .swagger-ui .opblock .opblock-summary-path,
+    .swagger-ui .opblock .opblock-summary-path__deprecated,
+    .swagger-ui .opblock .opblock-summary-description {
+      color: #111827 !important;
+      fill: #111827 !important;
+    }
+    .swagger-ui .opblock-section-header,
+    .swagger-ui .responses-inner h4,
+    .swagger-ui .responses-inner h5,
+    .swagger-ui .tab li,
+    .swagger-ui .parameter__type,
+    .swagger-ui .model-title,
+    .swagger-ui .models h4 {
+      color: #111827 !important;
+    }
+    .swagger-ui .models {
+      background: #ffffff !important;
+      border: 1px solid #e5e7eb !important;
+      border-radius: 10px !important;
+      padding: 8px !important;
+    }
+    .swagger-ui .model-container, .swagger-ui .model-box {
+      background: #ffffff !important;
+      color: #111827 !important;
+      border-color: #e5e7eb !important;
+    }
+    .swagger-ui .model, .swagger-ui .prop-name, .swagger-ui .prop-type, .swagger-ui .prop-format {
+      color: #111827 !important;
+    }
+    .swagger-ui .response-col_status, .swagger-ui .response-col_description,
+    .swagger-ui label, .swagger-ui .parameter__name,
+    .swagger-ui table tbody tr td, .swagger-ui .responses-table, .swagger-ui .parameters-col_description {
+      color: #111827 !important;
+      background: #ffffff !important;
+      border-color: #e5e7eb !important;
+    }
+    .swagger-ui input, .swagger-ui textarea, .swagger-ui select {
+      background: #0f172a !important;
+      color: #e5e7eb !important;
+      border-color: #374151 !important;
+    }
+    .swagger-ui .btn.execute { background: #2563eb !important; color: white !important; }
+    .swagger-ui .btn { border-color: #4b5563 !important; }
+    </style>
+    """
+    html = swagger.body.decode("utf-8").replace("</head>", f"{dark_css}</head>")
+    return HTMLResponse(content=html, status_code=200)
 @app.get("/health")
 def health() -> Dict[str, str]:
     """Return a lightweight deployment health signal."""
     try:
         request = payload or ResetRequest()
+        env = DataOpsEnv(
+            seed=request.seed,
+            task_name=request.task_name.value if request.task_name is not None else None,
+        )
         observation = env.reset()
         global active_env
             "task_name": env.state().get("task_name"),
             "observation": observation.model_dump(),
         }
+    except ValueError as exc:
+        raise HTTPException(status_code=422, detail=str(exc)) from exc
     except Exception as exc:
         logger.exception("Failed to reset environment", exc_info=exc)
         raise HTTPException(status_code=500, detail="Failed to reset environment") from exc

task.py CHANGED Viewed

@@ -8,6 +8,7 @@ broader and less gameable.
 from __future__ import annotations
 from typing import Any, Dict, List, TypedDict
@@ -43,7 +44,23 @@ def _pick_variant(variant: int | None, variants: List[TaskDefinition]) -> TaskDe
     """Select a deterministic task variant with a stable default."""
     index = 0 if variant is None else max(0, min(len(variants) - 1, int(variant)))
-    return variants[index]
 def easy_cleaning_task(variant: int | None = None) -> TaskDefinition:
@@ -320,8 +337,8 @@ def hard_conflict_resolution_task(variant: int | None = None) -> TaskDefinition:
             "initial_table": [
                 {"row_id": 21, "customer_id": "C200", "name": "Nina Patel", "email": "nina.patel@example.com", "phone": "206-555-0101", "status": "active"},
                 {"row_id": 22, "customer_id": "C200", "name": "Nina Patel", "email": "nina.patel@example.com", "phone": "206-555-0101", "status": "active"},
-                {"row_id": 23, "customer_id": "C201", "name": "Evan Cole", "email": "evan.cole@example", "phone": "4155550102", "status": "active"},
-                {"row_id": 24, "customer_id": "C201", "name": "Evan Cole", "email": "evan.cole@example.com", "phone": "(415) 555-0102", "status": "inactive"},
                 {"row_id": 25, "customer_id": "C202", "name": "A. J. Brown", "email": "aj.brown@example.com", "phone": "+1-312-555-0103", "status": "active"},
                 {"row_id": 26, "customer_id": "C203", "name": "Marta Silva", "email": "shared@example.com", "phone": "646-555-0104", "status": "active"},
                 {"row_id": 27, "customer_id": "C204", "name": "Martin Silva", "email": "shared@example.com", "phone": "646-555-0105", "status": "active"},
@@ -336,6 +353,9 @@ def hard_conflict_resolution_task(variant: int | None = None) -> TaskDefinition:
                 {
                     "type": "conflict",
                     "rows": [23, 24],
                     "description": "Rows 23 and 24 conflict for the same customer and must be reconciled into one trustworthy record.",
                 },
                 {
@@ -399,8 +419,8 @@ def hard_conflict_resolution_task(variant: int | None = None) -> TaskDefinition:
             "initial_table": [
                 {"row_id": 51, "customer_id": "A900", "name": "Lena Brooks", "email": "lena.brooks@example.com", "phone": "212-555-0111", "status": "active"},
                 {"row_id": 52, "customer_id": "A900", "name": "Lena Brooks", "email": "lena.brooks@example.com", "phone": "212-555-0111", "status": "active"},
-                {"row_id": 53, "customer_id": "A901", "name": "Ravi Shah", "email": "ravi.shah example.com", "phone": "6465550112", "status": "active"},
-                {"row_id": 54, "customer_id": "A901", "name": "Ravi Shah", "email": "ravi.shah@example.com", "phone": "646-555-0112", "status": "inactive"},
                 {"row_id": 55, "customer_id": "A902", "name": "M. E. Klein", "email": "mek@example.com", "phone": "+1-303-555-0113", "status": "active"},
                 {"row_id": 56, "customer_id": "A903", "name": "Sana Noor", "email": "ops@example.com", "phone": "718-555-0114", "status": "active"},
                 {"row_id": 57, "customer_id": "A904", "name": "Sana N.", "email": "ops@example.com", "phone": "718-555-0115", "status": "active"},
@@ -415,6 +435,9 @@ def hard_conflict_resolution_task(variant: int | None = None) -> TaskDefinition:
                 {
                     "type": "conflict",
                     "rows": [53, 54],
                     "description": "Rows 53 and 54 conflict for the same customer and must be reconciled into one trustworthy record.",
                 },
                 {

 from __future__ import annotations
+from copy import deepcopy
 from typing import Any, Dict, List, TypedDict
     """Select a deterministic task variant with a stable default."""
     index = 0 if variant is None else max(0, min(len(variants) - 1, int(variant)))
+    selected = deepcopy(variants[index])
+    selected["hidden_issues"] = _with_fixable_flags(selected.get("hidden_issues", []))
+    return selected
+def _with_fixable_flags(hidden_issues: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
+    """Ensure each hidden issue carries an explicit ``fixable`` flag."""
+    enriched: List[Dict[str, Any]] = []
+    for issue in hidden_issues:
+        issue_copy = dict(issue)
+        if "fixable" not in issue_copy:
+            issue_type = issue_copy.get("type")
+            # Structural conflicts usually require judgment across rows.
+            issue_copy["fixable"] = issue_type not in {"duplicate", "conflict", "constraint_violation"}
+        enriched.append(issue_copy)
+    return enriched
 def easy_cleaning_task(variant: int | None = None) -> TaskDefinition:
             "initial_table": [
                 {"row_id": 21, "customer_id": "C200", "name": "Nina Patel", "email": "nina.patel@example.com", "phone": "206-555-0101", "status": "active"},
                 {"row_id": 22, "customer_id": "C200", "name": "Nina Patel", "email": "nina.patel@example.com", "phone": "206-555-0101", "status": "active"},
+                {"row_id": 23, "customer_id": "C201", "name": "Evan Cole", "email": "evan.cole@example", "phone": "4155550102", "status": "active", "age": 250},
+                {"row_id": 24, "customer_id": "C201", "name": "Evan Cole", "email": "evan.cole@example.com", "phone": "(415) 555-0102", "status": "inactive", "age": 45},
                 {"row_id": 25, "customer_id": "C202", "name": "A. J. Brown", "email": "aj.brown@example.com", "phone": "+1-312-555-0103", "status": "active"},
                 {"row_id": 26, "customer_id": "C203", "name": "Marta Silva", "email": "shared@example.com", "phone": "646-555-0104", "status": "active"},
                 {"row_id": 27, "customer_id": "C204", "name": "Martin Silva", "email": "shared@example.com", "phone": "646-555-0105", "status": "active"},
                 {
                     "type": "conflict",
                     "rows": [23, 24],
+                    "field": "age",
+                    "values": [250, 45],
+                    "fixable": False,
                     "description": "Rows 23 and 24 conflict for the same customer and must be reconciled into one trustworthy record.",
                 },
                 {
             "initial_table": [
                 {"row_id": 51, "customer_id": "A900", "name": "Lena Brooks", "email": "lena.brooks@example.com", "phone": "212-555-0111", "status": "active"},
                 {"row_id": 52, "customer_id": "A900", "name": "Lena Brooks", "email": "lena.brooks@example.com", "phone": "212-555-0111", "status": "active"},
+                {"row_id": 53, "customer_id": "A901", "name": "Ravi Shah", "email": "ravi.shah example.com", "phone": "6465550112", "status": "active", "age": 250},
+                {"row_id": 54, "customer_id": "A901", "name": "Ravi Shah", "email": "ravi.shah@example.com", "phone": "646-555-0112", "status": "inactive", "age": 45},
                 {"row_id": 55, "customer_id": "A902", "name": "M. E. Klein", "email": "mek@example.com", "phone": "+1-303-555-0113", "status": "active"},
                 {"row_id": 56, "customer_id": "A903", "name": "Sana Noor", "email": "ops@example.com", "phone": "718-555-0114", "status": "active"},
                 {"row_id": 57, "customer_id": "A904", "name": "Sana N.", "email": "ops@example.com", "phone": "718-555-0115", "status": "active"},
                 {
                     "type": "conflict",
                     "rows": [53, 54],
+                    "field": "age",
+                    "values": [250, 45],
+                    "fixable": False,
                     "description": "Rows 53 and 54 conflict for the same customer and must be reconciled into one trustworthy record.",
                 },
                 {