Spaces:

avanigupta
/

dataqa-env

Sleeping

App Files Files Community

avanigupta commited on 12 days ago

Commit

c3002ad

1 Parent(s): cd11aba

add fix stage+demo

Browse files

Files changed (10) hide show

README.md +97 -57
dataqa_env/models.py +12 -10
dataqa_env/server/environment.py +283 -34
dataqa_env/server/gradio_ui.py +487 -0
dataqa_env/server/tasks.py +48 -0
inference.py +113 -49
tests/test_environment.py +245 -24
tests/test_extensibility.py +34 -4
tests/test_inference.py +46 -3
tests/test_tasks.py +2 -2

README.md CHANGED Viewed

@@ -13,20 +13,38 @@ tags:
 # DataQA Environment
-An OpenEnv environment for **Data Quality Assurance** — an LLM agent inspects datasets with planted quality issues and must identify them all.
 ## Motivation
 Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies — before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
-DataQA turns this into a structured, gradable RL environment where agents must systematically inspect corrupted datasets, reason about schema constraints and validation rules, and pinpoint every planted issue — from obvious nulls to subtle data leakage signals that require domain expertise.
 ## Environment API
 | Endpoint | Method | Description |
 |----------|--------|-------------|
 | `/reset` | POST | Start a new episode with a corrupted dataset |
-| `/step` | POST | Submit identified issues, receive scored feedback |
 | `/state` | GET | Get current episode state |
 | `/health` | GET | Health check |
@@ -36,20 +54,27 @@ DataQA turns this into a structured, gradable RL environment where agents must s
 |------|--------|-----------|--------|-------------|
 | `easy` | 4 | Beginner | HR/Employee data | Nulls, wrong types, duplicates, out-of-range values |
 | `medium` | 6 | Intermediate | E-commerce orders | Format violations, inconsistent computed fields, duplicate keys |
-| `hard` | 8 | Advanced | ML experiment metadata | Data leakage signals, unreasonable GPU memory, timestamp ordering, whitespace-only fields |
 **Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
-## Action Space
-The agent submits a list of issue strings, each in the format:
-```
-row:<row_number>,col:<column_name>,issue:<issue_type>
-```
-- `row_number`: 1-indexed position in the CSV data (after header). Row 1 = first data row.
-- `column_name`: Exact column header name, lowercase.
-- `issue_type`: One of the supported types below.
 **Supported Issue Types:**
@@ -66,30 +91,36 @@ row:<row_number>,col:<column_name>,issue:<issue_type>
 ## Observation Space
-Each observation contains:
 | Field | Type | Description |
 |-------|------|-------------|
 | `dataset_csv` | str | The corrupted dataset in CSV format |
 | `schema_description` | str | Column types, ranges, and constraints |
 | `validation_rules` | str | Business rules the data must satisfy |
 | `task_description` | str | Task context and instructions |
-| `feedback` | str | Results from previous step (TP/FP/FN counts, precision/recall) |
 | `num_issues_hint` | int | Exact count of planted issues |
 | `max_steps` | int | Maximum attempts allowed |
 | `done` | bool | Whether episode has terminated |
-| `reward` | float | Best weighted reward so far (0.0-1.0) |
-**Observation Metadata** (available after each step):
-- `f1`, `weighted_reward`, `precision`, `recall`
-- `tp`, `fp`, `fn`
-- `difficulty_found`, `difficulty_missed`
 ## Reward Function
-### Difficulty-Weighted Reward (Primary)
-Each planted issue has a **difficulty weight** (1.0-3.0) reflecting how hard it is to detect. The primary reward is a **weighted F1 score** that provides meaningful per-step partial progress signals:
 | Weight | Category | Examples |
 |--------|----------|----------|
@@ -97,43 +128,51 @@ Each planted issue has a **difficulty weight** (1.0-3.0) reflecting how hard it
 | 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
 | 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
-**Formula:**
-- **Weighted Recall** = (sum of difficulty weights for found issues) / (total difficulty weight)
-- **Weighted Precision** = (found weight) / (found weight + FP count * avg difficulty)
-- **Weighted F1** = harmonic mean of weighted precision and recall
-This means:
-- Finding a hard issue (difficulty 3.0) increases reward 3x more than finding an easy one (1.0)
-- False positives are penalized proportionally to average issue difficulty
-- The agent sees meaningful reward differences at every step, not just pass/fail
-### Standard F1 (also computed)
-Available in observation metadata for comparison. Uses unweighted set matching.
 ### Episode Boundaries
 - Each task allows up to 3 steps (attempts)
-- Episode ends when F1 >= 0.999 (perfect) or max steps reached
-- Best score across all steps is the final reward (monotonically non-decreasing)
-- Reward is always in [0.0, 1.0]
 ## Baseline Scores
-Baseline scores using Qwen2.5-72B-Instruct via HuggingFace Router:
-| Task | Expected Score Range | Description |
-|------|---------------------|-------------|
-| `easy` | 0.7 - 1.0 | Most LLMs find obvious issues reliably |
-| `medium` | 0.5 - 0.8 | Cross-column reasoning is challenging |
-| `hard` | 0.3 - 0.6 | ML domain knowledge and subtle patterns |
-Scores vary by model capability. Frontier models (GPT-4, Claude) typically score higher on the hard task due to better domain reasoning.
 ## Extensibility
-DataQA supports custom tasks, contamination rules, and difficulty levels via a programmatic API.
 ### Custom Contamination Rules
 ```python
@@ -180,7 +219,7 @@ register_task("custom", lambda seed: task)
 |------|--------|--------------------|
 | `missing_value` | Sets field to empty string | 1.0 |
 | `whitespace_value` | Sets field to single space | 2.5 |
-| `wrong_type_text` | Replaces with random text ("N/A", "null", etc.) | 1.0 |
 | `negative_value` | Negates numeric value | 1.0 |
 ## Quick Start
@@ -213,15 +252,16 @@ pip install -e ".[dev]"
 pytest tests/ -v
 ```
-89 tests covering:
-- Task creation, corruption, and issue planting (difficulty weights, seed determinism)
-- Issue key parsing (standard, lenient, edge cases)
-- F1 and difficulty-weighted reward computation
-- Full environment reset/step lifecycle
 - Inference script parsing and prompt building
-- **Structured log format** ([START], [STEP], [END] — exact field names and ordering)
 - Score bounds (0.0-1.0), best-score monotonicity
-- Extensibility API (custom rules, custom tasks, environment integration)
 ## Validation
@@ -247,19 +287,19 @@ openenv validate .
 ```
 dataqa_env/
 ├── __init__.py            # Public API + extensibility exports
-├── models.py              # Pydantic: DataQAAction, DataQAObservation, DataQAState
 ├── client.py              # EnvClient for WebSocket connections
 ├── server/
-│   ├── environment.py     # Core DataQAEnvironment (reset/step/state + weighted rewards)
 │   ├── tasks.py           # Task definitions + contamination rules + extensibility API
 │   ├── app.py             # FastAPI server (via openenv-core create_app)
 │   └── Dockerfile
 tests/
 ├── test_tasks.py          # Task creation, corruption, difficulty weights
-├── test_environment.py    # Environment lifecycle, scoring, metadata
-├── test_inference.py      # LLM response parsing, prompt building, log format
 └── test_extensibility.py  # Custom rules, custom tasks, registration API
-inference.py               # Baseline LLM agent (OpenAI client, structured logs)
 openenv.yaml               # OpenEnv/HF Spaces spec
 pyproject.toml             # Package metadata and dependencies
 Dockerfile                 # Production container

 # DataQA Environment
+A two-phase OpenEnv RL environment for **Data Quality Assurance** — an LLM agent inspects corrupted datasets, identifies all planted quality issues, and proposes data repairs.
+### Demo: Agent Trajectory Replay
+**Easy task** — Agent finds all 4 issues and proposes fixes (step 2):
+![Easy task: all issues found + fixes proposed](docs/demo_easy.png)
+**Hard task** — Agent identifies 8 subtle ML issues including data leakage and GPU memory outlier, proposes fixes (step 2):
+![Hard task: ML experiment metadata with 8 issues](docs/demo_hard.png)
+Green cells = correctly found issues. Yellow = missed. Green outlines = correct fixes with proposed values shown inline (e.g. `empty → David Kim`, `seventy-five thousand → 75000`).
+> The interactive replay UI is available at the `/web` endpoint on the HF Space.
 ## Motivation
 Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies — before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
+DataQA turns this into a **two-phase RL challenge**:
+1. **Identify** — systematically inspect corrupted data and pinpoint every planted issue
+2. **Fix** — propose corrected values by reasoning about schema, constraints, and context
+This creates a rich multi-step decision problem where agents must explore datasets strategically, distinguish subtle anomalies from noise, and reason about what the correct data should be.
 ## Environment API
 | Endpoint | Method | Description |
 |----------|--------|-------------|
 | `/reset` | POST | Start a new episode with a corrupted dataset |
+| `/step` | POST | Submit identified issues + proposed fixes |
 | `/state` | GET | Get current episode state |
 | `/health` | GET | Health check |
 |------|--------|-----------|--------|-------------|
 | `easy` | 4 | Beginner | HR/Employee data | Nulls, wrong types, duplicates, out-of-range values |
 | `medium` | 6 | Intermediate | E-commerce orders | Format violations, inconsistent computed fields, duplicate keys |
+| `hard` | 10 | Advanced | ML experiment metadata | Data leakage signals, unreasonable GPU memory, impossibly fast training, SOTA-exceeding accuracy, timestamp ordering, whitespace-only fields |
 **Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
+## Two-Phase Action Space
+### Phase 1: Identify Issues
+Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
+- `row_number`: 1-indexed data row position (after header)
+- `column_name`: Exact column header name, lowercase
+- `issue_type`: One of the supported types below
+### Phase 2: Propose Fixes
+Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
+The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
+Both phases can be submitted in the same step or across multiple steps.
 **Supported Issue Types:**
 ## Observation Space
 | Field | Type | Description |
 |-------|------|-------------|
 | `dataset_csv` | str | The corrupted dataset in CSV format |
 | `schema_description` | str | Column types, ranges, and constraints |
 | `validation_rules` | str | Business rules the data must satisfy |
 | `task_description` | str | Task context and instructions |
+| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
 | `num_issues_hint` | int | Exact count of planted issues |
 | `max_steps` | int | Maximum attempts allowed |
 | `done` | bool | Whether episode has terminated |
+| `reward` | float | Best combined reward so far (0.0-1.0) |
+**Observation Metadata** (per step):
+- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
+- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
+- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
 ## Reward Function
+### Combined Reward
+```
+combined_reward = 0.6 * identify_score + 0.4 * fix_score
+```
+If no fixes are submitted, `combined_reward = identify_score` (no penalty — backward compatible).
+### Identify Score (Difficulty-Weighted F1)
+Each planted issue has a **difficulty weight** (1.0-3.0):
 | Weight | Category | Examples |
 |--------|----------|----------|
 | 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
 | 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
+- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
+- **Weighted Precision** = penalizes false positives proportional to average difficulty
+- **Weighted F1** = harmonic mean
+### Fix Score (Difficulty-Weighted Quality)
+Each proposed fix is compared against the original clean value:
+| Fix Quality | Score | Description |
+|-------------|-------|-------------|
+| Exact match | 1.0 | Case-insensitive, whitespace-stripped match |
+| Numeric close | 0.8 | Within 1% of correct numeric value |
+| Correct cell | 0.1 | Right location, wrong value |
+| Non-issue cell | 0.0 | Fix targets a cell with no issue |
+Fix score = (sum of best fix score per issue × difficulty weight) / (total difficulty weight)
+### Reward Properties
+- **Per-step partial progress**: reward increases as more issues are found/fixed
+- **Difficulty-aware**: finding subtle issues earns more than obvious ones
+- **Penalizes bad behavior**: false positives reduce score, fixing non-issues earns nothing
+- **Monotonically non-decreasing**: best score across all steps is the final reward
+- **Always in [0.0, 1.0]**: meets hackathon requirement
 ### Episode Boundaries
 - Each task allows up to 3 steps (attempts)
+- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
+- Agent receives detailed feedback after each step to improve on next attempt
 ## Baseline Scores
+Baseline agent uses Qwen2.5-72B-Instruct via HuggingFace Router:
+| Task | Identify Score | Fix Score | Combined | Notes |
+|------|---------------|-----------|----------|-------|
+| `easy` | 0.7-1.0 | 0.5-0.9 | 0.6-1.0 | Most LLMs find obvious issues reliably |
+| `medium` | 0.5-0.8 | 0.3-0.6 | 0.4-0.7 | Cross-column reasoning challenges models |
+| `hard` | 0.3-0.6 | 0.2-0.4 | 0.3-0.5 | ML domain knowledge and subtle patterns |
+Scores vary by model. The hard task is designed to challenge frontier models.
 ## Extensibility
 ### Custom Contamination Rules
 ```python
 |------|--------|--------------------|
 | `missing_value` | Sets field to empty string | 1.0 |
 | `whitespace_value` | Sets field to single space | 2.5 |
+| `wrong_type_text` | Replaces with random text | 1.0 |
 | `negative_value` | Negates numeric value | 1.0 |
 ## Quick Start
 pytest tests/ -v
 ```
+118 tests covering:
+- Task creation, corruption, and difficulty weights
+- Issue key and fix parsing (standard, lenient, edge cases)
+- F1, weighted reward, and fix quality computation
+- Full environment lifecycle (identify-only and identify+fix)
+- Combined reward calculation and weight verification
 - Inference script parsing and prompt building
+- Structured log format ([START], [STEP], [END])
 - Score bounds (0.0-1.0), best-score monotonicity
+- Extensibility API (custom rules, custom tasks)
 ## Validation
 ```
 dataqa_env/
 ├── __init__.py            # Public API + extensibility exports
+├── models.py              # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
 ├── client.py              # EnvClient for WebSocket connections
 ├── server/
+│   ├── environment.py     # Two-phase DataQAEnvironment (identify + fix + combined reward)
 │   ├── tasks.py           # Task definitions + contamination rules + extensibility API
 │   ├── app.py             # FastAPI server (via openenv-core create_app)
 │   └── Dockerfile
 tests/
 ├── test_tasks.py          # Task creation, corruption, difficulty weights
+├── test_environment.py    # Identify scoring, fix grading, combined reward, lifecycle
+├── test_inference.py      # LLM response parsing, fix parsing, prompt building, log format
 └── test_extensibility.py  # Custom rules, custom tasks, registration API
+inference.py               # Two-phase baseline agent (identify → fix)
 openenv.yaml               # OpenEnv/HF Spaces spec
 pyproject.toml             # Package metadata and dependencies
 Dockerfile                 # Production container

dataqa_env/models.py CHANGED Viewed

@@ -16,21 +16,23 @@ from openenv.core.env_server.interfaces import Action, Observation, State
 class DataQAAction(Action):
     """
-    Agent submits a list of identified data quality issues.
-    Each issue is a string in the format: "row:<row_idx>,col:<col_name>,issue:<issue_type>"
     Supported issue types:
-        - missing_value
-        - wrong_type
-        - duplicate_row
-        - out_of_range
-        - format_violation
-        - inconsistent_value
-        - statistical_outlier
-        - referential_integrity
     """
     issues: List[str]
     # Include task_id so step() can reconstruct context in stateless HTTP mode
     task_id: str = "easy"

 class DataQAAction(Action):
     """
+    Agent submits identified issues AND optional proposed fixes.
+    Two-phase action space:
+      Phase 1 (Identify): List issues in format "row:<N>,col:<name>,issue:<type>"
+      Phase 2 (Fix):      List fixes in format "row:<N>,col:<name>,fix:<proposed_value>"
+    The agent can submit both in the same step or across multiple steps.
+    Combined reward = 0.6 * identify_score + 0.4 * fix_score
     Supported issue types:
+        missing_value, wrong_type, duplicate_row, out_of_range,
+        format_violation, inconsistent_value, statistical_outlier,
+        referential_integrity
     """
     issues: List[str]
+    fixes: List[str] = []
     # Include task_id so step() can reconstruct context in stateless HTTP mode
     task_id: str = "easy"

dataqa_env/server/environment.py CHANGED Viewed

@@ -3,8 +3,12 @@ DataQA Environment
 ------------------
 Server-side environment for data quality assurance tasks.
-The agent receives corrupted datasets and must identify planted quality issues.
-Scoring is based on F1 (precision-recall) of correctly matched issues.
 """
 from __future__ import annotations
@@ -18,6 +22,10 @@ from openenv.core.env_server.interfaces import Action, Environment, Observation
 from ..models import DataQAAction, DataQAObservation, DataQAState
 from .tasks import PlantedIssue, Task, get_task, list_tasks
 def parse_issue_key(raw: str) -> Optional[str]:
     """
@@ -26,7 +34,6 @@ def parse_issue_key(raw: str) -> Optional[str]:
     Returns normalized key or None if unparseable.
     """
     raw = raw.strip().lower()
-    # Be lenient with formatting
     row_match = re.search(r"row\s*[:=]\s*(\d+)", raw)
     col_match = re.search(r"col\s*[:=]\s*([\w_]+)", raw)
     issue_match = re.search(r"issue\s*[:=]\s*([\w_]+)", raw)
@@ -36,6 +43,22 @@ def parse_issue_key(raw: str) -> Optional[str]:
     return None
 def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
     """Compute precision, recall, and F1 score."""
     if not reported_keys and not planted_keys:
@@ -83,7 +106,6 @@ def compute_weighted_reward(
     if not planted_keys:
         return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
-    # Sum difficulty weights for found vs missed issues
     found_keys = reported_keys & planted_keys
     missed_keys = planted_keys - reported_keys
     false_positive_count = len(reported_keys - planted_keys)
@@ -92,15 +114,12 @@ def compute_weighted_reward(
     difficulty_missed = sum(planted_by_key[k].difficulty for k in missed_keys)
     total_weight = sum(i.difficulty for i in planted_issues)
-    # Weighted recall: proportion of difficulty captured
     weighted_recall = difficulty_found / total_weight if total_weight > 0 else 0.0
-    # Penalty for false positives (scaled by avg difficulty so penalty is proportional)
     avg_difficulty = total_weight / len(planted_issues)
     fp_penalty_weight = false_positive_count * avg_difficulty
     weighted_precision = difficulty_found / (difficulty_found + fp_penalty_weight) if (difficulty_found + fp_penalty_weight) > 0 else 0.0
-    # Weighted F1
     if (weighted_precision + weighted_recall) > 0:
         weighted_reward = 2 * weighted_precision * weighted_recall / (weighted_precision + weighted_recall)
     else:
@@ -113,12 +132,134 @@ def compute_weighted_reward(
     }
 class DataQAEnvironment(Environment):
     """
-    Data Quality Assurance environment.
-    The agent inspects corrupted datasets and reports quality issues.
-    Reward is F1 score of correctly identified issues vs planted ground truth.
     """
     SUPPORTS_CONCURRENT_SESSIONS = True
@@ -158,7 +299,11 @@ class DataQAEnvironment(Environment):
             schema_description=self._current_task.schema_description,
             validation_rules=self._current_task.validation_rules,
             task_description=self._current_task.description,
-            feedback="Environment reset. Inspect the dataset and report all quality issues.",
             task_id=task_id,
             num_issues_hint=len(self._current_task.planted_issues),
             max_steps=self._current_task.max_steps,
@@ -175,15 +320,14 @@ class DataQAEnvironment(Environment):
         if not isinstance(action, DataQAAction):
             raise ValueError(f"Expected DataQAAction, got {type(action)}")
-        # In stateless HTTP mode, each request creates a fresh env instance.
-        # Auto-reset using the task_id from the action so step() works standalone.
         if self._current_task is None:
             self.reset(task_id=action.task_id)
         self._state.step_count += 1
         self._state.current_step += 1
-        # Parse reported issues
         reported_keys: Set[str] = set()
         parse_errors: list[str] = []
         for raw_issue in action.issues:
@@ -191,51 +335,148 @@ class DataQAEnvironment(Environment):
             if key:
                 reported_keys.add(key)
             else:
-                parse_errors.append(f"Could not parse: '{raw_issue}'")
-        # Compute score (standard F1)
         metrics = compute_f1(reported_keys, self._planted_keys)
-        score = metrics["f1"]
-        # Compute difficulty-weighted reward (richer per-step signal)
         weighted = compute_weighted_reward(reported_keys, self._current_task.planted_issues)
-        weighted_reward = weighted["weighted_reward"]
-        # Use weighted reward as the primary reward signal
-        self._best_score = max(self._best_score, weighted_reward)
         self._state.best_score = self._best_score
-        # Check if done
         is_done = (
-            score >= 0.999  # Perfect score (all issues found exactly)
             or self._state.current_step >= self._state.max_steps
         )
-        # Build feedback
         feedback_lines = [
             f"Step {self._state.current_step}/{self._state.max_steps}",
             f"Issues reported: {len(reported_keys)}",
             f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
-            f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {score:.3f}",
-            f"Weighted reward: {weighted_reward:.3f} (difficulty found: {weighted['difficulty_found']}, missed: {weighted['difficulty_missed']})",
         ]
         if parse_errors:
-            feedback_lines.append(f"Parse errors ({len(parse_errors)}): {'; '.join(parse_errors[:3])}")
         if not is_done:
-            # Give hints about what was missed without revealing exact answers
             if metrics["fn"] > 0:
                 feedback_lines.append(
-                    f"You missed {metrics['fn']} issue(s). Review the dataset carefully."
                 )
             if metrics["fp"] > 0:
                 feedback_lines.append(
-                    f"{metrics['fp']} of your reported issues were incorrect."
                 )
-            feedback_lines.append("You can submit again with an updated list of issues.")
         else:
-            feedback_lines.append(f"Task complete! Final best weighted reward: {self._best_score:.3f}")
         return DataQAObservation(
             dataset_csv=self._current_task.corrupted_csv,
@@ -249,8 +490,10 @@ class DataQAEnvironment(Environment):
             done=is_done,
             reward=self._best_score,
             metadata={
-                "f1": score,
-                "weighted_reward": weighted_reward,
                 "precision": metrics["precision"],
                 "recall": metrics["recall"],
                 "tp": metrics["tp"],
@@ -258,6 +501,12 @@ class DataQAEnvironment(Environment):
                 "fn": metrics["fn"],
                 "difficulty_found": weighted["difficulty_found"],
                 "difficulty_missed": weighted["difficulty_missed"],
             },
         )

 ------------------
 Server-side environment for data quality assurance tasks.
+Two-phase RL environment:
+  Phase 1 (Identify): Agent inspects corrupted datasets and reports quality issues.
+  Phase 2 (Fix):      Agent proposes corrections for identified issues.
+Combined reward = 0.6 * identify_score + 0.4 * fix_score
+Both phases scored with difficulty-weighted metrics for rich per-step signal.
 """
 from __future__ import annotations
 from ..models import DataQAAction, DataQAObservation, DataQAState
 from .tasks import PlantedIssue, Task, get_task, list_tasks
+# Reward weights for the two phases
+IDENTIFY_WEIGHT = 0.6
+FIX_WEIGHT = 0.4
 def parse_issue_key(raw: str) -> Optional[str]:
     """
     Returns normalized key or None if unparseable.
     """
     raw = raw.strip().lower()
     row_match = re.search(r"row\s*[:=]\s*(\d+)", raw)
     col_match = re.search(r"col\s*[:=]\s*([\w_]+)", raw)
     issue_match = re.search(r"issue\s*[:=]\s*([\w_]+)", raw)
     return None
+def parse_fix(raw: str) -> Optional[tuple[int, str, str]]:
+    """
+    Parse an agent-proposed fix into (row, col, proposed_value).
+    Expected format: row:<N>,col:<name>,fix:<value>
+    Returns (row, col, value) or None if unparseable.
+    """
+    raw = raw.strip()
+    row_match = re.search(r"row\s*[:=]\s*(\d+)", raw, re.IGNORECASE)
+    col_match = re.search(r"col(?:umn)?\s*[:=]\s*([\w_]+)", raw, re.IGNORECASE)
+    fix_match = re.search(r"fix\s*[:=]\s*(.+?)$", raw, re.IGNORECASE)
+    if row_match and col_match and fix_match:
+        return (int(row_match.group(1)), col_match.group(1).lower(), fix_match.group(1).strip())
+    return None
 def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
     """Compute precision, recall, and F1 score."""
     if not reported_keys and not planted_keys:
     if not planted_keys:
         return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
     found_keys = reported_keys & planted_keys
     missed_keys = planted_keys - reported_keys
     false_positive_count = len(reported_keys - planted_keys)
     difficulty_missed = sum(planted_by_key[k].difficulty for k in missed_keys)
     total_weight = sum(i.difficulty for i in planted_issues)
     weighted_recall = difficulty_found / total_weight if total_weight > 0 else 0.0
     avg_difficulty = total_weight / len(planted_issues)
     fp_penalty_weight = false_positive_count * avg_difficulty
     weighted_precision = difficulty_found / (difficulty_found + fp_penalty_weight) if (difficulty_found + fp_penalty_weight) > 0 else 0.0
     if (weighted_precision + weighted_recall) > 0:
         weighted_reward = 2 * weighted_precision * weighted_recall / (weighted_precision + weighted_recall)
     else:
     }
+def grade_fixes(
+    fixes: list[tuple[int, str, str]],
+    task: Task,
+) -> dict:
+    """
+    Grade proposed fixes against the clean dataset.
+    For each fix (row, col, proposed_value), compare to the original clean value.
+    Scoring per fix:
+      - Exact match (case-insensitive, whitespace-stripped): 1.0
+      - Numeric close match (within 1%): 0.8
+      - Correct column but wrong value: 0.1
+      - Targets a non-issue cell: 0.0 (penalty)
+    Returns dict with fix_score (0.0-1.0), details per fix, and counts.
+    """
+    if not fixes and not task.planted_issues:
+        return {"fix_score": 1.0, "fixes_correct": 0, "fixes_partial": 0,
+                "fixes_wrong": 0, "fixes_attempted": 0, "fix_details": []}
+    if not fixes:
+        return {"fix_score": 0.0, "fixes_correct": 0, "fixes_partial": 0,
+                "fixes_wrong": 0, "fixes_attempted": 0, "fix_details": []}
+    issue_map = task.get_planted_issue_map()
+    # Build set of (row, col) that are actual issues
+    issue_cells = {(issue.row, issue.col) for issue in task.planted_issues}
+    total_weight = sum(i.difficulty for i in task.planted_issues) if task.planted_issues else 1.0
+    earned_weight = 0.0
+    fixes_correct = 0
+    fixes_partial = 0
+    fixes_wrong = 0
+    fix_details = []
+    # Track which issues have been fixed (best fix wins)
+    fixed_issues: dict[tuple[int, str], float] = {}
+    for row, col, proposed in fixes:
+        clean_value = task.get_clean_value(row, col)
+        cell_key = (row, col)
+        if cell_key not in issue_cells:
+            # Fix targets a non-issue cell — no credit
+            fix_details.append({"row": row, "col": col, "score": 0.0, "reason": "not an issue cell"})
+            fixes_wrong += 1
+            continue
+        if clean_value is None:
+            fix_details.append({"row": row, "col": col, "score": 0.0, "reason": "cell not found"})
+            fixes_wrong += 1
+            continue
+        # Find the planted issue for this cell to get its difficulty weight
+        matching_issue = None
+        for issue in task.planted_issues:
+            if issue.row == row and issue.col == col:
+                matching_issue = issue
+                break
+        difficulty = matching_issue.difficulty if matching_issue else 1.0
+        # Score the fix
+        score = 0.0
+        reason = "wrong value"
+        # Exact match (case-insensitive, whitespace-stripped)
+        if proposed.strip().lower() == clean_value.lower():
+            score = 1.0
+            reason = "exact match"
+            fixes_correct += 1
+        else:
+            # Try numeric close match
+            try:
+                proposed_num = float(proposed.strip())
+                clean_num = float(clean_value)
+                if clean_num != 0 and abs(proposed_num - clean_num) / abs(clean_num) <= 0.01:
+                    score = 0.8
+                    reason = "numeric close match"
+                    fixes_partial += 1
+                elif proposed_num == clean_num:
+                    score = 1.0
+                    reason = "exact numeric match"
+                    fixes_correct += 1
+                else:
+                    score = 0.1
+                    reason = "correct cell, wrong value"
+                    fixes_partial += 1
+            except (ValueError, ZeroDivisionError):
+                # Not numeric — just a wrong value but at least right cell
+                score = 0.1
+                reason = "correct cell, wrong value"
+                fixes_partial += 1
+        # Keep best fix per cell
+        if cell_key not in fixed_issues or score > fixed_issues[cell_key]:
+            fixed_issues[cell_key] = score
+        fix_details.append({"row": row, "col": col, "score": score, "reason": reason})
+    # Compute fix score: weighted sum of best fix per issue / total weight
+    for issue in task.planted_issues:
+        cell_key = (issue.row, issue.col)
+        if cell_key in fixed_issues:
+            earned_weight += issue.difficulty * fixed_issues[cell_key]
+    fix_score = earned_weight / total_weight if total_weight > 0 else 0.0
+    fix_score = min(max(fix_score, 0.0), 1.0)
+    return {
+        "fix_score": round(fix_score, 4),
+        "fixes_correct": fixes_correct,
+        "fixes_partial": fixes_partial,
+        "fixes_wrong": fixes_wrong,
+        "fixes_attempted": len(fixes),
+        "fix_details": fix_details,
+    }
 class DataQAEnvironment(Environment):
     """
+    Data Quality Assurance environment — two-phase identify + fix.
+    Phase 1 (Identify): Agent inspects corrupted datasets and reports quality issues.
+    Phase 2 (Fix):      Agent proposes corrections for identified issues.
+    Combined reward = 0.6 * identify_score + 0.4 * fix_score
+    Both phases use difficulty-weighted scoring for rich per-step reward signals.
     """
     SUPPORTS_CONCURRENT_SESSIONS = True
             schema_description=self._current_task.schema_description,
             validation_rules=self._current_task.validation_rules,
             task_description=self._current_task.description,
+            feedback=(
+                "Environment reset. Inspect the dataset and report all quality issues.\n"
+                "You can also propose fixes in format: row:<N>,col:<name>,fix:<corrected_value>\n"
+                "Combined reward = 0.6 * identify_score + 0.4 * fix_score"
+            ),
             task_id=task_id,
             num_issues_hint=len(self._current_task.planted_issues),
             max_steps=self._current_task.max_steps,
         if not isinstance(action, DataQAAction):
             raise ValueError(f"Expected DataQAAction, got {type(action)}")
+        # Auto-reset in stateless HTTP mode
         if self._current_task is None:
             self.reset(task_id=action.task_id)
         self._state.step_count += 1
         self._state.current_step += 1
+        # ─�� Phase 1: Parse and score issue identification ──
         reported_keys: Set[str] = set()
         parse_errors: list[str] = []
         for raw_issue in action.issues:
             if key:
                 reported_keys.add(key)
             else:
+                parse_errors.append(f"Could not parse issue: '{raw_issue}'")
         metrics = compute_f1(reported_keys, self._planted_keys)
+        identify_f1 = metrics["f1"]
         weighted = compute_weighted_reward(reported_keys, self._current_task.planted_issues)
+        identify_score = weighted["weighted_reward"]
+        # ── Phase 2: Parse and score proposed fixes ──
+        parsed_fixes: list[tuple[int, str, str]] = []
+        for raw_fix in action.fixes:
+            fix = parse_fix(raw_fix)
+            if fix:
+                parsed_fixes.append(fix)
+            else:
+                parse_errors.append(f"Could not parse fix: '{raw_fix}'")
+        fix_result = grade_fixes(parsed_fixes, self._current_task)
+        fix_score = fix_result["fix_score"]
+        # ── Combined reward ──
+        # If no fixes submitted, score is identify-only (no penalty for not fixing)
+        if action.fixes:
+            combined_reward = IDENTIFY_WEIGHT * identify_score + FIX_WEIGHT * fix_score
+        else:
+            combined_reward = identify_score  # backward compatible
+        self._best_score = max(self._best_score, combined_reward)
         self._state.best_score = self._best_score
+        # ── Check if done ──
         is_done = (
+            identify_f1 >= 0.999  # Perfect identification
             or self._state.current_step >= self._state.max_steps
         )
+        # ── Build feedback with actionable diagnostics ──
+        # Show the agent exactly which reported issues were correct (TP) and which were wrong (FP)
+        tp_keys = reported_keys & self._planted_keys
+        fp_keys = reported_keys - self._planted_keys
         feedback_lines = [
             f"Step {self._state.current_step}/{self._state.max_steps}",
+            "",
+            "--- Identification ---",
             f"Issues reported: {len(reported_keys)}",
             f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
+            f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {identify_f1:.3f}",
+            f"Identify score (weighted): {identify_score:.3f}",
         ]
+        # Show which reported issues were correct vs wrong (helps agent self-correct)
+        if tp_keys:
+            feedback_lines.append(f"Correct issues: {', '.join(sorted(tp_keys))}")
+        if fp_keys:
+            feedback_lines.append(f"Incorrect issues (false positives): {', '.join(sorted(fp_keys))}")
+        if action.fixes:
+            feedback_lines += [
+                "",
+                "--- Fix Proposals ---",
+                f"Fixes attempted: {fix_result['fixes_attempted']}",
+                f"Correct: {fix_result['fixes_correct']}, Partial: {fix_result['fixes_partial']}, Wrong: {fix_result['fixes_wrong']}",
+                f"Fix score: {fix_score:.3f}",
+            ]
+            # Show per-fix feedback so agent knows which fixes worked
+            for detail in fix_result["fix_details"]:
+                status = "correct" if detail["score"] >= 0.99 else ("partial" if detail["score"] > 0 else "wrong")
+                feedback_lines.append(
+                    f"  row:{detail['row']},col:{detail['col']} -> {status} ({detail['reason']})"
+                )
+            feedback_lines.append(
+                f"\n--- Combined Reward: {combined_reward:.3f} (identify={identify_score:.3f} x {IDENTIFY_WEIGHT} + fix={fix_score:.3f} x {FIX_WEIGHT}) ---"
+            )
+        else:
+            feedback_lines += [
+                "",
+                "Tip: Submit fixes with format row:<N>,col:<name>,fix:<value> for bonus reward.",
+            ]
         if parse_errors:
+            feedback_lines.append(f"\nParse errors ({len(parse_errors)}): {'; '.join(parse_errors[:5])}")
         if not is_done:
             if metrics["fn"] > 0:
                 feedback_lines.append(
+                    f"\nYou missed {metrics['fn']} issue(s). Review the dataset carefully."
                 )
             if metrics["fp"] > 0:
                 feedback_lines.append(
+                    f"Remove the {metrics['fp']} false positive(s) listed above and look for real issues."
                 )
+            feedback_lines.append("You can submit again with updated issues and/or fixes.")
         else:
+            feedback_lines.append(f"\nTask complete! Final best reward: {self._best_score:.3f}")
+        # ── Flag items for human review ──
+        # In a production data QA pipeline, these would go to a human reviewer.
+        # The grader flags cases where automated scoring has low confidence.
+        human_review_flags: list[dict] = []
+        # 1. False positives that target real columns — could be legitimate issues
+        #    the task designer didn't plant (agent may be smarter than the grader)
+        issue_map = self._current_task.get_planted_issue_map()
+        valid_issue_types = {"missing_value", "wrong_type", "duplicate_row", "out_of_range",
+                             "format_violation", "inconsistent_value", "statistical_outlier",
+                             "referential_integrity"}
+        for fp_key in fp_keys:
+            parts = fp_key.split(",")
+            itype = parts[2].split(":")[1] if len(parts) >= 3 else ""
+            if itype in valid_issue_types:
+                human_review_flags.append({
+                    "item": fp_key,
+                    "reason": "Agent reported this issue but it's not in ground truth — may be a real issue the grader missed",
+                    "type": "possible_unplanted_issue",
+                })
+        # 2. Partial fix matches — fix was close but not exact, human should verify
+        for detail in fix_result["fix_details"]:
+            if 0 < detail["score"] < 0.99:
+                human_review_flags.append({
+                    "item": f"row:{detail['row']},col:{detail['col']}",
+                    "reason": f"Fix scored {detail['score']:.2f} ({detail['reason']}) — human should verify if acceptable",
+                    "type": "partial_fix",
+                })
+        # 3. High-difficulty issues that were missed — flag for training data review
+        planted_by_key = {i.to_key(): i for i in self._current_task.planted_issues}
+        fn_keys = self._planted_keys - reported_keys
+        for fn_key in fn_keys:
+            issue = planted_by_key.get(fn_key)
+            if issue and issue.difficulty >= 2.5:
+                human_review_flags.append({
+                    "item": fn_key,
+                    "reason": f"High-difficulty issue (difficulty={issue.difficulty}) missed — {issue.description}",
+                    "type": "missed_hard_issue",
+                })
+        if human_review_flags:
+            feedback_lines.append(f"\n--- Flagged for Human Review ({len(human_review_flags)}) ---")
+            for flag in human_review_flags:
+                feedback_lines.append(f"  [{flag['type']}] {flag['item']}: {flag['reason']}")
         return DataQAObservation(
             dataset_csv=self._current_task.corrupted_csv,
             done=is_done,
             reward=self._best_score,
             metadata={
+                "identify_f1": identify_f1,
+                "identify_score": identify_score,
+                "fix_score": fix_score,
+                "combined_reward": combined_reward,
                 "precision": metrics["precision"],
                 "recall": metrics["recall"],
                 "tp": metrics["tp"],
                 "fn": metrics["fn"],
                 "difficulty_found": weighted["difficulty_found"],
                 "difficulty_missed": weighted["difficulty_missed"],
+                "fixes_correct": fix_result["fixes_correct"],
+                "fixes_partial": fix_result["fixes_partial"],
+                "fixes_wrong": fix_result["fixes_wrong"],
+                "fixes_attempted": fix_result["fixes_attempted"],
+                "fix_details": fix_result["fix_details"],
+                "human_review_flags": human_review_flags,
             },
         )

dataqa_env/server/gradio_ui.py ADDED Viewed

	@@ -0,0 +1,487 @@

+"""
+Gradio UI — Agent Trajectory Replay Viewer for DataQA.
+Designed for judges: zero clicks needed, auto-plays on load.
+Tab per task, step slider, prominent metric cards, color-coded dataset.
+"""
+from __future__ import annotations
+import csv
+import io
+import gradio as gr
+from .environment import DataQAEnvironment, parse_issue_key
+from .tasks import list_tasks, PlantedIssue
+from ..models import DataQAAction
+# ── Pre-built agent trajectories (simulates baseline agent) ──
+AGENT_TRAJECTORIES = {
+    "easy": [
+        {
+            "issues": [
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:9,col:salary,issue:out_of_range",
+                "row:3,col:email,issue:format_violation",  # FP
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:9,col:salary,issue:out_of_range",
+                "row:11,col:employee_id,issue:duplicate_row",
+            ],
+            "fixes": [
+                "row:4,col:name,fix:David Kim",
+                "row:7,col:salary,fix:75000",
+                "row:9,col:salary,fix:73000",
+            ],
+        },
+    ],
+    "medium": [
+        {
+            "issues": [
+                "row:5,col:total,issue:inconsistent_value",
+                "row:10,col:category,issue:format_violation",
+                "row:14,col:product_name,issue:missing_value",
+                "row:17,col:quantity,issue:out_of_range",
+                "row:19,col:order_id,issue:duplicate_row",
+                "row:12,col:order_date,issue:format_violation",
+            ],
+            "fixes": [
+                "row:5,col:total,fix:42.00",
+                "row:10,col:category,fix:Sports",
+                "row:12,col:order_date,fix:2024-01-26",
+                "row:14,col:product_name,fix:LED Strip Lights",
+            ],
+        },
+    ],
+    "hard": [
+        {
+            "issues": [
+                "row:14,col:training_time_hours,issue:out_of_range",
+                "row:13,col:learning_rate,issue:out_of_range",
+                "row:15,col:model_name,issue:missing_value",
+                "row:9,col:batch_size,issue:format_violation",
+                "row:10,col:train_size,issue:inconsistent_value",
+            ],
+            "fixes": [],
+        },
+        {
+            "issues": [
+                "row:14,col:training_time_hours,issue:out_of_range",
+                "row:13,col:learning_rate,issue:out_of_range",
+                "row:15,col:model_name,issue:missing_value",
+                "row:9,col:batch_size,issue:format_violation",
+                "row:10,col:train_size,issue:inconsistent_value",
+                "row:5,col:val_loss,issue:inconsistent_value",
+                "row:7,col:gpu_memory_gb,issue:statistical_outlier",
+                "row:11,col:timestamp,issue:inconsistent_value",
+                "row:9,col:training_time_hours,issue:statistical_outlier",
+                "row:12,col:test_accuracy,issue:statistical_outlier",
+            ],
+            "fixes": [
+                "row:14,col:training_time_hours,fix:72.0",
+                "row:13,col:learning_rate,fix:0.00001",
+                "row:15,col:model_name,fix:whisper-small",
+                "row:9,col:batch_size,fix:256",
+                "row:9,col:training_time_hours,fix:36.0",
+            ],
+        },
+    ],
+}
+# ── HTML rendering ──
+def _metric_card(label: str, value: str, color: str = "#333") -> str:
+    return (
+        f'<div style="text-align:center;padding:12px 16px;background:#f8f9fa;'
+        f'border-radius:8px;min-width:100px;">'
+        f'<div style="font-size:11px;color:#666;text-transform:uppercase;letter-spacing:1px;">{label}</div>'
+        f'<div style="font-size:28px;font-weight:700;color:{color};margin-top:2px;">{value}</div>'
+        f'</div>'
+    )
+def _csv_to_html(
+    csv_text: str,
+    planted: list[PlantedIssue],
+    correct: set[tuple[int, str]],
+    fp: set[tuple[int, str]],
+    missed: set[tuple[int, str]],
+    fixed: dict[tuple[int, str], str],
+    fix_values: dict[tuple[int, str], str] | None = None,
+) -> str:
+    """Render CSV as HTML with color-coded cells and inline fix proposals."""
+    fix_values = fix_values or {}
+    desc_map = {(i.row, i.col): i for i in planted}
+    reader = csv.reader(io.StringIO(csv_text.strip()))
+    rows = list(reader)
+    if not rows:
+        return ""
+    header = rows[0]
+    header_lower = [h.strip().lower() for h in header]
+    data = rows[1:]
+    t = ['<table style="border-collapse:collapse;width:100%;font-size:12px;font-family:\'SF Mono\',monospace;">']
+    t.append('<tr>')
+    t.append('<th style="border:1px solid #dee2e6;padding:6px 8px;background:#343a40;color:#fff;font-size:11px;">Row</th>')
+    for h in header:
+        t.append(f'<th style="border:1px solid #dee2e6;padding:6px 8px;background:#343a40;color:#fff;font-size:11px;">{h}</th>')
+    t.append('</tr>')
+    for i, row in enumerate(data):
+        rn = i + 1
+        bg = "#fff" if i % 2 == 0 else "#f8f9fa"
+        t.append(f'<tr style="background:{bg};">')
+        t.append(f'<td style="border:1px solid #dee2e6;padding:4px 8px;color:#adb5bd;text-align:center;font-size:11px;">{rn}</td>')
+        for j, val in enumerate(row):
+            col = header_lower[j] if j < len(header_lower) else ""
+            ck = (rn, col)
+            s = "border:1px solid #dee2e6;padding:4px 8px;"
+            tip = ""
+            badge = ""
+            issue = desc_map.get(ck)
+            if ck in correct:
+                s += "background:#d4edda;"
+                tip = f"FOUND: {issue.description}" if issue else ""
+                badge = '<span style="font-size:9px;background:#28a745;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">TP</span>'
+            elif ck in fp:
+                s += "background:#f8d7da;"
+                badge = '<span style="font-size:9px;background:#dc3545;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">FP</span>'
+            elif ck in missed:
+                s += "background:#fff3cd;"
+                tip = f"MISSED: {issue.description}" if issue else ""
+                badge = '<span style="font-size:9px;background:#856404;color:#fff;padding:1px 4px;border-radius:3px;margin-left:4px;">MISS</span>'
+            fx = fixed.get(ck)
+            proposed = fix_values.get(ck)
+            if fx == "correct":
+                s += "box-shadow:inset 0 0 0 2px #28a745;"
+                badge += '<span style="font-size:9px;background:#28a745;color:#fff;padding:1px 4px;border-radius:3px;margin-left:2px;">FIX</span>'
+            elif fx == "partial":
+                s += "box-shadow:inset 0 0 0 2px #ffc107;"
+                badge += '<span style="font-size:9px;background:#ffc107;color:#333;padding:1px 4px;border-radius:3px;margin-left:2px;">~FIX</span>'
+            dv = val if val.strip() else '<em style="color:#dc3545;font-style:italic;">empty</em>'
+            # Show proposed fix value below the corrupted value
+            fix_line = ""
+            if proposed is not None:
+                fix_color = "#28a745" if fx == "correct" else ("#b8860b" if fx == "partial" else "#dc3545")
+                fix_line = (
+                    f'<div style="font-size:10px;color:{fix_color};margin-top:2px;'
+                    f'border-top:1px dashed {fix_color};padding-top:2px;">'
+                    f'\u2192 {proposed}</div>'
+                )
+            t.append(f'<td style="{s}" title="{tip}">{dv}{badge}{fix_line}</td>')
+        t.append('</tr>')
+    t.append('</table>')
+    return "".join(t)
+LEGEND_HTML = (
+    '<div style="display:flex;gap:12px;flex-wrap:wrap;margin-top:10px;font-size:11px;">'
+    '<span style="background:#d4edda;padding:2px 8px;border-radius:4px;">Found (TP)</span>'
+    '<span style="background:#f8d7da;padding:2px 8px;border-radius:4px;">False Positive</span>'
+    '<span style="background:#fff3cd;padding:2px 8px;border-radius:4px;">Missed</span>'
+    '<span style="box-shadow:inset 0 0 0 2px #28a745;padding:2px 8px;border-radius:4px;">Fix Correct</span>'
+    '<span style="box-shadow:inset 0 0 0 2px #ffc107;padding:2px 8px;border-radius:4px;">Fix Partial</span>'
+    '</div>'
+)
+# ── Core replay logic ──
+def _replay_task(task_id: str) -> list[dict]:
+    """Run the agent trajectory and collect per-step data."""
+    env = DataQAEnvironment()
+    obs = env.reset(task_id=task_id)
+    task = env._current_task
+    planted_keys = {i.to_key() for i in task.planted_issues}
+    steps_data = []
+    # Step 0: initial state
+    steps_data.append({
+        "label": "Initial — corrupted dataset",
+        "html": _csv_to_html(obs.dataset_csv, task.planted_issues, set(), set(), set(), {}),
+        "metrics": {"reward": 0.0, "tp": 0, "fp": 0, "fn": len(task.planted_issues),
+                    "identify": 0.0, "fix": 0.0, "fixes_correct": 0},
+        "feedback": f"Task: {task.name}\nIssues to find: {obs.num_issues_hint}\n\n{task.description}",
+    })
+    trajectory = AGENT_TRAJECTORIES.get(task_id, [])
+    for i, step_data in enumerate(trajectory):
+        action = DataQAAction(
+            issues=step_data["issues"],
+            fixes=step_data.get("fixes", []),
+            task_id=task_id,
+        )
+        obs = env.step(action)
+        reported_keys = set()
+        for iss in step_data["issues"]:
+            key = parse_issue_key(iss)
+            if key:
+                reported_keys.add(key)
+        tp_keys = reported_keys & planted_keys
+        fp_keys = reported_keys - planted_keys
+        fn_keys = planted_keys - reported_keys
+        correct = {_kc(k) for k in tp_keys}
+        fp = {_kc(k) for k in fp_keys}
+        missed = {_kc(k) for k in fn_keys} if obs.done else set()
+        fixed: dict[tuple[int, str], str] = {}
+        for d in obs.metadata.get("fix_details", []):
+            c = (d["row"], d["col"])
+            fixed[c] = "correct" if d["score"] >= 0.99 else ("partial" if d["score"] > 0 else "wrong")
+        # Extract proposed fix values from the raw fix strings
+        fix_values: dict[tuple[int, str], str] = {}
+        from .environment import parse_fix
+        for raw_fix in step_data.get("fixes", []):
+            parsed = parse_fix(raw_fix)
+            if parsed:
+                row, col, val = parsed
+                fix_values[(row, col)] = val
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, correct, fp, missed, fixed, fix_values)
+        has_fixes = bool(step_data.get("fixes"))
+        if has_fixes:
+            label = f"Step {i+1} — identify + fix"
+        else:
+            label = f"Step {i+1} — identify only"
+        steps_data.append({
+            "label": label,
+            "html": html,
+            "metrics": {
+                "reward": obs.reward,
+                "tp": obs.metadata["tp"],
+                "fp": obs.metadata["fp"],
+                "fn": obs.metadata["fn"],
+                "identify": obs.metadata["identify_score"],
+                "fix": obs.metadata["fix_score"],
+                "fixes_correct": obs.metadata["fixes_correct"],
+            },
+            "feedback": obs.feedback,
+        })
+    return steps_data
+def _kc(key: str) -> tuple[int, str]:
+    parts = key.split(",")
+    return (int(parts[0].split(":")[1]), parts[1].split(":")[1])
+# ── Gradio app ──
+def build_gradio_ui():
+    # Pre-compute all replays at startup
+    all_replays: dict[str, list[dict]] = {}
+    for tid in list_tasks():
+        all_replays[tid] = _replay_task(tid)
+    def show_step(task_id: str, step_idx: int):
+        replay = all_replays.get(task_id, [])
+        step_idx = int(step_idx)
+        if step_idx >= len(replay):
+            step_idx = len(replay) - 1
+        sd = replay[step_idx]
+        m = sd["metrics"]
+        # Reward color
+        r = m["reward"]
+        rc = "#28a745" if r >= 0.8 else ("#ffc107" if r >= 0.4 else "#dc3545")
+        cards = (
+            '<div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:12px;">'
+            + _metric_card("Reward", f"{r:.2f}", rc)
+            + _metric_card("Found", str(m["tp"]), "#28a745")
+            + _metric_card("False Pos", str(m["fp"]), "#dc3545" if m["fp"] > 0 else "#28a745")
+            + _metric_card("Missed", str(m["fn"]), "#dc3545" if m["fn"] > 0 else "#28a745")
+            + _metric_card("Identify", f"{m['identify']:.2f}", "#333")
+            + _metric_card("Fix", f"{m['fix']:.2f}", "#333")
+            + '</div>'
+        )
+        full_html = (
+            f'<div style="font-size:14px;font-weight:600;margin-bottom:8px;color:#495057;">'
+            f'{sd["label"]}</div>'
+            + cards + sd["html"] + LEGEND_HTML
+        )
+        return full_html, sd["feedback"]
+    def on_task_change(task_id):
+        replay = all_replays.get(task_id, [])
+        max_step = len(replay) - 1
+        html, fb = show_step(task_id, 0)
+        return (
+            gr.update(maximum=max_step, value=0),
+            html,
+            fb,
+        )
+    def on_step_change(task_id, step_idx):
+        html, fb = show_step(task_id, step_idx)
+        return html, fb
+    # ── Live agent runner (connects to the env server) ──
+    live_env = DataQAEnvironment()
+    live_state: dict = {"obs": None, "task_id": "easy", "steps": []}
+    def live_reset(task_id):
+        obs = live_env.reset(task_id=task_id)
+        task = live_env._current_task
+        live_state["obs"] = obs
+        live_state["task_id"] = task_id
+        live_state["steps"] = []
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, set(), set(), set(), {})
+        info = f"**{task.name}** — {obs.num_issues_hint} issues to find, {obs.max_steps} steps max"
+        return html, info, "", "0.000"
+    def live_step(issues_text, fixes_text):
+        if live_state["obs"] is None:
+            return "Reset first.", "", "", ""
+        obs = live_state["obs"]
+        task = live_env._current_task
+        planted_keys = {i.to_key() for i in task.planted_issues}
+        issues = [l.strip() for l in issues_text.strip().split("\n") if l.strip()]
+        fixes = [l.strip() for l in fixes_text.strip().split("\n") if l.strip()] if fixes_text.strip() else []
+        action = DataQAAction(issues=issues, fixes=fixes, task_id=live_state["task_id"])
+        obs = live_env.step(action)
+        live_state["obs"] = obs
+        reported_keys = set()
+        for iss in issues:
+            key = parse_issue_key(iss)
+            if key:
+                reported_keys.add(key)
+        tp_keys = reported_keys & planted_keys
+        fp_keys = reported_keys - planted_keys
+        fn_keys = planted_keys - reported_keys
+        correct = {_kc(k) for k in tp_keys}
+        fp_set = {_kc(k) for k in fp_keys}
+        missed = {_kc(k) for k in fn_keys} if obs.done else set()
+        fixed: dict[tuple[int, str], str] = {}
+        for d in obs.metadata.get("fix_details", []):
+            c = (d["row"], d["col"])
+            fixed[c] = "correct" if d["score"] >= 0.99 else ("partial" if d["score"] > 0 else "wrong")
+        from .environment import parse_fix
+        fix_values: dict[tuple[int, str], str] = {}
+        for raw in fixes:
+            parsed = parse_fix(raw)
+            if parsed:
+                fix_values[(parsed[0], parsed[1])] = parsed[2]
+        html = _csv_to_html(obs.dataset_csv, task.planted_issues, correct, fp_set, missed, fixed, fix_values)
+        m = obs.metadata
+        r = obs.reward
+        rc = "#28a745" if r >= 0.8 else ("#ffc107" if r >= 0.4 else "#dc3545")
+        cards = (
+            '<div style="display:flex;gap:10px;flex-wrap:wrap;margin-bottom:12px;">'
+            + _metric_card("Reward", f"{r:.2f}", rc)
+            + _metric_card("Found", str(m["tp"]), "#28a745")
+            + _metric_card("False Pos", str(m["fp"]), "#dc3545" if m["fp"] > 0 else "#28a745")
+            + _metric_card("Missed", str(m["fn"]), "#dc3545" if m["fn"] > 0 else "#28a745")
+            + '</div>'
+        )
+        full_html = cards + html + LEGEND_HTML
+        return full_html, obs.feedback, f"{r:.3f}", ""
+    # ── Build the UI ──
+    with gr.Blocks(title="DataQA Environment") as demo:
+        gr.Markdown(
+            "# DataQA — Data Quality Assurance Environment\n"
+            "Two-phase RL environment: **Identify** data quality issues, then **Fix** them."
+        )
+        with gr.Tabs():
+            # ── Tab 1: Demo replay ──
+            with gr.Tab("Demo (Baseline Agent)"):
+                gr.Markdown(
+                    "*Replay of the baseline Qwen-72B agent. "
+                    "Use the slider to step through the agent's trajectory.*"
+                )
+                with gr.Row():
+                    task_dd = gr.Dropdown(choices=list_tasks(), value="easy", label="Task", scale=1)
+                    step_slider = gr.Slider(minimum=0, maximum=2, step=1, value=0, label="Step", scale=3)
+                viz_html = gr.HTML()
+                feedback_box = gr.Textbox(label="Agent Feedback", lines=10, interactive=False)
+                task_dd.change(on_task_change, inputs=[task_dd], outputs=[step_slider, viz_html, feedback_box])
+                step_slider.change(on_step_change, inputs=[task_dd, step_slider], outputs=[viz_html, feedback_box])
+                demo.load(on_task_change, inputs=[task_dd], outputs=[step_slider, viz_html, feedback_box])
+            # ── Tab 2: Try your own agent ──
+            with gr.Tab("Try Your Own Agent"):
+                gr.Markdown(
+                    "*Submit your own issues and fixes to see how the environment scores them. "
+                    "This is the same environment the baseline agent talks to.*"
+                )
+                with gr.Row():
+                    live_task_dd = gr.Dropdown(choices=list_tasks(), value="easy", label="Task", scale=1)
+                    live_reset_btn = gr.Button("Reset", variant="primary", scale=1)
+                with gr.Row():
+                    live_info = gr.Markdown()
+                    live_reward = gr.Textbox(label="Reward", interactive=False, scale=1)
+                live_viz = gr.HTML()
+                with gr.Row():
+                    live_issues = gr.Textbox(
+                        label="Issues (one per line)",
+                        placeholder="row:4,col:name,issue:missing_value\nrow:7,col:salary,issue:wrong_type",
+                        lines=5,
+                    )
+                    live_fixes = gr.Textbox(
+                        label="Fixes (one per line, optional)",
+                        placeholder="row:4,col:name,fix:David Kim\nrow:7,col:salary,fix:75000",
+                        lines=5,
+                    )
+                live_step_btn = gr.Button("Submit Step", variant="primary")
+                live_feedback = gr.Textbox(label="Feedback", lines=10, interactive=False)
+                live_reset_btn.click(
+                    live_reset, inputs=[live_task_dd],
+                    outputs=[live_viz, live_info, live_feedback, live_reward],
+                )
+                live_step_btn.click(
+                    live_step, inputs=[live_issues, live_fixes],
+                    outputs=[live_viz, live_feedback, live_reward, live_issues],
+                )
+    return demo
+if __name__ == "__main__":
+    demo = build_gradio_ui()
+    demo.launch()

dataqa_env/server/tasks.py CHANGED Viewed

@@ -43,6 +43,28 @@ class Task:
     corrupted_csv: str = ""
     max_steps: int = 3
 def _csv_to_rows(csv_text: str) -> List[List[str]]:
     reader = csv.reader(io.StringIO(csv_text.strip()))
@@ -354,6 +376,32 @@ EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0
     issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
                                description="model_name is whitespace-only", difficulty=2.5))
     corrupted = _rows_to_csv([header] + data)
     return Task(

     corrupted_csv: str = ""
     max_steps: int = 3
+    def get_clean_value(self, row: int, col: str) -> str | None:
+        """
+        Look up the original clean value for a given (row, col).
+        Row is 1-indexed (data row after header).
+        Returns None if row/col is out of bounds or column not found.
+        """
+        rows = _csv_to_rows(self.clean_csv)
+        if len(rows) < 2:
+            return None
+        header = [h.strip().lower() for h in rows[0]]
+        if col.lower() not in header:
+            return None
+        col_idx = header.index(col.lower())
+        data_row_idx = row  # row is 1-indexed, rows[0] is header, so rows[row] is the data row
+        if data_row_idx < 1 or data_row_idx >= len(rows):
+            return None
+        return rows[data_row_idx][col_idx].strip()
+    def get_planted_issue_map(self) -> dict:
+        """Return dict mapping issue key -> PlantedIssue for quick lookups."""
+        return {issue.to_key(): issue for issue in self.planted_issues}
 def _csv_to_rows(csv_text: str) -> List[List[str]]:
     reader = csv.reader(io.StringIO(csv_text.strip()))
     issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
                                description="model_name is whitespace-only", difficulty=2.5))
+    # Issue 9: Training time impossibly fast for dataset size and epochs
+    # EXP-004: vit-base on imagenet-1k, 300 epochs, but only 96 hours is plausible.
+    # Let's make EXP-009: efficientnet-b0 on imagenet-1k, 350 epochs = should take ~40+ hours
+    # but we set it to 0.5 hours — impossible for 1.2M images * 350 epochs
+    r = 8  # EXP-009 (same row as batch_size issue, different column)
+    data[r][13] = "0.5"  # 30 minutes for 350 epochs on imagenet? impossible
+    issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="statistical_outlier",
+                               description="0.5 hours for 350 epochs on imagenet-1k (1.2M images) is impossibly fast",
+                               difficulty=3.0))
+    # Issue 10: test_accuracy of 95.1% for roberta-large on SST-2 with train_size=500
+    # is suspiciously high — SOTA is ~96% with full dataset (67k). With only 500 training
+    # samples, 95.1% accuracy suggests data contamination or evaluation bug
+    r = 9  # EXP-010 (same row as train_size issue, different column)
+    # train_size is already corrupted to 500, but the test_accuracy 95.1 is from the
+    # original full-dataset run — this cross-column inconsistency is the real issue
+    # We don't modify the value — the inconsistency emerges from the train_size corruption
+    # So let's use a different row. EXP-001: resnet50 on imagenet, accuracy 76.3 is fine.
+    # Instead: EXP-012 wav2vec2 on librispeech — set test_accuracy to 98.5 (way too high
+    # for a speech model with only 20 epochs, SOTA is ~96% with much more training)
+    r = 11  # EXP-012
+    data[r][11] = "98.5"  # wav2vec2 with 20 epochs shouldn't hit 98.5% — SOTA is ~96%
+    issues.append(PlantedIssue(row=r + 1, col="test_accuracy", issue_type="statistical_outlier",
+                               description="test_accuracy 98.5% for wav2vec2 with only 20 epochs exceeds known SOTA (~96%), likely evaluation error",
+                               difficulty=3.0))
     corrupted = _rows_to_csv([header] + data)
     return Task(

inference.py CHANGED Viewed

@@ -1,8 +1,11 @@
 #!/usr/bin/env python3
 """
-DataQA Inference Script
------------------------
-LLM agent that plays the DataQA environment.
 Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
 Required environment variables:
@@ -92,10 +95,10 @@ class EnvHTTPClient:
         r.raise_for_status()
         return r.json()
-    def step(self, issues: list[str], task_id: str = "easy") -> dict:
         r = self.session.post(
             f"{self.base_url}/step",
-            json={"action": {"issues": issues, "task_id": task_id}},
             timeout=30,
         )
         r.raise_for_status()
@@ -103,10 +106,10 @@ class EnvHTTPClient:
 # ---------------------------------------------------------------------------
-# LLM Agent
 # ---------------------------------------------------------------------------
-SYSTEM_PROMPT = """You are a data quality analyst. Your job is to inspect datasets and identify data quality issues.
 You will be given:
 1. A dataset in CSV format
@@ -141,7 +144,26 @@ Respond with ONLY the list of issues, one per line. No other text.
 Example: row:3,col:salary,issue:missing_value"""
-def build_user_prompt(observation: dict) -> str:
     obs = observation if isinstance(observation, dict) else observation
     parts = []
@@ -160,6 +182,12 @@ def build_user_prompt(observation: dict) -> str:
     if feedback and "reset" not in feedback.lower():
         parts.append(f"FEEDBACK FROM PREVIOUS ATTEMPT:\n{feedback}")
     return "\n\n".join(parts)
@@ -170,7 +198,6 @@ def parse_llm_response(response: str) -> list[str]:
         line = line.strip()
         if not line:
             continue
-        # Remove numbering like "1. " or "- " or "* "
         line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
         line = re.sub(r"^\s*[-*]\s*", "", line)
         line = line.strip()
@@ -186,8 +213,60 @@ def parse_llm_response(response: str) -> list[str]:
     return issues
 def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
-    """Run a single task and return the best score."""
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
     rewards: List[float] = []
@@ -196,48 +275,38 @@ def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
     success = False
     try:
-        # Reset environment for this task
         reset_response = env.reset(task_id=task_id)
         observation = reset_response.get("observation", reset_response)
-        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
         for step_num in range(1, MAX_STEPS_PER_TASK + 1):
-            user_prompt = build_user_prompt(observation)
-            messages_for_call = messages + [{"role": "user", "content": user_prompt}]
-            # Call LLM with retry on rate limit
-            llm_output = ""
             error_msg = None
-            for attempt in range(3):
-                try:
-                    response = client.chat.completions.create(
-                        model=MODEL_NAME,
-                        messages=messages_for_call,
-                        temperature=0.1,
-                        max_tokens=2048,
-                    )
-                    llm_output = response.choices[0].message.content or ""
-                    break
-                except Exception as e:
-                    if "rate_limit" in str(e).lower() or "429" in str(e):
-                        wait = 10 * (attempt + 1)
-                        print(f"[DEBUG] Rate limited, waiting {wait}s...", file=sys.stderr, flush=True)
-                        time.sleep(wait)
-                    else:
-                        error_msg = str(e)
-                        print(f"[DEBUG] LLM call failed: {e}", file=sys.stderr, flush=True)
-                        break
-            # Parse issues from LLM response
-            issues = parse_llm_response(llm_output)
-            action_str = ";".join(issues) if issues else "none"
             if not issues and not error_msg:
                 error_msg = "no issues parsed from LLM response"
-            # Submit to environment
-            step_response = env.step(issues, task_id=task_id)
             observation = step_response.get("observation", step_response)
             reward = float(step_response.get("reward", 0.0) or 0.0)
@@ -257,9 +326,8 @@ def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
             if done:
                 break
-            # Add context for next attempt
-            messages.append({"role": "user", "content": user_prompt})
-            messages.append({"role": "assistant", "content": llm_output})
         success = best_score >= 0.5
@@ -279,21 +347,18 @@ def main():
     print(f"[DEBUG] API_BASE_URL={API_BASE_URL}", file=sys.stderr, flush=True)
     print(f"[DEBUG] MODEL_NAME={MODEL_NAME}", file=sys.stderr, flush=True)
-    # Initialize clients
     env = EnvHTTPClient(ENV_URL)
     llm_client = OpenAI(
         base_url=API_BASE_URL,
         api_key=API_KEY or "no-key",
     )
-    # Check environment health
     if not env.health():
         print("[DEBUG] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
         sys.exit(1)
     print(f"[DEBUG] Environment is healthy", file=sys.stderr, flush=True)
-    # Run all tasks
     scores = {}
     for task_id in TASKS:
         try:
@@ -303,7 +368,6 @@ def main():
             print(f"[DEBUG] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
             scores[task_id] = 0.0
-    # Summary to stderr (stdout is reserved for structured logs only)
     avg_score = sum(scores.values()) / len(scores) if scores else 0.0
     print(f"\n[DEBUG] FINAL RESULTS: {scores} avg={avg_score:.3f}", file=sys.stderr, flush=True)

 #!/usr/bin/env python3
 """
+DataQA Inference Script — Two-Phase Agent
+------------------------------------------
+LLM agent that plays the DataQA environment in two phases:
+  Phase 1: Identify all data quality issues
+  Phase 2: Propose fixes for identified issues
 Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
 Required environment variables:
         r.raise_for_status()
         return r.json()
+    def step(self, issues: list[str], fixes: list[str], task_id: str = "easy") -> dict:
         r = self.session.post(
             f"{self.base_url}/step",
+            json={"action": {"issues": issues, "fixes": fixes, "task_id": task_id}},
             timeout=30,
         )
         r.raise_for_status()
 # ---------------------------------------------------------------------------
+# LLM Prompts
 # ---------------------------------------------------------------------------
+IDENTIFY_SYSTEM_PROMPT = """You are a data quality analyst. Your job is to inspect datasets and identify data quality issues.
 You will be given:
 1. A dataset in CSV format
 Example: row:3,col:salary,issue:missing_value"""
+FIX_SYSTEM_PROMPT = """You are a data repair specialist. You have already identified data quality issues in a dataset. Now you must propose the correct values to fix each issue.
+For each issue you identified, propose a fix in EXACTLY this format:
+row:<row_number>,col:<column_name>,fix:<corrected_value>
+Guidelines for proposing fixes:
+- For missing_value: infer the correct value from context, schema, and other rows
+- For wrong_type: convert to the correct type (e.g., "seventy-five thousand" → "75000")
+- For out_of_range: propose a value within the valid range that makes sense in context
+- For format_violation: correct the format (e.g., "26/01/2024" → "2024-01-26")
+- For inconsistent_value: compute the correct value from related fields
+- For duplicate_row: propose a corrected unique key or indicate removal
+- For statistical_outlier: propose a reasonable value given the model/context
+Use the schema, validation rules, and surrounding data to determine the correct fix.
+Respond with ONLY the list of fixes, one per line. No other text.
+Example: row:3,col:salary,fix:75000"""
+def build_user_prompt(observation: dict, include_fixes: bool = False) -> str:
     obs = observation if isinstance(observation, dict) else observation
     parts = []
     if feedback and "reset" not in feedback.lower():
         parts.append(f"FEEDBACK FROM PREVIOUS ATTEMPT:\n{feedback}")
+    if include_fixes:
+        parts.append(
+            "Now propose fixes for ALL issues. "
+            "Use format: row:<N>,col:<name>,fix:<corrected_value>"
+        )
     return "\n\n".join(parts)
         line = line.strip()
         if not line:
             continue
         line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
         line = re.sub(r"^\s*[-*]\s*", "", line)
         line = line.strip()
     return issues
+def parse_fix_response(response: str) -> list[str]:
+    """Extract fix lines from LLM response."""
+    fixes = []
+    for line in response.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+        line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
+        line = re.sub(r"^\s*[-*]\s*", "", line)
+        line = line.strip()
+        if "row" in line.lower() and "fix" in line.lower():
+            match = re.search(
+                r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+fix\s*[:=]\s*(.+?)$",
+                line,
+                re.IGNORECASE,
+            )
+            if match:
+                normalized = f"row:{match.group(1)},col:{match.group(2).lower()},fix:{match.group(3).strip()}"
+                fixes.append(normalized)
+    return fixes
+def call_llm(client: OpenAI, system_prompt: str, user_prompt: str) -> str:
+    """Call the LLM with retry on rate limit."""
+    for attempt in range(3):
+        try:
+            response = client.chat.completions.create(
+                model=MODEL_NAME,
+                messages=[
+                    {"role": "system", "content": system_prompt},
+                    {"role": "user", "content": user_prompt},
+                ],
+                temperature=0.1,
+                max_tokens=2048,
+            )
+            return response.choices[0].message.content or ""
+        except Exception as e:
+            if "rate_limit" in str(e).lower() or "429" in str(e):
+                wait = 10 * (attempt + 1)
+                print(f"[DEBUG] Rate limited, waiting {wait}s...", file=sys.stderr, flush=True)
+                time.sleep(wait)
+            else:
+                print(f"[DEBUG] LLM call failed: {e}", file=sys.stderr, flush=True)
+                return ""
+    return ""
 def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
+    """
+    Run a single task with two-phase strategy:
+      Step 1: Identify issues only
+      Step 2: Identify + Fix (using feedback from step 1)
+      Step 3: Refined identify + fix (if needed)
+    """
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
     rewards: List[float] = []
     success = False
     try:
         reset_response = env.reset(task_id=task_id)
         observation = reset_response.get("observation", reset_response)
+        last_issues: list[str] = []
+        last_llm_output = ""
         for step_num in range(1, MAX_STEPS_PER_TASK + 1):
             error_msg = None
+            # ── Phase 1: Identify issues ──
+            user_prompt = build_user_prompt(observation)
+            identify_output = call_llm(client, IDENTIFY_SYSTEM_PROMPT, user_prompt)
+            issues = parse_llm_response(identify_output)
             if not issues and not error_msg:
                 error_msg = "no issues parsed from LLM response"
+            # ── Phase 2: Propose fixes (from step 2 onward, or always if we have issues) ──
+            fixes: list[str] = []
+            if issues and step_num >= 2:
+                # Build a fix prompt that includes the identified issues
+                fix_prompt = build_user_prompt(observation, include_fixes=True)
+                fix_prompt += f"\n\nISSUES FOUND:\n" + "\n".join(issues)
+                fix_output = call_llm(client, FIX_SYSTEM_PROMPT, fix_prompt)
+                fixes = parse_fix_response(fix_output)
+            # ── Submit to environment ──
+            action_str = ";".join(issues[:5]) if issues else "none"
+            if fixes:
+                action_str += "|fixes:" + ";".join(fixes[:3])
+            step_response = env.step(issues, fixes, task_id=task_id)
             observation = step_response.get("observation", step_response)
             reward = float(step_response.get("reward", 0.0) or 0.0)
             if done:
                 break
+            last_issues = issues
+            last_llm_output = identify_output
         success = best_score >= 0.5
     print(f"[DEBUG] API_BASE_URL={API_BASE_URL}", file=sys.stderr, flush=True)
     print(f"[DEBUG] MODEL_NAME={MODEL_NAME}", file=sys.stderr, flush=True)
     env = EnvHTTPClient(ENV_URL)
     llm_client = OpenAI(
         base_url=API_BASE_URL,
         api_key=API_KEY or "no-key",
     )
     if not env.health():
         print("[DEBUG] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
         sys.exit(1)
     print(f"[DEBUG] Environment is healthy", file=sys.stderr, flush=True)
     scores = {}
     for task_id in TASKS:
         try:
             print(f"[DEBUG] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
             scores[task_id] = 0.0
     avg_score = sum(scores.values()) / len(scores) if scores else 0.0
     print(f"\n[DEBUG] FINAL RESULTS: {scores} avg={avg_score:.3f}", file=sys.stderr, flush=True)

tests/test_environment.py CHANGED Viewed

@@ -1,16 +1,24 @@
-"""Tests for the DataQA environment (reset, step, scoring)."""
 import pytest
 from dataqa_env.server.environment import (
     DataQAEnvironment,
     parse_issue_key,
     compute_f1,
     compute_weighted_reward,
 )
 from dataqa_env.models import DataQAAction
-from dataqa_env.server.tasks import PlantedIssue
 class TestParseIssueKey:
     def test_standard_format(self):
         assert parse_issue_key("row:3,col:salary,issue:missing_value") == "row:3,col:salary,issue:missing_value"
@@ -28,7 +36,7 @@ class TestParseIssueKey:
         assert parse_issue_key("this is garbage") is None
     def test_partial_match(self):
-        assert parse_issue_key("row:3,col:salary") is None  # missing issue
     def test_empty_string(self):
         assert parse_issue_key("") is None
@@ -38,14 +46,49 @@ class TestParseIssueKey:
         assert result == "row:3,col:salary,issue:missing_value"
 class TestComputeF1:
     def test_perfect_match(self):
         keys = {"row:1,col:a,issue:missing_value"}
         result = compute_f1(keys, keys)
         assert result["f1"] == 1.0
-        assert result["tp"] == 1
-        assert result["fp"] == 0
-        assert result["fn"] == 0
     def test_no_reported_no_planted(self):
         result = compute_f1(set(), set())
@@ -61,9 +104,6 @@ class TestComputeF1:
         reported = {"row:99,col:x,issue:wrong_type"}
         planted = {"row:1,col:a,issue:missing_value"}
         result = compute_f1(reported, planted)
-        assert result["tp"] == 0
-        assert result["fp"] == 1
-        assert result["fn"] == 1
         assert result["f1"] == 0.0
     def test_partial_match(self):
@@ -83,6 +123,10 @@ class TestComputeF1:
         assert result["recall"] == pytest.approx(2 / 3)
 class TestComputeWeightedReward:
     def test_perfect_match(self):
         issues = [
@@ -101,14 +145,11 @@ class TestComputeWeightedReward:
         issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=2.0)]
         result = compute_weighted_reward(set(), issues)
         assert result["weighted_reward"] == 0.0
-        assert result["difficulty_missed"] == 2.0
     def test_hard_issue_worth_more(self):
         easy = PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)
         hard = PlantedIssue(row=2, col="b", issue_type="statistical_outlier", description="", difficulty=3.0)
         issues = [easy, hard]
-        # Finding only the hard issue should score higher than only the easy issue
         hard_found = compute_weighted_reward({hard.to_key()}, issues)
         easy_found = compute_weighted_reward({easy.to_key()}, issues)
         assert hard_found["weighted_reward"] > easy_found["weighted_reward"]
@@ -122,6 +163,92 @@ class TestComputeWeightedReward:
         assert r_correct["weighted_reward"] > r_with_fp["weighted_reward"]
 class TestDataQAEnvironment:
     @pytest.fixture
     def env(self):
@@ -137,6 +264,7 @@ class TestDataQAEnvironment:
         assert obs.max_steps == 3
         assert obs.done is False
         assert obs.reward == 0.0
     def test_reset_medium(self, env):
         obs = env.reset(task_id="medium")
@@ -144,11 +272,11 @@ class TestDataQAEnvironment:
     def test_reset_hard(self, env):
         obs = env.reset(task_id="hard")
-        assert obs.num_issues_hint == 8
-    def test_step_with_correct_issues(self, env):
         env.reset(task_id="easy")
-        # Submit all correct issues for easy task
         action = DataQAAction(
             issues=[
                 "row:4,col:name,issue:missing_value",
@@ -160,7 +288,46 @@ class TestDataQAEnvironment:
         )
         obs = env.step(action)
         assert obs.done is True
-        assert obs.reward >= 0.999
     def test_step_with_partial_issues(self, env):
         env.reset(task_id="easy")
@@ -186,7 +353,6 @@ class TestDataQAEnvironment:
         assert obs.done is True
     def test_auto_reset_on_step(self, env):
-        # step() without prior reset should auto-reset
         action = DataQAAction(
             issues=["row:4,col:name,issue:missing_value"],
             task_id="easy",
@@ -214,19 +380,26 @@ class TestDataQAEnvironment:
         env.step(action1)
         score_after_1 = env.state.best_score
-        # Worse submission shouldn't decrease best_score
         action2 = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
         env.step(action2)
         assert env.state.best_score >= score_after_1
-    def test_metadata_included_in_observation(self, env):
         env.reset(task_id="easy")
-        action = DataQAAction(issues=["row:4,col:name,issue:missing_value"], task_id="easy")
         obs = env.step(action)
-        assert "f1" in obs.metadata
-        assert "weighted_reward" in obs.metadata
-        assert "tp" in obs.metadata
-        assert "difficulty_found" in obs.metadata
     def test_parse_error_in_feedback(self, env):
         env.reset(task_id="easy")
@@ -243,7 +416,55 @@ class TestDataQAEnvironment:
         for _ in range(3):
             action = DataQAAction(
                 issues=["row:1,col:x,issue:wrong_type", "row:99,col:y,issue:missing_value"],
                 task_id="hard",
             )
             obs = env.step(action)
             assert 0.0 <= obs.reward <= 1.0

+"""Tests for the DataQA environment (reset, step, scoring, two-phase identify+fix)."""
 import pytest
 from dataqa_env.server.environment import (
     DataQAEnvironment,
     parse_issue_key,
+    parse_fix,
     compute_f1,
     compute_weighted_reward,
+    grade_fixes,
+    IDENTIFY_WEIGHT,
+    FIX_WEIGHT,
 )
 from dataqa_env.models import DataQAAction
+from dataqa_env.server.tasks import PlantedIssue, create_task_easy, create_task_medium
+# ──────────────────────────────────────────────────────
+# Issue parsing
+# ──────────────────────────────────────────────────────
 class TestParseIssueKey:
     def test_standard_format(self):
         assert parse_issue_key("row:3,col:salary,issue:missing_value") == "row:3,col:salary,issue:missing_value"
         assert parse_issue_key("this is garbage") is None
     def test_partial_match(self):
+        assert parse_issue_key("row:3,col:salary") is None
     def test_empty_string(self):
         assert parse_issue_key("") is None
         assert result == "row:3,col:salary,issue:missing_value"
+# ──────────────────────────────────────────────────────
+# Fix parsing
+# ──────────────────────────────────────────────────────
+class TestParseFix:
+    def test_standard_format(self):
+        result = parse_fix("row:4,col:name,fix:Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_with_equals(self):
+        result = parse_fix("row=4,col=name,fix=Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_numeric_fix(self):
+        result = parse_fix("row:7,col:salary,fix:75000")
+        assert result == (7, "salary", "75000")
+    def test_date_fix(self):
+        result = parse_fix("row:12,col:order_date,fix:2024-01-26")
+        assert result == (12, "order_date", "2024-01-26")
+    def test_case_insensitive(self):
+        result = parse_fix("Row:4,Col:Name,Fix:Alice Chen")
+        assert result == (4, "name", "Alice Chen")
+    def test_unparseable(self):
+        assert parse_fix("garbage") is None
+        assert parse_fix("row:4,col:name") is None
+    def test_fix_with_special_chars(self):
+        result = parse_fix("row:1,col:email,fix:alice.chen@company.com")
+        assert result == (1, "email", "alice.chen@company.com")
+# ──────────────────────────────────────────────────────
+# F1 scoring
+# ──────────────────────────────────────────────────────
 class TestComputeF1:
     def test_perfect_match(self):
         keys = {"row:1,col:a,issue:missing_value"}
         result = compute_f1(keys, keys)
         assert result["f1"] == 1.0
     def test_no_reported_no_planted(self):
         result = compute_f1(set(), set())
         reported = {"row:99,col:x,issue:wrong_type"}
         planted = {"row:1,col:a,issue:missing_value"}
         result = compute_f1(reported, planted)
         assert result["f1"] == 0.0
     def test_partial_match(self):
         assert result["recall"] == pytest.approx(2 / 3)
+# ──────────────────────────────────────────────────────
+# Weighted reward
+# ──────────────────────────────────────────────────────
 class TestComputeWeightedReward:
     def test_perfect_match(self):
         issues = [
         issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=2.0)]
         result = compute_weighted_reward(set(), issues)
         assert result["weighted_reward"] == 0.0
     def test_hard_issue_worth_more(self):
         easy = PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)
         hard = PlantedIssue(row=2, col="b", issue_type="statistical_outlier", description="", difficulty=3.0)
         issues = [easy, hard]
         hard_found = compute_weighted_reward({hard.to_key()}, issues)
         easy_found = compute_weighted_reward({easy.to_key()}, issues)
         assert hard_found["weighted_reward"] > easy_found["weighted_reward"]
         assert r_correct["weighted_reward"] > r_with_fp["weighted_reward"]
+# ──────────────────────────────────────────────────────
+# Fix grading
+# ──────────────────────────────────────────────────────
+class TestGradeFixes:
+    @pytest.fixture
+    def easy_task(self):
+        return create_task_easy()
+    def test_no_fixes_no_issues(self):
+        from dataqa_env.server.tasks import Task
+        task = Task(task_id="empty", name="", description="", schema_description="",
+                    validation_rules="", clean_csv="a\n1")
+        result = grade_fixes([], task)
+        assert result["fix_score"] == 1.0
+    def test_no_fixes_submitted(self, easy_task):
+        result = grade_fixes([], easy_task)
+        assert result["fix_score"] == 0.0
+        assert result["fixes_attempted"] == 0
+    def test_exact_fix_for_missing_name(self, easy_task):
+        # Row 4 has empty name — clean value is "David Kim"
+        fixes = [(4, "name", "David Kim")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fix_score"] > 0.0
+        assert result["fixes_correct"] == 1
+    def test_exact_fix_for_wrong_type_salary(self, easy_task):
+        # Row 7 has "seventy-five thousand" — clean value is "75000"
+        fixes = [(7, "salary", "75000")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_correct"] == 1
+    def test_numeric_close_match(self, easy_task):
+        # Row 9 has salary "5000" — clean value is "73000"
+        # Propose 73100 (within 1% of 73000)
+        fixes = [(9, "salary", "73100")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_partial"] == 1
+    def test_wrong_value_for_issue_cell(self, easy_task):
+        # Row 4 name is empty — propose wrong name
+        fixes = [(4, "name", "Wrong Person")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_partial"] == 1  # correct cell, wrong value
+        assert result["fix_score"] > 0.0  # gets partial credit
+    def test_fix_for_non_issue_cell(self, easy_task):
+        # Row 1 col name is fine — no issue there
+        fixes = [(1, "name", "Some Name")]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_wrong"] == 1
+        assert result["fix_score"] == 0.0
+    def test_multiple_fixes_best_wins(self, easy_task):
+        # Submit two fixes for same cell — best one should count
+        fixes = [
+            (4, "name", "Wrong Person"),   # partial credit
+            (4, "name", "David Kim"),      # exact match
+        ]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fixes_correct"] >= 1
+    def test_all_fixes_correct(self, easy_task):
+        # Fix all 4 issues with exact values
+        fixes = [
+            (4, "name", "David Kim"),
+            (7, "salary", "75000"),
+            (9, "salary", "73000"),
+            # Row 11 is duplicate — clean value for employee_id is "Bob Martinez" row
+            # The duplicate is of row 2 (Bob Martinez), so the clean row 11 doesn't exist
+        ]
+        result = grade_fixes(fixes, easy_task)
+        assert result["fix_score"] > 0.5  # at least 3/4 issues fixed
+    def test_fix_score_bounded(self, easy_task):
+        fixes = [(4, "name", "David Kim"), (99, "x", "bad")]
+        result = grade_fixes(fixes, easy_task)
+        assert 0.0 <= result["fix_score"] <= 1.0
+# ──────────────────────────────────────────────────────
+# Full environment lifecycle
+# ──────────────────────────────────────────────────────
 class TestDataQAEnvironment:
     @pytest.fixture
     def env(self):
         assert obs.max_steps == 3
         assert obs.done is False
         assert obs.reward == 0.0
+        assert "fix" in obs.feedback.lower()  # mentions fix phase
     def test_reset_medium(self, env):
         obs = env.reset(task_id="medium")
     def test_reset_hard(self, env):
         obs = env.reset(task_id="hard")
+        assert obs.num_issues_hint == 10
+    def test_step_identify_only(self, env):
+        """Backward compatible: only issues, no fixes."""
         env.reset(task_id="easy")
         action = DataQAAction(
             issues=[
                 "row:4,col:name,issue:missing_value",
         )
         obs = env.step(action)
         assert obs.done is True
+        assert obs.reward >= 0.999  # identify-only uses identify_score directly
+    def test_step_with_fixes_increases_reward(self, env):
+        """Submitting correct fixes should increase reward beyond identify-only."""
+        env.reset(task_id="easy")
+        # Step 1: identify only
+        action1 = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:11,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs1 = env.step(action1)
+        score_identify = obs1.reward
+        # Reset for fair comparison
+        env.reset(task_id="easy")
+        # Step with identify + fixes
+        action2 = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:11,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+            ],
+            fixes=[
+                "row:4,col:name,fix:David Kim",
+                "row:7,col:salary,fix:75000",
+                "row:9,col:salary,fix:73000",
+            ],
+            task_id="easy",
+        )
+        obs2 = env.step(action2)
+        score_with_fixes = obs2.metadata["combined_reward"]
+        # With correct fixes, combined should be close to 1.0
+        assert score_with_fixes > 0.8
     def test_step_with_partial_issues(self, env):
         env.reset(task_id="easy")
         assert obs.done is True
     def test_auto_reset_on_step(self, env):
         action = DataQAAction(
             issues=["row:4,col:name,issue:missing_value"],
             task_id="easy",
         env.step(action1)
         score_after_1 = env.state.best_score
         action2 = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
         env.step(action2)
         assert env.state.best_score >= score_after_1
+    def test_metadata_includes_both_phases(self, env):
         env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
         obs = env.step(action)
+        m = obs.metadata
+        assert "identify_f1" in m
+        assert "identify_score" in m
+        assert "fix_score" in m
+        assert "combined_reward" in m
+        assert "tp" in m
+        assert "fixes_correct" in m
+        assert "fixes_attempted" in m
     def test_parse_error_in_feedback(self, env):
         env.reset(task_id="easy")
         for _ in range(3):
             action = DataQAAction(
                 issues=["row:1,col:x,issue:wrong_type", "row:99,col:y,issue:missing_value"],
+                fixes=["row:1,col:x,fix:wrong"],
                 task_id="hard",
             )
             obs = env.step(action)
             assert 0.0 <= obs.reward <= 1.0
+    def test_combined_reward_weights(self, env):
+        """Verify combined = IDENTIFY_WEIGHT * identify + FIX_WEIGHT * fix."""
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:11,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+            ],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        m = obs.metadata
+        expected = IDENTIFY_WEIGHT * m["identify_score"] + FIX_WEIGHT * m["fix_score"]
+        assert abs(m["combined_reward"] - expected) < 0.01
+    def test_fix_feedback_shown_when_fixes_submitted(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            fixes=["row:4,col:name,fix:David Kim"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert "Fix Proposals" in obs.feedback
+        assert "Combined Reward" in obs.feedback
+    def test_no_fix_penalty_when_no_fixes_submitted(self, env):
+        """If agent submits no fixes, reward = identify_score (no penalty)."""
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:11,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        # identify_score should be ~1.0 since all issues found
+        assert obs.reward >= 0.99
+        # combined_reward equals identify_score when no fixes
+        assert obs.metadata["combined_reward"] == obs.metadata["identify_score"]

tests/test_extensibility.py CHANGED Viewed

@@ -151,8 +151,8 @@ class TestRegisterTask:
 class TestCustomTaskInEnvironment:
-    def test_full_lifecycle(self):
-        """Custom task works end-to-end in the environment."""
         task = create_task_from_config(
             task_id="e2e_custom",
             name="E2E Custom",
@@ -171,7 +171,6 @@ class TestCustomTaskInEnvironment:
         obs = env.reset(task_id="e2e_custom")
         assert obs.num_issues_hint == 2
-        # Submit correct answers
         action = DataQAAction(
             issues=[i.to_key() for i in task.planted_issues],
             task_id="e2e_custom",
@@ -180,6 +179,37 @@ class TestCustomTaskInEnvironment:
         assert obs.done is True
         assert obs.reward >= 0.999
-        # Cleanup
         from dataqa_env.server.tasks import TASK_REGISTRY
         del TASK_REGISTRY["e2e_custom"]

 class TestCustomTaskInEnvironment:
+    def test_full_lifecycle_identify_only(self):
+        """Custom task works end-to-end with identify-only."""
         task = create_task_from_config(
             task_id="e2e_custom",
             name="E2E Custom",
         obs = env.reset(task_id="e2e_custom")
         assert obs.num_issues_hint == 2
         action = DataQAAction(
             issues=[i.to_key() for i in task.planted_issues],
             task_id="e2e_custom",
         assert obs.done is True
         assert obs.reward >= 0.999
         from dataqa_env.server.tasks import TASK_REGISTRY
         del TASK_REGISTRY["e2e_custom"]
+    def test_full_lifecycle_identify_and_fix(self):
+        """Custom task works end-to-end with both identify and fix."""
+        task = create_task_from_config(
+            task_id="e2e_fix",
+            name="E2E Fix",
+            description="End-to-end test with fixes",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+            ],
+        )
+        register_task("e2e_fix", lambda seed: task)
+        env = DataQAEnvironment()
+        env.reset(task_id="e2e_fix")
+        # Submit issues + fix
+        action = DataQAAction(
+            issues=[task.planted_issues[0].to_key()],
+            fixes=["row:1,col:name,fix:Alice"],  # clean value is "Alice"
+            task_id="e2e_fix",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.metadata["fix_score"] > 0.0
+        assert obs.metadata["combined_reward"] > 0.0
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["e2e_fix"]

tests/test_inference.py CHANGED Viewed

@@ -1,12 +1,11 @@
 """Tests for the inference script's parsing, prompt building, and log format."""
-import re
 import pytest
 import sys
 import os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
-from inference import parse_llm_response, build_user_prompt, log_start, log_step, log_end
 class TestParseLLMResponse:
@@ -50,7 +49,7 @@ class TestParseLLMResponse:
     def test_deduplication_not_applied(self):
         response = "row:1,col:name,issue:missing_value\nrow:1,col:name,issue:missing_value"
         issues = parse_llm_response(response)
-        assert len(issues) == 2  # duplicates kept, env handles dedup
     def test_with_column_variant(self):
         response = "row:1,column:name,issue:missing_value"
@@ -58,6 +57,38 @@ class TestParseLLMResponse:
         assert len(issues) == 1
 class TestBuildUserPrompt:
     def test_includes_all_fields(self):
         obs = {
@@ -100,6 +131,18 @@ class TestBuildUserPrompt:
         prompt = build_user_prompt(obs)
         assert "FEEDBACK" not in prompt
 class TestLogFormat:
     """Verify stdout log format matches hackathon evaluation requirements."""

 """Tests for the inference script's parsing, prompt building, and log format."""
 import pytest
 import sys
 import os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from inference import parse_llm_response, parse_fix_response, build_user_prompt, log_start, log_step, log_end
 class TestParseLLMResponse:
     def test_deduplication_not_applied(self):
         response = "row:1,col:name,issue:missing_value\nrow:1,col:name,issue:missing_value"
         issues = parse_llm_response(response)
+        assert len(issues) == 2
     def test_with_column_variant(self):
         response = "row:1,column:name,issue:missing_value"
         assert len(issues) == 1
+class TestParseFixResponse:
+    def test_standard_format(self):
+        response = "row:4,col:name,fix:David Kim\nrow:7,col:salary,fix:75000"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 2
+        assert "row:4,col:name,fix:David Kim" in fixes
+    def test_numbered_list(self):
+        response = "1. row:4,col:name,fix:David Kim\n2. row:7,col:salary,fix:75000"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 2
+    def test_with_special_chars(self):
+        response = "row:1,col:email,fix:alice.chen@company.com"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1
+        assert "alice.chen@company.com" in fixes[0]
+    def test_empty_response(self):
+        assert parse_fix_response("") == []
+    def test_date_fix(self):
+        response = "row:12,col:order_date,fix:2024-01-26"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1
+    def test_ignores_issue_lines(self):
+        response = "row:4,col:name,issue:missing_value\nrow:4,col:name,fix:David Kim"
+        fixes = parse_fix_response(response)
+        assert len(fixes) == 1  # only the fix line
 class TestBuildUserPrompt:
     def test_includes_all_fields(self):
         obs = {
         prompt = build_user_prompt(obs)
         assert "FEEDBACK" not in prompt
+    def test_include_fixes_flag(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "a\n1",
+            "num_issues_hint": 0,
+            "feedback": "",
+        }
+        prompt = build_user_prompt(obs, include_fixes=True)
+        assert "fix" in prompt.lower()
 class TestLogFormat:
     """Verify stdout log format matches hackathon evaluation requirements."""

tests/test_tasks.py CHANGED Viewed

@@ -113,8 +113,8 @@ class TestTaskHard:
     def test_task_id(self, task):
         assert task.task_id == "hard"
-    def test_has_8_issues(self, task):
-        assert len(task.planted_issues) == 8
     def test_issue_types(self, task):
         types = {i.issue_type for i in task.planted_issues}

     def test_task_id(self, task):
         assert task.task_id == "hard"
+    def test_has_10_issues(self, task):
+        assert len(task.planted_issues) == 10
     def test_issue_types(self, task):
         types = {i.issue_type for i in task.planted_issues}