Spaces:

avanigupta
/

dataqa-env

Sleeping

App Files Files Community

avanigupta commited on 13 days ago

Commit

cd11aba

1 Parent(s): 4c1a85d

fixes v1: add per step reward

Browse files

Files changed (12) hide show

README.md +198 -41
dataqa_env/__init__.py +16 -1
dataqa_env/server/environment.py +77 -4
dataqa_env/server/tasks.py +158 -36
inference.py +116 -126
prevalidation_script.sh +185 -0
sample_inference_script.py +188 -0
tests/__init__.py +0 -0
tests/test_environment.py +249 -0
tests/test_extensibility.py +185 -0
tests/test_inference.py +148 -0
tests/test_tasks.py +161 -0

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: DataQA Environment Server
-emoji: 🔍
 colorFrom: blue
 colorTo: gray
 sdk: docker
@@ -15,49 +15,173 @@ tags:
 An OpenEnv environment for **Data Quality Assurance** — an LLM agent inspects datasets with planted quality issues and must identify them all.
-## Overview
-DataQA simulates the real-world task of validating datasets before they enter ML training pipelines or production databases. The agent receives a corrupted dataset along with its schema and validation rules, then must identify all planted data quality issues.
-### Why Data QA?
-Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, inconsistencies, and subtle statistical anomalies. This environment turns that task into a structured, gradable challenge.
 ## Environment API
-| Endpoint | Description |
-|----------|-------------|
-| `reset(task_id)` | Start a new episode with a corrupted dataset |
-| `step(issues)` | Submit identified issues, receive F1-scored feedback |
-| `state()` | Get current episode state |
 ## Tasks
-| Task | Issues | Difficulty | Description |
-|------|--------|-----------|-------------|
-| `easy` | 4 | Beginner | Employee directory — nulls, wrong types, duplicates, out-of-range |
-| `medium` | 6 | Intermediate | E-commerce orders — format violations, inconsistent totals, duplicate keys |
-| `hard` | 8 | Advanced | ML experiment metadata — data leakage signals, unreasonable GPU usage, timestamp ordering |
 ## Reward Function
-Scoring uses **F1 score** (harmonic mean of precision and recall):
-- **Precision**: What fraction of reported issues are real?
-- **Recall**: What fraction of planted issues did the agent find?
-- **F1**: `2 * precision * recall / (precision + recall)`
-Issues are matched by `row:<N>,col:<column>,issue:<type>` keys.
-The agent gets up to 3 attempts per task with feedback on each attempt (true positives, false positives, missed count).
-## Action/Observation Space
-**Action**: List of issue strings in format `row:<row_number>,col:<column_name>,issue:<issue_type>`
-**Observation**: Dataset CSV + schema + validation rules + feedback from previous attempt
-**Issue Types**: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`, `referential_integrity`
 ## Quick Start
@@ -68,42 +192,75 @@ pip install -e .
 # Run server locally
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
-# Run inference
-API_BASE_URL=https://api.groq.com/openai/v1 \
-MODEL_NAME=llama-3.3-70b-versatile \
-LLM_API_KEY=your-key \
 python inference.py
 ```
 ## Docker
 ```bash
-docker build -t dataqa-env -f dataqa_env/server/Dockerfile .
 docker run -p 8000:8000 dataqa-env
 ```
 ## Environment Variables
 | Variable | Description | Default |
 |----------|-------------|---------|
-| `API_BASE_URL` | LLM API endpoint | `https://api.groq.com/openai/v1` |
-| `MODEL_NAME` | Model identifier | `llama-3.3-70b-versatile` |
-| `HF_TOKEN` | HuggingFace token | - |
 | `ENV_URL` | Environment server URL | `http://localhost:8000` |
-| `LLM_API_KEY` | API key for LLM provider | Falls back to HF_TOKEN |
 ## Architecture
 ```
 dataqa_env/
 ├── models.py              # Pydantic: DataQAAction, DataQAObservation, DataQAState
 ├── client.py              # EnvClient for WebSocket connections
 ├── server/
-│   ├── environment.py     # Core DataQAEnvironment (reset/step/state)
-│   ├── tasks.py           # Task definitions + data corruption + grading
-│   ├── app.py             # FastAPI server
 │   └── Dockerfile
-├── openenv.yaml
-├── pyproject.toml
-└── inference.py           # LLM agent using OpenAI client
 ```

 ---
 title: DataQA Environment Server
+emoji: "\U0001F50D"
 colorFrom: blue
 colorTo: gray
 sdk: docker
 An OpenEnv environment for **Data Quality Assurance** — an LLM agent inspects datasets with planted quality issues and must identify them all.
+## Motivation
+Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, logical inconsistencies, and subtle statistical anomalies — before data enters ML pipelines or production databases. This is a genuine, high-frequency human task that directly impacts model quality and business outcomes.
+DataQA turns this into a structured, gradable RL environment where agents must systematically inspect corrupted datasets, reason about schema constraints and validation rules, and pinpoint every planted issue — from obvious nulls to subtle data leakage signals that require domain expertise.
 ## Environment API
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/reset` | POST | Start a new episode with a corrupted dataset |
+| `/step` | POST | Submit identified issues, receive scored feedback |
+| `/state` | GET | Get current episode state |
+| `/health` | GET | Health check |
 ## Tasks
+| Task | Issues | Difficulty | Domain | Description |
+|------|--------|-----------|--------|-------------|
+| `easy` | 4 | Beginner | HR/Employee data | Nulls, wrong types, duplicates, out-of-range values |
+| `medium` | 6 | Intermediate | E-commerce orders | Format violations, inconsistent computed fields, duplicate keys |
+| `hard` | 8 | Advanced | ML experiment metadata | Data leakage signals, unreasonable GPU memory, timestamp ordering, whitespace-only fields |
+**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage) and multi-row temporal reasoning.
+## Action Space
+The agent submits a list of issue strings, each in the format:
+```
+row:<row_number>,col:<column_name>,issue:<issue_type>
+```
+- `row_number`: 1-indexed position in the CSV data (after header). Row 1 = first data row.
+- `column_name`: Exact column header name, lowercase.
+- `issue_type`: One of the supported types below.
+**Supported Issue Types:**
+| Type | Description | Example |
+|------|-------------|---------|
+| `missing_value` | Null, empty, or whitespace-only | Empty name field |
+| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
+| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
+| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
+| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
+| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
+| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
+| `referential_integrity` | Foreign key violation | (available for custom tasks) |
+## Observation Space
+Each observation contains:
+| Field | Type | Description |
+|-------|------|-------------|
+| `dataset_csv` | str | The corrupted dataset in CSV format |
+| `schema_description` | str | Column types, ranges, and constraints |
+| `validation_rules` | str | Business rules the data must satisfy |
+| `task_description` | str | Task context and instructions |
+| `feedback` | str | Results from previous step (TP/FP/FN counts, precision/recall) |
+| `num_issues_hint` | int | Exact count of planted issues |
+| `max_steps` | int | Maximum attempts allowed |
+| `done` | bool | Whether episode has terminated |
+| `reward` | float | Best weighted reward so far (0.0-1.0) |
+**Observation Metadata** (available after each step):
+- `f1`, `weighted_reward`, `precision`, `recall`
+- `tp`, `fp`, `fn`
+- `difficulty_found`, `difficulty_missed`
 ## Reward Function
+### Difficulty-Weighted Reward (Primary)
+Each planted issue has a **difficulty weight** (1.0-3.0) reflecting how hard it is to detect. The primary reward is a **weighted F1 score** that provides meaningful per-step partial progress signals:
+| Weight | Category | Examples |
+|--------|----------|----------|
+| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
+| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
+| 2.5-3.0 | Hard | Data leakage, statistical outliers, whitespace-only |
+**Formula:**
+- **Weighted Recall** = (sum of difficulty weights for found issues) / (total difficulty weight)
+- **Weighted Precision** = (found weight) / (found weight + FP count * avg difficulty)
+- **Weighted F1** = harmonic mean of weighted precision and recall
+This means:
+- Finding a hard issue (difficulty 3.0) increases reward 3x more than finding an easy one (1.0)
+- False positives are penalized proportionally to average issue difficulty
+- The agent sees meaningful reward differences at every step, not just pass/fail
+### Standard F1 (also computed)
+Available in observation metadata for comparison. Uses unweighted set matching.
+### Episode Boundaries
+- Each task allows up to 3 steps (attempts)
+- Episode ends when F1 >= 0.999 (perfect) or max steps reached
+- Best score across all steps is the final reward (monotonically non-decreasing)
+- Reward is always in [0.0, 1.0]
+## Baseline Scores
+Baseline scores using Qwen2.5-72B-Instruct via HuggingFace Router:
+| Task | Expected Score Range | Description |
+|------|---------------------|-------------|
+| `easy` | 0.7 - 1.0 | Most LLMs find obvious issues reliably |
+| `medium` | 0.5 - 0.8 | Cross-column reasoning is challenging |
+| `hard` | 0.3 - 0.6 | ML domain knowledge and subtle patterns |
+Scores vary by model capability. Frontier models (GPT-4, Claude) typically score higher on the hard task due to better domain reasoning.
+## Extensibility
+DataQA supports custom tasks, contamination rules, and difficulty levels via a programmatic API.
+### Custom Contamination Rules
+```python
+from dataqa_env import register_contamination_rule
+from dataqa_env.server.tasks import PlantedIssue
+def swap_digits(rows, header, col_idx, row_idx, rng):
+    val = rows[row_idx][col_idx]
+    corrupted = val[::-1]
+    issue = PlantedIssue(
+        row=row_idx + 1, col=header[col_idx],
+        issue_type="format_violation",
+        description=f"Digits swapped in {header[col_idx]}",
+        difficulty=2.0,
+    )
+    return corrupted, issue
+register_contamination_rule("swap_digits", swap_digits)
+```
+### Custom Tasks from Config
+```python
+from dataqa_env import create_task_from_config, register_task
+task = create_task_from_config(
+    task_id="custom",
+    name="Custom Validation",
+    description="Find quality issues in this dataset.",
+    schema_description="id: int, name: str, score: int (0-100)",
+    validation_rules="No missing values. Scores must be 0-100.",
+    clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
+    contaminations=[
+        {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+        {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
+    ],
+)
+register_task("custom", lambda seed: task)
+```
+### Built-in Contamination Rules
+| Rule | Effect | Default Difficulty |
+|------|--------|--------------------|
+| `missing_value` | Sets field to empty string | 1.0 |
+| `whitespace_value` | Sets field to single space | 2.5 |
+| `wrong_type_text` | Replaces with random text ("N/A", "null", etc.) | 1.0 |
+| `negative_value` | Negates numeric value | 1.0 |
 ## Quick Start
 # Run server locally
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
+# Run inference (set your API credentials)
+API_BASE_URL=https://router.huggingface.co/v1 \
+MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
+HF_TOKEN=your-token \
 python inference.py
 ```
 ## Docker
 ```bash
+docker build -t dataqa-env .
 docker run -p 8000:8000 dataqa-env
 ```
+## Testing
+```bash
+pip install -e ".[dev]"
+pytest tests/ -v
+```
+89 tests covering:
+- Task creation, corruption, and issue planting (difficulty weights, seed determinism)
+- Issue key parsing (standard, lenient, edge cases)
+- F1 and difficulty-weighted reward computation
+- Full environment reset/step lifecycle
+- Inference script parsing and prompt building
+- **Structured log format** ([START], [STEP], [END] — exact field names and ordering)
+- Score bounds (0.0-1.0), best-score monotonicity
+- Extensibility API (custom rules, custom tasks, environment integration)
+## Validation
+```bash
+# OpenEnv spec validation
+openenv validate .
+# Pre-submission validation (requires HF Space URL)
+./prevalidation_script.sh https://your-space.hf.space
+```
 ## Environment Variables
 | Variable | Description | Default |
 |----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
+| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
+| `HF_TOKEN` | HuggingFace token / API key | - |
 | `ENV_URL` | Environment server URL | `http://localhost:8000` |
 ## Architecture
 ```
 dataqa_env/
+├── __init__.py            # Public API + extensibility exports
 ├── models.py              # Pydantic: DataQAAction, DataQAObservation, DataQAState
 ├── client.py              # EnvClient for WebSocket connections
 ├── server/
+│   ├── environment.py     # Core DataQAEnvironment (reset/step/state + weighted rewards)
+│   ├── tasks.py           # Task definitions + contamination rules + extensibility API
+│   ├── app.py             # FastAPI server (via openenv-core create_app)
 │   └── Dockerfile
+tests/
+├── test_tasks.py          # Task creation, corruption, difficulty weights
+├── test_environment.py    # Environment lifecycle, scoring, metadata
+├── test_inference.py      # LLM response parsing, prompt building, log format
+└── test_extensibility.py  # Custom rules, custom tasks, registration API
+inference.py               # Baseline LLM agent (OpenAI client, structured logs)
+openenv.yaml               # OpenEnv/HF Spaces spec
+pyproject.toml             # Package metadata and dependencies
+Dockerfile                 # Production container
 ```

dataqa_env/__init__.py CHANGED Viewed

@@ -1,4 +1,19 @@
 from .client import DataQAEnv
 from .models import DataQAAction, DataQAObservation, DataQAState
-__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

 from .client import DataQAEnv
 from .models import DataQAAction, DataQAObservation, DataQAState
+from .server.tasks import (
+    create_task_from_config,
+    register_task,
+    register_contamination_rule,
+    CONTAMINATION_RULES,
+)
+__all__ = [
+    "DataQAEnv",
+    "DataQAAction",
+    "DataQAObservation",
+    "DataQAState",
+    "create_task_from_config",
+    "register_task",
+    "register_contamination_rule",
+    "CONTAMINATION_RULES",
+]

dataqa_env/server/environment.py CHANGED Viewed

@@ -58,6 +58,61 @@ def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
     return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
 class DataQAEnvironment(Environment):
     """
     Data Quality Assurance environment.
@@ -138,15 +193,21 @@ class DataQAEnvironment(Environment):
             else:
                 parse_errors.append(f"Could not parse: '{raw_issue}'")
-        # Compute score
         metrics = compute_f1(reported_keys, self._planted_keys)
         score = metrics["f1"]
-        self._best_score = max(self._best_score, score)
         self._state.best_score = self._best_score
         # Check if done
         is_done = (
-            score >= 0.999  # Perfect score
             or self._state.current_step >= self._state.max_steps
         )
@@ -156,6 +217,7 @@ class DataQAEnvironment(Environment):
             f"Issues reported: {len(reported_keys)}",
             f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
             f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {score:.3f}",
         ]
         if parse_errors:
@@ -173,7 +235,7 @@ class DataQAEnvironment(Environment):
                 )
             feedback_lines.append("You can submit again with an updated list of issues.")
         else:
-            feedback_lines.append(f"Task complete! Final best F1 score: {self._best_score:.3f}")
         return DataQAObservation(
             dataset_csv=self._current_task.corrupted_csv,
@@ -186,6 +248,17 @@ class DataQAEnvironment(Environment):
             max_steps=self._state.max_steps,
             done=is_done,
             reward=self._best_score,
         )
     @property

     return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
+def compute_weighted_reward(
+    reported_keys: Set[str],
+    planted_issues: list,
+) -> dict:
+    """
+    Compute difficulty-weighted reward for richer per-step signal.
+    Each planted issue has a difficulty weight (1.0-3.0). Finding harder issues
+    earns more reward. False positives incur a penalty scaled by average difficulty.
+    Returns dict with weighted_reward (0.0-1.0), plus per-issue breakdown.
+    """
+    if not planted_issues and not reported_keys:
+        return {"weighted_reward": 1.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
+    planted_by_key = {issue.to_key(): issue for issue in planted_issues}
+    planted_keys = set(planted_by_key.keys())
+    if not reported_keys:
+        total_weight = sum(i.difficulty for i in planted_issues)
+        return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": total_weight}
+    if not planted_keys:
+        return {"weighted_reward": 0.0, "difficulty_found": 0.0, "difficulty_missed": 0.0}
+    # Sum difficulty weights for found vs missed issues
+    found_keys = reported_keys & planted_keys
+    missed_keys = planted_keys - reported_keys
+    false_positive_count = len(reported_keys - planted_keys)
+    difficulty_found = sum(planted_by_key[k].difficulty for k in found_keys)
+    difficulty_missed = sum(planted_by_key[k].difficulty for k in missed_keys)
+    total_weight = sum(i.difficulty for i in planted_issues)
+    # Weighted recall: proportion of difficulty captured
+    weighted_recall = difficulty_found / total_weight if total_weight > 0 else 0.0
+    # Penalty for false positives (scaled by avg difficulty so penalty is proportional)
+    avg_difficulty = total_weight / len(planted_issues)
+    fp_penalty_weight = false_positive_count * avg_difficulty
+    weighted_precision = difficulty_found / (difficulty_found + fp_penalty_weight) if (difficulty_found + fp_penalty_weight) > 0 else 0.0
+    # Weighted F1
+    if (weighted_precision + weighted_recall) > 0:
+        weighted_reward = 2 * weighted_precision * weighted_recall / (weighted_precision + weighted_recall)
+    else:
+        weighted_reward = 0.0
+    return {
+        "weighted_reward": round(weighted_reward, 4),
+        "difficulty_found": round(difficulty_found, 2),
+        "difficulty_missed": round(difficulty_missed, 2),
+    }
 class DataQAEnvironment(Environment):
     """
     Data Quality Assurance environment.
             else:
                 parse_errors.append(f"Could not parse: '{raw_issue}'")
+        # Compute score (standard F1)
         metrics = compute_f1(reported_keys, self._planted_keys)
         score = metrics["f1"]
+        # Compute difficulty-weighted reward (richer per-step signal)
+        weighted = compute_weighted_reward(reported_keys, self._current_task.planted_issues)
+        weighted_reward = weighted["weighted_reward"]
+        # Use weighted reward as the primary reward signal
+        self._best_score = max(self._best_score, weighted_reward)
         self._state.best_score = self._best_score
         # Check if done
         is_done = (
+            score >= 0.999  # Perfect score (all issues found exactly)
             or self._state.current_step >= self._state.max_steps
         )
             f"Issues reported: {len(reported_keys)}",
             f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
             f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {score:.3f}",
+            f"Weighted reward: {weighted_reward:.3f} (difficulty found: {weighted['difficulty_found']}, missed: {weighted['difficulty_missed']})",
         ]
         if parse_errors:
                 )
             feedback_lines.append("You can submit again with an updated list of issues.")
         else:
+            feedback_lines.append(f"Task complete! Final best weighted reward: {self._best_score:.3f}")
         return DataQAObservation(
             dataset_csv=self._current_task.corrupted_csv,
             max_steps=self._state.max_steps,
             done=is_done,
             reward=self._best_score,
+            metadata={
+                "f1": score,
+                "weighted_reward": weighted_reward,
+                "precision": metrics["precision"],
+                "recall": metrics["recall"],
+                "tp": metrics["tp"],
+                "fp": metrics["fp"],
+                "fn": metrics["fn"],
+                "difficulty_found": weighted["difficulty_found"],
+                "difficulty_missed": weighted["difficulty_missed"],
+            },
         )
     @property

dataqa_env/server/tasks.py CHANGED Viewed

@@ -25,6 +25,7 @@ class PlantedIssue:
     col: str
     issue_type: str
     description: str
     def to_key(self) -> str:
         return f"row:{self.row},col:{self.col},issue:{self.issue_type}"
@@ -93,29 +94,29 @@ def create_task_easy(seed: int = 42) -> Task:
     data = rows[1:]
     issues: List[PlantedIssue] = []
-    # Issue 1: Missing value - null out a name
     r = 3  # row index in data (0-based), displayed as row 4 in CSV
     data[r][1] = ""
     issues.append(PlantedIssue(row=r + 1, col="name", issue_type="missing_value",
-                               description="Empty name field"))
-    # Issue 2: Wrong type - salary as text
     r = 6
     data[r][4] = "seventy-five thousand"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="wrong_type",
-                               description="Salary is text instead of integer"))
-    # Issue 3: Duplicate row
     dup_source = 1
     data.append(list(data[dup_source]))
     issues.append(PlantedIssue(row=len(data), col="employee_id", issue_type="duplicate_row",
-                               description=f"Exact duplicate of row {dup_source + 1}"))
-    # Issue 4: Out of range salary
     r = 8
     data[r][4] = "5000"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="out_of_range",
-                               description="Salary 5000 is below minimum 50000"))
     corrupted = _rows_to_csv([header] + data)
@@ -190,41 +191,41 @@ ORD-020,CUST-118,Fitness Tracker,Electronics,1,79.99,2024-02-03,AU,delivered,79.
     data = rows[1:]
     issues: List[PlantedIssue] = []
-    # Issue 1: total doesn't match quantity * unit_price
     r = 4  # ORD-005
     data[r][9] = "84.00"  # should be 42.00 (qty=1, price=42.00)
     issues.append(PlantedIssue(row=r + 1, col="total", issue_type="inconsistent_value",
-                               description="total (84.00) != quantity (1) * unit_price (42.00)"))
-    # Issue 2: Invalid category
     r = 9  # ORD-010
     data[r][3] = "Fitness"  # should be Sports
     issues.append(PlantedIssue(row=r + 1, col="category", issue_type="format_violation",
-                               description="'Fitness' is not in allowed categories"))
-    # Issue 3: Missing value in product_name
     r = 13  # ORD-014
     data[r][2] = ""
     issues.append(PlantedIssue(row=r + 1, col="product_name", issue_type="missing_value",
-                               description="Empty product_name"))
-    # Issue 4: Out of range quantity
     r = 16  # ORD-017
     data[r][4] = "-1"
     issues.append(PlantedIssue(row=r + 1, col="quantity", issue_type="out_of_range",
-                               description="Negative quantity"))
-    # Issue 5: Duplicate order_id
     r = 18  # ORD-019
     data[r][0] = "ORD-003"
     issues.append(PlantedIssue(row=r + 1, col="order_id", issue_type="duplicate_row",
-                               description="Duplicate order_id ORD-003"))
-    # Issue 6: Wrong date format
     r = 11  # ORD-012
     data[r][6] = "26/01/2024"
     issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="format_violation",
-                               description="Date format DD/MM/YYYY instead of YYYY-MM-DD"))
     corrupted = _rows_to_csv([header] + data)
@@ -301,53 +302,57 @@ EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0
     data = rows[1:]
     issues: List[PlantedIssue] = []
-    # Issue 1: Data leakage signal — val_loss much lower than train_loss
     r = 4  # EXP-005
     data[r][10] = "0.15"  # val_loss=0.15 but train_loss=0.28 → suspicious
     issues.append(PlantedIssue(row=r + 1, col="val_loss", issue_type="inconsistent_value",
-                               description="val_loss (0.15) significantly less than train_loss (0.28), potential data leakage"))
-    # Issue 2: Batch size not power of 2
     r = 8  # EXP-009
     data[r][7] = "250"  # not a power of 2
     issues.append(PlantedIssue(row=r + 1, col="batch_size", issue_type="format_violation",
-                               description="batch_size 250 is not a power of 2"))
-    # Issue 3: GPU memory unreasonable for model
     r = 6  # EXP-007 resnet18 on cifar10
     data[r][12] = "42.5"  # resnet18 shouldn't need 42.5 GB
     issues.append(PlantedIssue(row=r + 1, col="gpu_memory_gb", issue_type="statistical_outlier",
-                               description="resnet18 on cifar10 using 42.5 GB GPU memory is unreasonable"))
-    # Issue 4: Timestamp out of order
     r = 10  # EXP-011
     data[r][14] = "2024-03-02T09:00:00"  # should be after EXP-010's timestamp
     issues.append(PlantedIssue(row=r + 1, col="timestamp", issue_type="inconsistent_value",
-                               description="Timestamp 2024-03-02 is before EXP-010's timestamp 2024-03-11"))
-    # Issue 5: Train size smaller than test size
     r = 9  # EXP-010
     data[r][3] = "500"  # train_size=500 but test_size=1821
     issues.append(PlantedIssue(row=r + 1, col="train_size", issue_type="inconsistent_value",
-                               description="train_size (500) is smaller than test_size (1821)"))
-    # Issue 6: Negative training time
     r = 13  # EXP-014
     data[r][13] = "-72.0"
     issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="out_of_range",
-                               description="Negative training time"))
-    # Issue 7: Learning rate out of range
     r = 12  # EXP-013
     data[r][6] = "2.5"  # way too high
     issues.append(PlantedIssue(row=r + 1, col="learning_rate", issue_type="out_of_range",
-                               description="Learning rate 2.5 exceeds maximum of 1.0"))
-    # Issue 8: Missing model name (subtle — single space instead of empty)
     r = 14  # EXP-015
     data[r][1] = " "
     issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
-                               description="model_name is whitespace-only"))
     corrupted = _rows_to_csv([header] + data)
@@ -370,6 +375,123 @@ EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0
     )
 # ---------------------------------------------------------------------------
 # Task registry
 # ---------------------------------------------------------------------------

     col: str
     issue_type: str
     description: str
+    difficulty: float = 1.0  # 1.0=easy, 2.0=medium, 3.0=hard (for weighted reward)
     def to_key(self) -> str:
         return f"row:{self.row},col:{self.col},issue:{self.issue_type}"
     data = rows[1:]
     issues: List[PlantedIssue] = []
+    # Issue 1: Missing value - null out a name (easy to spot)
     r = 3  # row index in data (0-based), displayed as row 4 in CSV
     data[r][1] = ""
     issues.append(PlantedIssue(row=r + 1, col="name", issue_type="missing_value",
+                               description="Empty name field", difficulty=1.0))
+    # Issue 2: Wrong type - salary as text (easy to spot)
     r = 6
     data[r][4] = "seventy-five thousand"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="wrong_type",
+                               description="Salary is text instead of integer", difficulty=1.0))
+    # Issue 3: Duplicate row (moderate — requires cross-row comparison)
     dup_source = 1
     data.append(list(data[dup_source]))
     issues.append(PlantedIssue(row=len(data), col="employee_id", issue_type="duplicate_row",
+                               description=f"Exact duplicate of row {dup_source + 1}", difficulty=1.5))
+    # Issue 4: Out of range salary (easy to spot)
     r = 8
     data[r][4] = "5000"
     issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="out_of_range",
+                               description="Salary 5000 is below minimum 50000", difficulty=1.0))
     corrupted = _rows_to_csv([header] + data)
     data = rows[1:]
     issues: List[PlantedIssue] = []
+    # Issue 1: total doesn't match quantity * unit_price (requires cross-column check)
     r = 4  # ORD-005
     data[r][9] = "84.00"  # should be 42.00 (qty=1, price=42.00)
     issues.append(PlantedIssue(row=r + 1, col="total", issue_type="inconsistent_value",
+                               description="total (84.00) != quantity (1) * unit_price (42.00)", difficulty=2.0))
+    # Issue 2: Invalid category (requires knowing the allowed set)
     r = 9  # ORD-010
     data[r][3] = "Fitness"  # should be Sports
     issues.append(PlantedIssue(row=r + 1, col="category", issue_type="format_violation",
+                               description="'Fitness' is not in allowed categories", difficulty=1.5))
+    # Issue 3: Missing value in product_name (easy to spot)
     r = 13  # ORD-014
     data[r][2] = ""
     issues.append(PlantedIssue(row=r + 1, col="product_name", issue_type="missing_value",
+                               description="Empty product_name", difficulty=1.0))
+    # Issue 4: Out of range quantity (easy to spot)
     r = 16  # ORD-017
     data[r][4] = "-1"
     issues.append(PlantedIssue(row=r + 1, col="quantity", issue_type="out_of_range",
+                               description="Negative quantity", difficulty=1.0))
+    # Issue 5: Duplicate order_id (requires cross-row comparison)
     r = 18  # ORD-019
     data[r][0] = "ORD-003"
     issues.append(PlantedIssue(row=r + 1, col="order_id", issue_type="duplicate_row",
+                               description="Duplicate order_id ORD-003", difficulty=1.5))
+    # Issue 6: Wrong date format (moderate — format mismatch)
     r = 11  # ORD-012
     data[r][6] = "26/01/2024"
     issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="format_violation",
+                               description="Date format DD/MM/YYYY instead of YYYY-MM-DD", difficulty=1.5))
     corrupted = _rows_to_csv([header] + data)
     data = rows[1:]
     issues: List[PlantedIssue] = []
+    # Issue 1: Data leakage signal — val_loss much lower than train_loss (hard — requires ML knowledge)
     r = 4  # EXP-005
     data[r][10] = "0.15"  # val_loss=0.15 but train_loss=0.28 → suspicious
     issues.append(PlantedIssue(row=r + 1, col="val_loss", issue_type="inconsistent_value",
+                               description="val_loss (0.15) significantly less than train_loss (0.28), potential data leakage",
+                               difficulty=3.0))
+    # Issue 2: Batch size not power of 2 (moderate — domain convention)
     r = 8  # EXP-009
     data[r][7] = "250"  # not a power of 2
     issues.append(PlantedIssue(row=r + 1, col="batch_size", issue_type="format_violation",
+                               description="batch_size 250 is not a power of 2", difficulty=2.0))
+    # Issue 3: GPU memory unreasonable for model (hard — requires model size reasoning)
     r = 6  # EXP-007 resnet18 on cifar10
     data[r][12] = "42.5"  # resnet18 shouldn't need 42.5 GB
     issues.append(PlantedIssue(row=r + 1, col="gpu_memory_gb", issue_type="statistical_outlier",
+                               description="resnet18 on cifar10 using 42.5 GB GPU memory is unreasonable",
+                               difficulty=3.0))
+    # Issue 4: Timestamp out of order (moderate — requires sequential comparison)
     r = 10  # EXP-011
     data[r][14] = "2024-03-02T09:00:00"  # should be after EXP-010's timestamp
     issues.append(PlantedIssue(row=r + 1, col="timestamp", issue_type="inconsistent_value",
+                               description="Timestamp 2024-03-02 is before EXP-010's timestamp 2024-03-11",
+                               difficulty=2.0))
+    # Issue 5: Train size smaller than test size (moderate — cross-column logic)
     r = 9  # EXP-010
     data[r][3] = "500"  # train_size=500 but test_size=1821
     issues.append(PlantedIssue(row=r + 1, col="train_size", issue_type="inconsistent_value",
+                               description="train_size (500) is smaller than test_size (1821)",
+                               difficulty=2.0))
+    # Issue 6: Negative training time (easy to spot)
     r = 13  # EXP-014
     data[r][13] = "-72.0"
     issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="out_of_range",
+                               description="Negative training time", difficulty=1.0))
+    # Issue 7: Learning rate out of range (easy to spot)
     r = 12  # EXP-013
     data[r][6] = "2.5"  # way too high
     issues.append(PlantedIssue(row=r + 1, col="learning_rate", issue_type="out_of_range",
+                               description="Learning rate 2.5 exceeds maximum of 1.0", difficulty=1.5))
+    # Issue 8: Missing model name (hard — whitespace-only is subtle)
     r = 14  # EXP-015
     data[r][1] = " "
     issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
+                               description="model_name is whitespace-only", difficulty=2.5))
     corrupted = _rows_to_csv([header] + data)
     )
+# ---------------------------------------------------------------------------
+# Contamination rules for extensible task creation
+# ---------------------------------------------------------------------------
+# Each contamination rule is a callable: (rows, header, col_idx, row_idx, rng) -> (new_value, PlantedIssue)
+# Users can define their own and register them.
+CONTAMINATION_RULES = {
+    "missing_value": lambda rows, header, col_idx, row_idx, rng: (
+        "",
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="missing_value",
+            description=f"Empty {header[col_idx]} field", difficulty=1.0,
+        ),
+    ),
+    "whitespace_value": lambda rows, header, col_idx, row_idx, rng: (
+        " ",
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="missing_value",
+            description=f"Whitespace-only {header[col_idx]} field", difficulty=2.5,
+        ),
+    ),
+    "wrong_type_text": lambda rows, header, col_idx, row_idx, rng: (
+        rng.choice(["not-a-number", "N/A", "null", "undefined"]),
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="wrong_type",
+            description=f"{header[col_idx]} is text instead of expected type", difficulty=1.0,
+        ),
+    ),
+    "negative_value": lambda rows, header, col_idx, row_idx, rng: (
+        str(-abs(float(rows[row_idx][col_idx]) if rows[row_idx][col_idx] else 1)),
+        PlantedIssue(
+            row=row_idx + 1, col=header[col_idx], issue_type="out_of_range",
+            description=f"Negative {header[col_idx]}", difficulty=1.0,
+        ),
+    ),
+}
+def create_task_from_config(
+    task_id: str,
+    name: str,
+    description: str,
+    schema_description: str,
+    validation_rules: str,
+    clean_csv: str,
+    contaminations: List[dict],
+    max_steps: int = 3,
+    seed: int = 42,
+) -> Task:
+    """
+    Create a custom task from a configuration dict.
+    Each contamination entry should have:
+        - rule: str (key in CONTAMINATION_RULES) or callable
+        - row: int (0-based row index in data)
+        - col: int (column index in header)
+        - difficulty: float (optional, overrides rule default)
+    Example:
+        contaminations = [
+            {"rule": "missing_value", "row": 2, "col": 1, "difficulty": 1.5},
+            {"rule": "negative_value", "row": 5, "col": 4},
+        ]
+    """
+    rng = random.Random(seed)
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    for spec in contaminations:
+        rule = spec["rule"]
+        row_idx = spec["row"]
+        col_idx = spec["col"]
+        if callable(rule):
+            new_val, issue = rule(data, header, col_idx, row_idx, rng)
+        elif rule in CONTAMINATION_RULES:
+            new_val, issue = CONTAMINATION_RULES[rule](data, header, col_idx, row_idx, rng)
+        else:
+            raise ValueError(f"Unknown contamination rule: {rule}. Available: {list(CONTAMINATION_RULES.keys())}")
+        data[row_idx][col_idx] = new_val
+        if "difficulty" in spec:
+            issue.difficulty = spec["difficulty"]
+        issues.append(issue)
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id=task_id,
+        name=name,
+        description=description,
+        schema_description=schema_description,
+        validation_rules=validation_rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=max_steps,
+    )
+def register_task(task_id: str, factory_fn):
+    """Register a custom task factory. Factory must accept (seed: int) -> Task."""
+    TASK_REGISTRY[task_id] = factory_fn
+def register_contamination_rule(name: str, rule_fn):
+    """
+    Register a custom contamination rule.
+    rule_fn signature: (rows, header, col_idx, row_idx, rng) -> (new_value, PlantedIssue)
+    """
+    CONTAMINATION_RULES[name] = rule_fn
 # ---------------------------------------------------------------------------
 # Task registry
 # ---------------------------------------------------------------------------

inference.py CHANGED Viewed

@@ -6,21 +6,23 @@ LLM agent that plays the DataQA environment.
 Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
 Required environment variables:
-    API_BASE_URL  - LLM API endpoint (e.g., https://api.groq.com/openai/v1)
-    MODEL_NAME    - Model identifier (e.g., llama-3.3-70b-versatile)
-    HF_TOKEN      - HuggingFace token (for HF Spaces access)
-Structured logging format: [START], [STEP], [END] tags for evaluation.
 """
 from __future__ import annotations
-import json
 import os
 import re
 import sys
 import time
-from typing import Optional
 import requests
 from openai import OpenAI
@@ -28,52 +30,43 @@ from openai import OpenAI
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
-API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
-MODEL_NAME = os.environ.get("MODEL_NAME", "llama-3.3-70b-versatile")
-HF_TOKEN = os.environ.get("HF_TOKEN", "")
-ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
 TASKS = ["easy", "medium", "hard"]
 MAX_STEPS_PER_TASK = 3
 # ---------------------------------------------------------------------------
-# Logging helpers (structured stdout for evaluation)
 # ---------------------------------------------------------------------------
-def log_start(task_id: str, metadata: Optional[dict] = None):
-    entry = {"event": "START", "task_id": task_id, "timestamp": time.time()}
-    if metadata:
-        entry["metadata"] = metadata
-    print(f"[START] {json.dumps(entry)}", flush=True)
-def log_step(task_id: str, step: int, reward: float, details: Optional[dict] = None):
-    entry = {
-        "event": "STEP",
-        "task_id": task_id,
-        "step": step,
-        "reward": reward,
-        "timestamp": time.time(),
-    }
-    if details:
-        entry["details"] = details
-    print(f"[STEP] {json.dumps(entry)}", flush=True)
-def log_end(task_id: str, final_score: float, metadata: Optional[dict] = None):
-    entry = {
-        "event": "END",
-        "task_id": task_id,
-        "final_score": final_score,
-        "timestamp": time.time(),
-    }
-    if metadata:
-        entry["metadata"] = metadata
-    print(f"[END] {json.dumps(entry)}", flush=True)
 # ---------------------------------------------------------------------------
-# Environment HTTP client (simple, no WebSocket needed for inference)
 # ---------------------------------------------------------------------------
 class EnvHTTPClient:
@@ -108,11 +101,6 @@ class EnvHTTPClient:
         r.raise_for_status()
         return r.json()
-    def state(self) -> dict:
-        r = self.session.get(f"{self.base_url}/state", timeout=10)
-        r.raise_for_status()
-        return r.json()
 # ---------------------------------------------------------------------------
 # LLM Agent
@@ -142,7 +130,6 @@ CRITICAL INSTRUCTIONS FOR ROW NUMBERING:
 - Row numbers refer to the ROW POSITION in the CSV data, NOT the value of any ID column
 - Row 1 = the FIRST data row after the header
 - Row 2 = the SECOND data row after the header
-- For example, if the CSV has header on line 1 and data starting on line 2, the data on line 2 is row 1, line 3 is row 2, etc.
 - DO NOT use the employee_id, order_id, or experiment_id as the row number
 - Column names must match exactly (use the CSV header names, lowercase)
 - Check EVERY row and EVERY column systematically
@@ -188,14 +175,12 @@ def parse_llm_response(response: str) -> list[str]:
         line = re.sub(r"^\s*[-*]\s*", "", line)
         line = line.strip()
         if "row" in line.lower() and "col" in line.lower():
-            # Lenient regex: accept : or = as delimiters, case-insensitive
             match = re.search(
                 r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+issue\s*[:=]\s*([\w_]+)",
                 line,
                 re.IGNORECASE,
             )
             if match:
-                # Normalize to lowercase canonical format
                 normalized = f"row:{match.group(1)},col:{match.group(2).lower()},issue:{match.group(3).lower()}"
                 issues.append(normalized)
     return issues
@@ -203,68 +188,84 @@ def parse_llm_response(response: str) -> list[str]:
 def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
     """Run a single task and return the best score."""
-    log_start(task_id)
-    # Reset environment for this task
-    reset_response = env.reset(task_id=task_id)
-    observation = reset_response.get("observation", reset_response)
     best_score = 0.0
-    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
-    for step_num in range(1, MAX_STEPS_PER_TASK + 1):
-        user_prompt = build_user_prompt(observation)
-        messages_for_call = messages + [{"role": "user", "content": user_prompt}]
-        # Call LLM with retry on rate limit
-        llm_output = ""
-        for attempt in range(3):
-            try:
-                response = client.chat.completions.create(
-                    model=MODEL_NAME,
-                    messages=messages_for_call,
-                    temperature=0.1,
-                    max_tokens=2048,
-                )
-                llm_output = response.choices[0].message.content or ""
-                break
-            except Exception as e:
-                if "rate_limit" in str(e).lower() or "429" in str(e):
-                    wait = 10 * (attempt + 1)
-                    print(f"[WARN] Rate limited, waiting {wait}s...", flush=True)
-                    time.sleep(wait)
-                else:
-                    print(f"[ERROR] LLM call failed: {e}", file=sys.stderr, flush=True)
                     break
-        # Parse issues from LLM response
-        issues = parse_llm_response(llm_output)
-        if not issues:
-            print(f"[WARN] No issues parsed from LLM response for {task_id} step {step_num}", file=sys.stderr, flush=True)
-        # Submit to environment
-        step_response = env.step(issues, task_id=task_id)
-        observation = step_response.get("observation", step_response)
-        # reward and done are at the top level of the response, not inside observation
-        reward = float(step_response.get("reward", 0.0) or 0.0)
-        done = bool(step_response.get("done", False))
-        best_score = max(best_score, reward)
-        log_step(task_id, step_num, reward, {
-            "issues_reported": len(issues),
-            "feedback": observation.get("feedback", ""),
-        })
-        if done:
-            break
-        # Add context for next attempt
-        messages.append({"role": "user", "content": user_prompt})
-        messages.append({"role": "assistant", "content": llm_output})
-    log_end(task_id, best_score)
     return best_score
@@ -273,49 +274,38 @@ def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
 # ---------------------------------------------------------------------------
 def main():
-    print(f"[INFO] DataQA Inference starting", flush=True)
-    print(f"[INFO] ENV_URL={ENV_URL}", flush=True)
-    print(f"[INFO] API_BASE_URL={API_BASE_URL}", flush=True)
-    print(f"[INFO] MODEL_NAME={MODEL_NAME}", flush=True)
     # Initialize clients
     env = EnvHTTPClient(ENV_URL)
     llm_client = OpenAI(
         base_url=API_BASE_URL,
-        api_key=os.environ.get("LLM_API_KEY", HF_TOKEN or "no-key"),
     )
     # Check environment health
     if not env.health():
-        print("[ERROR] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
         sys.exit(1)
-    print(f"[INFO] Environment is healthy", flush=True)
     # Run all tasks
     scores = {}
     for task_id in TASKS:
-        print(f"\n{'='*60}", flush=True)
-        print(f"[INFO] Starting task: {task_id}", flush=True)
-        print(f"{'='*60}", flush=True)
         try:
             score = run_task(llm_client, env, task_id)
             scores[task_id] = score
-            print(f"[INFO] Task {task_id} completed with score: {score:.3f}", flush=True)
         except Exception as e:
-            print(f"[ERROR] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
             scores[task_id] = 0.0
-    # Summary
-    print(f"\n{'='*60}", flush=True)
-    print("[INFO] FINAL RESULTS", flush=True)
-    print(f"{'='*60}", flush=True)
-    for task_id, score in scores.items():
-        print(f"[INFO] {task_id}: {score:.3f}", flush=True)
     avg_score = sum(scores.values()) / len(scores) if scores else 0.0
-    print(f"[INFO] Average score: {avg_score:.3f}", flush=True)
 if __name__ == "__main__":

 Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
 Required environment variables:
+    API_BASE_URL  - LLM API endpoint (e.g., https://router.huggingface.co/v1)
+    MODEL_NAME    - Model identifier (e.g., Qwen/Qwen2.5-72B-Instruct)
+    HF_TOKEN      - HuggingFace token / API key
+STDOUT FORMAT (mandatory for evaluation):
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
 """
 from __future__ import annotations
 import os
 import re
 import sys
 import time
+from typing import List, Optional
 import requests
 from openai import OpenAI
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
+BENCHMARK = "dataqa_env"
 TASKS = ["easy", "medium", "hard"]
 MAX_STEPS_PER_TASK = 3
 # ---------------------------------------------------------------------------
+# Logging helpers (structured stdout — exact format required by evaluation)
 # ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
 # ---------------------------------------------------------------------------
+# Environment HTTP client
 # ---------------------------------------------------------------------------
 class EnvHTTPClient:
         r.raise_for_status()
         return r.json()
 # ---------------------------------------------------------------------------
 # LLM Agent
 - Row numbers refer to the ROW POSITION in the CSV data, NOT the value of any ID column
 - Row 1 = the FIRST data row after the header
 - Row 2 = the SECOND data row after the header
 - DO NOT use the employee_id, order_id, or experiment_id as the row number
 - Column names must match exactly (use the CSV header names, lowercase)
 - Check EVERY row and EVERY column systematically
         line = re.sub(r"^\s*[-*]\s*", "", line)
         line = line.strip()
         if "row" in line.lower() and "col" in line.lower():
             match = re.search(
                 r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+issue\s*[:=]\s*([\w_]+)",
                 line,
                 re.IGNORECASE,
             )
             if match:
                 normalized = f"row:{match.group(1)},col:{match.group(2).lower()},issue:{match.group(3).lower()}"
                 issues.append(normalized)
     return issues
 def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
     """Run a single task and return the best score."""
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
     best_score = 0.0
+    success = False
+    try:
+        # Reset environment for this task
+        reset_response = env.reset(task_id=task_id)
+        observation = reset_response.get("observation", reset_response)
+        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+        for step_num in range(1, MAX_STEPS_PER_TASK + 1):
+            user_prompt = build_user_prompt(observation)
+            messages_for_call = messages + [{"role": "user", "content": user_prompt}]
+            # Call LLM with retry on rate limit
+            llm_output = ""
+            error_msg = None
+            for attempt in range(3):
+                try:
+                    response = client.chat.completions.create(
+                        model=MODEL_NAME,
+                        messages=messages_for_call,
+                        temperature=0.1,
+                        max_tokens=2048,
+                    )
+                    llm_output = response.choices[0].message.content or ""
                     break
+                except Exception as e:
+                    if "rate_limit" in str(e).lower() or "429" in str(e):
+                        wait = 10 * (attempt + 1)
+                        print(f"[DEBUG] Rate limited, waiting {wait}s...", file=sys.stderr, flush=True)
+                        time.sleep(wait)
+                    else:
+                        error_msg = str(e)
+                        print(f"[DEBUG] LLM call failed: {e}", file=sys.stderr, flush=True)
+                        break
+            # Parse issues from LLM response
+            issues = parse_llm_response(llm_output)
+            action_str = ";".join(issues) if issues else "none"
+            if not issues and not error_msg:
+                error_msg = "no issues parsed from LLM response"
+            # Submit to environment
+            step_response = env.step(issues, task_id=task_id)
+            observation = step_response.get("observation", step_response)
+            reward = float(step_response.get("reward", 0.0) or 0.0)
+            done = bool(step_response.get("done", False))
+            best_score = max(best_score, reward)
+            rewards.append(reward)
+            steps_taken = step_num
+            log_step(
+                step=step_num,
+                action=action_str,
+                reward=reward,
+                done=done,
+                error=error_msg,
+            )
+            if done:
+                break
+            # Add context for next attempt
+            messages.append({"role": "user", "content": user_prompt})
+            messages.append({"role": "assistant", "content": llm_output})
+        success = best_score >= 0.5
+    finally:
+        log_end(success=success, steps=steps_taken, score=best_score, rewards=rewards)
     return best_score
 # ---------------------------------------------------------------------------
 def main():
+    print(f"[DEBUG] DataQA Inference starting", file=sys.stderr, flush=True)
+    print(f"[DEBUG] ENV_URL={ENV_URL}", file=sys.stderr, flush=True)
+    print(f"[DEBUG] API_BASE_URL={API_BASE_URL}", file=sys.stderr, flush=True)
+    print(f"[DEBUG] MODEL_NAME={MODEL_NAME}", file=sys.stderr, flush=True)
     # Initialize clients
     env = EnvHTTPClient(ENV_URL)
     llm_client = OpenAI(
         base_url=API_BASE_URL,
+        api_key=API_KEY or "no-key",
     )
     # Check environment health
     if not env.health():
+        print("[DEBUG] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
         sys.exit(1)
+    print(f"[DEBUG] Environment is healthy", file=sys.stderr, flush=True)
     # Run all tasks
     scores = {}
     for task_id in TASKS:
         try:
             score = run_task(llm_client, env, task_id)
             scores[task_id] = score
         except Exception as e:
+            print(f"[DEBUG] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
             scores[task_id] = 0.0
+    # Summary to stderr (stdout is reserved for structured logs only)
     avg_score = sum(scores.values()) / len(scores) if scores else 0.0
+    print(f"\n[DEBUG] FINAL RESULTS: {scores} avg={avg_score:.3f}", file=sys.stderr, flush=True)
 if __name__ == "__main__":

prevalidation_script.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0

sample_inference_script.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

tests/__init__.py ADDED Viewed

File without changes

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,249 @@

+"""Tests for the DataQA environment (reset, step, scoring)."""
+import pytest
+from dataqa_env.server.environment import (
+    DataQAEnvironment,
+    parse_issue_key,
+    compute_f1,
+    compute_weighted_reward,
+)
+from dataqa_env.models import DataQAAction
+from dataqa_env.server.tasks import PlantedIssue
+class TestParseIssueKey:
+    def test_standard_format(self):
+        assert parse_issue_key("row:3,col:salary,issue:missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_with_equals(self):
+        assert parse_issue_key("row=3,col=salary,issue=missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_case_insensitive(self):
+        assert parse_issue_key("Row:3,Col:Salary,Issue:Missing_Value") == "row:3,col:salary,issue:missing_value"
+    def test_with_spaces(self):
+        assert parse_issue_key("row: 3, col: salary, issue: missing_value") == "row:3,col:salary,issue:missing_value"
+    def test_unparseable(self):
+        assert parse_issue_key("this is garbage") is None
+    def test_partial_match(self):
+        assert parse_issue_key("row:3,col:salary") is None  # missing issue
+    def test_empty_string(self):
+        assert parse_issue_key("") is None
+    def test_semicolon_separator(self):
+        result = parse_issue_key("row:3;col:salary;issue:missing_value")
+        assert result == "row:3,col:salary,issue:missing_value"
+class TestComputeF1:
+    def test_perfect_match(self):
+        keys = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(keys, keys)
+        assert result["f1"] == 1.0
+        assert result["tp"] == 1
+        assert result["fp"] == 0
+        assert result["fn"] == 0
+    def test_no_reported_no_planted(self):
+        result = compute_f1(set(), set())
+        assert result["f1"] == 1.0
+    def test_no_reported_some_planted(self):
+        planted = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(set(), planted)
+        assert result["f1"] == 0.0
+        assert result["fn"] == 1
+    def test_all_false_positives(self):
+        reported = {"row:99,col:x,issue:wrong_type"}
+        planted = {"row:1,col:a,issue:missing_value"}
+        result = compute_f1(reported, planted)
+        assert result["tp"] == 0
+        assert result["fp"] == 1
+        assert result["fn"] == 1
+        assert result["f1"] == 0.0
+    def test_partial_match(self):
+        reported = {"row:1,col:a,issue:missing_value", "row:2,col:b,issue:wrong_type"}
+        planted = {"row:1,col:a,issue:missing_value", "row:3,col:c,issue:duplicate_row"}
+        result = compute_f1(reported, planted)
+        assert result["tp"] == 1
+        assert result["fp"] == 1
+        assert result["fn"] == 1
+        assert 0 < result["f1"] < 1
+    def test_precision_recall_calculation(self):
+        reported = {"a", "b", "c"}
+        planted = {"a", "b", "d"}
+        result = compute_f1(reported, planted)
+        assert result["precision"] == pytest.approx(2 / 3)
+        assert result["recall"] == pytest.approx(2 / 3)
+class TestComputeWeightedReward:
+    def test_perfect_match(self):
+        issues = [
+            PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0),
+            PlantedIssue(row=2, col="b", issue_type="wrong_type", description="", difficulty=3.0),
+        ]
+        reported = {i.to_key() for i in issues}
+        result = compute_weighted_reward(reported, issues)
+        assert result["weighted_reward"] == 1.0
+    def test_empty_both(self):
+        result = compute_weighted_reward(set(), [])
+        assert result["weighted_reward"] == 1.0
+    def test_no_reported(self):
+        issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=2.0)]
+        result = compute_weighted_reward(set(), issues)
+        assert result["weighted_reward"] == 0.0
+        assert result["difficulty_missed"] == 2.0
+    def test_hard_issue_worth_more(self):
+        easy = PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)
+        hard = PlantedIssue(row=2, col="b", issue_type="statistical_outlier", description="", difficulty=3.0)
+        issues = [easy, hard]
+        # Finding only the hard issue should score higher than only the easy issue
+        hard_found = compute_weighted_reward({hard.to_key()}, issues)
+        easy_found = compute_weighted_reward({easy.to_key()}, issues)
+        assert hard_found["weighted_reward"] > easy_found["weighted_reward"]
+    def test_false_positives_reduce_reward(self):
+        issues = [PlantedIssue(row=1, col="a", issue_type="missing_value", description="", difficulty=1.0)]
+        correct = {issues[0].to_key()}
+        with_fp = correct | {"row:99,col:x,issue:wrong_type"}
+        r_correct = compute_weighted_reward(correct, issues)
+        r_with_fp = compute_weighted_reward(with_fp, issues)
+        assert r_correct["weighted_reward"] > r_with_fp["weighted_reward"]
+class TestDataQAEnvironment:
+    @pytest.fixture
+    def env(self):
+        return DataQAEnvironment()
+    def test_reset_returns_observation(self, env):
+        obs = env.reset(task_id="easy")
+        assert obs.dataset_csv
+        assert obs.schema_description
+        assert obs.validation_rules
+        assert obs.task_description
+        assert obs.num_issues_hint == 4
+        assert obs.max_steps == 3
+        assert obs.done is False
+        assert obs.reward == 0.0
+    def test_reset_medium(self, env):
+        obs = env.reset(task_id="medium")
+        assert obs.num_issues_hint == 6
+    def test_reset_hard(self, env):
+        obs = env.reset(task_id="hard")
+        assert obs.num_issues_hint == 8
+    def test_step_with_correct_issues(self, env):
+        env.reset(task_id="easy")
+        # Submit all correct issues for easy task
+        action = DataQAAction(
+            issues=[
+                "row:4,col:name,issue:missing_value",
+                "row:7,col:salary,issue:wrong_type",
+                "row:11,col:employee_id,issue:duplicate_row",
+                "row:9,col:salary,issue:out_of_range",
+            ],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.reward >= 0.999
+    def test_step_with_partial_issues(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert 0 < obs.reward < 1.0
+        assert obs.done is False
+    def test_step_with_no_issues(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=[], task_id="easy")
+        obs = env.step(action)
+        assert obs.reward == 0.0
+    def test_step_exhausts_max_steps(self, env):
+        env.reset(task_id="easy")
+        for _ in range(3):
+            action = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
+            obs = env.step(action)
+        assert obs.done is True
+    def test_auto_reset_on_step(self, env):
+        # step() without prior reset should auto-reset
+        action = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value"],
+            task_id="easy",
+        )
+        obs = env.step(action)
+        assert obs.task_id == "easy"
+    def test_state_tracking(self, env):
+        env.reset(task_id="easy")
+        assert env.state.task_id == "easy"
+        assert env.state.current_step == 0
+        assert env.state.best_score == 0.0
+        action = DataQAAction(issues=["row:4,col:name,issue:missing_value"], task_id="easy")
+        env.step(action)
+        assert env.state.current_step == 1
+        assert env.state.best_score > 0.0
+    def test_best_score_monotonic(self, env):
+        env.reset(task_id="easy")
+        action1 = DataQAAction(
+            issues=["row:4,col:name,issue:missing_value", "row:7,col:salary,issue:wrong_type"],
+            task_id="easy",
+        )
+        env.step(action1)
+        score_after_1 = env.state.best_score
+        # Worse submission shouldn't decrease best_score
+        action2 = DataQAAction(issues=["row:99,col:x,issue:wrong_type"], task_id="easy")
+        env.step(action2)
+        assert env.state.best_score >= score_after_1
+    def test_metadata_included_in_observation(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=["row:4,col:name,issue:missing_value"], task_id="easy")
+        obs = env.step(action)
+        assert "f1" in obs.metadata
+        assert "weighted_reward" in obs.metadata
+        assert "tp" in obs.metadata
+        assert "difficulty_found" in obs.metadata
+    def test_parse_error_in_feedback(self, env):
+        env.reset(task_id="easy")
+        action = DataQAAction(issues=["garbage input"], task_id="easy")
+        obs = env.step(action)
+        assert "Parse error" in obs.feedback
+    def test_concurrent_sessions_flag(self):
+        assert DataQAEnvironment.SUPPORTS_CONCURRENT_SESSIONS is True
+    def test_reward_between_0_and_1(self, env):
+        """Hackathon requirement: scores must be 0.0-1.0."""
+        env.reset(task_id="hard")
+        for _ in range(3):
+            action = DataQAAction(
+                issues=["row:1,col:x,issue:wrong_type", "row:99,col:y,issue:missing_value"],
+                task_id="hard",
+            )
+            obs = env.step(action)
+            assert 0.0 <= obs.reward <= 1.0

tests/test_extensibility.py ADDED Viewed

	@@ -0,0 +1,185 @@

+"""Tests for the extensibility API — custom tasks and contamination rules."""
+import pytest
+from dataqa_env.server.tasks import (
+    PlantedIssue,
+    create_task_from_config,
+    register_task,
+    register_contamination_rule,
+    CONTAMINATION_RULES,
+    get_task,
+    list_tasks,
+)
+from dataqa_env.server.environment import DataQAEnvironment, compute_weighted_reward
+from dataqa_env.models import DataQAAction
+SIMPLE_CSV = "id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92\n4,Dave,78"
+class TestCreateTaskFromConfig:
+    def test_basic_creation(self):
+        task = create_task_from_config(
+            task_id="test_custom",
+            name="Test Task",
+            description="Test",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1},
+            ],
+        )
+        assert task.task_id == "test_custom"
+        assert len(task.planted_issues) == 1
+        assert task.planted_issues[0].issue_type == "missing_value"
+        assert task.planted_issues[0].col == "name"
+    def test_multiple_contaminations(self):
+        task = create_task_from_config(
+            task_id="multi",
+            name="Multi",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1},
+                {"rule": "missing_value", "row": 2, "col": 1},
+            ],
+        )
+        assert len(task.planted_issues) == 2
+    def test_custom_difficulty_override(self):
+        task = create_task_from_config(
+            task_id="custom_diff",
+            name="Custom Difficulty",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 2.5},
+            ],
+        )
+        assert task.planted_issues[0].difficulty == 2.5
+    def test_callable_rule(self):
+        def custom_rule(rows, header, col_idx, row_idx, rng):
+            return "CORRUPTED", PlantedIssue(
+                row=row_idx + 1, col=header[col_idx], issue_type="wrong_type",
+                description="Custom corruption", difficulty=1.5,
+            )
+        task = create_task_from_config(
+            task_id="callable",
+            name="Callable Rule",
+            description="Test",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": custom_rule, "row": 1, "col": 2},
+            ],
+        )
+        assert task.planted_issues[0].issue_type == "wrong_type"
+        assert "CORRUPTED" in task.corrupted_csv
+    def test_unknown_rule_raises(self):
+        with pytest.raises(ValueError, match="Unknown contamination rule"):
+            create_task_from_config(
+                task_id="bad",
+                name="Bad",
+                description="",
+                schema_description="",
+                validation_rules="",
+                clean_csv=SIMPLE_CSV,
+                contaminations=[{"rule": "nonexistent_rule", "row": 0, "col": 0}],
+            )
+class TestRegisterContaminationRule:
+    def test_register_and_use(self):
+        def reverse_value(rows, header, col_idx, row_idx, rng):
+            val = rows[row_idx][col_idx]
+            return val[::-1], PlantedIssue(
+                row=row_idx + 1, col=header[col_idx], issue_type="format_violation",
+                description="Reversed value", difficulty=1.5,
+            )
+        register_contamination_rule("reverse", reverse_value)
+        assert "reverse" in CONTAMINATION_RULES
+        task = create_task_from_config(
+            task_id="rev_test",
+            name="Reverse Test",
+            description="",
+            schema_description="",
+            validation_rules="",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[{"rule": "reverse", "row": 0, "col": 1}],
+        )
+        assert task.planted_issues[0].issue_type == "format_violation"
+        # "Alice" reversed is "ecilA"
+        assert "ecilA" in task.corrupted_csv
+        # Cleanup
+        del CONTAMINATION_RULES["reverse"]
+class TestRegisterTask:
+    def test_register_and_get(self):
+        task = create_task_from_config(
+            task_id="registered",
+            name="Registered Task",
+            description="Test registered task",
+            schema_description="id: int, name: str",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[{"rule": "missing_value", "row": 1, "col": 1}],
+        )
+        register_task("registered", lambda seed: task)
+        assert "registered" in list_tasks()
+        fetched = get_task("registered")
+        assert fetched.task_id == "registered"
+        assert len(fetched.planted_issues) == 1
+        # Cleanup
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["registered"]
+class TestCustomTaskInEnvironment:
+    def test_full_lifecycle(self):
+        """Custom task works end-to-end in the environment."""
+        task = create_task_from_config(
+            task_id="e2e_custom",
+            name="E2E Custom",
+            description="End-to-end test",
+            schema_description="id: int, name: str, score: int",
+            validation_rules="No missing values",
+            clean_csv=SIMPLE_CSV,
+            contaminations=[
+                {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+                {"rule": "whitespace_value", "row": 2, "col": 1, "difficulty": 2.5},
+            ],
+        )
+        register_task("e2e_custom", lambda seed: task)
+        env = DataQAEnvironment()
+        obs = env.reset(task_id="e2e_custom")
+        assert obs.num_issues_hint == 2
+        # Submit correct answers
+        action = DataQAAction(
+            issues=[i.to_key() for i in task.planted_issues],
+            task_id="e2e_custom",
+        )
+        obs = env.step(action)
+        assert obs.done is True
+        assert obs.reward >= 0.999
+        # Cleanup
+        from dataqa_env.server.tasks import TASK_REGISTRY
+        del TASK_REGISTRY["e2e_custom"]

tests/test_inference.py ADDED Viewed

	@@ -0,0 +1,148 @@

+"""Tests for the inference script's parsing, prompt building, and log format."""
+import re
+import pytest
+import sys
+import os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from inference import parse_llm_response, build_user_prompt, log_start, log_step, log_end
+class TestParseLLMResponse:
+    def test_standard_format(self):
+        response = "row:1,col:name,issue:missing_value\nrow:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+        assert "row:1,col:name,issue:missing_value" in issues
+    def test_numbered_list(self):
+        response = "1. row:1,col:name,issue:missing_value\n2. row:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_bullet_list(self):
+        response = "- row:1,col:name,issue:missing_value\n* row:2,col:salary,issue:wrong_type"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2
+    def test_equals_delimiter(self):
+        response = "row=1,col=name,issue=missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+        assert issues[0] == "row:1,col:name,issue:missing_value"
+    def test_mixed_case(self):
+        response = "Row:1,Col:Name,Issue:Missing_Value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+        assert issues[0] == "row:1,col:name,issue:missing_value"
+    def test_empty_response(self):
+        assert parse_llm_response("") == []
+        assert parse_llm_response("   ") == []
+    def test_garbage_lines_skipped(self):
+        response = "Here are the issues:\nrow:1,col:name,issue:missing_value\nNo more issues."
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+    def test_deduplication_not_applied(self):
+        response = "row:1,col:name,issue:missing_value\nrow:1,col:name,issue:missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 2  # duplicates kept, env handles dedup
+    def test_with_column_variant(self):
+        response = "row:1,column:name,issue:missing_value"
+        issues = parse_llm_response(response)
+        assert len(issues) == 1
+class TestBuildUserPrompt:
+    def test_includes_all_fields(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "col: int",
+            "validation_rules": "no nulls",
+            "dataset_csv": "a,b\n1,2",
+            "num_issues_hint": 3,
+            "feedback": "",
+        }
+        prompt = build_user_prompt(obs)
+        assert "Find issues" in prompt
+        assert "col: int" in prompt
+        assert "no nulls" in prompt
+        assert "a,b" in prompt
+        assert "3 issues" in prompt
+    def test_includes_feedback_on_retry(self):
+        obs = {
+            "task_description": "Find issues",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "a\n1",
+            "num_issues_hint": 0,
+            "feedback": "Step 1/3: You missed 2 issues",
+        }
+        prompt = build_user_prompt(obs)
+        assert "FEEDBACK" in prompt
+        assert "missed 2" in prompt
+    def test_excludes_reset_feedback(self):
+        obs = {
+            "task_description": "",
+            "schema_description": "",
+            "validation_rules": "",
+            "dataset_csv": "",
+            "num_issues_hint": 0,
+            "feedback": "Environment reset. Start inspecting.",
+        }
+        prompt = build_user_prompt(obs)
+        assert "FEEDBACK" not in prompt
+class TestLogFormat:
+    """Verify stdout log format matches hackathon evaluation requirements."""
+    def test_log_start_format(self, capsys):
+        log_start(task="easy", env="dataqa_env", model="test-model")
+        out = capsys.readouterr().out.strip()
+        assert out == "[START] task=easy env=dataqa_env model=test-model"
+    def test_log_step_format(self, capsys):
+        log_step(step=1, action="row:1,col:name,issue:missing_value", reward=0.50, done=False, error=None)
+        out = capsys.readouterr().out.strip()
+        assert out == "[STEP] step=1 action=row:1,col:name,issue:missing_value reward=0.50 done=false error=null"
+    def test_log_step_with_error(self, capsys):
+        log_step(step=2, action="none", reward=0.00, done=True, error="timeout")
+        out = capsys.readouterr().out.strip()
+        assert "error=timeout" in out
+        assert "done=true" in out
+    def test_log_end_format(self, capsys):
+        log_end(success=True, steps=3, score=0.85, rewards=[0.25, 0.50, 0.85])
+        out = capsys.readouterr().out.strip()
+        assert out == "[END] success=true steps=3 score=0.850 rewards=0.25,0.50,0.85"
+    def test_log_end_failure(self, capsys):
+        log_end(success=False, steps=1, score=0.0, rewards=[0.0])
+        out = capsys.readouterr().out.strip()
+        assert "success=false" in out
+        assert "score=0.000" in out
+    def test_reward_format_2_decimal(self, capsys):
+        log_step(step=1, action="test", reward=0.123456, done=False, error=None)
+        out = capsys.readouterr().out.strip()
+        assert "reward=0.12" in out
+    def test_no_newlines_within_line(self, capsys):
+        log_start(task="easy", env="dataqa_env", model="model")
+        log_step(step=1, action="act", reward=0.0, done=False, error=None)
+        log_end(success=False, steps=1, score=0.0, rewards=[0.0])
+        out = capsys.readouterr().out
+        lines = [l for l in out.split("\n") if l.strip()]
+        assert len(lines) == 3
+        assert lines[0].startswith("[START]")
+        assert lines[1].startswith("[STEP]")
+        assert lines[2].startswith("[END]")

tests/test_tasks.py ADDED Viewed

	@@ -0,0 +1,161 @@

+"""Tests for task definitions, data corruption, and issue planting."""
+import pytest
+from dataqa_env.server.tasks import (
+    PlantedIssue,
+    Task,
+    create_task_easy,
+    create_task_medium,
+    create_task_hard,
+    get_task,
+    list_tasks,
+    _csv_to_rows,
+    _rows_to_csv,
+)
+class TestPlantedIssue:
+    def test_to_key(self):
+        issue = PlantedIssue(row=3, col="salary", issue_type="missing_value", description="test")
+        assert issue.to_key() == "row:3,col:salary,issue:missing_value"
+    def test_difficulty_default(self):
+        issue = PlantedIssue(row=1, col="name", issue_type="missing_value", description="test")
+        assert issue.difficulty == 1.0
+    def test_difficulty_custom(self):
+        issue = PlantedIssue(row=1, col="name", issue_type="missing_value", description="test", difficulty=3.0)
+        assert issue.difficulty == 3.0
+class TestCSVHelpers:
+    def test_roundtrip(self):
+        csv_text = "a,b,c\n1,2,3\n4,5,6"
+        rows = _csv_to_rows(csv_text)
+        assert len(rows) == 3
+        result = _rows_to_csv(rows)
+        assert "1,2,3" in result
+    def test_empty_csv(self):
+        rows = _csv_to_rows("a,b\n")
+        assert len(rows) == 1  # header only
+class TestTaskEasy:
+    @pytest.fixture
+    def task(self):
+        return create_task_easy()
+    def test_task_id(self, task):
+        assert task.task_id == "easy"
+    def test_has_4_issues(self, task):
+        assert len(task.planted_issues) == 4
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "missing_value" in types
+        assert "wrong_type" in types
+        assert "duplicate_row" in types
+        assert "out_of_range" in types
+    def test_corrupted_csv_differs_from_clean(self, task):
+        assert task.corrupted_csv != task.clean_csv
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_max_steps(self, task):
+        assert task.max_steps == 3
+    def test_corrupted_csv_has_more_rows(self, task):
+        clean_rows = _csv_to_rows(task.clean_csv)
+        corrupt_rows = _csv_to_rows(task.corrupted_csv)
+        assert len(corrupt_rows) > len(clean_rows)  # duplicate row added
+    def test_difficulty_weights(self, task):
+        for issue in task.planted_issues:
+            assert 1.0 <= issue.difficulty <= 3.0
+class TestTaskMedium:
+    @pytest.fixture
+    def task(self):
+        return create_task_medium()
+    def test_task_id(self, task):
+        assert task.task_id == "medium"
+    def test_has_6_issues(self, task):
+        assert len(task.planted_issues) == 6
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types
+        assert "format_violation" in types
+        assert "missing_value" in types
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+    def test_difficulty_weights(self, task):
+        for issue in task.planted_issues:
+            assert 1.0 <= issue.difficulty <= 3.0
+class TestTaskHard:
+    @pytest.fixture
+    def task(self):
+        return create_task_hard()
+    def test_task_id(self, task):
+        assert task.task_id == "hard"
+    def test_has_8_issues(self, task):
+        assert len(task.planted_issues) == 8
+    def test_issue_types(self, task):
+        types = {i.issue_type for i in task.planted_issues}
+        assert "inconsistent_value" in types
+        assert "format_violation" in types
+        assert "statistical_outlier" in types
+        assert "out_of_range" in types
+        assert "missing_value" in types
+    def test_has_high_difficulty_issues(self, task):
+        hard_issues = [i for i in task.planted_issues if i.difficulty >= 2.5]
+        assert len(hard_issues) >= 2  # data leakage, GPU outlier, whitespace
+    def test_issue_keys_unique(self, task):
+        keys = [i.to_key() for i in task.planted_issues]
+        assert len(keys) == len(set(keys))
+class TestTaskRegistry:
+    def test_list_tasks(self):
+        tasks = list_tasks()
+        assert set(tasks) == {"easy", "medium", "hard"}
+    def test_get_task_easy(self):
+        task = get_task("easy")
+        assert task.task_id == "easy"
+    def test_get_task_medium(self):
+        task = get_task("medium")
+        assert task.task_id == "medium"
+    def test_get_task_hard(self):
+        task = get_task("hard")
+        assert task.task_id == "hard"
+    def test_get_task_unknown_raises(self):
+        with pytest.raises(ValueError, match="Unknown task"):
+            get_task("nonexistent")
+    def test_seed_determinism(self):
+        t1 = get_task("easy", seed=42)
+        t2 = get_task("easy", seed=42)
+        assert t1.corrupted_csv == t2.corrupted_csv
+        assert [i.to_key() for i in t1.planted_issues] == [i.to_key() for i in t2.planted_issues]