Spaces:

varb15
/

dataqa-env

Sleeping

App Files Files Community

varb15 commited on 12 days ago

Commit

0c216ef

verified ·

1 Parent(s): 8ced877

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

Dockerfile +36 -0
README.md +104 -5
__init__.py +3 -0
client.py +5 -0
dataqa_env/__init__.py +4 -0
dataqa_env/client.py +37 -0
dataqa_env/models.py +75 -0
dataqa_env/server/Dockerfile +33 -0
dataqa_env/server/__init__.py +0 -0
dataqa_env/server/app.py +28 -0
dataqa_env/server/environment.py +193 -0
dataqa_env/server/tasks.py +391 -0
inference.py +322 -0
models.py +4 -0
openenv.yaml +6 -0
openenv_dataqa_env.egg-info/PKG-INFO +13 -0
openenv_dataqa_env.egg-info/SOURCES.txt +15 -0
openenv_dataqa_env.egg-info/dependency_links.txt +1 -0
openenv_dataqa_env.egg-info/entry_points.txt +2 -0
openenv_dataqa_env.egg-info/requires.txt +9 -0
openenv_dataqa_env.egg-info/top_level.txt +1 -0
pyproject.toml +32 -0
server/__init__.py +0 -0
server/app.py +14 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,36 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv for fast dependency management
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
+    mv /root/.local/bin/uv /usr/local/bin/uv && \
+    mv /root/.local/bin/uvx /usr/local/bin/uvx
+# Copy project files
+COPY pyproject.toml /app/
+COPY openenv.yaml /app/
+COPY dataqa_env/ /app/dataqa_env/
+COPY inference.py /app/
+COPY README.md /app/
+# Install dependencies
+RUN uv sync --no-editable 2>/dev/null || pip install -e .
+# Set environment
+ENV PATH="/app/.venv/bin:$PATH"
+ENV PYTHONPATH="/app:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["uvicorn", "dataqa_env.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,10 +1,109 @@
 ---
-title: Dataqa Env
-emoji: 💻
-colorFrom: red
-colorTo: yellow
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: DataQA Environment Server
+emoji: 🔍
+colorFrom: blue
+colorTo: gray
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# DataQA Environment
+An OpenEnv environment for **Data Quality Assurance** — an LLM agent inspects datasets with planted quality issues and must identify them all.
+## Overview
+DataQA simulates the real-world task of validating datasets before they enter ML training pipelines or production databases. The agent receives a corrupted dataset along with its schema and validation rules, then must identify all planted data quality issues.
+### Why Data QA?
+Every ML engineer and data scientist spends significant time debugging data quality issues — missing values, type mismatches, inconsistencies, and subtle statistical anomalies. This environment turns that task into a structured, gradable challenge.
+## Environment API
+| Endpoint | Description |
+|----------|-------------|
+| `reset(task_id)` | Start a new episode with a corrupted dataset |
+| `step(issues)` | Submit identified issues, receive F1-scored feedback |
+| `state()` | Get current episode state |
+## Tasks
+| Task | Issues | Difficulty | Description |
+|------|--------|-----------|-------------|
+| `easy` | 4 | Beginner | Employee directory — nulls, wrong types, duplicates, out-of-range |
+| `medium` | 6 | Intermediate | E-commerce orders — format violations, inconsistent totals, duplicate keys |
+| `hard` | 8 | Advanced | ML experiment metadata — data leakage signals, unreasonable GPU usage, timestamp ordering |
+## Reward Function
+Scoring uses **F1 score** (harmonic mean of precision and recall):
+- **Precision**: What fraction of reported issues are real?
+- **Recall**: What fraction of planted issues did the agent find?
+- **F1**: `2 * precision * recall / (precision + recall)`
+Issues are matched by `row:<N>,col:<column>,issue:<type>` keys.
+The agent gets up to 3 attempts per task with feedback on each attempt (true positives, false positives, missed count).
+## Action/Observation Space
+**Action**: List of issue strings in format `row:<row_number>,col:<column_name>,issue:<issue_type>`
+**Observation**: Dataset CSV + schema + validation rules + feedback from previous attempt
+**Issue Types**: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`, `referential_integrity`
+## Quick Start
+```bash
+# Install
+pip install -e .
+# Run server locally
+uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
+# Run inference
+API_BASE_URL=https://api.groq.com/openai/v1 \
+MODEL_NAME=llama-3.3-70b-versatile \
+LLM_API_KEY=your-key \
+python inference.py
+```
+## Docker
+```bash
+docker build -t dataqa-env -f dataqa_env/server/Dockerfile .
+docker run -p 8000:8000 dataqa-env
+```
+## Environment Variables
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://api.groq.com/openai/v1` |
+| `MODEL_NAME` | Model identifier | `llama-3.3-70b-versatile` |
+| `HF_TOKEN` | HuggingFace token | - |
+| `ENV_URL` | Environment server URL | `http://localhost:8000` |
+| `LLM_API_KEY` | API key for LLM provider | Falls back to HF_TOKEN |
+## Architecture
+```
+dataqa_env/
+├── models.py              # Pydantic: DataQAAction, DataQAObservation, DataQAState
+├── client.py              # EnvClient for WebSocket connections
+├── server/
+│   ├── environment.py     # Core DataQAEnvironment (reset/step/state)
+│   ├── tasks.py           # Task definitions + data corruption + grading
+│   ├── app.py             # FastAPI server
+│   └── Dockerfile
+├── openenv.yaml
+├── pyproject.toml
+└── inference.py           # LLM agent using OpenAI client
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ from dataqa_env import DataQAEnv, DataQAAction, DataQAObservation, DataQAState
2	+
3	+ __all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

client.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Root-level client for OpenEnv compatibility."""
+from dataqa_env.client import DataQAEnv
+from dataqa_env.models import DataQAAction, DataQAObservation, DataQAState
+__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

dataqa_env/__init__.py ADDED Viewed

	@@ -0,0 +1,4 @@

+from .client import DataQAEnv
+from .models import DataQAAction, DataQAObservation, DataQAState
+__all__ = ["DataQAEnv", "DataQAAction", "DataQAObservation", "DataQAState"]

dataqa_env/client.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""
+DataQAEnv Client
+----------------
+Client-side wrapper for the DataQA environment server.
+"""
+from __future__ import annotations
+from openenv.core.client_types import StepResult
+from openenv.core.env_client import EnvClient
+from .models import DataQAAction, DataQAObservation, DataQAState
+class DataQAEnv(EnvClient[DataQAAction, DataQAObservation, DataQAState]):
+    def _step_payload(self, action: DataQAAction) -> dict:
+        return {"issues": action.issues, "task_id": action.task_id}
+    def _parse_result(self, payload: dict) -> StepResult[DataQAObservation]:
+        obs = DataQAObservation(**payload["observation"])
+        return StepResult(
+            observation=obs,
+            reward=payload.get("reward"),
+            done=bool(payload.get("done", False)),
+        )
+    def _parse_state(self, payload: dict) -> DataQAState:
+        return DataQAState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id", ""),
+            current_step=payload.get("current_step", 0),
+            max_steps=payload.get("max_steps", 3),
+            best_score=payload.get("best_score", 0.0),
+            total_planted_issues=payload.get("total_planted_issues", 0),
+        )

dataqa_env/models.py ADDED Viewed

	@@ -0,0 +1,75 @@

+"""
+DataQA Environment Models
+-------------------------
+Action/Observation/State types for the Data Quality Assurance environment.
+The agent receives a dataset with planted quality issues and must identify them.
+Grading is based on F1 score (precision × recall) of correctly identified issues.
+"""
+from __future__ import annotations
+from typing import List, Optional
+from openenv.core.env_server.interfaces import Action, Observation, State
+class DataQAAction(Action):
+    """
+    Agent submits a list of identified data quality issues.
+    Each issue is a string in the format: "row:<row_idx>,col:<col_name>,issue:<issue_type>"
+    Supported issue types:
+        - missing_value
+        - wrong_type
+        - duplicate_row
+        - out_of_range
+        - format_violation
+        - inconsistent_value
+        - statistical_outlier
+        - referential_integrity
+    """
+    issues: List[str]
+    # Include task_id so step() can reconstruct context in stateless HTTP mode
+    task_id: str = "easy"
+class DataQAObservation(Observation):
+    """
+    What the agent sees: a dataset, its schema/rules, and feedback.
+    """
+    # The dataset as CSV text
+    dataset_csv: str = ""
+    # Schema description (column names, expected types, constraints)
+    schema_description: str = ""
+    # Validation rules in plain text
+    validation_rules: str = ""
+    # Task description
+    task_description: str = ""
+    # Feedback from previous step (empty on reset)
+    feedback: str = ""
+    # Current task ID
+    task_id: str = ""
+    # Number of planted issues (hint for the agent)
+    num_issues_hint: int = 0
+    # Max allowed steps for this task
+    max_steps: int = 3
+class DataQAState(State):
+    """Tracks episode progress."""
+    task_id: str = ""
+    current_step: int = 0
+    max_steps: int = 3
+    best_score: float = 0.0
+    total_planted_issues: int = 0

dataqa_env/server/Dockerfile ADDED Viewed

	@@ -0,0 +1,33 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system deps
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    git curl \
+    && rm -rf /var/lib/apt/lists/*
+# Install uv for fast dependency management
+RUN curl -LsSf https://astral.sh/uv/install.sh | sh && \
+    mv /root/.local/bin/uv /usr/local/bin/uv && \
+    mv /root/.local/bin/uvx /usr/local/bin/uvx
+# Copy project files
+COPY . /app/env
+WORKDIR /app/env
+# Install dependencies
+RUN uv sync --frozen --no-editable 2>/dev/null || uv sync --no-editable
+# Set environment
+ENV PATH="/app/env/.venv/bin:$PATH"
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')" || exit 1
+EXPOSE 8000
+CMD ["uvicorn", "dataqa_env.server.app:app", "--host", "0.0.0.0", "--port", "8000"]

dataqa_env/server/__init__.py ADDED Viewed

File without changes

dataqa_env/server/app.py ADDED Viewed

	@@ -0,0 +1,28 @@

+"""
+FastAPI application for the DataQA Environment.
+Usage:
+    uvicorn dataqa_env.server.app:app --reload --host 0.0.0.0 --port 8000
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+    from .environment import DataQAEnvironment
+    from ..models import DataQAAction, DataQAObservation
+except ImportError:
+    from openenv.core.env_server.http_server import create_app
+    from dataqa_env.server.environment import DataQAEnvironment
+    from dataqa_env.models import DataQAAction, DataQAObservation
+app = create_app(
+    DataQAEnvironment, DataQAAction, DataQAObservation, env_name="dataqa_env"
+)
+def main():
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

dataqa_env/server/environment.py ADDED Viewed

	@@ -0,0 +1,193 @@

+"""
+DataQA Environment
+------------------
+Server-side environment for data quality assurance tasks.
+The agent receives corrupted datasets and must identify planted quality issues.
+Scoring is based on F1 (precision-recall) of correctly matched issues.
+"""
+from __future__ import annotations
+import re
+import uuid
+from typing import Any, Optional, Set
+from openenv.core.env_server.interfaces import Action, Environment, Observation
+from ..models import DataQAAction, DataQAObservation, DataQAState
+from .tasks import PlantedIssue, Task, get_task, list_tasks
+def parse_issue_key(raw: str) -> Optional[str]:
+    """
+    Parse an agent-reported issue string into a normalized key.
+    Expected format: row:<N>,col:<name>,issue:<type>
+    Returns normalized key or None if unparseable.
+    """
+    raw = raw.strip().lower()
+    # Be lenient with formatting
+    row_match = re.search(r"row\s*[:=]\s*(\d+)", raw)
+    col_match = re.search(r"col\s*[:=]\s*([\w_]+)", raw)
+    issue_match = re.search(r"issue\s*[:=]\s*([\w_]+)", raw)
+    if row_match and col_match and issue_match:
+        return f"row:{row_match.group(1)},col:{col_match.group(1)},issue:{issue_match.group(1)}"
+    return None
+def compute_f1(reported_keys: Set[str], planted_keys: Set[str]) -> dict:
+    """Compute precision, recall, and F1 score."""
+    if not reported_keys and not planted_keys:
+        return {"precision": 1.0, "recall": 1.0, "f1": 1.0, "tp": 0, "fp": 0, "fn": 0}
+    if not reported_keys:
+        return {"precision": 0.0, "recall": 0.0, "f1": 0.0, "tp": 0, "fp": 0, "fn": len(planted_keys)}
+    if not planted_keys:
+        return {"precision": 0.0, "recall": 0.0, "f1": 0.0, "tp": 0, "fp": len(reported_keys), "fn": 0}
+    tp = len(reported_keys & planted_keys)
+    fp = len(reported_keys - planted_keys)
+    fn = len(planted_keys - reported_keys)
+    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0.0
+    return {"precision": precision, "recall": recall, "f1": f1, "tp": tp, "fp": fp, "fn": fn}
+class DataQAEnvironment(Environment):
+    """
+    Data Quality Assurance environment.
+    The agent inspects corrupted datasets and reports quality issues.
+    Reward is F1 score of correctly identified issues vs planted ground truth.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self):
+        self._state = DataQAState()
+        self._current_task: Optional[Task] = None
+        self._planted_keys: Set[str] = set()
+        self._best_score: float = 0.0
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        task_id = kwargs.get("task_id", "easy")
+        task_seed = seed if seed is not None else 42
+        self._current_task = get_task(task_id, seed=task_seed)
+        self._planted_keys = {issue.to_key() for issue in self._current_task.planted_issues}
+        self._best_score = 0.0
+        ep_id = episode_id or str(uuid.uuid4())
+        self._state = DataQAState(
+            episode_id=ep_id,
+            step_count=0,
+            task_id=task_id,
+            current_step=0,
+            max_steps=self._current_task.max_steps,
+            best_score=0.0,
+            total_planted_issues=len(self._current_task.planted_issues),
+        )
+        return DataQAObservation(
+            dataset_csv=self._current_task.corrupted_csv,
+            schema_description=self._current_task.schema_description,
+            validation_rules=self._current_task.validation_rules,
+            task_description=self._current_task.description,
+            feedback="Environment reset. Inspect the dataset and report all quality issues.",
+            task_id=task_id,
+            num_issues_hint=len(self._current_task.planted_issues),
+            max_steps=self._current_task.max_steps,
+            done=False,
+            reward=0.0,
+        )
+    def step(
+        self,
+        action: Action,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> Observation:
+        if not isinstance(action, DataQAAction):
+            raise ValueError(f"Expected DataQAAction, got {type(action)}")
+        # In stateless HTTP mode, each request creates a fresh env instance.
+        # Auto-reset using the task_id from the action so step() works standalone.
+        if self._current_task is None:
+            self.reset(task_id=action.task_id)
+        self._state.step_count += 1
+        self._state.current_step += 1
+        # Parse reported issues
+        reported_keys: Set[str] = set()
+        parse_errors: list[str] = []
+        for raw_issue in action.issues:
+            key = parse_issue_key(raw_issue)
+            if key:
+                reported_keys.add(key)
+            else:
+                parse_errors.append(f"Could not parse: '{raw_issue}'")
+        # Compute score
+        metrics = compute_f1(reported_keys, self._planted_keys)
+        score = metrics["f1"]
+        self._best_score = max(self._best_score, score)
+        self._state.best_score = self._best_score
+        # Check if done
+        is_done = (
+            score >= 0.999  # Perfect score
+            or self._state.current_step >= self._state.max_steps
+        )
+        # Build feedback
+        feedback_lines = [
+            f"Step {self._state.current_step}/{self._state.max_steps}",
+            f"Issues reported: {len(reported_keys)}",
+            f"True positives: {metrics['tp']}, False positives: {metrics['fp']}, Missed: {metrics['fn']}",
+            f"Precision: {metrics['precision']:.3f}, Recall: {metrics['recall']:.3f}, F1: {score:.3f}",
+        ]
+        if parse_errors:
+            feedback_lines.append(f"Parse errors ({len(parse_errors)}): {'; '.join(parse_errors[:3])}")
+        if not is_done:
+            # Give hints about what was missed without revealing exact answers
+            if metrics["fn"] > 0:
+                feedback_lines.append(
+                    f"You missed {metrics['fn']} issue(s). Review the dataset carefully."
+                )
+            if metrics["fp"] > 0:
+                feedback_lines.append(
+                    f"{metrics['fp']} of your reported issues were incorrect."
+                )
+            feedback_lines.append("You can submit again with an updated list of issues.")
+        else:
+            feedback_lines.append(f"Task complete! Final best F1 score: {self._best_score:.3f}")
+        return DataQAObservation(
+            dataset_csv=self._current_task.corrupted_csv,
+            schema_description=self._current_task.schema_description,
+            validation_rules=self._current_task.validation_rules,
+            task_description=self._current_task.description,
+            feedback="\n".join(feedback_lines),
+            task_id=self._current_task.task_id,
+            num_issues_hint=len(self._current_task.planted_issues),
+            max_steps=self._state.max_steps,
+            done=is_done,
+            reward=self._best_score,
+        )
+    @property
+    def state(self) -> DataQAState:
+        return self._state

dataqa_env/server/tasks.py ADDED Viewed

	@@ -0,0 +1,391 @@

+"""
+Task definitions for the DataQA environment.
+Each task provides:
+- A clean dataset (CSV)
+- A schema + validation rules
+- A set of planted issues (ground truth)
+- A function to inject those issues into the clean data
+"""
+from __future__ import annotations
+import csv
+import io
+import random
+from dataclasses import dataclass, field
+from typing import List, Set
+@dataclass
+class PlantedIssue:
+    """A single planted data quality issue."""
+    row: int
+    col: str
+    issue_type: str
+    description: str
+    def to_key(self) -> str:
+        return f"row:{self.row},col:{self.col},issue:{self.issue_type}"
+@dataclass
+class Task:
+    task_id: str
+    name: str
+    description: str
+    schema_description: str
+    validation_rules: str
+    clean_csv: str
+    planted_issues: List[PlantedIssue] = field(default_factory=list)
+    corrupted_csv: str = ""
+    max_steps: int = 3
+def _csv_to_rows(csv_text: str) -> List[List[str]]:
+    reader = csv.reader(io.StringIO(csv_text.strip()))
+    return [row for row in reader]
+def _rows_to_csv(rows: List[List[str]]) -> str:
+    output = io.StringIO()
+    writer = csv.writer(output)
+    writer.writerows(rows)
+    return output.getvalue()
+# ---------------------------------------------------------------------------
+# TASK 1: Easy — Employee directory with obvious issues
+# ---------------------------------------------------------------------------
+def create_task_easy(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """employee_id,name,email,department,salary,start_date
+101,Alice Chen,alice.chen@company.com,Engineering,95000,2022-03-15
+102,Bob Martinez,bob.martinez@company.com,Marketing,72000,2021-07-01
+103,Carol Davis,carol.davis@company.com,Engineering,98000,2020-11-20
+104,David Kim,david.kim@company.com,Sales,68000,2023-01-10
+105,Eve Johnson,eve.johnson@company.com,HR,71000,2022-06-05
+106,Frank Wilson,frank.wilson@company.com,Engineering,102000,2019-08-12
+107,Grace Lee,grace.lee@company.com,Marketing,75000,2021-12-01
+108,Hank Brown,hank.brown@company.com,Sales,65000,2023-04-18
+109,Iris Patel,iris.patel@company.com,HR,73000,2020-02-28
+110,Jack Taylor,jack.taylor@company.com,Engineering,97000,2022-09-14"""
+    schema_desc = """Columns:
+- employee_id: integer, unique, range 100-999
+- name: string, non-empty, format "FirstName LastName"
+- email: string, valid email format, must match pattern firstname.lastname@company.com
+- department: string, one of [Engineering, Marketing, Sales, HR]
+- salary: integer, range 50000-150000
+- start_date: string, format YYYY-MM-DD, must be between 2015-01-01 and 2025-12-31"""
+    rules = """1. No missing values in any column
+2. employee_id must be unique
+3. email must follow the pattern: lowercase(firstname).lowercase(lastname)@company.com
+4. salary must be within the valid range
+5. No duplicate rows"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Missing value - null out a name
+    r = 3  # row index in data (0-based), displayed as row 4 in CSV
+    data[r][1] = ""
+    issues.append(PlantedIssue(row=r + 1, col="name", issue_type="missing_value",
+                               description="Empty name field"))
+    # Issue 2: Wrong type - salary as text
+    r = 6
+    data[r][4] = "seventy-five thousand"
+    issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="wrong_type",
+                               description="Salary is text instead of integer"))
+    # Issue 3: Duplicate row
+    dup_source = 1
+    data.append(list(data[dup_source]))
+    issues.append(PlantedIssue(row=len(data), col="employee_id", issue_type="duplicate_row",
+                               description=f"Exact duplicate of row {dup_source + 1}"))
+    # Issue 4: Out of range salary
+    r = 8
+    data[r][4] = "5000"
+    issues.append(PlantedIssue(row=r + 1, col="salary", issue_type="out_of_range",
+                               description="Salary 5000 is below minimum 50000"))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="easy",
+        name="Employee Directory Validation",
+        description=(
+            "You are given an employee directory dataset. "
+            "Find all data quality issues based on the schema and validation rules. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 2: Medium — E-commerce orders with moderate issues
+# ---------------------------------------------------------------------------
+def create_task_medium(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """order_id,customer_id,product_name,category,quantity,unit_price,order_date,shipping_country,status,total
+ORD-001,CUST-100,Wireless Mouse,Electronics,2,29.99,2024-01-15,US,delivered,59.98
+ORD-002,CUST-101,Python Cookbook,Books,1,45.50,2024-01-16,UK,delivered,45.50
+ORD-003,CUST-102,USB-C Hub,Electronics,1,35.00,2024-01-17,US,shipped,35.00
+ORD-004,CUST-103,Yoga Mat,Sports,1,25.99,2024-01-18,CA,delivered,25.99
+ORD-005,CUST-104,Desk Lamp,Home,1,42.00,2024-01-19,US,processing,42.00
+ORD-006,CUST-105,Running Shoes,Sports,1,89.99,2024-01-20,DE,delivered,89.99
+ORD-007,CUST-106,Mechanical Keyboard,Electronics,1,129.99,2024-01-21,US,shipped,129.99
+ORD-008,CUST-100,Monitor Stand,Home,1,55.00,2024-01-22,US,delivered,55.00
+ORD-009,CUST-107,Data Science Handbook,Books,2,39.99,2024-01-23,UK,delivered,79.98
+ORD-010,CUST-108,Resistance Bands,Sports,3,12.99,2024-01-24,CA,shipped,38.97
+ORD-011,CUST-109,Webcam HD,Electronics,1,65.00,2024-01-25,US,delivered,65.00
+ORD-012,CUST-110,Standing Desk,Home,1,299.99,2024-01-26,US,processing,299.99
+ORD-013,CUST-111,Tennis Racket,Sports,1,75.00,2024-01-27,AU,delivered,75.00
+ORD-014,CUST-112,LED Strip Lights,Home,2,18.50,2024-01-28,US,shipped,37.00
+ORD-015,CUST-113,AI Textbook,Books,1,59.99,2024-01-29,DE,delivered,59.99
+ORD-016,CUST-114,Bluetooth Speaker,Electronics,1,49.99,2024-01-30,UK,delivered,49.99
+ORD-017,CUST-115,Jump Rope,Sports,2,8.99,2024-01-31,US,shipped,17.98
+ORD-018,CUST-116,Coffee Table Book,Books,1,32.00,2024-02-01,CA,delivered,32.00
+ORD-019,CUST-117,Ergonomic Chair,Home,1,450.00,2024-02-02,US,processing,450.00
+ORD-020,CUST-118,Fitness Tracker,Electronics,1,79.99,2024-02-03,AU,delivered,79.99"""
+    schema_desc = """Columns:
+- order_id: string, unique, format ORD-NNN
+- customer_id: string, format CUST-NNN
+- product_name: string, non-empty
+- category: string, one of [Electronics, Books, Sports, Home]
+- quantity: integer, range 1-100
+- unit_price: float, range 0.01-10000.00
+- order_date: string, format YYYY-MM-DD
+- shipping_country: string, ISO 2-letter country code
+- status: string, one of [processing, shipped, delivered, cancelled, returned]
+- total: float, must equal quantity * unit_price"""
+    rules = """1. No missing values in any column
+2. order_id must be unique
+3. total must equal quantity * unit_price (tolerance: 0.01)
+4. order_date must be in valid chronological order for sequential order_ids
+5. category must be from the allowed set
+6. All monetary values must have at most 2 decimal places
+7. shipping_country must be a valid ISO 2-letter code"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: total doesn't match quantity * unit_price
+    r = 4  # ORD-005
+    data[r][9] = "84.00"  # should be 42.00 (qty=1, price=42.00)
+    issues.append(PlantedIssue(row=r + 1, col="total", issue_type="inconsistent_value",
+                               description="total (84.00) != quantity (1) * unit_price (42.00)"))
+    # Issue 2: Invalid category
+    r = 9  # ORD-010
+    data[r][3] = "Fitness"  # should be Sports
+    issues.append(PlantedIssue(row=r + 1, col="category", issue_type="format_violation",
+                               description="'Fitness' is not in allowed categories"))
+    # Issue 3: Missing value in product_name
+    r = 13  # ORD-014
+    data[r][2] = ""
+    issues.append(PlantedIssue(row=r + 1, col="product_name", issue_type="missing_value",
+                               description="Empty product_name"))
+    # Issue 4: Out of range quantity
+    r = 16  # ORD-017
+    data[r][4] = "-1"
+    issues.append(PlantedIssue(row=r + 1, col="quantity", issue_type="out_of_range",
+                               description="Negative quantity"))
+    # Issue 5: Duplicate order_id
+    r = 18  # ORD-019
+    data[r][0] = "ORD-003"
+    issues.append(PlantedIssue(row=r + 1, col="order_id", issue_type="duplicate_row",
+                               description="Duplicate order_id ORD-003"))
+    # Issue 6: Wrong date format
+    r = 11  # ORD-012
+    data[r][6] = "26/01/2024"
+    issues.append(PlantedIssue(row=r + 1, col="order_date", issue_type="format_violation",
+                               description="Date format DD/MM/YYYY instead of YYYY-MM-DD"))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="medium",
+        name="E-commerce Orders Validation",
+        description=(
+            "You are given an e-commerce orders dataset. "
+            "Find all data quality issues based on the schema and validation rules. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# TASK 3: Hard — ML training metadata with subtle issues
+# ---------------------------------------------------------------------------
+def create_task_hard(seed: int = 42) -> Task:
+    rng = random.Random(seed)
+    clean_csv = """experiment_id,model_name,dataset,train_size,val_size,test_size,learning_rate,batch_size,epochs,train_loss,val_loss,test_accuracy,gpu_memory_gb,training_time_hours,timestamp
+EXP-001,resnet50,imagenet-1k,1281167,50000,100000,0.001,256,90,0.85,1.12,76.3,12.4,48.5,2024-03-01T10:00:00
+EXP-002,bert-base,squad-v2,130319,11873,8862,0.00003,32,3,0.45,0.52,81.2,7.8,2.1,2024-03-02T14:30:00
+EXP-003,gpt2-small,openwebtext,8013769,100000,100000,0.0003,64,1,3.12,3.28,0.0,14.2,72.0,2024-03-03T09:15:00
+EXP-004,vit-base,imagenet-1k,1281167,50000,100000,0.001,512,300,0.72,0.98,79.8,15.6,96.0,2024-03-05T08:00:00
+EXP-005,distilbert,mnli,392702,9815,9796,0.00005,16,5,0.28,0.35,84.6,5.2,1.5,2024-03-06T11:00:00
+EXP-006,llama2-7b,alpaca-52k,51760,500,500,0.00002,4,3,1.05,1.18,0.0,38.5,8.2,2024-03-07T16:00:00
+EXP-007,resnet18,cifar10,50000,5000,10000,0.01,128,200,0.15,0.28,93.5,3.2,1.8,2024-03-08T10:30:00
+EXP-008,t5-small,cnn-dailymail,287113,13368,11490,0.0001,16,10,1.45,1.62,0.0,6.8,4.5,2024-03-09T13:00:00
+EXP-009,efficientnet-b0,imagenet-1k,1281167,50000,100000,0.005,256,350,0.68,0.89,77.1,8.4,36.0,2024-03-10T07:45:00
+EXP-010,roberta-large,sst2,67349,872,1821,0.00001,8,10,0.08,0.12,95.1,14.8,3.2,2024-03-11T15:00:00
+EXP-011,yolov5-m,coco-2017,118287,5000,40670,0.01,32,300,0.032,0.045,0.0,10.2,24.0,2024-03-12T09:00:00
+EXP-012,wav2vec2,librispeech,281241,5567,2620,0.0001,8,20,0.92,1.05,0.0,12.6,15.0,2024-03-13T11:30:00
+EXP-013,clip-base,cc3m,2818102,15000,15000,0.00001,256,32,2.15,2.38,0.0,22.4,48.0,2024-03-14T08:00:00
+EXP-014,detr,coco-2017,118287,5000,40670,0.0001,4,500,1.85,2.12,0.0,16.0,72.0,2024-03-15T10:00:00
+EXP-015,whisper-small,common-voice,520000,16000,16000,0.00005,16,5,0.55,0.68,0.0,7.4,6.5,2024-03-16T14:00:00"""
+    schema_desc = """Columns:
+- experiment_id: string, unique, format EXP-NNN
+- model_name: string, non-empty
+- dataset: string, non-empty
+- train_size: integer, positive, must be > val_size and > test_size
+- val_size: integer, positive
+- test_size: integer, positive
+- learning_rate: float, range 1e-7 to 1.0
+- batch_size: integer, must be power of 2, range 1-1024
+- epochs: integer, positive, range 1-1000
+- train_loss: float, non-negative
+- val_loss: float, non-negative, typically >= train_loss (if not, may indicate data leakage)
+- test_accuracy: float, range 0-100 (percentage), 0.0 is valid for generative models
+- gpu_memory_gb: float, positive
+- training_time_hours: float, positive
+- timestamp: string, ISO 8601 format, chronological order by experiment_id"""
+    rules = """1. No missing values
+2. experiment_id must be unique
+3. val_loss should be >= train_loss (if val_loss < train_loss significantly, flag as potential data leakage)
+4. batch_size must be a power of 2
+5. train_size must be larger than both val_size and test_size
+6. learning_rate must be within valid range
+7. gpu_memory_gb should be reasonable for the model size (e.g., resnet18 shouldn't need 40GB)
+8. training_time should be proportional to dataset size and epochs (flag major inconsistencies)
+9. timestamps must be in chronological order"""
+    rows = _csv_to_rows(clean_csv)
+    header = rows[0]
+    data = rows[1:]
+    issues: List[PlantedIssue] = []
+    # Issue 1: Data leakage signal — val_loss much lower than train_loss
+    r = 4  # EXP-005
+    data[r][10] = "0.15"  # val_loss=0.15 but train_loss=0.28 → suspicious
+    issues.append(PlantedIssue(row=r + 1, col="val_loss", issue_type="inconsistent_value",
+                               description="val_loss (0.15) significantly less than train_loss (0.28), potential data leakage"))
+    # Issue 2: Batch size not power of 2
+    r = 8  # EXP-009
+    data[r][7] = "250"  # not a power of 2
+    issues.append(PlantedIssue(row=r + 1, col="batch_size", issue_type="format_violation",
+                               description="batch_size 250 is not a power of 2"))
+    # Issue 3: GPU memory unreasonable for model
+    r = 6  # EXP-007 resnet18 on cifar10
+    data[r][12] = "42.5"  # resnet18 shouldn't need 42.5 GB
+    issues.append(PlantedIssue(row=r + 1, col="gpu_memory_gb", issue_type="statistical_outlier",
+                               description="resnet18 on cifar10 using 42.5 GB GPU memory is unreasonable"))
+    # Issue 4: Timestamp out of order
+    r = 10  # EXP-011
+    data[r][14] = "2024-03-02T09:00:00"  # should be after EXP-010's timestamp
+    issues.append(PlantedIssue(row=r + 1, col="timestamp", issue_type="inconsistent_value",
+                               description="Timestamp 2024-03-02 is before EXP-010's timestamp 2024-03-11"))
+    # Issue 5: Train size smaller than test size
+    r = 9  # EXP-010
+    data[r][3] = "500"  # train_size=500 but test_size=1821
+    issues.append(PlantedIssue(row=r + 1, col="train_size", issue_type="inconsistent_value",
+                               description="train_size (500) is smaller than test_size (1821)"))
+    # Issue 6: Negative training time
+    r = 13  # EXP-014
+    data[r][13] = "-72.0"
+    issues.append(PlantedIssue(row=r + 1, col="training_time_hours", issue_type="out_of_range",
+                               description="Negative training time"))
+    # Issue 7: Learning rate out of range
+    r = 12  # EXP-013
+    data[r][6] = "2.5"  # way too high
+    issues.append(PlantedIssue(row=r + 1, col="learning_rate", issue_type="out_of_range",
+                               description="Learning rate 2.5 exceeds maximum of 1.0"))
+    # Issue 8: Missing model name (subtle — single space instead of empty)
+    r = 14  # EXP-015
+    data[r][1] = " "
+    issues.append(PlantedIssue(row=r + 1, col="model_name", issue_type="missing_value",
+                               description="model_name is whitespace-only"))
+    corrupted = _rows_to_csv([header] + data)
+    return Task(
+        task_id="hard",
+        name="ML Experiment Metadata Validation",
+        description=(
+            "You are given an ML experiment tracking dataset. "
+            "Find all data quality issues based on the schema and validation rules. "
+            "This dataset contains subtle issues including potential data leakage signals, "
+            "unreasonable resource usage, and logical inconsistencies. "
+            "Report each issue in the format: row:<row_number>,col:<column_name>,issue:<issue_type>"
+        ),
+        schema_description=schema_desc,
+        validation_rules=rules,
+        clean_csv=clean_csv,
+        planted_issues=issues,
+        corrupted_csv=corrupted,
+        max_steps=3,
+    )
+# ---------------------------------------------------------------------------
+# Task registry
+# ---------------------------------------------------------------------------
+TASK_REGISTRY = {
+    "easy": create_task_easy,
+    "medium": create_task_medium,
+    "hard": create_task_hard,
+}
+def get_task(task_id: str, seed: int = 42) -> Task:
+    if task_id not in TASK_REGISTRY:
+        raise ValueError(f"Unknown task: {task_id}. Available: {list(TASK_REGISTRY.keys())}")
+    return TASK_REGISTRY[task_id](seed=seed)
+def list_tasks() -> List[str]:
+    return list(TASK_REGISTRY.keys())

inference.py ADDED Viewed

	@@ -0,0 +1,322 @@

+#!/usr/bin/env python3
+"""
+DataQA Inference Script
+-----------------------
+LLM agent that plays the DataQA environment.
+Uses the OpenAI client to interact with any OpenAI-compatible LLM API.
+Required environment variables:
+    API_BASE_URL  - LLM API endpoint (e.g., https://api.groq.com/openai/v1)
+    MODEL_NAME    - Model identifier (e.g., llama-3.3-70b-versatile)
+    HF_TOKEN      - HuggingFace token (for HF Spaces access)
+Structured logging format: [START], [STEP], [END] tags for evaluation.
+"""
+from __future__ import annotations
+import json
+import os
+import re
+import sys
+import time
+from typing import Optional
+import requests
+from openai import OpenAI
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.groq.com/openai/v1")
+MODEL_NAME = os.environ.get("MODEL_NAME", "llama-3.3-70b-versatile")
+HF_TOKEN = os.environ.get("HF_TOKEN", "")
+ENV_URL = os.environ.get("ENV_URL", "http://localhost:8000")
+TASKS = ["easy", "medium", "hard"]
+MAX_STEPS_PER_TASK = 3
+# ---------------------------------------------------------------------------
+# Logging helpers (structured stdout for evaluation)
+# ---------------------------------------------------------------------------
+def log_start(task_id: str, metadata: Optional[dict] = None):
+    entry = {"event": "START", "task_id": task_id, "timestamp": time.time()}
+    if metadata:
+        entry["metadata"] = metadata
+    print(f"[START] {json.dumps(entry)}", flush=True)
+def log_step(task_id: str, step: int, reward: float, details: Optional[dict] = None):
+    entry = {
+        "event": "STEP",
+        "task_id": task_id,
+        "step": step,
+        "reward": reward,
+        "timestamp": time.time(),
+    }
+    if details:
+        entry["details"] = details
+    print(f"[STEP] {json.dumps(entry)}", flush=True)
+def log_end(task_id: str, final_score: float, metadata: Optional[dict] = None):
+    entry = {
+        "event": "END",
+        "task_id": task_id,
+        "final_score": final_score,
+        "timestamp": time.time(),
+    }
+    if metadata:
+        entry["metadata"] = metadata
+    print(f"[END] {json.dumps(entry)}", flush=True)
+# ---------------------------------------------------------------------------
+# Environment HTTP client (simple, no WebSocket needed for inference)
+# ---------------------------------------------------------------------------
+class EnvHTTPClient:
+    """Minimal HTTP client for the DataQA environment."""
+    def __init__(self, base_url: str):
+        self.base_url = base_url.rstrip("/")
+        self.session = requests.Session()
+    def health(self) -> bool:
+        try:
+            r = self.session.get(f"{self.base_url}/health", timeout=10)
+            return r.status_code == 200
+        except Exception:
+            return False
+    def reset(self, task_id: str = "easy") -> dict:
+        r = self.session.post(
+            f"{self.base_url}/reset",
+            json={"task_id": task_id},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.json()
+    def step(self, issues: list[str], task_id: str = "easy") -> dict:
+        r = self.session.post(
+            f"{self.base_url}/step",
+            json={"action": {"issues": issues, "task_id": task_id}},
+            timeout=30,
+        )
+        r.raise_for_status()
+        return r.json()
+    def state(self) -> dict:
+        r = self.session.get(f"{self.base_url}/state", timeout=10)
+        r.raise_for_status()
+        return r.json()
+# ---------------------------------------------------------------------------
+# LLM Agent
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = """You are a data quality analyst. Your job is to inspect datasets and identify data quality issues.
+You will be given:
+1. A dataset in CSV format
+2. A schema describing expected column types and constraints
+3. Validation rules that the data should satisfy
+You must identify ALL data quality issues and report each one in EXACTLY this format:
+row:<row_number>,col:<column_name>,issue:<issue_type>
+Supported issue types:
+- missing_value (null, empty, or whitespace-only)
+- wrong_type (value doesn't match expected type)
+- duplicate_row (exact duplicate or duplicate key)
+- out_of_range (value outside valid range)
+- format_violation (wrong format, invalid enum value)
+- inconsistent_value (computed field doesn't match, logical inconsistency)
+- statistical_outlier (value is unreasonable given context)
+- referential_integrity (foreign key violation)
+CRITICAL INSTRUCTIONS FOR ROW NUMBERING:
+- Row numbers refer to the ROW POSITION in the CSV data, NOT the value of any ID column
+- Row 1 = the FIRST data row after the header
+- Row 2 = the SECOND data row after the header
+- For example, if the CSV has header on line 1 and data starting on line 2, the data on line 2 is row 1, line 3 is row 2, etc.
+- DO NOT use the employee_id, order_id, or experiment_id as the row number
+- Column names must match exactly (use the CSV header names, lowercase)
+- Check EVERY row and EVERY column systematically
+- Consider cross-column consistency (e.g., total = quantity * price)
+- Look for subtle issues like whitespace-only values, near-duplicates
+- Report ALL issues you find, even if uncertain
+Respond with ONLY the list of issues, one per line. No other text.
+Example: row:3,col:salary,issue:missing_value"""
+def build_user_prompt(observation: dict) -> str:
+    obs = observation if isinstance(observation, dict) else observation
+    parts = []
+    if obs.get("task_description"):
+        parts.append(f"TASK: {obs['task_description']}")
+    parts.append(f"SCHEMA:\n{obs.get('schema_description', '')}")
+    parts.append(f"VALIDATION RULES:\n{obs.get('validation_rules', '')}")
+    parts.append(f"DATASET:\n{obs.get('dataset_csv', '')}")
+    hint = obs.get("num_issues_hint", 0)
+    if hint:
+        parts.append(f"HINT: There are exactly {hint} issues to find.")
+    feedback = obs.get("feedback", "")
+    if feedback and "reset" not in feedback.lower():
+        parts.append(f"FEEDBACK FROM PREVIOUS ATTEMPT:\n{feedback}")
+    return "\n\n".join(parts)
+def parse_llm_response(response: str) -> list[str]:
+    """Extract issue lines from LLM response."""
+    issues = []
+    for line in response.strip().split("\n"):
+        line = line.strip()
+        if not line:
+            continue
+        # Remove numbering like "1. " or "- " or "* "
+        line = re.sub(r"^\s*[\d]+[.\)]\s*", "", line)
+        line = re.sub(r"^\s*[-*]\s*", "", line)
+        line = line.strip()
+        if "row" in line.lower() and "col" in line.lower():
+            # Lenient regex: accept : or = as delimiters, case-insensitive
+            match = re.search(
+                r"row\s*[:=]\s*(\d+)\s*[,;\s]+col(?:umn)?\s*[:=]\s*([\w_]+)\s*[,;\s]+issue\s*[:=]\s*([\w_]+)",
+                line,
+                re.IGNORECASE,
+            )
+            if match:
+                # Normalize to lowercase canonical format
+                normalized = f"row:{match.group(1)},col:{match.group(2).lower()},issue:{match.group(3).lower()}"
+                issues.append(normalized)
+    return issues
+def run_task(client: OpenAI, env: EnvHTTPClient, task_id: str) -> float:
+    """Run a single task and return the best score."""
+    log_start(task_id)
+    # Reset environment for this task
+    reset_response = env.reset(task_id=task_id)
+    observation = reset_response.get("observation", reset_response)
+    best_score = 0.0
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    for step_num in range(1, MAX_STEPS_PER_TASK + 1):
+        user_prompt = build_user_prompt(observation)
+        messages_for_call = messages + [{"role": "user", "content": user_prompt}]
+        # Call LLM with retry on rate limit
+        llm_output = ""
+        for attempt in range(3):
+            try:
+                response = client.chat.completions.create(
+                    model=MODEL_NAME,
+                    messages=messages_for_call,
+                    temperature=0.1,
+                    max_tokens=2048,
+                )
+                llm_output = response.choices[0].message.content or ""
+                break
+            except Exception as e:
+                if "rate_limit" in str(e).lower() or "429" in str(e):
+                    wait = 10 * (attempt + 1)
+                    print(f"[WARN] Rate limited, waiting {wait}s...", flush=True)
+                    time.sleep(wait)
+                else:
+                    print(f"[ERROR] LLM call failed: {e}", file=sys.stderr, flush=True)
+                    break
+        # Parse issues from LLM response
+        issues = parse_llm_response(llm_output)
+        if not issues:
+            print(f"[WARN] No issues parsed from LLM response for {task_id} step {step_num}", file=sys.stderr, flush=True)
+        # Submit to environment
+        step_response = env.step(issues, task_id=task_id)
+        observation = step_response.get("observation", step_response)
+        # reward and done are at the top level of the response, not inside observation
+        reward = float(step_response.get("reward", 0.0) or 0.0)
+        done = bool(step_response.get("done", False))
+        best_score = max(best_score, reward)
+        log_step(task_id, step_num, reward, {
+            "issues_reported": len(issues),
+            "feedback": observation.get("feedback", ""),
+        })
+        if done:
+            break
+        # Add context for next attempt
+        messages.append({"role": "user", "content": user_prompt})
+        messages.append({"role": "assistant", "content": llm_output})
+    log_end(task_id, best_score)
+    return best_score
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    print(f"[INFO] DataQA Inference starting", flush=True)
+    print(f"[INFO] ENV_URL={ENV_URL}", flush=True)
+    print(f"[INFO] API_BASE_URL={API_BASE_URL}", flush=True)
+    print(f"[INFO] MODEL_NAME={MODEL_NAME}", flush=True)
+    # Initialize clients
+    env = EnvHTTPClient(ENV_URL)
+    llm_client = OpenAI(
+        base_url=API_BASE_URL,
+        api_key=os.environ.get("LLM_API_KEY", HF_TOKEN or "no-key"),
+    )
+    # Check environment health
+    if not env.health():
+        print("[ERROR] Environment is not healthy. Exiting.", file=sys.stderr, flush=True)
+        sys.exit(1)
+    print(f"[INFO] Environment is healthy", flush=True)
+    # Run all tasks
+    scores = {}
+    for task_id in TASKS:
+        print(f"\n{'='*60}", flush=True)
+        print(f"[INFO] Starting task: {task_id}", flush=True)
+        print(f"{'='*60}", flush=True)
+        try:
+            score = run_task(llm_client, env, task_id)
+            scores[task_id] = score
+            print(f"[INFO] Task {task_id} completed with score: {score:.3f}", flush=True)
+        except Exception as e:
+            print(f"[ERROR] Task {task_id} failed: {e}", file=sys.stderr, flush=True)
+            scores[task_id] = 0.0
+    # Summary
+    print(f"\n{'='*60}", flush=True)
+    print("[INFO] FINAL RESULTS", flush=True)
+    print(f"{'='*60}", flush=True)
+    for task_id, score in scores.items():
+        print(f"[INFO] {task_id}: {score:.3f}", flush=True)
+    avg_score = sum(scores.values()) / len(scores) if scores else 0.0
+    print(f"[INFO] Average score: {avg_score:.3f}", flush=True)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,4 @@

+"""Root-level models for OpenEnv compatibility."""
+from dataqa_env.models import DataQAAction, DataQAObservation, DataQAState
+__all__ = ["DataQAAction", "DataQAObservation", "DataQAState"]

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: dataqa_env
+type: space
+runtime: fastapi
+app: dataqa_env.server.app:app
+port: 8000

openenv_dataqa_env.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,13 @@

+Metadata-Version: 2.4
+Name: openenv-dataqa-env
+Version: 0.1.0
+Summary: Data Quality Assurance Environment for OpenEnv - An LLM agent inspects datasets to find planted quality issues
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: fastapi>=0.115.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: uvicorn[standard]>=0.24.0
+Requires-Dist: requests>=2.31.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_dataqa_env.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+README.md
+pyproject.toml
+dataqa_env/__init__.py
+dataqa_env/client.py
+dataqa_env/models.py
+dataqa_env/server/__init__.py
+dataqa_env/server/app.py
+dataqa_env/server/environment.py
+dataqa_env/server/tasks.py
+openenv_dataqa_env.egg-info/PKG-INFO
+openenv_dataqa_env.egg-info/SOURCES.txt
+openenv_dataqa_env.egg-info/dependency_links.txt
+openenv_dataqa_env.egg-info/entry_points.txt
+openenv_dataqa_env.egg-info/requires.txt
+openenv_dataqa_env.egg-info/top_level.txt

openenv_dataqa_env.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_dataqa_env.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = dataqa_env.server.app:main

openenv_dataqa_env.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+openenv-core[core]>=0.2.2
+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn[standard]>=0.24.0
+requests>=2.31.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_dataqa_env.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ dataqa_env

pyproject.toml ADDED Viewed

	@@ -0,0 +1,32 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-dataqa-env"
+version = "0.1.0"
+description = "Data Quality Assurance Environment for OpenEnv - An LLM agent inspects datasets to find planted quality issues"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn[standard]>=0.24.0",
+    "requests>=2.31.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "dataqa_env.server.app:main"
+[tool.setuptools]
+packages = ["dataqa_env", "dataqa_env.server"]
+package-dir = { "dataqa_env" = "dataqa_env", "dataqa_env.server" = "dataqa_env/server" }
+[tool.setuptools.package-data]
+dataqa_env = ["**/*.yaml", "**/*.yml"]

server/__init__.py ADDED Viewed

File without changes

server/app.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""
+Root-level server entry point for OpenEnv compatibility.
+"""
+from dataqa_env.server.app import app  # noqa: F401
+def main():
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff