Spaces:

Prithvigg
/

queryforge

Running

App Files Files Community

Prithvigg commited on 24 days ago

Commit

a8a3c90

verified ·

1 Parent(s): 452be68

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

Dockerfile +81 -0
README.md +348 -5
__init__.py +23 -0
baseline.py +244 -0
client.py +94 -0
judge.py +414 -0
models.py +126 -0
openenv.yaml +26 -0
openenv_queryforge.egg-info/PKG-INFO +11 -0
openenv_queryforge.egg-info/SOURCES.txt +16 -0
openenv_queryforge.egg-info/dependency_links.txt +1 -0
openenv_queryforge.egg-info/entry_points.txt +2 -0
openenv_queryforge.egg-info/requires.txt +7 -0
openenv_queryforge.egg-info/top_level.txt +1 -0
playbook.py +249 -0
pyproject.toml +41 -0
server/__init__.py +11 -0
server/app.py +119 -0
server/queryforge_environment.py +180 -0
server/requirements.txt +8 -0
tasks.py +415 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=queryforge
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,353 @@
 ---
-title: Queryforge
-emoji: 🏆
-colorFrom: green
-colorTo: blue
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: QueryForge Environment Server
+emoji: 🔍
+colorFrom: blue
+colorTo: indigo
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+  - sql
+  - reinforcement-learning
 ---
+# QueryForge — SQL Debugging & Optimisation Environment
+SQL is the language that runs the world's data infrastructure. Yet SQL bugs are silent killers — a missing JOIN condition inflates totals by 3×, a correlated subquery scans a million rows once per row, a typo in a keyword stops production cold. These bugs are rarely caught by linters, rarely surfaced by error messages, and routinely shipped to production.
+QueryForge is an **OpenEnv-compatible reinforcement learning environment** where an agent learns to debug and optimise SQL queries. The agent receives a broken or slow query, submits fixes, and receives graded feedback from a deterministic DuckDB engine combined with an Anthropic AI quality judge — a smooth, informative reward signal across the full 0.0 → 1.0 range.
+---
+## Why SQL Debugging as an RL Environment?
+LLMs can write SQL. What they struggle with is the **iterative, feedback-driven debugging loop** that real engineers do:
+- Read the error message
+- Form a hypothesis about the root cause
+- Patch the query
+- Check if the output is now correct
+- Refine until it's both correct *and* efficient
+This is precisely the loop that RL is built for. QueryForge provides the environment that closes this loop with a graded, multi-stage reward signal — not just "correct / incorrect" but partial credit for syntax validity, execution success, row correctness, and code quality.
+---
+## Environment Overview
+| Property | Value |
+|---|---|
+| Task type | SQL debugging & optimisation |
+| Action space | Single SQL query string |
+| Observation space | Task description + graded feedback |
+| Reward range | 0.0 – 1.0 (continuous) |
+| Episode termination | Score ≥ 0.90, no improvement for 2 steps, or max steps |
+| Grading engine | DuckDB (deterministic) + Anthropic AI judge |
+| Concurrent sessions | Supported |
+---
+## Reward Scale
+The grading pipeline has four stages that produce a smooth partial-progress signal:
+| Score | Meaning |
+|---|---|
+| **0.00** | Syntax error — query could not be parsed |
+| **0.15** | Syntax valid but runtime error |
+| **0.30** | Executes but returns 0 rows or wrong row count |
+| **0.30 – 0.80** | Partial row correctness (deterministic, DuckDB) |
+| **0.80 – 1.00** | Correct rows + AI quality assessment (Anthropic) |
+The AI judge scores on three axes: **Correctness** (0–0.50), **Optimization** (0–0.30 — penalises cartesian products, correlated subqueries), **Code quality** (0–0.20 — readability, aliases, formatting).
+> **Offline mode:** If `ANTHROPIC_API_KEY` is not set, the AI judge is skipped and scoring is fully deterministic (capped at 0.80). The done threshold self-adjusts to 0.80 in this case so episodes still terminate correctly.
+---
+## Action Space
+```python
+class SQLAction(Action):
+    sql: str  # The SQL query to submit for grading
+```
+One field. The agent submits a SQL string. No multi-statement queries (`;` separated) are allowed — rejected with score 0.0.
+---
+## Observation Space
+```python
+class SQLObservation(Observation):
+    # Task context (set on reset, constant within an episode)
+    task_id: str            # e.g. "task_easy_syntax"
+    task_level: str         # "easy" | "medium" | "hard" | "custom"
+    task_title: str         # Human-readable title
+    task_description: str   # Full context: schema, broken query, error, goal
+    # Per-step grading signals
+    syntax_valid: bool      # True if query parsed without error
+    execution_success: bool # True if query ran to completion in DuckDB
+    execution_error: str    # Runtime error message, if any
+    rows_returned: int      # Number of rows returned
+    # Feedback
+    feedback: str           # Detailed grading feedback (DuckDB + AI judge)
+    hint: str               # Actionable hint (suppressed once score >= 0.90)
+    # Episode progress
+    attempt: int            # Number of queries submitted this episode
+    best_score: float       # Highest score achieved so far
+    done: bool
+    reward: float           # Score for this specific step (0.0 – 1.0)
+```
+---
+## Built-in Tasks
+### Easy — Fix Syntax Errors
+Three SQL keywords are misspelled (`SELEC`, `FORM`, `WEHRE`). The agent must identify and correct them.
+**Schema:** `users(id, name, age, city)` — 6 rows
+**Goal:** Return name and age of users older than 30 in New York, ordered by name
+**Max steps:** 5
+### Medium — Fix the Cartesian JOIN
+A missing `JOIN` condition (`o.product_id = p.id`) causes a cartesian product, inflating every total by 3×. The agent must rewrite using explicit `INNER JOIN … ON` syntax.
+**Schema:** `orders`, `users`, `products` — e-commerce dataset
+**Goal:** Correct per-(user, product) total amount spent, ordered by total DESC
+**Max steps:** 5
+### Hard — Rewrite Correlated Subquery as CTE
+A semantically correct but O(N²) query re-executes `AVG(salary)` for every employee row. The agent must rewrite using a `WITH` clause that computes department averages exactly once.
+**Schema:** `departments`, `employees` — 9 employees across 3 departments
+**Goal:** Employees who earn strictly above their department average, ordered by dept/salary
+**Max steps:** 6
+> Tasks have **structural penalties**: the hard task requires a `WITH` clause (−0.30 if absent); the medium task requires explicit `JOIN` syntax (−0.20 if absent). This prevents an agent from gaming the score by submitting the broken query verbatim.
+---
+## Custom Tasks
+Register any SQL task at runtime — no code changes needed.
+### Via Python
+```python
+from tasks import REGISTRY, task_from_dict
+REGISTRY.register(task_from_dict({
+    "id": "my_window_task",
+    "level": "hard",
+    "title": "Rank Employees by Salary",
+    "schema_ddl": "CREATE TABLE emp (id INT, name VARCHAR, dept VARCHAR, salary DECIMAL); INSERT INTO emp VALUES ...",
+    "broken_query": "SELECT name, salary FROM emp ORDER BY salary DESC",
+    "expected_rows": [{"name": "Alice", "rank": 1}, ...],
+    "hint": "Use ROW_NUMBER() OVER (PARTITION BY dept ORDER BY salary DESC)",
+    "solution_query": "SELECT name, RANK() OVER (ORDER BY salary DESC) AS rank FROM emp",
+}))
+```
+### Via REST API (when server is running)
+```bash
+# Register a custom task
+curl -X POST http://localhost:8000/tasks \
+  -H "Content-Type: application/json" \
+  -d '{"id": "my_task", "schema_ddl": "...", "expected_rows": [...]}'
+# List all tasks
+curl http://localhost:8000/tasks
+# Remove a custom task
+curl -X DELETE http://localhost:8000/tasks/my_task
+```
+### Via JSON file
+```python
+REGISTRY.load_from_json("my_tasks.json")
+```
+---
+## Quickstart
+### Install dependencies
+```bash
+python -m venv .venv
+.venv/bin/pip install -e ".[dev]"
+```
+### Run the local playbook (no server needed)
+Tests all three built-in tasks directly, with progressive SQL attempts:
+```bash
+ANTHROPIC_API_KEY=your_key .venv/bin/python playbook.py
+```
+### Run the baseline inference script
+Runs a Claude model as an agent against all tasks and reports scores:
+```bash
+# Default model (claude-haiku-4-5)
+ANTHROPIC_API_KEY=your_key .venv/bin/python baseline.py
+# Specific model
+ANTHROPIC_API_KEY=your_key .venv/bin/python baseline.py --model claude-opus-4-6
+# Single task with verbose output
+ANTHROPIC_API_KEY=your_key .venv/bin/python baseline.py --task task_hard_cte --verbose
+```
+### Run the HTTP server
+```bash
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+---
+## Baseline Results
+The following scores were produced by running `claude-haiku-4-5` as the agent against all three tasks with the full AI judge active. These serve as the reproducible baseline for this environment.
+| Task | Level | Steps Used | Best Score |
+|---|---|---|---|
+| Fix the Syntax Errors | easy | 1 | **1.000** |
+| Fix the Cartesian JOIN | medium | 1 | **0.900** |
+| Rewrite Correlated Subquery as CTE | hard | 1 | **0.950** |
+| **Average** | | | **0.950** |
+All three tasks were solved (or near-solved) on the first step, demonstrating that:
+- The reward pipeline returns meaningful signal immediately
+- The environment terminates cleanly when the done threshold (≥ 0.90) is met
+- A stronger model or a harder task set would produce more training-relevant trajectories
+---
+## API Endpoints
+| Method | Endpoint | Description |
+|---|---|---|
+| `POST` | `/reset` | Start a new episode. Pass `{"task_id": "..."}` to pin to a task |
+| `POST` | `/step` | Submit a SQL query: `{"sql": "SELECT ..."}` |
+| `GET` | `/state` | Current episode ID and step count |
+| `GET` | `/schema` | Action and observation JSON schemas |
+| `POST` | `/tasks` | Register a custom task |
+| `GET` | `/tasks` | List all registered tasks |
+| `DELETE` | `/tasks/{task_id}` | Remove a custom task (built-ins protected) |
+| `WS` | `/ws` | WebSocket endpoint for persistent low-latency sessions |
+| `GET` | `/health` | Container health check |
+| `GET` | `/docs` | Interactive OpenAPI documentation |
+### Examples
+```bash
+# Start an episode pinned to the hard task
+curl -X POST http://localhost:8000/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_hard_cte"}'
+# Submit a query
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{"sql": "WITH dept_avg AS (SELECT department_id, AVG(salary) AS avg_salary FROM employees GROUP BY department_id) SELECT e.name, e.department_id, e.salary FROM employees e JOIN dept_avg d ON e.department_id = d.department_id WHERE e.salary > d.avg_salary ORDER BY e.department_id, e.salary DESC"}'
+# List all available tasks
+curl http://localhost:8000/tasks
+```
+---
+## Python Client
+```python
+from queryforge import QueryforgeEnv, SQLAction
+with QueryforgeEnv(base_url="http://localhost:8000") as env:
+    # Pin to a specific task
+    obs = env.reset(task_id="task_medium_join")
+    print(obs.task_description)
+    # Submit a fix
+    result = env.step(SQLAction(sql="""
+        SELECT u.name, p.title, SUM(o.amount) AS total_spent
+        FROM orders o
+        INNER JOIN users u ON o.user_id = u.id
+        INNER JOIN products p ON o.product_id = p.id
+        GROUP BY u.name, p.title
+        ORDER BY total_spent DESC
+    """))
+    print(f"Score: {result.reward:.3f}")
+    print(f"Feedback: {result.observation.feedback}")
+    print(f"Done: {result.done}")
+    # Register and use a custom task
+    env.register_task(TaskSpec(
+        id="my_task",
+        schema_ddl="CREATE TABLE ...; INSERT INTO ...",
+        expected_rows=[{"col": "val"}],
+        title="My Custom Task",
+    ))
+    obs = env.reset(task_id="my_task")
+```
+---
+## Project Structure
+```
+queryforge/
+├── __init__.py                     # Public exports (SQLAction, SQLObservation, TaskSpec, REGISTRY)
+├── models.py                       # SQLAction, SQLObservation, TaskSpec Pydantic models
+├── tasks.py                        # Built-in tasks + thread-safe TaskRegistry
+├── judge.py                        # 4-stage grading pipeline (DuckDB + Anthropic)
+├── client.py                       # QueryforgeEnv client with task management helpers
+├── playbook.py                     # Local test runner (no server required)
+├── baseline.py                     # Baseline inference script (Claude as agent)
+├── openenv.yaml                    # OpenEnv manifest
+├── pyproject.toml                  # Project metadata and dependencies
+├── uv.lock                         # Locked dependencies
+└── server/
+    ├── app.py                      # FastAPI app — core + /tasks REST endpoints
+    ├── queryforge_environment.py   # Environment class (reset, step, state)
+    ├── Dockerfile                  # Container image
+    └── requirements.txt            # Server dependencies
+```
+---
+## Deployment
+### Hugging Face Spaces (recommended)
+```bash
+UV_CACHE_DIR=/tmp/uv-cache openenv push . --repo-id <hf-username>/queryforge
+```
+Add `ANTHROPIC_API_KEY` as a Space secret after deployment. Without it, the environment runs in deterministic-only mode (scores capped at 0.80, done threshold self-adjusts accordingly).
+### Docker
+```bash
+docker build -t queryforge:latest -f server/Dockerfile .
+docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY queryforge:latest
+```
+The deployed environment exposes:
+- **`/web`** — Interactive UI for exploring the environment
+- **`/docs`** — Full OpenAPI / Swagger interface
+- **`/ws`** — WebSocket endpoint for persistent agent sessions
+- **`/health`** — Container health monitoring
+---
+## Environment Design Notes
+**Why DuckDB?** DuckDB runs fully in-memory with no external process or network dependency. Each `step()` call creates an isolated connection, seeds it with the task's schema, runs the agent's query, then closes — complete isolation with zero shared state between steps.
+**Why a 4-stage reward?** Binary correct/incorrect rewards give an agent no gradient to climb when its query is simply broken. The 4-stage pipeline means every improvement — fixing a typo, avoiding a runtime error, returning the right row count, getting the right rows, writing clean SQL — is rewarded. This produces a smooth loss landscape for policy gradient methods.
+**Why structural penalties?** Without them, an agent could achieve 0.80 on the hard CTE task by submitting the original correlated subquery verbatim (rows match, but the task was never solved). Structural penalties enforce that the agent actually learned *what* to change, not just that rows matched.

__init__.py ADDED Viewed

	@@ -0,0 +1,23 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""QueryForge — SQL Debugger & Optimiser Environment."""
+from .client import QueryforgeEnv
+from .models import SQLAction, SQLObservation, TaskSpec
+from .tasks import TASKS, TASK_BY_ID, SQLTask, REGISTRY, task_from_dict
+__all__ = [
+    "SQLAction",
+    "SQLObservation",
+    "TaskSpec",
+    "QueryforgeEnv",
+    "TASKS",
+    "TASK_BY_ID",
+    "SQLTask",
+    "REGISTRY",
+    "task_from_dict",
+]

baseline.py ADDED Viewed

	@@ -0,0 +1,244 @@

+"""
+QueryForge Baseline Inference Script
+─────────────────────────────────────
+Runs a Claude model as an agent against all 3 built-in tasks and reports
+a reproducible baseline score.
+Usage:
+    # All tasks, default model (claude-haiku-4-5):
+    python baseline.py
+    # Specific model:
+    python baseline.py --model claude-opus-4-6
+    # Single task:
+    python baseline.py --task task_easy_syntax
+    # More verbose output:
+    python baseline.py --verbose
+Requirements:
+    ANTHROPIC_API_KEY must be set in the environment.
+"""
+import argparse
+import os
+import re
+import sys
+import anthropic
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from models import SQLAction
+from server.queryforge_environment import QueryforgeEnvironment
+from tasks import REGISTRY
+# ── Constants ─────────────────────────────────────────────────────────────────
+DEFAULT_MODEL = "claude-haiku-4-5"
+SYSTEM_PROMPT = """\
+You are an expert SQL engineer. You will be given a SQL debugging or \
+optimisation challenge. Your job is to submit a corrected or improved SQL query.
+Rules:
+- Respond with ONLY a single SQL query inside a ```sql ... ``` code block.
+- Do not explain your reasoning outside the code block.
+- Do not include multiple statements (no semicolons except at the very end).
+- If you receive feedback on a previous attempt, use it to improve your query.
+"""
+# ── SQL extraction ─────────────────────────────────────────────────────────────
+_SQL_BLOCK = re.compile(r"```(?:sql)?\s*(.*?)```", re.DOTALL | re.IGNORECASE)
+def _extract_sql(text: str) -> str:
+    """Pull the first SQL code block out of Claude's response."""
+    match = _SQL_BLOCK.search(text)
+    if match:
+        return match.group(1).strip()
+    # Fallback: return the whole response stripped — better than crashing
+    return text.strip()
+# ── Formatting helpers ────────────────────────────────────────────────────────
+def _hr(char="═", width=70):
+    print(char * width)
+def _score_bar(score: float, width: int = 25) -> str:
+    filled = int(score * width)
+    bar = "█" * filled + "░" * (width - filled)
+    return f"[{bar}] {score:.3f}"
+# ── Per-task agent loop ────────────────────────────────────────────────────────
+def run_task(
+    task_id: str,
+    model: str,
+    client: anthropic.Anthropic,
+    verbose: bool = False,
+) -> dict:
+    """
+    Run one episode of a single task.
+    Returns a dict with keys:
+        task_id, task_title, task_level,
+        best_score, attempts, done
+    """
+    env = QueryforgeEnvironment()
+    obs = env.reset(task_id=task_id)
+    if obs.done:
+        # reset() returned an error (unknown task_id)
+        print(f"  ERROR: {obs.feedback}")
+        return {"task_id": task_id, "best_score": 0.0, "attempts": 0, "done": False}
+    print(f"\n  Task : {obs.task_title}  [{obs.task_level}]  (max {env._current_task.max_steps} steps)")
+    if verbose:
+        print(f"  ID   : {obs.task_id}")
+    # ── Build initial conversation ────────────────────────────────────────────
+    messages = [
+        {
+            "role": "user",
+            "content": (
+                f"Here is your SQL challenge:\n\n{obs.task_description}\n\n"
+                "Provide your fixed SQL query."
+            ),
+        }
+    ]
+    step = 0
+    while not obs.done:
+        step += 1
+        # ── Call Claude ───────────────────────────────────────────────────────
+        with client.messages.stream(
+            model=model,
+            max_tokens=512,
+            system=SYSTEM_PROMPT,
+            messages=messages,
+        ) as stream:
+            response_text = ""
+            for text in stream.text_stream:
+                response_text += text
+        sql = _extract_sql(response_text)
+        if verbose:
+            print(f"\n  ── Step {step}")
+            short_sql = sql[:120] + ("…" if len(sql) > 120 else "")
+            print(f"     SQL: {short_sql}")
+        # ── Submit to environment ─────────────────────────────────────────────
+        obs = env.step(SQLAction(sql=sql))
+        score_bar = _score_bar(obs.reward or 0.0)
+        status = "✓ DONE" if obs.done else f"step {step}/{env._current_task.max_steps}"
+        print(f"  [{status}]  Score: {score_bar}")
+        if verbose and obs.feedback:
+            fb = obs.feedback[:200] + ("…" if len(obs.feedback) > 200 else "")
+            print(f"     Feedback: {fb}")
+        if obs.done:
+            break
+        # ── Append exchange to conversation for next attempt ──────────────────
+        messages.append({"role": "assistant", "content": response_text})
+        messages.append({
+            "role": "user",
+            "content": (
+                f"Your query scored {obs.reward:.3f}. Here is the feedback:\n\n"
+                f"{obs.feedback}\n\n"
+                f"Hint: {obs.hint}\n\n"
+                "Please try again with an improved SQL query."
+            ),
+        })
+    return {
+        "task_id": task_id,
+        "task_title": obs.task_title,
+        "task_level": obs.task_level,
+        "best_score": obs.best_score,
+        "attempts": obs.attempt,
+        "done": obs.done,
+    }
+# ── Main ───────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(description="QueryForge Baseline Inference")
+    parser.add_argument(
+        "--model", default=DEFAULT_MODEL,
+        help=f"Anthropic model ID to use (default: {DEFAULT_MODEL})"
+    )
+    parser.add_argument(
+        "--task", default=None,
+        help="Run a single task by ID instead of all built-in tasks"
+    )
+    parser.add_argument(
+        "--verbose", action="store_true",
+        help="Print SQL queries and full feedback for each step"
+    )
+    args = parser.parse_args()
+    # ── Validate API key ──────────────────────────────────────────────────────
+    api_key = os.environ.get("ANTHROPIC_API_KEY")
+    if not api_key:
+        print("ERROR: ANTHROPIC_API_KEY is not set.")
+        sys.exit(1)
+    client = anthropic.Anthropic(api_key=api_key)
+    # ── Determine tasks to run ────────────────────────────────────────────────
+    if args.task:
+        task_ids = [args.task]
+    else:
+        task_ids = ["task_easy_syntax", "task_medium_join", "task_hard_cte"]
+    # ── Header ────────────────────────────────────────────────────────────────
+    _hr()
+    print("  QueryForge — Baseline Inference")
+    print(f"  Model  : {args.model}")
+    print(f"  Tasks  : {', '.join(task_ids)}")
+    _hr()
+    # ── Run each task ─────────────────────────────────────────────────────────
+    results = []
+    for task_id in task_ids:
+        print(f"\n{'─' * 70}")
+        result = run_task(task_id, args.model, client, verbose=args.verbose)
+        results.append(result)
+    # ── Results table ─────────────────────────────────────────────────────────
+    print(f"\n{'═' * 70}")
+    print("  BASELINE RESULTS")
+    print(f"  Model: {args.model}")
+    print(f"{'═' * 70}")
+    print(f"  {'Task':<28} {'Level':<8} {'Steps':>5}  {'Best Score'}")
+    print(f"  {'─' * 28} {'─' * 8} {'─' * 5}  {'─' * 30}")
+    total_score = 0.0
+    for r in results:
+        title = r.get("task_title", r["task_id"])[:27]
+        level = r.get("task_level", "?")
+        attempts = r.get("attempts", "?")
+        score = r["best_score"]
+        total_score += score
+        bar = _score_bar(score)
+        print(f"  {title:<28} {level:<8} {attempts:>5}  {bar}")
+    avg = total_score / len(results) if results else 0.0
+    print(f"{'─' * 70}")
+    print(f"  {'AVERAGE':<28} {'':8} {'':5}  {_score_bar(avg)}")
+    print(f"{'═' * 70}\n")
+if __name__ == "__main__":
+    main()

client.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""QueryForge Environment Client."""
+from typing import Any, Dict, List, Optional
+import httpx
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .models import SQLAction, SQLObservation, TaskSpec
+class QueryforgeEnv(EnvClient[SQLAction, SQLObservation, State]):
+    """
+    Client for the QueryForge SQL Debugger & Optimiser environment.
+    Maintains a persistent WebSocket connection to the environment server.
+    Each client instance has its own dedicated session (isolated task state).
+    Example:
+        >>> with QueryforgeEnv(base_url="http://localhost:8000") as env:
+        ...     obs = env.reset()
+        ...     print(obs.task_title)
+        ...     print(obs.task_description)
+        ...
+        ...     result = env.step(SQLAction(sql="SELECT name, age FROM users WHERE age > 30"))
+        ...     print(result.reward, result.observation.feedback)
+    Example with Docker:
+        >>> env = QueryforgeEnv.from_docker_image("queryforge-env:latest")
+        >>> try:
+        ...     obs = env.reset()
+        ...     result = env.step(SQLAction(sql="SELECT ..."))
+        ... finally:
+        ...     env.close()
+    """
+    def _step_payload(self, action: SQLAction) -> Dict:
+        return {"sql": action.sql}
+    def _parse_result(self, payload: Dict) -> StepResult[SQLObservation]:
+        obs_data = payload.get("observation", {})
+        observation = SQLObservation(
+            task_id=obs_data.get("task_id", ""),
+            task_level=obs_data.get("task_level", ""),
+            task_title=obs_data.get("task_title", ""),
+            task_description=obs_data.get("task_description", ""),
+            syntax_valid=obs_data.get("syntax_valid", False),
+            execution_success=obs_data.get("execution_success", False),
+            execution_error=obs_data.get("execution_error"),
+            rows_returned=obs_data.get("rows_returned", 0),
+            feedback=obs_data.get("feedback", ""),
+            hint=obs_data.get("hint", ""),
+            attempt=obs_data.get("attempt", 0),
+            best_score=obs_data.get("best_score", 0.0),
+            done=payload.get("done", False),
+            reward=payload.get("reward", 0.0),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward", 0.0),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )
+    # ── Task Registry helpers ─────────────────────────────────────────────────
+    def register_task(self, spec: TaskSpec) -> Dict[str, Any]:
+        """Register a custom task on the server. Returns the server response dict."""
+        resp = httpx.post(
+            f"{self.base_url}/tasks",
+            json=spec.model_dump(),
+            timeout=10,
+        )
+        resp.raise_for_status()
+        return resp.json()
+    def list_tasks(self) -> List[Dict[str, Any]]:
+        """Return all registered tasks (built-in + custom) as a list of dicts."""
+        resp = httpx.get(f"{self.base_url}/tasks", timeout=10)
+        resp.raise_for_status()
+        return resp.json()
+    def delete_task(self, task_id: str) -> Dict[str, Any]:
+        """Remove a custom task by ID. Raises httpx.HTTPStatusError on 403/404."""
+        resp = httpx.delete(f"{self.base_url}/tasks/{task_id}", timeout=10)
+        resp.raise_for_status()
+        return resp.json()

judge.py ADDED Viewed

	@@ -0,0 +1,414 @@

+"""
+QueryForge Judge — deterministic DuckDB grading + Anthropic AI quality scoring.
+Grading pipeline for each submitted SQL query:
+  Stage 1 — Syntax (0.0 → 0.15)
+    DuckDB EXPLAIN parses the query.  Fail → score = 0.0.
+  Stage 2 — Execution (→ 0.30)
+    Run the full query against in-memory DuckDB seeded with task data.
+    Fail → score = 0.15 (syntax was fine, runtime error).
+  Stage 3 — Correctness (→ 0.80)
+    Compare returned rows against expected rows.
+    Perfect match → deterministic score reaches 0.80.
+    Partial credit for correct row count or partial row matches.
+  Stage 4 — AI Quality (→ 1.0)
+    Anthropic claude-sonnet-4-6 evaluates optimization, code style, and
+    semantic correctness vs. the reference solution.
+    The AI score can move the final score up to 1.0 when rows are correct,
+    or provide nuanced feedback even when rows are partially wrong.
+Environment variable required:
+  ANTHROPIC_API_KEY — standard Anthropic SDK key.
+"""
+import json
+import re
+from typing import Any, Dict, List, Optional, Tuple
+import anthropic
+import duckdb
+try:
+    from .tasks import SQLTask, TestCase
+except ImportError:
+    from tasks import SQLTask, TestCase
+JUDGE_MODEL = "claude-haiku-4-5-20251001"
+# ---------------------------------------------------------------------------
+# Stage 1 — Syntax check
+# ---------------------------------------------------------------------------
+def _reject_multi_statement(query: str) -> Optional[str]:
+    """Return an error message if the query contains multiple statements."""
+    # Strip string literals and comments before checking for semicolons
+    stripped = re.sub(r"'[^']*'", "", query)   # remove string literals
+    stripped = re.sub(r"--[^\n]*", "", stripped)  # remove line comments
+    stripped = re.sub(r"/\*.*?\*/", "", stripped, flags=re.DOTALL)  # block comments
+    stripped = stripped.strip().rstrip(";")  # allow a single trailing semicolon
+    if ";" in stripped:
+        return "Multi-statement queries are not allowed."
+    return None
+def check_syntax(query: str) -> Tuple[bool, Optional[str]]:
+    """
+    Return (is_valid, error_message).
+    Strategy: run EXPLAIN against an empty in-memory DuckDB.
+    - "Parser Error" in the exception → genuine syntax error → invalid.
+    - "Catalog Error" / "Binder Error" → tables unknown but syntax is fine → valid.
+    - Any other exception → treat as syntax error to be safe.
+    """
+    multi_err = _reject_multi_statement(query)
+    if multi_err:
+        return False, multi_err
+    conn = duckdb.connect(":memory:")
+    try:
+        conn.execute(f"EXPLAIN {query}")
+        return True, None
+    except Exception as exc:
+        msg = str(exc)
+        # Catalog/Binder errors mean the SQL parsed fine; tables just aren't seeded.
+        if any(
+            tag in msg
+            for tag in ("Catalog Error", "Binder Error", "Table with name",
+                        "Referenced column", "does not exist", "column")
+        ):
+            return True, None
+        return False, msg
+    finally:
+        conn.close()
+# ---------------------------------------------------------------------------
+# Stage 2 — Execution
+# ---------------------------------------------------------------------------
+def execute_query(
+    schema_ddl: str, query: str
+) -> Tuple[bool, Optional[List[Dict[str, Any]]], Optional[str]]:
+    """
+    Seed a fresh DuckDB in-memory DB with *schema_ddl*, then run *query*.
+    Returns (success, rows_as_list_of_dicts, error_message).
+    """
+    conn = duckdb.connect(":memory:")
+    try:
+        conn.execute(schema_ddl)
+        result = conn.execute(query).fetchdf()
+        rows = result.to_dict(orient="records")
+        # Convert numpy types to native Python
+        clean: List[Dict[str, Any]] = []
+        for row in rows:
+            clean.append({k: _native(v) for k, v in row.items()})
+        return True, clean, None
+    except Exception as exc:
+        return False, None, str(exc)
+    finally:
+        conn.close()
+def _native(value: Any) -> Any:
+    """Convert numpy scalars → native Python types for JSON-safe comparison."""
+    try:
+        import numpy as np  # duckdb fetchdf() returns numpy types
+        if isinstance(value, (np.integer,)):
+            return int(value)
+        if isinstance(value, (np.floating,)):
+            return float(value)
+        if isinstance(value, np.bool_):
+            return bool(value)
+    except ImportError:
+        pass
+    return value
+# ---------------------------------------------------------------------------
+# Stage 3 — Row correctness
+# ---------------------------------------------------------------------------
+def _normalize(row: Dict[str, Any]) -> Dict[str, Any]:
+    """Round floats to 2 dp so 999.99000000001 == 999.99."""
+    return {
+        k: (round(float(v), 2) if isinstance(v, float) else v)
+        for k, v in row.items()
+    }
+def _sort_key(row: Dict[str, Any], order_by: Optional[str]) -> tuple:
+    if order_by:
+        cols = [c.strip() for c in order_by.split(",")]
+        return tuple(str(row.get(c, "")) for c in cols)
+    return tuple(str(v) for v in row.values())
+def rows_match(
+    actual: List[Dict[str, Any]],
+    expected: List[Dict[str, Any]],
+    order_by: Optional[str] = None,
+) -> Tuple[float, str]:
+    """
+    Compare *actual* vs *expected* rows.
+    Scoring:
+      1.0  — exact match
+      0.5–0.9 — row count matches, some rows differ
+      0.3  — row count wrong but partial overlap
+      0.0  — empty when non-empty expected
+    """
+    if not expected:
+        return (1.0, "No expected rows — query accepted.") if not actual else (
+            0.8, f"Expected empty result but got {len(actual)} row(s)."
+        )
+    if not actual:
+        return 0.0, f"Query returned 0 rows; expected {len(expected)}."
+    # Project actual rows to only the expected columns (agent may SELECT extra).
+    # Use case-insensitive matching: build a map from lower(actual_col) → actual_col.
+    expected_cols = list(expected[0].keys())
+    lower_map = {k.lower(): k for k in actual[0].keys()} if actual else {}
+    def _project(row: Dict[str, Any]) -> Dict[str, Any]:
+        out: Dict[str, Any] = {}
+        for ec in expected_cols:
+            actual_key = lower_map.get(ec.lower())
+            if actual_key is not None:
+                out[ec] = row[actual_key]
+        return out
+    projected = [_project(row) for row in actual]
+    if len(projected) != len(expected):
+        overlap_ratio = min(len(projected), len(expected)) / max(len(projected), len(expected))
+        score = 0.3 * overlap_ratio
+        return score, (
+            f"Row count mismatch: got {len(projected)}, expected {len(expected)}. "
+            f"({overlap_ratio:.0%} overlap ratio)"
+        )
+    actual_sorted = sorted([_normalize(r) for r in projected], key=lambda r: _sort_key(r, order_by))
+    expected_sorted = sorted([_normalize(r) for r in expected], key=lambda r: _sort_key(r, order_by))
+    matches = sum(1 for a, e in zip(actual_sorted, expected_sorted) if a == e)
+    row_accuracy = matches / len(expected)
+    if row_accuracy == 1.0:
+        return 1.0, "All rows match perfectly."
+    score = 0.5 + 0.4 * row_accuracy
+    return score, f"{matches}/{len(expected)} rows match correctly."
+# ---------------------------------------------------------------------------
+# Stage 4 — Anthropic AI judge
+# ---------------------------------------------------------------------------
+def call_anthropic_judge(
+    task: SQLTask,
+    agent_query: str,
+    execution_success: bool,
+    execution_error: Optional[str],
+    actual_rows: Optional[List[Dict[str, Any]]],
+    deterministic_score: float,
+) -> Tuple[float, str, str]:
+    """
+    Call claude-sonnet-4-6 to evaluate query quality across three axes:
+      - Correctness  (0–0.50)
+      - Optimization (0–0.30)  — avoids inefficiencies, uses best SQL patterns
+      - Code quality (0–0.20)  — readable, well-aliased, idiomatic SQL
+    Returns (final_score, feedback, improvement_hint).
+    Falls back to deterministic_score if the API call fails.
+    """
+    client = anthropic.Anthropic()
+    sample_actual = json.dumps(actual_rows[:5] if actual_rows else [], indent=2)
+    sample_expected = json.dumps(
+        task.test_cases[0].expected_rows if task.test_cases else [], indent=2
+    )
+    prompt = f"""\
+You are a strict SQL expert judge scoring an agent's query for the task below.
+## Task  ({task.level})
+{task.description}
+## Agent Query
+```sql
+{agent_query}
+```
+## Execution
+- Success: {execution_success}
+- Error: {execution_error or "None"}
+- Rows returned (first 5): {sample_actual}
+- Expected rows: {sample_expected}
+## Reference Solution
+```sql
+{task.solution_query}
+```
+## Deterministic row-match score (0.0–1.0): {deterministic_score:.3f}
+Score the agent query on THREE axes and sum them for the final score:
+| Axis         | Max  | Criteria |
+|--------------|------|----------|
+| Correctness  | 0.50 | Produces the right rows for the stated goal |
+| Optimization | 0.30 | Avoids cartesian products / correlated subqueries; uses efficient patterns (CTEs, explicit JOINs, proper GROUP BY) |
+| Code quality | 0.20 | Readable aliases, clean formatting, no redundant clauses |
+IMPORTANT rules:
+- If execution failed with a runtime error, Correctness ≤ 0.10.
+- If rows are fully correct per deterministic score ≥ 0.95, Correctness ≥ 0.40.
+- For the medium task: a query that still uses comma-join syntax scores Optimization ≤ 0.05.
+- For the hard task: a query without a CTE scores Optimization ≤ 0.10.
+Respond with ONLY valid JSON (no markdown fences):
+{{
+  "correctness":  <float 0.0–0.50>,
+  "optimization": <float 0.0–0.30>,
+  "code_quality": <float 0.0–0.20>,
+  "score":        <sum of above, float 0.0–1.0>,
+  "feedback":     "<2–3 sentences summarising what the agent did right/wrong>",
+  "hint":         "<one concrete actionable improvement, or 'Excellent!' if score >= 0.95>"
+}}"""
+    try:
+        message = client.messages.create(
+            model="claude-sonnet-4-6",
+            max_tokens=512,
+            messages=[{"role": "user", "content": prompt}],
+        )
+        raw = message.content[0].text.strip()
+        # Strip accidental markdown fences
+        if raw.startswith("```"):
+            raw = raw.split("```")[1]
+            if raw.startswith("json"):
+                raw = raw[4:]
+            raw = raw.rsplit("```", 1)[0].strip()
+        data = json.loads(raw)
+        score = float(data["score"])
+        score = max(0.0, min(1.0, score))
+        feedback = str(data.get("feedback", ""))
+        hint = str(data.get("hint", ""))
+        return score, feedback, hint
+    except Exception as exc:
+        # Graceful fallback — no API key, network error, or parse failure
+        msg = str(exc).lower()
+        reason = (
+            "no ANTHROPIC_API_KEY set"
+            if "api_key" in msg or "auth" in msg or "authentication" in msg
+            else type(exc).__name__
+        )
+        return (
+            deterministic_score,
+            f"AI judge offline ({reason}). Using deterministic score.",
+            task.hint,
+        )
+# ---------------------------------------------------------------------------
+# Public entry point
+# ---------------------------------------------------------------------------
+def grade(
+    task: SQLTask, agent_query: str
+) -> Tuple[float, str, Dict[str, Any]]:
+    """
+    Full grading pipeline.  Returns (score 0.0–1.0, feedback, details_dict).
+    Partial progress scoring:
+      0.00  — syntax error (unparseable)
+      0.15  — syntax valid, runtime error
+      0.30  — executes, but 0 rows returned
+      0.30–0.80 — partial row matches (deterministic)
+      0.80–1.00 — correct rows + AI quality assessment
+    """
+    details: Dict[str, Any] = {}
+    # ── Stage 1: syntax ──────────────────────────────────────────────────────
+    syntax_ok, syntax_error = check_syntax(agent_query)
+    details["syntax_valid"] = syntax_ok
+    details["syntax_error"] = syntax_error
+    if not syntax_ok:
+        return 0.0, f"Syntax error: {syntax_error}", details
+    # ── Stage 2: execution ───────────────────────────────────────────────────
+    exec_ok, rows, exec_error = execute_query(task.schema_ddl, agent_query)
+    details["execution_success"] = exec_ok
+    details["execution_error"] = exec_error
+    details["rows_returned"] = len(rows) if rows else 0
+    if not exec_ok:
+        # Syntax valid but runtime error — call AI for nuanced feedback
+        ai_score, ai_feedback, ai_hint = call_anthropic_judge(
+            task, agent_query, False, exec_error, None, 0.15
+        )
+        details["ai_score"] = ai_score
+        details["ai_feedback"] = ai_feedback
+        final = max(0.15, ai_score * 0.3)  # cap at 0.3 when execution fails
+        return final, f"Runtime error: {exec_error} | AI: {ai_feedback}", details
+    # ── Stage 3: row correctness ─────────────────────────────────────────────
+    test_case = task.test_cases[0]
+    row_score, row_feedback = rows_match(rows, test_case.expected_rows, test_case.order_by)
+    details["row_match_score"] = row_score
+    details["row_match_feedback"] = row_feedback
+    # ── Stage 3b: structural checks (task-specific) ─────────────────────────
+    # These prevent high scores when the agent submits the broken query verbatim
+    # or ignores the task's structural requirement.
+    structural_penalty = 0.0
+    query_upper = agent_query.upper()
+    if task.level == "hard" and "WITH " not in query_upper:
+        structural_penalty = 0.30  # hard task demands a CTE
+        row_feedback += " (Penalty: no CTE detected — task requires WITH clause.)"
+    elif task.level == "medium" and "JOIN " not in query_upper:
+        structural_penalty = 0.20  # medium task demands explicit JOINs
+        row_feedback += " (Penalty: no explicit JOIN — task requires JOIN … ON syntax.)"
+    details["structural_penalty"] = structural_penalty
+    # Deterministic score: 0.30 base for executing + up to 0.50 for rows − penalty
+    deterministic_score = max(0.30, 0.30 + 0.50 * row_score - structural_penalty)
+    # ── Stage 4: AI quality ──────────────────────────────────────────────────
+    ai_score, ai_feedback, ai_hint = call_anthropic_judge(
+        task, agent_query, True, None, rows, deterministic_score
+    )
+    details["ai_score"] = ai_score
+    details["ai_feedback"] = ai_feedback
+    details["ai_hint"] = ai_hint
+    # Final blending:
+    #   rows fully correct  → trust AI score (can reach 1.0)
+    #   rows partially wrong → clamp AI score to not exceed deterministic
+    if row_score >= 0.95:
+        final_score = ai_score
+    elif row_score >= 0.5:
+        # Blend: AI provides nuance but can't exceed deterministic ceiling
+        final_score = min(deterministic_score, ai_score + 0.05)
+    else:
+        # Low row accuracy — stay near deterministic
+        final_score = min(deterministic_score, ai_score * 0.6)
+    final_score = max(0.0, min(1.0, final_score))
+    feedback = (
+        f"[Rows] {row_feedback}  "
+        f"[AI Judge] {ai_feedback}  "
+        f"[Hint] {ai_hint}"
+    )
+    return final_score, feedback, details

models.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""
+Data models for the QueryForge SQL environment.
+SQLAction    — the agent's submitted SQL query.
+SQLObservation — task description + grading feedback returned after each step.
+TaskSpec     — payload for registering a custom task via POST /tasks.
+"""
+from typing import Any, Dict, List, Optional
+from openenv.core.env_server.types import Action, Observation
+from pydantic import BaseModel, Field
+class SQLAction(Action):
+    """Action: submit a SQL query for evaluation."""
+    sql: str = Field(..., description="The SQL query to submit for grading")
+class SQLObservation(Observation):
+    """Observation returned after reset() or step()."""
+    # ── Task context ─────────────────────────────────────────────────────────
+    task_id: str = Field(default="", description="Active task identifier")
+    task_level: str = Field(
+        default="", description="Difficulty: easy | medium | hard"
+    )
+    task_title: str = Field(default="", description="Human-readable task title")
+    task_description: str = Field(
+        default="",
+        description=(
+            "Full task description: schema, broken query, error message, and goal"
+        ),
+    )
+    # ── Per-step grading signals ──────────────────────────────────────────────
+    syntax_valid: bool = Field(
+        default=False, description="True if the submitted query parsed without error"
+    )
+    execution_success: bool = Field(
+        default=False, description="True if the query ran to completion in DuckDB"
+    )
+    execution_error: Optional[str] = Field(
+        default=None, description="Runtime error message, if any"
+    )
+    rows_returned: int = Field(
+        default=0, description="Number of rows the query returned"
+    )
+    feedback: str = Field(
+        default="",
+        description="Detailed grading feedback from DuckDB + AI judge",
+    )
+    hint: str = Field(
+        default="", description="Actionable hint for the next attempt"
+    )
+    # ── Episode progress ──────────────────────────────────────────────────────
+    attempt: int = Field(
+        default=0, description="Number of queries submitted this episode"
+    )
+    best_score: float = Field(
+        default=0.0, description="Highest score achieved so far this episode"
+    )
+class TaskSpec(BaseModel):
+    """
+    Payload for registering a custom SQL task via POST /tasks
+    or directly via REGISTRY.register(task_from_dict(spec.model_dump())).
+    Required: id, schema_ddl, expected_rows
+    Everything else has sensible defaults.
+    """
+    id: str = Field(
+        ..., description="Unique task identifier, e.g. 'null_handling_task'"
+    )
+    level: str = Field(
+        default="custom",
+        description="Difficulty label: easy | medium | hard | custom",
+    )
+    title: str = Field(..., description="Human-readable task title")
+    description: str = Field(
+        default="",
+        description="Full task description shown to the agent (schema, goal, etc.)",
+    )
+    schema_ddl: str = Field(
+        ...,
+        description="CREATE TABLE + INSERT statements to seed the DuckDB test DB",
+    )
+    broken_query: str = Field(
+        default="",
+        description="The broken or slow query the agent must fix",
+    )
+    error_message: str = Field(
+        default="",
+        description="Error or performance warning shown to the agent alongside the task",
+    )
+    hint: str = Field(
+        default="",
+        description="Actionable hint surfaced in the observation after each wrong attempt",
+    )
+    expected_rows: List[Dict[str, Any]] = Field(
+        ...,
+        description=(
+            "Exact rows the correct query must return. "
+            "Used for deterministic row-match scoring."
+        ),
+    )
+    order_by: Optional[str] = Field(
+        default=None,
+        description="Comma-separated column names used to sort rows before comparison",
+    )
+    solution_query: str = Field(
+        default="",
+        description="Reference solution shown to the AI judge for quality scoring",
+    )
+    test_description: str = Field(
+        default="Custom test case",
+        description="One-line description of what the test case checks",
+    )
+    max_steps: int = Field(
+        default=5, ge=1, le=20,
+        description="Maximum number of step() calls allowed per episode",
+    )

openenv.yaml ADDED Viewed

	@@ -0,0 +1,26 @@

+spec_version: 1
+name: queryforge
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+description: |
+  SQL Query Debugger & Optimiser environment.
+  An agent receives a broken or slow SQL query together with the schema and an
+  error/performance warning. It must produce a working, optimised query.
+  Tasks (3 levels, cycled in order):
+    easy   — fix three misspelled SQL keywords (SELECT / FROM / WHERE)
+    medium — fix a missing JOIN condition that causes a cartesian product
+    hard   — rewrite a correlated subquery (O(N²)) as a CTE (O(N))
+  Reward signal (0.0 – 1.0):
+    0.00        syntax error
+    0.15        syntax valid, runtime error
+    0.30        executes, wrong / empty results
+    0.30–0.80   partial row correctness (deterministic, DuckDB)
+    0.80–1.00   correct results + AI quality score (Anthropic claude-sonnet-4-6)
+  Required env var: ANTHROPIC_API_KEY

openenv_queryforge.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,11 @@

+Metadata-Version: 2.4
+Name: openenv-queryforge
+Version: 0.1.0
+Summary: Queryforge environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.1
+Requires-Dist: duckdb>=0.10.0
+Requires-Dist: anthropic>=0.25.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_queryforge.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./judge.py
+./models.py
+./tasks.py
+openenv_queryforge.egg-info/PKG-INFO
+openenv_queryforge.egg-info/SOURCES.txt
+openenv_queryforge.egg-info/dependency_links.txt
+openenv_queryforge.egg-info/entry_points.txt
+openenv_queryforge.egg-info/requires.txt
+openenv_queryforge.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/queryforge_environment.py

openenv_queryforge.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_queryforge.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = queryforge.server.app:main

openenv_queryforge.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+openenv-core[core]>=0.2.1
+duckdb>=0.10.0
+anthropic>=0.25.0
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_queryforge.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ queryforge

playbook.py ADDED Viewed

	@@ -0,0 +1,249 @@

+"""
+QueryForge Local Playbook
+─────────────────────────
+Tests the environment directly (no HTTP server needed).
+Run from the queryforge directory:
+    .venv/bin/python playbook.py
+If ANTHROPIC_API_KEY is set, Stage 4 AI scoring is live.
+If not set, the judge falls back to deterministic scoring (capped at 0.80).
+"""
+import os
+import sys
+import textwrap
+# Make imports work whether run directly or as a module
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from server.queryforge_environment import QueryforgeEnvironment
+from models import SQLAction
+from tasks import REGISTRY, task_from_dict
+# ── Formatting helpers ────────────────────────────────────────────────────────
+def _hr(char="═", width=70):
+    print(char * width)
+def _section(title):
+    print()
+    _hr()
+    print(f"  {title}")
+    _hr()
+def _score_bar(score: float, width: int = 30) -> str:
+    filled = int(score * width)
+    bar = "█" * filled + "░" * (width - filled)
+    return f"[{bar}] {score:.2f}"
+def _print_obs(obs, show_description=False):
+    if show_description:
+        print()
+        print(textwrap.indent(obs.task_description, "  "))
+        print()
+    if obs.feedback and obs.feedback != "New task loaded. Submit your fixed/optimised SQL query.":
+        print(f"  Syntax valid    : {obs.syntax_valid}")
+        print(f"  Execution OK    : {obs.execution_success}")
+        if obs.execution_error:
+            print(f"  Execution error : {obs.execution_error[:100]}")
+        print(f"  Rows returned   : {obs.rows_returned}")
+        print(f"  Score           : {_score_bar(obs.reward or 0.0)}")
+        print(f"  Best this ep.   : {_score_bar(obs.best_score)}")
+        # Print just the first 200 chars of feedback to keep output clean
+        fb = obs.feedback[:250] + ("…" if len(obs.feedback) > 250 else "")
+        print(f"  Feedback        : {fb}")
+        if obs.hint:
+            print(f"  Hint            : {obs.hint[:120]}")
+def _attempt(env, label: str, sql: str):
+    print(f"\n  ── Attempt: {label}")
+    print(f"     SQL: {sql[:100]}{'…' if len(sql) > 100 else ''}")
+    obs = env.step(SQLAction(sql=sql))
+    _print_obs(obs)
+    return obs
+# ── Task runners ──────────────────────────────────────────────────────────────
+def run_easy(env):
+    _section("TASK 1 · EASY — Fix Syntax Errors")
+    env._task_index = 0  # pin to easy
+    obs = env.reset()
+    print(f"\n  Task : {obs.task_title}  [{obs.task_level}]")
+    print(f"  Steps: up to {5}")
+    _print_obs(obs, show_description=True)
+    _attempt(env, "still broken",
+             "SELEC name, age FORM users WEHRE age > 30")
+    _attempt(env, "one keyword fixed",
+             "SELECT name, age FORM users WEHRE age > 30")
+    _attempt(env, "all keywords fixed, no filter",
+             "SELECT name, age FROM users WHERE age > 30")
+    obs = _attempt(env, "correct solution",
+                   "SELECT name, age FROM users "
+                   "WHERE age > 30 AND city = 'New York' "
+                   "ORDER BY name ASC")
+    print(f"\n  Episode done: {obs.done}  |  Best score: {obs.best_score:.2f}")
+def run_medium(env):
+    _section("TASK 2 · MEDIUM — Fix the Cartesian JOIN")
+    env._task_index = 1  # pin to medium
+    obs = env.reset()
+    print(f"\n  Task : {obs.task_title}  [{obs.task_level}]")
+    print(f"  Steps: up to {5}")
+    _print_obs(obs, show_description=True)
+    _attempt(env, "broken verbatim (cartesian product)",
+             "SELECT u.name, p.title, SUM(o.amount) AS total_spent "
+             "FROM orders o, users u, products p "
+             "WHERE o.user_id = u.id "
+             "GROUP BY u.name, p.title "
+             "ORDER BY total_spent DESC")
+    _attempt(env, "comma-join but missing product condition",
+             "SELECT u.name, p.title, SUM(o.amount) AS total_spent "
+             "FROM orders o, users u, products p "
+             "WHERE o.user_id = u.id AND o.product_id = p.id "
+             "GROUP BY u.name, p.title "
+             "ORDER BY total_spent DESC")
+    obs = _attempt(env, "correct INNER JOINs",
+                   "SELECT u.name, p.title, SUM(o.amount) AS total_spent\n"
+                   "FROM orders o\n"
+                   "INNER JOIN users    u ON o.user_id    = u.id\n"
+                   "INNER JOIN products p ON o.product_id = p.id\n"
+                   "GROUP BY u.name, p.title\n"
+                   "ORDER BY total_spent DESC")
+    print(f"\n  Episode done: {obs.done}  |  Best score: {obs.best_score:.2f}")
+def run_hard(env):
+    _section("TASK 3 · HARD — Rewrite Correlated Subquery as CTE")
+    env._task_index = 2  # pin to hard
+    obs = env.reset()
+    print(f"\n  Task : {obs.task_title}  [{obs.task_level}]")
+    print(f"  Steps: up to {6}")
+    _print_obs(obs, show_description=True)
+    _attempt(env, "broken verbatim (no CTE — penalised even though rows match)",
+             "SELECT e.name, e.department_id, e.salary\n"
+             "FROM employees e\n"
+             "WHERE e.salary > (\n"
+             "    SELECT AVG(e2.salary) FROM employees e2\n"
+             "    WHERE e2.department_id = e.department_id\n"
+             ")\n"
+             "ORDER BY e.department_id, e.salary DESC")
+    _attempt(env, "halfway — CTE defined but wrong join",
+             "WITH dept_avg AS (\n"
+             "    SELECT department_id, AVG(salary) AS avg_salary\n"
+             "    FROM employees GROUP BY department_id\n"
+             ")\n"
+             "SELECT e.name, e.department_id, e.salary\n"
+             "FROM employees e, dept_avg d\n"
+             "WHERE e.salary > d.avg_salary\n"
+             "ORDER BY e.department_id, e.salary DESC")
+    obs = _attempt(env, "correct CTE with proper JOIN",
+                   "WITH dept_avg AS (\n"
+                   "    SELECT department_id, AVG(salary) AS avg_salary\n"
+                   "    FROM employees\n"
+                   "    GROUP BY department_id\n"
+                   ")\n"
+                   "SELECT e.name, e.department_id, e.salary\n"
+                   "FROM employees e\n"
+                   "JOIN dept_avg d ON e.department_id = d.department_id\n"
+                   "WHERE e.salary > d.avg_salary\n"
+                   "ORDER BY e.department_id, e.salary DESC")
+    print(f"\n  Episode done: {obs.done}  |  Best score: {obs.best_score:.2f}")
+# ── Custom task demo ──────────────────────────────────────────────────────────
+def run_custom(env):
+    _section("TASK 4 · CUSTOM — NULL Handling in Aggregation")
+    # Register a brand-new task at runtime
+    custom_task = task_from_dict({
+        "id": "custom_null_avg",
+        "level": "custom",
+        "title": "Handle NULLs in Aggregation",
+        "description": """\
+TASK: The query below skips NULL scores, making the class average look higher.
+Fix it so NULL scores are treated as 0.
+SCHEMA:
+  students(id INTEGER, name VARCHAR, score INTEGER)
+BROKEN QUERY:
+  SELECT AVG(score) AS avg_score FROM students
+ERROR:
+  NULL values are silently excluded by AVG(), inflating the result.
+GOAL: Return a single row with avg_score that treats NULL as 0.
+      Expected result: avg_score = 72.5""",
+        "schema_ddl": """\
+CREATE TABLE students (id INTEGER, name VARCHAR, score INTEGER);
+INSERT INTO students VALUES
+    (1, 'Alice', 90),
+    (2, 'Bob',   NULL),
+    (3, 'Carol', 80),
+    (4, 'Dave',  NULL),
+    (5, 'Eve',   70),
+    (6, 'Frank', 50);
+""",
+        "broken_query": "SELECT AVG(score) AS avg_score FROM students",
+        "error_message": "NULL scores are silently skipped by AVG().",
+        "hint": "Wrap score with COALESCE(score, 0) before averaging.",
+        "expected_rows": [{"avg_score": 65.0}],
+        "solution_query": "SELECT AVG(COALESCE(score, 0)) AS avg_score FROM students",
+        "test_description": "AVG treats NULL as 0 → 65.0",
+        "max_steps": 4,
+    })
+    REGISTRY.register(custom_task)
+    obs = env.reset(task_id="custom_null_avg")
+    print(f"\n  Task : {obs.task_title}  [{obs.task_level}]")
+    print(f"  Steps: up to {custom_task.max_steps}")
+    _print_obs(obs, show_description=True)
+    _attempt(env, "broken (NULL excluded)",
+             "SELECT AVG(score) AS avg_score FROM students")
+    obs = _attempt(env, "correct (COALESCE)",
+                   "SELECT AVG(COALESCE(score, 0)) AS avg_score FROM students")
+    print(f"\n  Episode done: {obs.done}  |  Best score: {obs.best_score:.2f}")
+    # Clean up: remove custom task from registry
+    REGISTRY.unregister("custom_null_avg")
+    print("  Custom task unregistered from registry.")
+# ── Main ──────────────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    ai_key = os.environ.get("ANTHROPIC_API_KEY")
+    _hr("═")
+    print("  QueryForge — Local Playbook")
+    print(f"  AI judge : {'LIVE (ANTHROPIC_API_KEY set)' if ai_key else 'OFFLINE (fallback to deterministic, max 0.80)'}")
+    _hr("═")
+    # Create a fresh env for each task so cycling order never matters
+    run_easy(QueryforgeEnvironment())
+    run_medium(QueryforgeEnvironment())
+    run_hard(QueryforgeEnvironment())
+    run_custom(QueryforgeEnvironment())
+    _section("DONE")
+    print("  All 4 tasks completed.\n")

pyproject.toml ADDED Viewed

	@@ -0,0 +1,41 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-queryforge"
+version = "0.1.0"
+description = "Queryforge environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.1",
+    # SQL execution engine (in-memory, no external deps)
+    "duckdb>=0.10.0",
+    # AI judge — quality scoring via Anthropic API
+    "anthropic>=0.25.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m queryforge.server.app
+server = "queryforge.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["queryforge", "queryforge.server"]
+package-dir = { "queryforge" = ".", "queryforge.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Queryforge environment server components."""
+from .queryforge_environment import QueryforgeEnvironment
+__all__ = ["QueryforgeEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,119 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Queryforge Environment.
+This module creates an HTTP server that exposes the QueryforgeEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+try:
+    from ..models import SQLAction, SQLObservation, TaskSpec
+    from ..tasks import REGISTRY, task_from_dict
+    from .queryforge_environment import QueryforgeEnvironment
+except ImportError:
+    from models import SQLAction, SQLObservation, TaskSpec
+    from tasks import REGISTRY, task_from_dict
+    from server.queryforge_environment import QueryforgeEnvironment
+from fastapi import HTTPException
+# Create the app with web interface and README integration
+app = create_app(
+    QueryforgeEnvironment,
+    SQLAction,
+    SQLObservation,
+    env_name="queryforge",
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+# ── Task Registry REST endpoints ──────────────────────────────────────────────
+@app.post("/tasks", tags=["Task Registry"], status_code=201)
+async def register_task(spec: TaskSpec):
+    """Register a custom SQL task. Replaces silently if the ID already exists."""
+    task = task_from_dict(spec.model_dump())
+    REGISTRY.register(task)
+    return {"ok": True, "task_id": task.id, "total_tasks": len(REGISTRY)}
+@app.get("/tasks", tags=["Task Registry"])
+async def list_tasks():
+    """List all registered tasks (built-in + custom)."""
+    return [
+        {"id": t.id, "level": t.level, "title": t.title}
+        for t in REGISTRY.list_all()
+    ]
+@app.delete("/tasks/{task_id}", tags=["Task Registry"])
+async def delete_task(task_id: str):
+    """Remove a custom task. Returns 403 for built-in tasks, 404 if not found."""
+    try:
+        REGISTRY.unregister(task_id)
+        return {"ok": True, "task_id": task_id}
+    except ValueError as exc:
+        raise HTTPException(status_code=403, detail=str(exc))
+    except KeyError:
+        raise HTTPException(status_code=404, detail=f"Task '{task_id}' not found.")
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m queryforge.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn queryforge.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/queryforge_environment.py ADDED Viewed

	@@ -0,0 +1,180 @@

+"""
+QueryForge SQL Environment — server-side implementation.
+The agent interacts with a SQL debugging and optimisation challenge:
+  reset()              → next task in round-robin rotation
+  reset(task_id="x")  → pin to a specific task by ID (built-in or custom)
+  step()               → grade the submitted query, return scored observation
+  state                → episode_id + step count
+Reward scale:
+  0.00        syntax error
+  0.15        syntax valid, runtime error
+  0.30        executes, wrong / empty results
+  0.30–0.80   partial row correctness (deterministic, DuckDB)
+  0.80–1.00   correct results + AI quality assessment (Anthropic)
+Episode ends when:
+  - score >= 0.90  (correct + high-quality solution)
+  - best_score has not improved for 2 consecutive steps (early stopping)
+  - max_steps for the task is exhausted
+"""
+from typing import Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..models import SQLAction, SQLObservation
+    from ..tasks import REGISTRY, SQLTask
+    from ..judge import grade
+except ImportError:
+    from models import SQLAction, SQLObservation
+    from tasks import REGISTRY, SQLTask
+    from judge import grade
+class QueryforgeEnvironment(Environment):
+    """
+    SQL Query Debugger & Optimiser environment.
+    Built-in tasks (cycled in order by default):
+      1. easy   — fix three misspelled SQL keywords
+      2. medium — fix a missing JOIN condition causing a cartesian product
+      3. hard   — rewrite a correlated subquery as a CTE
+    Custom tasks can be registered at runtime via POST /tasks and then
+    requested by passing task_id to reset():
+      env.reset(task_id="my_custom_task")
+    Each episode ends when:
+      - The agent achieves score ≥ 0.90 (correct + high-quality solution), or
+      - best_score has not improved for 2 consecutive steps (early stopping), or
+      - The maximum steps for the current task is exhausted.
+    Supports concurrent WebSocket sessions (each client gets its own instance).
+    """
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    # Episode ends when score >= this threshold.
+    # Falls back to 0.80 when ANTHROPIC_API_KEY is unset (AI judge offline,
+    # deterministic scoring caps at 0.80).
+    DONE_THRESHOLD: float = 0.80 if not __import__("os").environ.get("ANTHROPIC_API_KEY") else 0.90
+    # Episode ends when best_score has not improved for this many consecutive steps
+    EARLY_STOP_STEPS: int = 2
+    def __init__(self) -> None:
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._current_task: Optional[SQLTask] = None
+        self._best_score: float = 0.0
+        self._attempt: int = 0
+        self._stale_steps: int = 0  # consecutive steps with no best_score improvement
+    # ── OpenEnv interface ─────────────────────────────────────────────────────
+    def reset(
+        self,
+        task_id: Optional[str] = None,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs,
+    ) -> SQLObservation:
+        """
+        Start a new episode.
+        Args:
+            task_id:    Pin to a specific task by ID.  If None, the registry
+                        cycles round-robin through all registered tasks.
+            seed:       Ignored (reserved for future use).
+            episode_id: Optional custom episode identifier.
+        """
+        ep_id = episode_id or str(uuid4())
+        self._state = State(episode_id=ep_id, step_count=0)
+        self._best_score = 0.0
+        self._attempt = 0
+        self._stale_steps = 0
+        if task_id is not None:
+            try:
+                self._current_task = REGISTRY.get(task_id)
+            except KeyError as exc:
+                # Unknown task_id — return an error observation so the caller
+                # gets clear feedback instead of a silent 500.
+                return SQLObservation(
+                    feedback=str(exc),
+                    hint=f"Available task IDs: {', '.join(REGISTRY.ids())}",
+                    done=True,
+                    reward=0.0,
+                )
+        else:
+            self._current_task = REGISTRY.cycle_next()
+        return SQLObservation(
+            task_id=self._current_task.id,
+            task_level=self._current_task.level,
+            task_title=self._current_task.title,
+            task_description=self._current_task.description,
+            syntax_valid=False,
+            execution_success=False,
+            execution_error=None,
+            rows_returned=0,
+            feedback="New task loaded. Submit your fixed/optimised SQL query.",
+            hint=self._current_task.hint,
+            attempt=0,
+            best_score=0.0,
+            done=False,
+            reward=0.0,
+        )
+    def step(self, action: SQLAction) -> SQLObservation:  # type: ignore[override]
+        """Grade the submitted SQL query and return a scored observation."""
+        self._state.step_count += 1
+        self._attempt += 1
+        if self._current_task is None:
+            return SQLObservation(
+                feedback="No task active. Call reset() first.",
+                hint="Call reset() to start a new episode.",
+                done=True,
+                reward=0.0,
+            )
+        score, feedback, details = grade(self._current_task, action.sql)
+        # Fix 1 — early stopping: track consecutive steps with no improvement
+        if score > self._best_score:
+            self._stale_steps = 0
+        else:
+            self._stale_steps += 1
+        self._best_score = max(self._best_score, score)
+        # Fix 3 — lower done threshold + early stopping condition
+        done = (
+            score >= self.DONE_THRESHOLD
+            or self._stale_steps >= self.EARLY_STOP_STEPS
+            or self._state.step_count >= self._current_task.max_steps
+        )
+        return SQLObservation(
+            task_id=self._current_task.id,
+            task_level=self._current_task.level,
+            task_title=self._current_task.title,
+            task_description=self._current_task.description,
+            syntax_valid=bool(details.get("syntax_valid", False)),
+            execution_success=bool(details.get("execution_success", False)),
+            execution_error=details.get("execution_error"),
+            rows_returned=int(details.get("rows_returned", 0)),
+            feedback=feedback,
+            hint="" if score >= 0.9 else self._current_task.hint,
+            attempt=self._attempt,
+            best_score=self._best_score,
+            done=done,
+            reward=score,
+        )
+    @property
+    def state(self) -> State:
+        return self._state

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+openenv-core[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+duckdb>=0.10.0
+anthropic>=0.25.0

tasks.py ADDED Viewed

	@@ -0,0 +1,415 @@

+"""
+SQL task definitions and runtime task registry for the QueryForge environment.
+Built-in tasks:
+  easy   — fix three misspelled SQL keywords
+  medium — fix a cartesian JOIN producing wrong results
+  hard   — rewrite a correlated subquery as a CTE
+Custom tasks can be added at runtime via REGISTRY.register() or
+POST /tasks on the running server.
+"""
+import json
+from dataclasses import dataclass
+from pathlib import Path
+from threading import Lock
+from typing import Any, Dict, List, Optional
+# ── Data classes ──────────────────────────────────────────────────────────────
+@dataclass
+class TestCase:
+    """A single test case: expected output rows for correctness grading."""
+    description: str
+    expected_rows: List[Dict[str, Any]]
+    order_by: Optional[str] = None  # comma-separated columns to sort by
+@dataclass
+class SQLTask:
+    """Full definition of one SQL challenge."""
+    id: str
+    level: str          # "easy" | "medium" | "hard" | "custom"
+    title: str
+    description: str
+    schema_ddl: str     # DDL + seed INSERT statements for DuckDB
+    broken_query: str   # broken/slow query the agent must fix
+    error_message: str  # error or performance warning shown to agent
+    hint: str
+    test_cases: List[TestCase]
+    solution_query: str # reference solution used by the AI judge
+    max_steps: int = 5
+# ── Built-in tasks ────────────────────────────────────────────────────────────
+_TASK_EASY = SQLTask(
+    id="task_easy_syntax",
+    level="easy",
+    title="Fix the Syntax Errors",
+    description="""\
+TASK: Fix the syntax errors in the query below so it runs correctly.
+SCHEMA:
+  users(id INTEGER, name VARCHAR, age INTEGER, city VARCHAR)
+BROKEN QUERY:
+  SELEC name, age FORM users WEHRE age > 30 AND city = 'New York'
+ERROR:
+  Parser Error: syntax error at or near "SELEC"
+GOAL: Return a valid SQL query that retrieves `name` and `age`
+of users who are older than 30 AND live in New York.
+Order by name ASC.""",
+    schema_ddl="""\
+CREATE TABLE users (
+    id   INTEGER,
+    name VARCHAR,
+    age  INTEGER,
+    city VARCHAR
+);
+INSERT INTO users VALUES
+    (1, 'Alice',  35, 'New York'),
+    (2, 'Bob',    28, 'New York'),
+    (3, 'Carol',  42, 'Chicago'),
+    (4, 'Dave',   31, 'New York'),
+    (5, 'Eve',    25, 'New York'),
+    (6, 'Frank',  38, 'New York');
+""",
+    broken_query="SELEC name, age FORM users WEHRE age > 30 AND city = 'New York'",
+    error_message='Parser Error: syntax error at or near "SELEC"',
+    hint="Three SQL keywords are misspelled: SELEC → SELECT, FORM → FROM, WEHRE → WHERE.",
+    test_cases=[
+        TestCase(
+            description="Users over 30 living in New York, ordered by name",
+            expected_rows=[
+                {"name": "Alice", "age": 35},
+                {"name": "Dave",  "age": 31},
+                {"name": "Frank", "age": 38},
+            ],
+            order_by="name",
+        )
+    ],
+    solution_query=(
+        "SELECT name, age FROM users "
+        "WHERE age > 30 AND city = 'New York' "
+        "ORDER BY name ASC"
+    ),
+)
+_TASK_MEDIUM = SQLTask(
+    id="task_medium_join",
+    level="medium",
+    title="Fix the Cartesian JOIN",
+    description="""\
+TASK: The query below produces wildly inflated totals because a JOIN condition
+is missing, creating a cartesian product with the `products` table. Fix it.
+SCHEMAS:
+  users(id INTEGER, name VARCHAR, age INTEGER)
+  products(id INTEGER, title VARCHAR, price DECIMAL)
+  orders(id INTEGER, user_id INTEGER, product_id INTEGER, amount DECIMAL)
+BROKEN QUERY:
+  SELECT u.name, p.title, SUM(o.amount) AS total_spent
+  FROM orders o, users u, products p
+  WHERE o.user_id = u.id
+  GROUP BY u.name, p.title
+  ORDER BY total_spent DESC
+PROBLEM:
+  Missing join condition `o.product_id = p.id`.
+  Every order row is multiplied by ALL products, inflating every total by 3×.
+GOAL: Rewrite using explicit INNER JOIN … ON syntax with all correct join
+conditions. Return user name, product title, and true total amount spent per
+(user, product) pair, ordered by total_spent DESC.""",
+    schema_ddl="""\
+CREATE TABLE users    (id INTEGER, name VARCHAR, age INTEGER);
+CREATE TABLE products (id INTEGER, title VARCHAR, price DECIMAL);
+CREATE TABLE orders   (id INTEGER, user_id INTEGER, product_id INTEGER, amount DECIMAL);
+INSERT INTO users    VALUES (1,'Alice',30),(2,'Bob',25),(3,'Carol',35);
+INSERT INTO products VALUES (1,'Laptop',999.99),(2,'Phone',599.99),(3,'Tablet',399.99);
+INSERT INTO orders   VALUES
+    (1,1,1,999.99),(2,1,2,599.99),
+    (3,2,1,999.99),(4,2,3,399.99),
+    (5,3,2,599.99),(6,3,1,999.99);
+""",
+    broken_query="""\
+SELECT u.name, p.title, SUM(o.amount) AS total_spent
+FROM orders o, users u, products p
+WHERE o.user_id = u.id
+GROUP BY u.name, p.title
+ORDER BY total_spent DESC""",
+    error_message=(
+        "Query runs but produces WRONG results: totals are 3× too high "
+        "because every order is joined to every product (cartesian product)."
+    ),
+    hint=(
+        "Use INNER JOIN … ON for every table. "
+        "You need both: o.user_id = u.id  AND  o.product_id = p.id."
+    ),
+    test_cases=[
+        TestCase(
+            description="Correct per-(user, product) totals",
+            expected_rows=[
+                {"name": "Alice", "title": "Laptop", "total_spent": 999.99},
+                {"name": "Alice", "title": "Phone",  "total_spent": 599.99},
+                {"name": "Bob",   "title": "Laptop", "total_spent": 999.99},
+                {"name": "Bob",   "title": "Tablet", "total_spent": 399.99},
+                {"name": "Carol", "title": "Laptop", "total_spent": 999.99},
+                {"name": "Carol", "title": "Phone",  "total_spent": 599.99},
+            ],
+            order_by="name,title",
+        )
+    ],
+    solution_query="""\
+SELECT u.name, p.title, SUM(o.amount) AS total_spent
+FROM orders o
+INNER JOIN users    u ON o.user_id    = u.id
+INNER JOIN products p ON o.product_id = p.id
+GROUP BY u.name, p.title
+ORDER BY total_spent DESC""",
+)
+_TASK_HARD = SQLTask(
+    id="task_hard_cte",
+    level="hard",
+    title="Rewrite Correlated Subquery as CTE",
+    description="""\
+TASK: The query below is semantically correct but executes the inner AVG(salary)
+once per employee row — O(N) full scans. Rewrite it using a WITH (CTE) so the
+department averages are computed exactly once.
+SCHEMAS:
+  departments(id INTEGER, dept_name VARCHAR)
+  employees(id INTEGER, name VARCHAR, department_id INTEGER, salary DECIMAL)
+SLOW QUERY:
+  SELECT e.name, e.department_id, e.salary
+  FROM employees e
+  WHERE e.salary > (
+      SELECT AVG(e2.salary)
+      FROM employees e2
+      WHERE e2.department_id = e.department_id
+  )
+  ORDER BY e.department_id, e.salary DESC
+PERFORMANCE WARNING:
+  For 1 M employees the inner subquery executes 1 M times.
+  DuckDB's EXPLAIN shows: 'FILTER ... (subquery)' with nested loop.
+GOAL: Rewrite using a CTE that computes per-department average salary once,
+then join it to employees and filter. The result must be identical:
+employees who earn strictly above their own department's average salary,
+ordered by department_id ASC, salary DESC.""",
+    schema_ddl="""\
+CREATE TABLE departments (id INTEGER, dept_name VARCHAR);
+CREATE TABLE employees   (id INTEGER, name VARCHAR, department_id INTEGER, salary DECIMAL);
+INSERT INTO departments VALUES (1,'Engineering'),(2,'Marketing'),(3,'Sales');
+INSERT INTO employees VALUES
+    (1,'Alice', 1, 95000),(2,'Bob',   1, 75000),(3,'Carol', 1, 85000),
+    (4,'Dave',  2, 65000),(5,'Eve',   2, 70000),(6,'Frank', 2, 60000),
+    (7,'Grace', 3, 55000),(8,'Hank',  3, 72000),(9,'Iris',  3, 58000);
+""",
+    broken_query="""\
+SELECT e.name, e.department_id, e.salary
+FROM employees e
+WHERE e.salary > (
+    SELECT AVG(e2.salary)
+    FROM employees e2
+    WHERE e2.department_id = e.department_id
+)
+ORDER BY e.department_id, e.salary DESC""",
+    error_message=(
+        "PERFORMANCE: Correlated subquery re-executes AVG() for every row. "
+        "On large tables this is O(N²). Rewrite as a CTE for O(N) execution."
+    ),
+    hint=(
+        "WITH dept_avg AS (SELECT department_id, AVG(salary) AS avg_salary "
+        "FROM employees GROUP BY department_id) — then JOIN employees to dept_avg "
+        "and filter WHERE e.salary > d.avg_salary."
+    ),
+    test_cases=[
+        TestCase(
+            description="Employees strictly above their department's average salary",
+            expected_rows=[
+                {"name": "Alice", "department_id": 1, "salary": 95000.0},
+                {"name": "Eve",   "department_id": 2, "salary": 70000.0},
+                {"name": "Hank",  "department_id": 3, "salary": 72000.0},
+            ],
+            order_by="department_id,name",
+        )
+    ],
+    solution_query="""\
+WITH dept_avg AS (
+    SELECT department_id, AVG(salary) AS avg_salary
+    FROM employees
+    GROUP BY department_id
+)
+SELECT e.name, e.department_id, e.salary
+FROM employees e
+JOIN dept_avg d ON e.department_id = d.department_id
+WHERE e.salary > d.avg_salary
+ORDER BY e.department_id, e.salary DESC""",
+    max_steps=6,
+)
+# ── Task Registry ─────────────────────────────────────────────────────────────
+class TaskRegistry:
+    """
+    Thread-safe registry of SQL tasks, shared across all environment sessions.
+    Built-in tasks (easy / medium / hard) are always present and cannot be removed.
+    Custom tasks can be added via register(), load_from_json(), or POST /tasks.
+    """
+    _BUILTIN_IDS: frozenset = frozenset(
+        ["task_easy_syntax", "task_medium_join", "task_hard_cte"]
+    )
+    def __init__(self, initial_tasks: List[SQLTask]) -> None:
+        self._lock = Lock()
+        # Insertion-ordered dict preserves cycling order
+        self._tasks: Dict[str, SQLTask] = {t.id: t for t in initial_tasks}
+        self._cycle_index: int = 0
+    # ── CRUD ─────────────────────────────────────────────────────────────────
+    def register(self, task: SQLTask) -> None:
+        """Add or replace a task. Replaces silently if the ID already exists."""
+        with self._lock:
+            self._tasks[task.id] = task
+    def unregister(self, task_id: str) -> None:
+        """
+        Remove a custom task.
+        Raises ValueError for built-in tasks, KeyError if not found.
+        """
+        if task_id in self._BUILTIN_IDS:
+            raise ValueError(f"Built-in task '{task_id}' cannot be removed.")
+        with self._lock:
+            if task_id not in self._tasks:
+                raise KeyError(task_id)
+            del self._tasks[task_id]
+    def get(self, task_id: str) -> SQLTask:
+        """Return a task by ID. Raises KeyError with available IDs if not found."""
+        with self._lock:
+            if task_id not in self._tasks:
+                available = ", ".join(self._tasks.keys())
+                raise KeyError(
+                    f"Task '{task_id}' not found. "
+                    f"Available: {available}"
+                )
+            return self._tasks[task_id]
+    def list_all(self) -> List[SQLTask]:
+        """Return all registered tasks in insertion order."""
+        with self._lock:
+            return list(self._tasks.values())
+    def ids(self) -> List[str]:
+        """Return all task IDs in insertion order."""
+        with self._lock:
+            return list(self._tasks.keys())
+    # ── Cycling ───────────────────────────────────────────────────────────────
+    def cycle_next(self) -> SQLTask:
+        """Return the next task in round-robin order (wraps at end)."""
+        with self._lock:
+            tasks = list(self._tasks.values())
+            task = tasks[self._cycle_index % len(tasks)]
+            self._cycle_index += 1
+            return task
+    # ── Bulk loading ──────────────────────────────────────────────────────────
+    def load_from_json(self, path: str) -> int:
+        """
+        Load tasks from a JSON file (list of task spec objects).
+        Returns the number of tasks loaded.
+        Minimal required fields per task:
+          id, schema_ddl, expected_rows
+        Example file::
+            [
+              {
+                "id": "my_null_task",
+                "level": "medium",
+                "title": "Handle NULLs in aggregation",
+                "schema_ddl": "CREATE TABLE ...; INSERT ...",
+                "broken_query": "SELECT AVG(score) FROM ...",
+                "expected_rows": [{"avg_score": 72.5}],
+                "hint": "Use COALESCE to handle NULL scores."
+              }
+            ]
+        """
+        raw = json.loads(Path(path).read_text())
+        if isinstance(raw, dict):
+            raw = [raw]
+        for item in raw:
+            self.register(task_from_dict(item))
+        return len(raw)
+    # ── Helpers ───────────────────────────────────────────────────────────────
+    def __len__(self) -> int:
+        with self._lock:
+            return len(self._tasks)
+    def __contains__(self, task_id: str) -> bool:
+        with self._lock:
+            return task_id in self._tasks
+# ── Conversion helper ─────────────────────────────────────────────────────────
+def task_from_dict(d: Dict[str, Any]) -> SQLTask:
+    """
+    Construct an SQLTask from a plain dict (JSON payload or loaded file).
+    Required keys : id, schema_ddl, expected_rows
+    Optional keys : level, title, description, broken_query, error_message,
+                    hint, order_by, solution_query, test_description, max_steps
+    """
+    return SQLTask(
+        id=d["id"],
+        level=d.get("level", "custom"),
+        title=d.get("title", d["id"]),
+        description=d.get("description", ""),
+        schema_ddl=d["schema_ddl"],
+        broken_query=d.get("broken_query", ""),
+        error_message=d.get("error_message", ""),
+        hint=d.get("hint", ""),
+        test_cases=[
+            TestCase(
+                description=d.get("test_description", "Custom test case"),
+                expected_rows=d["expected_rows"],
+                order_by=d.get("order_by"),
+            )
+        ],
+        solution_query=d.get("solution_query", ""),
+        max_steps=d.get("max_steps", 5),
+    )
+# ── Global singleton ──────────────────────────────────────────────────────────
+REGISTRY = TaskRegistry([_TASK_EASY, _TASK_MEDIUM, _TASK_HARD])
+# Backwards-compat: snapshot of the three built-in tasks at import time
+TASKS: List[SQLTask] = [_TASK_EASY, _TASK_MEDIUM, _TASK_HARD]
+TASK_BY_ID: Dict[str, SQLTask] = {t.id: t for t in TASKS}

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff