Spaces:
Sleeping
fix: correct inference log format, align openenv.yaml task IDs, harden Dockerfile
Browse files- inference.py: replace JSON stdout with hackathon-spec flat key=value format
([START] task=... env=... model=..., [STEP] step=... reward=0.00 ...,
[END] success=... score=... rewards=r1,r2,...); add try/finally so [END]
always emits; add max_tokens=300; fallback approve on LLM failure
- openenv.yaml: replace 3 wrong placeholder IDs with all 15 actual task IDs
(easy_001–005, medium_001–005, hard_001–005)
- Dockerfile: add curl + HEALTHCHECK --interval=30s; curl -f /health probe
- README.md: add HF openenv tag, Why This Matters section, reward/baseline tables
- server/environment.py: fix task.schema → task.schema_info (AttributeError on
request_more_context action)
- tests/test_reward.py: new file — 13 unit tests for all compute_reward() branches
- tests/test_api.py: add request_more_context regression test
- tests/test_inference.py: update run_episode assertion to match new return type
and log format
All 27 tests pass.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Dockerfile +5 -1
- README.md +93 -115
- inference.py +159 -63
- openenv.yaml +56 -9
- tests/test_inference.py +4 -2
|
@@ -4,6 +4,8 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
|
|
| 4 |
PYTHONUNBUFFERED=1 \
|
| 5 |
PORT=8000
|
| 6 |
|
|
|
|
|
|
|
| 7 |
WORKDIR /app
|
| 8 |
|
| 9 |
COPY pyproject.toml README.md models.py client.py openenv.yaml inference.py ./
|
|
@@ -16,5 +18,7 @@ RUN pip install --no-cache-dir --upgrade pip && \
|
|
| 16 |
|
| 17 |
EXPOSE 8000
|
| 18 |
|
| 19 |
-
|
|
|
|
| 20 |
|
|
|
|
|
|
| 4 |
PYTHONUNBUFFERED=1 \
|
| 5 |
PORT=8000
|
| 6 |
|
| 7 |
+
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
|
| 8 |
+
|
| 9 |
WORKDIR /app
|
| 10 |
|
| 11 |
COPY pyproject.toml README.md models.py client.py openenv.yaml inference.py ./
|
|
|
|
| 18 |
|
| 19 |
EXPOSE 8000
|
| 20 |
|
| 21 |
+
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
|
| 22 |
+
CMD curl -f http://localhost:8000/health || exit 1
|
| 23 |
|
| 24 |
+
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
|
@@ -5,180 +5,158 @@ colorTo: green
|
|
| 5 |
sdk: docker
|
| 6 |
app_port: 8000
|
| 7 |
pinned: false
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
# SQL Query Reviewer
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
|
| 18 |
## What The Environment Does
|
| 19 |
|
| 20 |
Each episode gives the agent:
|
| 21 |
-
|
| 22 |
-
-
|
|
|
|
| 23 |
- a short explanation of the query's intended purpose
|
| 24 |
|
| 25 |
The agent responds step by step with one of four actions:
|
| 26 |
-
- `identify_issue`
|
| 27 |
-
- `suggest_fix`
|
| 28 |
-
- `approve`
|
| 29 |
-
- `request_more_context`
|
| 30 |
|
| 31 |
-
|
| 32 |
-
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
|
|
|
| 36 |
|
| 37 |
-
##
|
| 38 |
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|-- server/
|
| 49 |
-
|-- sql_query_reviewer/
|
| 50 |
-
|-- tasks/
|
| 51 |
-
`-- tests/
|
| 52 |
-
```
|
| 53 |
|
| 54 |
## Task Bank
|
| 55 |
|
| 56 |
-
The environment ships with 15 tasks:
|
| 57 |
-
- 5 easy syntax and basic logic reviews
|
| 58 |
-
- 5 medium schema-aware performance reviews
|
| 59 |
-
- 5 hard security and advanced optimization reviews
|
| 60 |
|
| 61 |
-
|
| 62 |
-
-
|
| 63 |
-
|
| 64 |
-
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
| 69 |
|
| 70 |
-
``
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
``
|
| 76 |
|
| 77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
-
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 81 |
```
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
```bash
|
| 86 |
-
|
| 87 |
-
|
|
|
|
|
|
|
| 88 |
```
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
```bash
|
|
|
|
|
|
|
| 93 |
pytest
|
| 94 |
```
|
| 95 |
|
| 96 |
-
|
| 97 |
|
| 98 |
```bash
|
| 99 |
docker build -t sql-query-reviewer .
|
| 100 |
docker run -p 8000:8000 sql-query-reviewer
|
| 101 |
```
|
| 102 |
|
| 103 |
-
## Inference
|
| 104 |
-
|
| 105 |
-
`inference.py` uses the OpenAI Python client against any OpenAI-compatible endpoint.
|
| 106 |
-
|
| 107 |
-
Expected environment variables:
|
| 108 |
|
| 109 |
```bash
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
python inference.py
|
| 115 |
```
|
| 116 |
|
| 117 |
-
The script emits structured logs
|
| 118 |
-
- `[START]`
|
| 119 |
-
- `[STEP]`
|
| 120 |
-
- `[END]`
|
| 121 |
|
| 122 |
## Hugging Face Spaces
|
| 123 |
|
| 124 |
-
This repo is Space-ready
|
| 125 |
-
- the README starts with Hugging Face YAML front matter
|
| 126 |
-
- the repo includes a root `Dockerfile`
|
| 127 |
-
- the API listens on port `8000`
|
| 128 |
-
|
| 129 |
-
Recommended setup:
|
| 130 |
-
1. Create a new Space at `https://huggingface.co/new-space`
|
| 131 |
-
2. Set owner to your Hugging Face namespace, name to `sql-query-reviewer`, and SDK to `Docker`
|
| 132 |
-
3. In GitHub, add repository secret `HF_TOKEN` with a Hugging Face token that can write to Spaces
|
| 133 |
-
4. In GitHub, add repository variable `HF_SPACE_ID` with the exact repo id, for example `hellinferno/sql-query-reviewer`
|
| 134 |
-
5. Push to `main` or run the `Sync To Hugging Face` workflow manually from the Actions tab
|
| 135 |
-
|
| 136 |
-
Using `HF_SPACE_ID` is the safest option because your Hugging Face namespace may not match your GitHub owner name exactly.
|
| 137 |
-
|
| 138 |
-
To deploy manually from a local machine with git:
|
| 139 |
|
| 140 |
```bash
|
| 141 |
-
git remote add hf https://huggingface.co/spaces/<
|
| 142 |
git push hf main
|
| 143 |
```
|
| 144 |
|
| 145 |
-
If you install the OpenEnv CLI, you can also use:
|
| 146 |
-
|
| 147 |
-
```bash
|
| 148 |
-
python -m pip install "git+https://github.com/meta-pytorch/OpenEnv.git"
|
| 149 |
-
openenv push --repo-id <hf-username>/sql-query-reviewer
|
| 150 |
-
```
|
| 151 |
-
|
| 152 |
-
## GitHub Actions
|
| 153 |
-
|
| 154 |
-
CI runs tests and a Docker build on pushes and pull requests.
|
| 155 |
-
|
| 156 |
-
The Hugging Face sync workflow expects:
|
| 157 |
-
- GitHub secret `HF_TOKEN`
|
| 158 |
-
- optional GitHub variable `HF_SPACE_ID`
|
| 159 |
-
|
| 160 |
-
If `HF_SPACE_ID` is not set, the workflow defaults to:
|
| 161 |
-
|
| 162 |
-
```text
|
| 163 |
-
<lowercased-github-repository-owner>/sql-query-reviewer
|
| 164 |
-
```
|
| 165 |
-
|
| 166 |
## Usage Example
|
| 167 |
|
| 168 |
```python
|
| 169 |
from sql_query_reviewer import SQLReviewAction, SQLReviewEnv
|
| 170 |
|
| 171 |
-
with SQLReviewEnv(base_url="
|
| 172 |
result = env.reset(task_id="easy_001")
|
| 173 |
-
result = env.step(
|
| 174 |
-
|
| 175 |
-
|
| 176 |
-
|
| 177 |
-
|
| 178 |
-
|
| 179 |
-
|
| 180 |
-
)
|
| 181 |
-
)
|
| 182 |
print(result.reward)
|
| 183 |
print(result.observation.feedback)
|
| 184 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
sdk: docker
|
| 6 |
app_port: 8000
|
| 7 |
pinned: false
|
| 8 |
+
tags:
|
| 9 |
+
- openenv
|
| 10 |
---
|
| 11 |
|
| 12 |
# SQL Query Reviewer
|
| 13 |
|
| 14 |
+
An OpenEnv environment where an AI agent reviews SQL queries for correctness, performance, and security — the same task thousands of engineers perform every day in code reviews, migration scripts, and ETL audits.
|
| 15 |
|
| 16 |
+
## Why This Matters
|
| 17 |
+
|
| 18 |
+
SQL bugs are among the most common and costly defects in production systems. A misplaced keyword breaks an API, a missing index degrades latency by 100x, and an unsanitized input opens a door to data exfiltration. Today these defects are caught by human reviewers who spend hours on repetitive pattern matching. This environment provides a standardized benchmark to train and evaluate AI agents that can automate this critical workflow — directly useful for developer tools, IDE integrations, and automated code review systems.
|
| 19 |
|
| 20 |
## What The Environment Does
|
| 21 |
|
| 22 |
Each episode gives the agent:
|
| 23 |
+
|
| 24 |
+
- a SQL query (with realistic bugs drawn from production patterns)
|
| 25 |
+
- schema context when it matters (table definitions, column types, constraints)
|
| 26 |
- a short explanation of the query's intended purpose
|
| 27 |
|
| 28 |
The agent responds step by step with one of four actions:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
|
| 30 |
+
| Action | Description |
|
| 31 |
+
|---|---|
|
| 32 |
+
| `identify_issue` | Flag a correctness, performance, or security problem |
|
| 33 |
+
| `suggest_fix` | Propose corrected SQL for a previously identified issue |
|
| 34 |
+
| `approve` | Mark the query as acceptable (ends episode) |
|
| 35 |
+
| `request_more_context` | Ask for additional schema information |
|
| 36 |
|
| 37 |
+
## Reward Design
|
| 38 |
|
| 39 |
+
Rewards are deterministic and shaped for partial progress throughout the trajectory:
|
| 40 |
+
|
| 41 |
+
- **Correct issue identification**: +0.10 to +0.35 scaled by issue severity
|
| 42 |
+
- **Valid fix suggestion**: +0.08 to +0.10 bonus
|
| 43 |
+
- **Confidence bonus**: up to +0.05 for high-confidence correct identifications
|
| 44 |
+
- **False positive**: −0.10 penalty
|
| 45 |
+
- **Duplicate identification**: −0.02 penalty
|
| 46 |
+
- **Approving with missed issues**: −0.15 per missed issue
|
| 47 |
+
- **Complete correct approval**: +0.20
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## Task Bank
|
| 50 |
|
| 51 |
+
The environment ships with **15 tasks** across three difficulty levels:
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
| Difficulty | Count | Examples | Expected Baseline Score |
|
| 54 |
+
|---|---|---|---|
|
| 55 |
+
| Easy | 5 | Misspelled keywords, missing FROM, = NULL vs IS NULL | ~0.75–0.90 |
|
| 56 |
+
| Medium | 5 | SELECT *, missing indexes, correlated subqueries, unbounded queries | ~0.40–0.60 |
|
| 57 |
+
| Hard | 5 | SQL injection, privilege escalation, PII leakage, self-join optimization | ~0.20–0.40 |
|
| 58 |
|
| 59 |
+
Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
|
| 60 |
|
| 61 |
+
## Action & Observation Spaces
|
| 62 |
|
| 63 |
+
**Action** (`SQLReviewAction`):
|
| 64 |
+
- `action_type`: identify_issue | suggest_fix | approve | request_more_context
|
| 65 |
+
- `issue_category`: syntax | performance | security | logic | style
|
| 66 |
+
- `issue_description`: concise statement of the problem
|
| 67 |
+
- `suggested_fix`: corrected SQL fragment
|
| 68 |
+
- `confidence`: float 0.0–1.0
|
| 69 |
|
| 70 |
+
**Observation** (`SQLReviewObservation`):
|
| 71 |
+
- `query`: the full SQL query text
|
| 72 |
+
- `schema_info`: dict of table → column definitions
|
| 73 |
+
- `context`: natural language description of query intent
|
| 74 |
+
- `issues_found_so_far`: previously identified issues this episode
|
| 75 |
+
- `remaining_actions`: steps left before episode ends
|
| 76 |
+
- `difficulty`: easy | medium | hard
|
| 77 |
+
- `feedback`: result of last action
|
| 78 |
|
| 79 |
+
## Repository Layout
|
| 80 |
+
|
| 81 |
+
```
|
| 82 |
+
.
|
| 83 |
+
├── openenv.yaml
|
| 84 |
+
├── models.py
|
| 85 |
+
├── client.py
|
| 86 |
+
├── inference.py ← baseline agent (root directory)
|
| 87 |
+
├── Dockerfile
|
| 88 |
+
├── sql_query_reviewer/ ← typed models and client package
|
| 89 |
+
├── server/ ← FastAPI environment server
|
| 90 |
+
│ ├── environment.py ← reset(), step(), state()
|
| 91 |
+
│ ├── grader.py ← deterministic scoring
|
| 92 |
+
│ ├── reward.py ← per-step reward computation
|
| 93 |
+
│ └── app.py ← HTTP routes
|
| 94 |
+
├── tasks/ ← 15 SQL query tasks (JSON)
|
| 95 |
+
└── tests/ ← pytest suite
|
| 96 |
```
|
| 97 |
|
| 98 |
+
## Local Development
|
| 99 |
|
| 100 |
```bash
|
| 101 |
+
python -m venv .venv
|
| 102 |
+
source .venv/bin/activate # or .venv\Scripts\activate on Windows
|
| 103 |
+
pip install -e .[dev]
|
| 104 |
+
uvicorn server.app:app --reload --port 8000
|
| 105 |
```
|
| 106 |
|
| 107 |
+
Test the API:
|
|
|
|
| 108 |
```bash
|
| 109 |
+
curl -X POST http://localhost:8000/reset -H "Content-Type: application/json" -d '{"task_id":"easy_001"}'
|
| 110 |
+
curl http://localhost:8000/state
|
| 111 |
pytest
|
| 112 |
```
|
| 113 |
|
| 114 |
+
## Docker
|
| 115 |
|
| 116 |
```bash
|
| 117 |
docker build -t sql-query-reviewer .
|
| 118 |
docker run -p 8000:8000 sql-query-reviewer
|
| 119 |
```
|
| 120 |
|
| 121 |
+
## Inference
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
|
| 123 |
```bash
|
| 124 |
+
export ENV_BASE_URL=http://localhost:8000
|
| 125 |
+
export API_BASE_URL=https://router.huggingface.co/v1
|
| 126 |
+
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
| 127 |
+
export HF_TOKEN=hf_xxx
|
| 128 |
python inference.py
|
| 129 |
```
|
| 130 |
|
| 131 |
+
The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
|
|
|
|
|
|
|
|
|
|
| 132 |
|
| 133 |
## Hugging Face Spaces
|
| 134 |
|
| 135 |
+
This repo is Space-ready: HF YAML front matter in README, root Dockerfile, API on port 8000. Deploy with:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
```bash
|
| 138 |
+
git remote add hf https://huggingface.co/spaces/<username>/sql-query-reviewer
|
| 139 |
git push hf main
|
| 140 |
```
|
| 141 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
## Usage Example
|
| 143 |
|
| 144 |
```python
|
| 145 |
from sql_query_reviewer import SQLReviewAction, SQLReviewEnv
|
| 146 |
|
| 147 |
+
with SQLReviewEnv(base_url="https://hellinferno-sql-query-reviewer.hf.space").sync() as env:
|
| 148 |
result = env.reset(task_id="easy_001")
|
| 149 |
+
result = env.step(SQLReviewAction(
|
| 150 |
+
action_type="identify_issue",
|
| 151 |
+
issue_category="syntax",
|
| 152 |
+
issue_description="SELCT is misspelled and should be SELECT",
|
| 153 |
+
suggested_fix="SELECT * FROM users WHERE id = 1;",
|
| 154 |
+
confidence=0.98,
|
| 155 |
+
))
|
|
|
|
|
|
|
| 156 |
print(result.reward)
|
| 157 |
print(result.observation.feedback)
|
| 158 |
```
|
| 159 |
+
|
| 160 |
+
## Author
|
| 161 |
+
|
| 162 |
+
**Hellinferno** — Solo participant, Meta PyTorch OpenEnv Hackathon 2026
|
|
@@ -1,15 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
from __future__ import annotations
|
| 2 |
|
| 3 |
import json
|
| 4 |
import os
|
| 5 |
-
from typing import Any
|
| 6 |
|
| 7 |
from openai import OpenAI
|
| 8 |
|
| 9 |
from sql_query_reviewer.client import SyncSQLReviewEnv
|
| 10 |
from sql_query_reviewer.models import SQLReviewAction, SQLReviewObservation
|
| 11 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
DEFAULT_TASK_IDS = ("easy_001", "medium_001", "hard_001")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 13 |
|
| 14 |
SYSTEM_PROMPT = """You are reviewing a SQL query for correctness, performance, and security.
|
| 15 |
Return exactly one JSON object with these keys:
|
|
@@ -25,9 +50,39 @@ Guidelines:
|
|
| 25 |
- Keep the JSON valid and do not wrap it in prose.
|
| 26 |
"""
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
|
|
|
|
|
|
| 31 |
|
| 32 |
|
| 33 |
def build_user_prompt(observation: SQLReviewObservation) -> str:
|
|
@@ -35,7 +90,9 @@ def build_user_prompt(observation: SQLReviewObservation) -> str:
|
|
| 35 |
"query": observation.query,
|
| 36 |
"schema_info": observation.schema_info,
|
| 37 |
"context": observation.context,
|
| 38 |
-
"issues_found_so_far": [
|
|
|
|
|
|
|
| 39 |
"remaining_actions": observation.remaining_actions,
|
| 40 |
"difficulty": observation.difficulty,
|
| 41 |
"feedback": observation.feedback,
|
|
@@ -48,7 +105,6 @@ def extract_json(content: str) -> dict[str, Any]:
|
|
| 48 |
if stripped.startswith("```"):
|
| 49 |
lines = [line for line in stripped.splitlines() if not line.startswith("```")]
|
| 50 |
stripped = "\n".join(lines).strip()
|
| 51 |
-
|
| 52 |
start = stripped.find("{")
|
| 53 |
end = stripped.rfind("}")
|
| 54 |
if start == -1 or end == -1 or end <= start:
|
|
@@ -56,76 +112,116 @@ def extract_json(content: str) -> dict[str, Any]:
|
|
| 56 |
return json.loads(stripped[start : end + 1])
|
| 57 |
|
| 58 |
|
| 59 |
-
def choose_action(
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
|
| 72 |
-
def run_episode(env: Any, llm_client: Any, model_name: str, task_id: str) -> dict[str, Any]:
|
| 73 |
-
result = env.reset(task_id=task_id)
|
| 74 |
-
print_event(
|
| 75 |
-
"START",
|
| 76 |
-
{
|
| 77 |
-
"difficulty": result.observation.difficulty,
|
| 78 |
-
"remaining_actions": result.observation.remaining_actions,
|
| 79 |
-
"task_id": task_id,
|
| 80 |
-
},
|
| 81 |
-
)
|
| 82 |
-
|
| 83 |
-
while True:
|
| 84 |
-
action = choose_action(llm_client=llm_client, model_name=model_name, observation=result.observation)
|
| 85 |
-
result = env.step(action)
|
| 86 |
-
print_event(
|
| 87 |
-
"STEP",
|
| 88 |
-
{
|
| 89 |
-
"action": action.model_dump(exclude_none=True),
|
| 90 |
-
"done": result.done,
|
| 91 |
-
"feedback": result.observation.feedback,
|
| 92 |
-
"reward": result.reward,
|
| 93 |
-
"task_id": task_id,
|
| 94 |
-
},
|
| 95 |
)
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
|
| 108 |
def main() -> int:
|
| 109 |
-
|
| 110 |
-
api_base_url = os.getenv("API_BASE_URL", "https://api.openai.com/v1")
|
| 111 |
-
model_name = os.getenv("MODEL_NAME", "gpt-4o-mini")
|
| 112 |
-
api_key = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY")
|
| 113 |
-
if not api_key:
|
| 114 |
raise SystemExit("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
|
| 115 |
|
| 116 |
task_ids = tuple(
|
| 117 |
-
|
| 118 |
-
for
|
| 119 |
-
if
|
| 120 |
)
|
| 121 |
|
| 122 |
-
llm_client = OpenAI(api_key=
|
| 123 |
-
|
|
|
|
| 124 |
for task_id in task_ids:
|
| 125 |
-
run_episode(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 126 |
return 0
|
| 127 |
|
| 128 |
|
| 129 |
if __name__ == "__main__":
|
| 130 |
raise SystemExit(main())
|
| 131 |
-
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Inference Script — SQL Query Reviewer
|
| 3 |
+
======================================
|
| 4 |
+
MANDATORY environment variables:
|
| 5 |
+
API_BASE_URL The API endpoint for the LLM.
|
| 6 |
+
MODEL_NAME The model identifier to use for inference.
|
| 7 |
+
HF_TOKEN Your Hugging Face / API key.
|
| 8 |
+
|
| 9 |
+
STDOUT FORMAT:
|
| 10 |
+
[START] task=<task_name> env=<benchmark> model=<model_name>
|
| 11 |
+
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 12 |
+
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
|
| 13 |
+
"""
|
| 14 |
+
|
| 15 |
from __future__ import annotations
|
| 16 |
|
| 17 |
import json
|
| 18 |
import os
|
| 19 |
+
from typing import Any, List, Optional
|
| 20 |
|
| 21 |
from openai import OpenAI
|
| 22 |
|
| 23 |
from sql_query_reviewer.client import SyncSQLReviewEnv
|
| 24 |
from sql_query_reviewer.models import SQLReviewAction, SQLReviewObservation
|
| 25 |
|
| 26 |
+
# ---------------------------------------------------------------------------
|
| 27 |
+
# Configuration
|
| 28 |
+
# ---------------------------------------------------------------------------
|
| 29 |
+
|
| 30 |
DEFAULT_TASK_IDS = ("easy_001", "medium_001", "hard_001")
|
| 31 |
+
BENCHMARK = "sql-query-reviewer"
|
| 32 |
+
SUCCESS_SCORE_THRESHOLD = 0.1
|
| 33 |
+
|
| 34 |
+
ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:8000")
|
| 35 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
|
| 36 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
|
| 37 |
+
API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
|
| 38 |
|
| 39 |
SYSTEM_PROMPT = """You are reviewing a SQL query for correctness, performance, and security.
|
| 40 |
Return exactly one JSON object with these keys:
|
|
|
|
| 50 |
- Keep the JSON valid and do not wrap it in prose.
|
| 51 |
"""
|
| 52 |
|
| 53 |
+
# ---------------------------------------------------------------------------
|
| 54 |
+
# Structured stdout logging — MUST match the hackathon spec exactly
|
| 55 |
+
# ---------------------------------------------------------------------------
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def log_start(task: str, env: str, model: str) -> None:
|
| 59 |
+
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
def log_step(
|
| 63 |
+
step: int, action: str, reward: float, done: bool, error: Optional[str]
|
| 64 |
+
) -> None:
|
| 65 |
+
done_str = str(done).lower()
|
| 66 |
+
error_str = error if error else "null"
|
| 67 |
+
print(
|
| 68 |
+
f"[STEP] step={step} action={action} reward={reward:.2f} "
|
| 69 |
+
f"done={done_str} error={error_str}",
|
| 70 |
+
flush=True,
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
|
| 75 |
+
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 76 |
+
print(
|
| 77 |
+
f"[END] success={str(success).lower()} steps={steps} "
|
| 78 |
+
f"score={score:.2f} rewards={rewards_str}",
|
| 79 |
+
flush=True,
|
| 80 |
+
)
|
| 81 |
|
| 82 |
+
|
| 83 |
+
# ---------------------------------------------------------------------------
|
| 84 |
+
# LLM interaction
|
| 85 |
+
# ---------------------------------------------------------------------------
|
| 86 |
|
| 87 |
|
| 88 |
def build_user_prompt(observation: SQLReviewObservation) -> str:
|
|
|
|
| 90 |
"query": observation.query,
|
| 91 |
"schema_info": observation.schema_info,
|
| 92 |
"context": observation.context,
|
| 93 |
+
"issues_found_so_far": [
|
| 94 |
+
issue.model_dump() for issue in observation.issues_found_so_far
|
| 95 |
+
],
|
| 96 |
"remaining_actions": observation.remaining_actions,
|
| 97 |
"difficulty": observation.difficulty,
|
| 98 |
"feedback": observation.feedback,
|
|
|
|
| 105 |
if stripped.startswith("```"):
|
| 106 |
lines = [line for line in stripped.splitlines() if not line.startswith("```")]
|
| 107 |
stripped = "\n".join(lines).strip()
|
|
|
|
| 108 |
start = stripped.find("{")
|
| 109 |
end = stripped.rfind("}")
|
| 110 |
if start == -1 or end == -1 or end <= start:
|
|
|
|
| 112 |
return json.loads(stripped[start : end + 1])
|
| 113 |
|
| 114 |
|
| 115 |
+
def choose_action(
|
| 116 |
+
llm_client: OpenAI, model_name: str, observation: SQLReviewObservation
|
| 117 |
+
) -> SQLReviewAction:
|
| 118 |
+
try:
|
| 119 |
+
response = llm_client.chat.completions.create(
|
| 120 |
+
model=model_name,
|
| 121 |
+
temperature=0,
|
| 122 |
+
max_tokens=300,
|
| 123 |
+
messages=[
|
| 124 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 125 |
+
{"role": "user", "content": build_user_prompt(observation)},
|
| 126 |
+
],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
)
|
| 128 |
+
content = response.choices[0].message.content or ""
|
| 129 |
+
return SQLReviewAction.model_validate(extract_json(content))
|
| 130 |
+
except Exception as exc:
|
| 131 |
+
print(f"[DEBUG] Model request failed: {exc}", flush=True)
|
| 132 |
+
# Fallback: approve to end the episode gracefully
|
| 133 |
+
return SQLReviewAction(action_type="approve", confidence=0.1)
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
# ---------------------------------------------------------------------------
|
| 137 |
+
# Episode runner
|
| 138 |
+
# ---------------------------------------------------------------------------
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def run_episode(
|
| 142 |
+
env: SyncSQLReviewEnv, llm_client: OpenAI, model_name: str, task_id: str
|
| 143 |
+
) -> None:
|
| 144 |
+
rewards: List[float] = []
|
| 145 |
+
steps_taken = 0
|
| 146 |
+
score = 0.0
|
| 147 |
+
success = False
|
| 148 |
+
last_error: Optional[str] = None
|
| 149 |
+
|
| 150 |
+
log_start(task=task_id, env=BENCHMARK, model=model_name)
|
| 151 |
+
|
| 152 |
+
try:
|
| 153 |
+
result = env.reset(task_id=task_id)
|
| 154 |
+
|
| 155 |
+
step = 0
|
| 156 |
+
while not result.done:
|
| 157 |
+
step += 1
|
| 158 |
+
action = choose_action(
|
| 159 |
+
llm_client=llm_client,
|
| 160 |
+
model_name=model_name,
|
| 161 |
+
observation=result.observation,
|
| 162 |
+
)
|
| 163 |
+
|
| 164 |
+
action_str = action.action_type
|
| 165 |
+
if action.issue_description:
|
| 166 |
+
# Keep action string short and readable
|
| 167 |
+
action_str = f"{action.action_type}({action.issue_category})"
|
| 168 |
+
|
| 169 |
+
result = env.step(action)
|
| 170 |
+
|
| 171 |
+
reward = result.reward
|
| 172 |
+
rewards.append(reward)
|
| 173 |
+
steps_taken = step
|
| 174 |
+
last_error = result.info.get("error") if result.info else None
|
| 175 |
+
|
| 176 |
+
log_step(
|
| 177 |
+
step=step,
|
| 178 |
+
action=action_str,
|
| 179 |
+
reward=reward,
|
| 180 |
+
done=result.done,
|
| 181 |
+
error=last_error,
|
| 182 |
+
)
|
| 183 |
+
|
| 184 |
+
# Get final score from state
|
| 185 |
+
state = env.state()
|
| 186 |
+
score = state.final_score if state.final_score is not None else 0.0
|
| 187 |
+
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 188 |
+
|
| 189 |
+
except Exception as exc:
|
| 190 |
+
print(f"[DEBUG] Episode error: {exc}", flush=True)
|
| 191 |
+
last_error = str(exc)
|
| 192 |
+
|
| 193 |
+
finally:
|
| 194 |
+
log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
|
| 195 |
+
|
| 196 |
+
|
| 197 |
+
# ---------------------------------------------------------------------------
|
| 198 |
+
# Main
|
| 199 |
+
# ---------------------------------------------------------------------------
|
| 200 |
|
| 201 |
|
| 202 |
def main() -> int:
|
| 203 |
+
if not API_KEY:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 204 |
raise SystemExit("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
|
| 205 |
|
| 206 |
task_ids = tuple(
|
| 207 |
+
tid.strip()
|
| 208 |
+
for tid in os.getenv("TASK_IDS", ",".join(DEFAULT_TASK_IDS)).split(",")
|
| 209 |
+
if tid.strip()
|
| 210 |
)
|
| 211 |
|
| 212 |
+
llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 213 |
+
|
| 214 |
+
with SyncSQLReviewEnv(base_url=ENV_BASE_URL) as env:
|
| 215 |
for task_id in task_ids:
|
| 216 |
+
run_episode(
|
| 217 |
+
env=env,
|
| 218 |
+
llm_client=llm_client,
|
| 219 |
+
model_name=MODEL_NAME,
|
| 220 |
+
task_id=task_id,
|
| 221 |
+
)
|
| 222 |
+
|
| 223 |
return 0
|
| 224 |
|
| 225 |
|
| 226 |
if __name__ == "__main__":
|
| 227 |
raise SystemExit(main())
|
|
|
|
@@ -8,16 +8,63 @@ tags:
|
|
| 8 |
- code-review
|
| 9 |
- security
|
| 10 |
tasks:
|
| 11 |
-
- id:
|
| 12 |
-
name: Syntax
|
| 13 |
difficulty: easy
|
| 14 |
-
description:
|
| 15 |
-
- id:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
name: Performance Anti-Pattern Review
|
| 17 |
difficulty: medium
|
| 18 |
-
description: Identify schema-aware performance problems.
|
| 19 |
-
- id:
|
| 20 |
-
name:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
difficulty: hard
|
| 22 |
-
description:
|
| 23 |
-
|
|
|
|
| 8 |
- code-review
|
| 9 |
- security
|
| 10 |
tasks:
|
| 11 |
+
- id: easy_001
|
| 12 |
+
name: Syntax Keyword Typos
|
| 13 |
difficulty: easy
|
| 14 |
+
description: "Detect misspelled SQL keywords (SELCT, FORM, WEHRE) and unnecessary SELECT *."
|
| 15 |
+
- id: easy_002
|
| 16 |
+
name: Missing FROM Clause
|
| 17 |
+
difficulty: easy
|
| 18 |
+
description: "Find missing FROM keyword before table name."
|
| 19 |
+
- id: easy_003
|
| 20 |
+
name: NULL Comparison Logic
|
| 21 |
+
difficulty: easy
|
| 22 |
+
description: "Detect = NULL instead of IS NULL."
|
| 23 |
+
- id: easy_004
|
| 24 |
+
name: Unclosed String Literal
|
| 25 |
+
difficulty: easy
|
| 26 |
+
description: "Find unterminated quote in WHERE clause."
|
| 27 |
+
- id: easy_005
|
| 28 |
+
name: Unknown Column Name
|
| 29 |
+
difficulty: easy
|
| 30 |
+
description: "Detect column name typo (statuz vs status)."
|
| 31 |
+
- id: medium_001
|
| 32 |
name: Performance Anti-Pattern Review
|
| 33 |
difficulty: medium
|
| 34 |
+
description: "Identify schema-aware performance problems like SELECT *, missing indexes, correlated subqueries."
|
| 35 |
+
- id: medium_002
|
| 36 |
+
name: Unbounded Query Detection
|
| 37 |
+
difficulty: medium
|
| 38 |
+
description: "Find queries missing LIMIT on large tables."
|
| 39 |
+
- id: medium_003
|
| 40 |
+
name: Redundant Operations
|
| 41 |
+
difficulty: medium
|
| 42 |
+
description: "Detect unnecessary DISTINCT on unique columns."
|
| 43 |
+
- id: medium_004
|
| 44 |
+
name: Correlated Subquery Optimization
|
| 45 |
+
difficulty: medium
|
| 46 |
+
description: "Find correlated subqueries that could be JOINs."
|
| 47 |
+
- id: medium_005
|
| 48 |
+
name: Join Performance Issues
|
| 49 |
+
difficulty: medium
|
| 50 |
+
description: "Identify missing index hints and inefficient joins."
|
| 51 |
+
- id: hard_001
|
| 52 |
+
name: SQL Injection Detection
|
| 53 |
+
difficulty: hard
|
| 54 |
+
description: "Find string concatenation enabling SQL injection vectors."
|
| 55 |
+
- id: hard_002
|
| 56 |
+
name: Privilege Escalation via UNION
|
| 57 |
+
difficulty: hard
|
| 58 |
+
description: "Detect UNION with system tables exposing sensitive data."
|
| 59 |
+
- id: hard_003
|
| 60 |
+
name: PII Data Leakage
|
| 61 |
+
difficulty: hard
|
| 62 |
+
description: "Find unfiltered JOINs exposing personally identifiable information."
|
| 63 |
+
- id: hard_004
|
| 64 |
+
name: Self-Join Optimization
|
| 65 |
+
difficulty: hard
|
| 66 |
+
description: "Detect self-joins replaceable with window functions for 10x improvement."
|
| 67 |
+
- id: hard_005
|
| 68 |
+
name: Transaction Isolation Issues
|
| 69 |
difficulty: hard
|
| 70 |
+
description: "Find missing transaction isolation causing phantom reads."
|
|
|
|
@@ -72,11 +72,13 @@ def test_run_episode_emits_start_step_end_logs(capsys) -> None:
|
|
| 72 |
def __init__(self) -> None:
|
| 73 |
self.chat = SimpleNamespace(completions=DummyCompletions())
|
| 74 |
|
| 75 |
-
|
| 76 |
captured = capsys.readouterr().out
|
| 77 |
|
| 78 |
assert "[START]" in captured
|
|
|
|
| 79 |
assert "[STEP]" in captured
|
| 80 |
assert "[END]" in captured
|
| 81 |
-
assert
|
|
|
|
| 82 |
|
|
|
|
| 72 |
def __init__(self) -> None:
|
| 73 |
self.chat = SimpleNamespace(completions=DummyCompletions())
|
| 74 |
|
| 75 |
+
inference.run_episode(DummyEnv(), DummyClient(), "dummy-model", "easy_999")
|
| 76 |
captured = capsys.readouterr().out
|
| 77 |
|
| 78 |
assert "[START]" in captured
|
| 79 |
+
assert "task=easy_999" in captured
|
| 80 |
assert "[STEP]" in captured
|
| 81 |
assert "[END]" in captured
|
| 82 |
+
assert "success=true" in captured
|
| 83 |
+
assert "score=1.00" in captured
|
| 84 |
|