Spaces:
Sleeping
Sleeping
File size: 11,760 Bytes
f762b8d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 | ---
title: OpenEnv SQL Analyst
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
- openenv
---
# SQL Data Analyst RL Environment
> A production-grade, containerized Reinforcement Learning environment for evaluating LLM-powered Data Analysts on real SQL business intelligence tasks.
**OpenEnv Hackathon Submission** | Meta x Scaler
---
## Environment Description and Motivation
This environment simulates a **mission-critical enterprise task**: an AI agent querying a production SQL database to extract business intelligence. In real-world enterprises, data analysts spend countless hours writing SQL queries to answer ad-hoc business questions from stakeholders. This environment provides a standardized benchmark to evaluate whether LLM agents can safely and accurately perform this task autonomously, measuring both **correctness** and **efficiency**.
### Why This Matters
- **Real-World Applicability**: Data analysis is one of the most common knowledge work tasks that LLMs are being deployed for
- **Safety-Critical**: Database access requires strict guardrails to prevent data corruption
- **Measurable Outcomes**: Business questions have definitive correct answers, enabling objective evaluation
### Production-Grade Security
The environment implements security safeguards that mirror real enterprise database access controls:
| Security Layer | Implementation | Purpose |
|----------------|----------------|---------|
| **Mutation Blocker** | Regex-based blocking of `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `TRUNCATE` | Prevents data corruption |
| **OOM Protection** | `cursor.fetchmany(50)` instead of `fetchall()` | Prevents memory exhaustion on large result sets |
| **Query Timeout** | 2-second timeout wrapper | Prevents runaway queries from consuming resources |
| **Read-Only Sandbox** | In-memory SQLite (`:memory:` mode) | Isolated execution environment |
---
## Action Space
The agent submits an `Action` object with **exactly one** of two fields:
| Field | Type | Description |
|-------|------|-------------|
| `sql_query` | `Optional[str]` | Execute a SQL query against the database |
| `submit_answer` | `Optional[str]` | Submit a final answer for grading |
**Mutual Exclusivity Enforced**: A Pydantic `@model_validator` ensures the agent provides exactly one of `sql_query` or `submit_answer`. Providing both or neither raises a `ValueError`.
```python
# Example Actions
action_query = Action(sql_query="SELECT COUNT(*) FROM users")
action_submit = Action(submit_answer="15")
```
---
## Observation Space
The agent receives an `Observation` object containing four fields:
| Field | Type | Description |
|-------|------|-------------|
| `schema_info` | `str` | Database schema information (tables, columns, types) |
| `current_question` | `str` | The business question the agent must answer |
| `last_query_result` | `str` | Result from the most recent SQL query (markdown table format) |
| `error_message` | `str` | Any error from the last action (empty string if none) |
---
## Reward Shaping
The environment implements precise partial reward signals to guide learning:
| Event | Reward | Episode Ends? |
|-------|--------|---------------|
| Successful SQL query (no errors) | `+0.1` | No |
| SQLite syntax error | `-0.1` | No |
| Destructive action detected | `-1.0` | **Yes** |
| Step count >= 15 (infinite loop shield) | `-0.5` | **Yes** |
| Correct answer submitted | `+1.0` | **Yes** |
| Incorrect answer submitted | `0.0` | **Yes** |
**Final Score Calculation**:
- If incorrect: `score = 0.0`
- If correct: `score = 0.7 + (1 - steps/15) * 0.3`
- Score range: `0.0` to `1.0`
---
## Task Descriptions
The environment includes **3 deterministic tasks** of increasing difficulty:
### Easy: User Count
| Attribute | Value |
|-----------|-------|
| **Task ID** | `easy_user_count` |
| **Difficulty** | Easy |
| **Question** | "How many users are registered in the system? Provide the total count as a single number." |
| **Ground Truth** | `15` |
| **SQL Complexity** | Single table `COUNT` query |
| **Reference SQL** | `SELECT COUNT(*) FROM users` |
### Medium: USA Revenue
| Attribute | Value |
|-----------|-------|
| **Task ID** | `medium_usa_revenue` |
| **Difficulty** | Medium |
| **Question** | "What is the total revenue (sum of total_amount) from purchases made by users in the USA? Provide the total as a number (rounded to 2 decimal places if needed)." |
| **Ground Truth** | `2423.87` |
| **SQL Complexity** | Two-table `JOIN` with `SUM` aggregation filtered by country |
| **Reference SQL** | `SELECT ROUND(SUM(p.total_amount), 2) FROM purchases p JOIN users u ON p.user_id = u.user_id WHERE u.country = 'USA'` |
### Hard: Top Spender
| Attribute | Value |
|-----------|-------|
| **Task ID** | `hard_top_spender` |
| **Difficulty** | Hard |
| **Question** | "Who is the top spender (user with highest total purchase amount)? Provide the username of the user who spent the most money in total." |
| **Ground Truth** | `alice` |
| **SQL Complexity** | Complex query with `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT` |
| **Reference SQL** | `SELECT u.username FROM users u JOIN purchases p ON u.user_id = p.user_id GROUP BY u.user_id, u.username ORDER BY SUM(p.total_amount) DESC LIMIT 1` |
### Grading System
All graders implement:
- **Type-agnostic normalization**: Whitespace trimming, lowercasing, numeric rounding to 2 decimal places
- **Numeric tolerance**: Answers within 0.01 absolute tolerance are exact matches
- **Partial credit**: Numeric answers within 10% receive 0.5 score
- **SQL evaluation**: If agent submits SQL as answer, it's executed and results compared
---
## Setup and Usage Instructions
### Prerequisites
- Docker installed and running
- Python 3.10+ (for local development)
- (Optional) HuggingFace token for inference with HF-hosted models
### Quick Start with Docker
```bash
# Clone the repository
git clone https://github.com/hitanshu04/openenv-sql-analyst.git
cd openenv_sql_analyst
# Build the Docker image
docker build -t openenv-sql-analyst .
# Run the container
docker run -p 7860:7860 openenv-sql-analyst
```
The server will be available at `http://localhost:7860`
### API Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Health check (returns 200 OK) |
| `/reset` | POST | Reset environment, returns initial observation |
| `/step` | POST | Execute action, returns (observation, reward, done, info) |
| `/state` | GET | Get current internal state |
### Local Development (Without Docker)
```bash
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Run the server directly
python -m server.app
# Or run validation
chmod +x validate.sh
./validate.sh
```
### Running Inference
```bash
# Set environment variables
export HF_TOKEN="your-huggingface-token"
export API_BASE_URL="https://api.openai.com/v1" # or HF inference endpoint
export MODEL_NAME="gpt-4o-mini"
# Run inference
python inference.py
```
### Environment Variables
| Variable | Description | Default |
|----------|-------------|---------|
| `HF_TOKEN` | HuggingFace API token (used as API key) | Required for inference |
| `API_BASE_URL` | OpenAI-compatible API endpoint | `https://api.openai.com/v1` |
| `MODEL_NAME` | Model identifier | `gpt-4o-mini` |
### Validation Gates
Run `./validate.sh` before submission. All 4 checks must pass:
| Step | Check | Failure Condition |
|------|-------|-------------------|
| 1/4 | Prerequisites | `docker` or `openenv` CLI not found |
| 2/4 | Docker Build | `Dockerfile` missing or build fails |
| 3/4 | OpenEnv Spec | `openenv validate` fails (yaml/models mismatch) |
| 4/4 | Inference Logs | Missing `[START]`/`[STEP]`/`[END]` tags or invalid score |
---
## Baseline Scores
Expected performance with `gpt-4o-mini`:
| Task | Difficulty | Expected Steps | Expected Score |
|------|------------|----------------|----------------|
| `easy_user_count` | Easy | 2-3 | 0.90 - 1.00 |
| `medium_usa_revenue` | Medium | 3-5 | 0.85 - 0.95 |
| `hard_top_spender` | Hard | 4-7 | 0.75 - 0.90 |
### STDOUT Log Format
The inference script outputs logs in the exact required format:
```
[START] task=<task_id> env=sql_analyst model=<model_name>
[STEP] step=<n> action=<action_type>=<value> reward=<r.rr> done=<bool> error=<msg>
[END] success=<bool> steps=<n> score=<s.ss> rewards=<r1>,<r2>,...
```
**Example Output**:
```
[START] task=easy_user_count env=sql_analyst model=gpt-4o-mini
[STEP] step=1 action=sql_query=SELECT COUNT(*) FROM users reward=0.10 done=false error=null
[STEP] step=2 action=submit_answer=15 reward=1.00 done=true error=null
[END] success=true steps=2 score=0.96 rewards=0.10,1.00
```
---
## Project Architecture
```
openenv_sql_analyst/
βββ openenv.yaml # OpenEnv specification (name, schemas, endpoints)
βββ Dockerfile # Container config (python:3.10-slim, port 7860)
βββ requirements.txt # Python dependencies
βββ pyproject.toml # Python project configuration
βββ validate.sh # Pre-submission validation (4 gates)
βββ inference.py # Baseline LLM agent implementation
βββ data/
β βββ mock_data.sql # SQLite mock database (3 tables, ~50 rows)
βββ environment/
β βββ __init__.py # Package exports
β βββ models.py # Pydantic schemas (Action, Observation, Reward)
β βββ db_engine.py # SQLite engine with security safeguards
β βββ tasks.py # Task definitions (Easy, Medium, Hard)
β βββ graders.py # Deterministic grading system
β βββ env.py # Main SQLAnalystEnv class (reset, step, state)
βββ server/
βββ app.py # FastAPI server (/reset, /step, /state endpoints)
```
---
## Technical Specifications
| Specification | Value |
|---------------|-------|
| Python Version | 3.10 |
| Container Base | `python:3.10-slim` |
| Container Port | 7860 |
| vCPU Limit | 2 |
| Memory Limit | 8 GB |
| Max Runtime | 20 minutes |
| Max Steps per Episode | 15 |
| Query Timeout | 2 seconds |
| Max Fetch Rows | 50 |
| Database | SQLite (in-memory) |
---
## Database Schema
The mock database contains 3 tables:
### users
| Column | Type | Constraints |
|--------|------|-------------|
| user_id | INTEGER | PRIMARY KEY |
| username | TEXT | NOT NULL |
| email | TEXT | NOT NULL |
| country | TEXT | NOT NULL |
| created_at | TEXT | NOT NULL |
### products
| Column | Type | Constraints |
|--------|------|-------------|
| product_id | INTEGER | PRIMARY KEY |
| product_name | TEXT | NOT NULL |
| category | TEXT | NOT NULL |
| price | REAL | NOT NULL |
| stock | INTEGER | NOT NULL |
### purchases
| Column | Type | Constraints |
|--------|------|-------------|
| purchase_id | INTEGER | PRIMARY KEY |
| user_id | INTEGER | NOT NULL, FOREIGN KEY |
| product_id | INTEGER | NOT NULL, FOREIGN KEY |
| quantity | INTEGER | NOT NULL |
| purchase_date | TEXT | NOT NULL |
| total_amount | REAL | NOT NULL |
---
## License
MIT License
---
## Acknowledgments
Built for the **Meta x Scaler OpenEnv Hackathon** - advancing the frontier of LLM agent evaluation through standardized, production-grade reinforcement learning environments.
|