--- title: OpenEnv SQL Analyst emoji: 📊 colorFrom: blue colorTo: green sdk: docker pinned: false tags: - openenv --- # SQL Data Analyst RL Environment > A production-grade, containerized Reinforcement Learning environment for evaluating LLM-powered Data Analysts on real SQL business intelligence tasks. **OpenEnv Hackathon Submission** | Meta x Scaler --- ## Environment Description and Motivation This environment simulates a **mission-critical enterprise task**: an AI agent querying a production SQL database to extract business intelligence. In real-world enterprises, data analysts spend countless hours writing SQL queries to answer ad-hoc business questions from stakeholders. This environment provides a standardized benchmark to evaluate whether LLM agents can safely and accurately perform this task autonomously, measuring both **correctness** and **efficiency**. ### Why This Matters - **Real-World Applicability**: Data analysis is one of the most common knowledge work tasks that LLMs are being deployed for - **Safety-Critical**: Database access requires strict guardrails to prevent data corruption - **Measurable Outcomes**: Business questions have definitive correct answers, enabling objective evaluation ### Production-Grade Security The environment implements security safeguards that mirror real enterprise database access controls: | Security Layer | Implementation | Purpose | |----------------|----------------|---------| | **Mutation Blocker** | Regex-based blocking of `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `TRUNCATE` | Prevents data corruption | | **OOM Protection** | `cursor.fetchmany(50)` instead of `fetchall()` | Prevents memory exhaustion on large result sets | | **Query Timeout** | 2-second timeout wrapper | Prevents runaway queries from consuming resources | | **Read-Only Sandbox** | In-memory SQLite (`:memory:` mode) | Isolated execution environment | --- ## Action Space The agent submits an `Action` object with **exactly one** of two fields: | Field | Type | Description | |-------|------|-------------| | `sql_query` | `Optional[str]` | Execute a SQL query against the database | | `submit_answer` | `Optional[str]` | Submit a final answer for grading | **Mutual Exclusivity Enforced**: A Pydantic `@model_validator` ensures the agent provides exactly one of `sql_query` or `submit_answer`. Providing both or neither raises a `ValueError`. ```python # Example Actions action_query = Action(sql_query="SELECT COUNT(*) FROM users") action_submit = Action(submit_answer="15") ``` --- ## Observation Space The agent receives an `Observation` object containing four fields: | Field | Type | Description | |-------|------|-------------| | `schema_info` | `str` | Database schema information (tables, columns, types) | | `current_question` | `str` | The business question the agent must answer | | `last_query_result` | `str` | Result from the most recent SQL query (markdown table format) | | `error_message` | `str` | Any error from the last action (empty string if none) | --- ## Reward Shaping The environment implements precise partial reward signals to guide learning: | Event | Reward | Episode Ends? | |-------|--------|---------------| | Successful SQL query (no errors) | `+0.1` | No | | SQLite syntax error | `-0.1` | No | | Destructive action detected | `-1.0` | **Yes** | | Step count >= 15 (infinite loop shield) | `-0.5` | **Yes** | | Correct answer submitted | `+1.0` | **Yes** | | Incorrect answer submitted | `0.0` | **Yes** | **Final Score Calculation**: - If incorrect: `score = 0.0` - If correct: `score = 0.7 + (1 - steps/15) * 0.3` - Score range: `0.0` to `1.0` --- ## Task Descriptions The environment includes **3 deterministic tasks** of increasing difficulty: ### Easy: User Count | Attribute | Value | |-----------|-------| | **Task ID** | `easy_user_count` | | **Difficulty** | Easy | | **Question** | "How many users are registered in the system? Provide the total count as a single number." | | **Ground Truth** | `15` | | **SQL Complexity** | Single table `COUNT` query | | **Reference SQL** | `SELECT COUNT(*) FROM users` | ### Medium: USA Revenue | Attribute | Value | |-----------|-------| | **Task ID** | `medium_usa_revenue` | | **Difficulty** | Medium | | **Question** | "What is the total revenue (sum of total_amount) from purchases made by users in the USA? Provide the total as a number (rounded to 2 decimal places if needed)." | | **Ground Truth** | `2423.87` | | **SQL Complexity** | Two-table `JOIN` with `SUM` aggregation filtered by country | | **Reference SQL** | `SELECT ROUND(SUM(p.total_amount), 2) FROM purchases p JOIN users u ON p.user_id = u.user_id WHERE u.country = 'USA'` | ### Hard: Top Spender | Attribute | Value | |-----------|-------| | **Task ID** | `hard_top_spender` | | **Difficulty** | Hard | | **Question** | "Who is the top spender (user with highest total purchase amount)? Provide the username of the user who spent the most money in total." | | **Ground Truth** | `alice` | | **SQL Complexity** | Complex query with `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT` | | **Reference SQL** | `SELECT u.username FROM users u JOIN purchases p ON u.user_id = p.user_id GROUP BY u.user_id, u.username ORDER BY SUM(p.total_amount) DESC LIMIT 1` | ### Grading System All graders implement: - **Type-agnostic normalization**: Whitespace trimming, lowercasing, numeric rounding to 2 decimal places - **Numeric tolerance**: Answers within 0.01 absolute tolerance are exact matches - **Partial credit**: Numeric answers within 10% receive 0.5 score - **SQL evaluation**: If agent submits SQL as answer, it's executed and results compared --- ## Setup and Usage Instructions ### Prerequisites - Docker installed and running - Python 3.10+ (for local development) - (Optional) HuggingFace token for inference with HF-hosted models ### Quick Start with Docker ```bash # Clone the repository git clone https://github.com/hitanshu04/openenv-sql-analyst.git cd openenv_sql_analyst # Build the Docker image docker build -t openenv-sql-analyst . # Run the container docker run -p 7860:7860 openenv-sql-analyst ``` The server will be available at `http://localhost:7860` ### API Endpoints | Endpoint | Method | Description | |----------|--------|-------------| | `/` | GET | Health check (returns 200 OK) | | `/reset` | POST | Reset environment, returns initial observation | | `/step` | POST | Execute action, returns (observation, reward, done, info) | | `/state` | GET | Get current internal state | ### Local Development (Without Docker) ```bash # Create virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install dependencies pip install -r requirements.txt # Run the server directly python -m server.app # Or run validation chmod +x validate.sh ./validate.sh ``` ### Running Inference ```bash # Set environment variables export HF_TOKEN="your-huggingface-token" export API_BASE_URL="https://api.openai.com/v1" # or HF inference endpoint export MODEL_NAME="gpt-4o-mini" # Run inference python inference.py ``` ### Environment Variables | Variable | Description | Default | |----------|-------------|---------| | `HF_TOKEN` | HuggingFace API token (used as API key) | Required for inference | | `API_BASE_URL` | OpenAI-compatible API endpoint | `https://api.openai.com/v1` | | `MODEL_NAME` | Model identifier | `gpt-4o-mini` | ### Validation Gates Run `./validate.sh` before submission. All 4 checks must pass: | Step | Check | Failure Condition | |------|-------|-------------------| | 1/4 | Prerequisites | `docker` or `openenv` CLI not found | | 2/4 | Docker Build | `Dockerfile` missing or build fails | | 3/4 | OpenEnv Spec | `openenv validate` fails (yaml/models mismatch) | | 4/4 | Inference Logs | Missing `[START]`/`[STEP]`/`[END]` tags or invalid score | --- ## Baseline Scores Expected performance with `gpt-4o-mini`: | Task | Difficulty | Expected Steps | Expected Score | |------|------------|----------------|----------------| | `easy_user_count` | Easy | 2-3 | 0.90 - 1.00 | | `medium_usa_revenue` | Medium | 3-5 | 0.85 - 0.95 | | `hard_top_spender` | Hard | 4-7 | 0.75 - 0.90 | ### STDOUT Log Format The inference script outputs logs in the exact required format: ``` [START] task= env=sql_analyst model= [STEP] step= action== reward= done= error= [END] success= steps= score= rewards=,,... ``` **Example Output**: ``` [START] task=easy_user_count env=sql_analyst model=gpt-4o-mini [STEP] step=1 action=sql_query=SELECT COUNT(*) FROM users reward=0.10 done=false error=null [STEP] step=2 action=submit_answer=15 reward=1.00 done=true error=null [END] success=true steps=2 score=0.96 rewards=0.10,1.00 ``` --- ## Project Architecture ``` openenv_sql_analyst/ ├── openenv.yaml # OpenEnv specification (name, schemas, endpoints) ├── Dockerfile # Container config (python:3.10-slim, port 7860) ├── requirements.txt # Python dependencies ├── pyproject.toml # Python project configuration ├── validate.sh # Pre-submission validation (4 gates) ├── inference.py # Baseline LLM agent implementation ├── data/ │ └── mock_data.sql # SQLite mock database (3 tables, ~50 rows) ├── environment/ │ ├── __init__.py # Package exports │ ├── models.py # Pydantic schemas (Action, Observation, Reward) │ ├── db_engine.py # SQLite engine with security safeguards │ ├── tasks.py # Task definitions (Easy, Medium, Hard) │ ├── graders.py # Deterministic grading system │ └── env.py # Main SQLAnalystEnv class (reset, step, state) └── server/ └── app.py # FastAPI server (/reset, /step, /state endpoints) ``` --- ## Technical Specifications | Specification | Value | |---------------|-------| | Python Version | 3.10 | | Container Base | `python:3.10-slim` | | Container Port | 7860 | | vCPU Limit | 2 | | Memory Limit | 8 GB | | Max Runtime | 20 minutes | | Max Steps per Episode | 15 | | Query Timeout | 2 seconds | | Max Fetch Rows | 50 | | Database | SQLite (in-memory) | --- ## Database Schema The mock database contains 3 tables: ### users | Column | Type | Constraints | |--------|------|-------------| | user_id | INTEGER | PRIMARY KEY | | username | TEXT | NOT NULL | | email | TEXT | NOT NULL | | country | TEXT | NOT NULL | | created_at | TEXT | NOT NULL | ### products | Column | Type | Constraints | |--------|------|-------------| | product_id | INTEGER | PRIMARY KEY | | product_name | TEXT | NOT NULL | | category | TEXT | NOT NULL | | price | REAL | NOT NULL | | stock | INTEGER | NOT NULL | ### purchases | Column | Type | Constraints | |--------|------|-------------| | purchase_id | INTEGER | PRIMARY KEY | | user_id | INTEGER | NOT NULL, FOREIGN KEY | | product_id | INTEGER | NOT NULL, FOREIGN KEY | | quantity | INTEGER | NOT NULL | | purchase_date | TEXT | NOT NULL | | total_amount | REAL | NOT NULL | --- ## License MIT License --- ## Acknowledgments Built for the **Meta x Scaler OpenEnv Hackathon** - advancing the frontier of LLM agent evaluation through standardized, production-grade reinforcement learning environments.