YashashMathur's picture
Sync from GitHub - all files
f762b8d verified
metadata
title: OpenEnv SQL Analyst
emoji: πŸ“Š
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - openenv

SQL Data Analyst RL Environment

A production-grade, containerized Reinforcement Learning environment for evaluating LLM-powered Data Analysts on real SQL business intelligence tasks.

OpenEnv Hackathon Submission | Meta x Scaler


Environment Description and Motivation

This environment simulates a mission-critical enterprise task: an AI agent querying a production SQL database to extract business intelligence. In real-world enterprises, data analysts spend countless hours writing SQL queries to answer ad-hoc business questions from stakeholders. This environment provides a standardized benchmark to evaluate whether LLM agents can safely and accurately perform this task autonomously, measuring both correctness and efficiency.

Why This Matters

  • Real-World Applicability: Data analysis is one of the most common knowledge work tasks that LLMs are being deployed for
  • Safety-Critical: Database access requires strict guardrails to prevent data corruption
  • Measurable Outcomes: Business questions have definitive correct answers, enabling objective evaluation

Production-Grade Security

The environment implements security safeguards that mirror real enterprise database access controls:

Security Layer Implementation Purpose
Mutation Blocker Regex-based blocking of INSERT, UPDATE, DELETE, DROP, ALTER, TRUNCATE Prevents data corruption
OOM Protection cursor.fetchmany(50) instead of fetchall() Prevents memory exhaustion on large result sets
Query Timeout 2-second timeout wrapper Prevents runaway queries from consuming resources
Read-Only Sandbox In-memory SQLite (:memory: mode) Isolated execution environment

Action Space

The agent submits an Action object with exactly one of two fields:

Field Type Description
sql_query Optional[str] Execute a SQL query against the database
submit_answer Optional[str] Submit a final answer for grading

Mutual Exclusivity Enforced: A Pydantic @model_validator ensures the agent provides exactly one of sql_query or submit_answer. Providing both or neither raises a ValueError.

# Example Actions
action_query = Action(sql_query="SELECT COUNT(*) FROM users")
action_submit = Action(submit_answer="15")

Observation Space

The agent receives an Observation object containing four fields:

Field Type Description
schema_info str Database schema information (tables, columns, types)
current_question str The business question the agent must answer
last_query_result str Result from the most recent SQL query (markdown table format)
error_message str Any error from the last action (empty string if none)

Reward Shaping

The environment implements precise partial reward signals to guide learning:

Event Reward Episode Ends?
Successful SQL query (no errors) +0.1 No
SQLite syntax error -0.1 No
Destructive action detected -1.0 Yes
Step count >= 15 (infinite loop shield) -0.5 Yes
Correct answer submitted +1.0 Yes
Incorrect answer submitted 0.0 Yes

Final Score Calculation:

  • If incorrect: score = 0.0
  • If correct: score = 0.7 + (1 - steps/15) * 0.3
  • Score range: 0.0 to 1.0

Task Descriptions

The environment includes 3 deterministic tasks of increasing difficulty:

Easy: User Count

Attribute Value
Task ID easy_user_count
Difficulty Easy
Question "How many users are registered in the system? Provide the total count as a single number."
Ground Truth 15
SQL Complexity Single table COUNT query
Reference SQL SELECT COUNT(*) FROM users

Medium: USA Revenue

Attribute Value
Task ID medium_usa_revenue
Difficulty Medium
Question "What is the total revenue (sum of total_amount) from purchases made by users in the USA? Provide the total as a number (rounded to 2 decimal places if needed)."
Ground Truth 2423.87
SQL Complexity Two-table JOIN with SUM aggregation filtered by country
Reference SQL SELECT ROUND(SUM(p.total_amount), 2) FROM purchases p JOIN users u ON p.user_id = u.user_id WHERE u.country = 'USA'

Hard: Top Spender

Attribute Value
Task ID hard_top_spender
Difficulty Hard
Question "Who is the top spender (user with highest total purchase amount)? Provide the username of the user who spent the most money in total."
Ground Truth alice
SQL Complexity Complex query with JOIN, GROUP BY, ORDER BY, and LIMIT
Reference SQL SELECT u.username FROM users u JOIN purchases p ON u.user_id = p.user_id GROUP BY u.user_id, u.username ORDER BY SUM(p.total_amount) DESC LIMIT 1

Grading System

All graders implement:

  • Type-agnostic normalization: Whitespace trimming, lowercasing, numeric rounding to 2 decimal places
  • Numeric tolerance: Answers within 0.01 absolute tolerance are exact matches
  • Partial credit: Numeric answers within 10% receive 0.5 score
  • SQL evaluation: If agent submits SQL as answer, it's executed and results compared

Setup and Usage Instructions

Prerequisites

  • Docker installed and running
  • Python 3.10+ (for local development)
  • (Optional) HuggingFace token for inference with HF-hosted models

Quick Start with Docker

# Clone the repository
git clone https://github.com/hitanshu04/openenv-sql-analyst.git
cd openenv_sql_analyst

# Build the Docker image
docker build -t openenv-sql-analyst .

# Run the container
docker run -p 7860:7860 openenv-sql-analyst

The server will be available at http://localhost:7860

API Endpoints

Endpoint Method Description
/ GET Health check (returns 200 OK)
/reset POST Reset environment, returns initial observation
/step POST Execute action, returns (observation, reward, done, info)
/state GET Get current internal state

Local Development (Without Docker)

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the server directly
python -m server.app

# Or run validation
chmod +x validate.sh
./validate.sh

Running Inference

# Set environment variables
export HF_TOKEN="your-huggingface-token"
export API_BASE_URL="https://api.openai.com/v1"  # or HF inference endpoint
export MODEL_NAME="gpt-4o-mini"

# Run inference
python inference.py

Environment Variables

Variable Description Default
HF_TOKEN HuggingFace API token (used as API key) Required for inference
API_BASE_URL OpenAI-compatible API endpoint https://api.openai.com/v1
MODEL_NAME Model identifier gpt-4o-mini

Validation Gates

Run ./validate.sh before submission. All 4 checks must pass:

Step Check Failure Condition
1/4 Prerequisites docker or openenv CLI not found
2/4 Docker Build Dockerfile missing or build fails
3/4 OpenEnv Spec openenv validate fails (yaml/models mismatch)
4/4 Inference Logs Missing [START]/[STEP]/[END] tags or invalid score

Baseline Scores

Expected performance with gpt-4o-mini:

Task Difficulty Expected Steps Expected Score
easy_user_count Easy 2-3 0.90 - 1.00
medium_usa_revenue Medium 3-5 0.85 - 0.95
hard_top_spender Hard 4-7 0.75 - 0.90

STDOUT Log Format

The inference script outputs logs in the exact required format:

[START] task=<task_id> env=sql_analyst model=<model_name>
[STEP]  step=<n> action=<action_type>=<value> reward=<r.rr> done=<bool> error=<msg>
[END]   success=<bool> steps=<n> score=<s.ss> rewards=<r1>,<r2>,...

Example Output:

[START] task=easy_user_count env=sql_analyst model=gpt-4o-mini
[STEP]  step=1 action=sql_query=SELECT COUNT(*) FROM users reward=0.10 done=false error=null
[STEP]  step=2 action=submit_answer=15 reward=1.00 done=true error=null
[END]   success=true steps=2 score=0.96 rewards=0.10,1.00

Project Architecture

openenv_sql_analyst/
β”œβ”€β”€ openenv.yaml          # OpenEnv specification (name, schemas, endpoints)
β”œβ”€β”€ Dockerfile            # Container config (python:3.10-slim, port 7860)
β”œβ”€β”€ requirements.txt      # Python dependencies
β”œβ”€β”€ pyproject.toml        # Python project configuration
β”œβ”€β”€ validate.sh           # Pre-submission validation (4 gates)
β”œβ”€β”€ inference.py          # Baseline LLM agent implementation
β”œβ”€β”€ data/
β”‚   └── mock_data.sql     # SQLite mock database (3 tables, ~50 rows)
β”œβ”€β”€ environment/
β”‚   β”œβ”€β”€ __init__.py       # Package exports
β”‚   β”œβ”€β”€ models.py         # Pydantic schemas (Action, Observation, Reward)
β”‚   β”œβ”€β”€ db_engine.py      # SQLite engine with security safeguards
β”‚   β”œβ”€β”€ tasks.py          # Task definitions (Easy, Medium, Hard)
β”‚   β”œβ”€β”€ graders.py        # Deterministic grading system
β”‚   └── env.py            # Main SQLAnalystEnv class (reset, step, state)
└── server/
    └── app.py            # FastAPI server (/reset, /step, /state endpoints)

Technical Specifications

Specification Value
Python Version 3.10
Container Base python:3.10-slim
Container Port 7860
vCPU Limit 2
Memory Limit 8 GB
Max Runtime 20 minutes
Max Steps per Episode 15
Query Timeout 2 seconds
Max Fetch Rows 50
Database SQLite (in-memory)

Database Schema

The mock database contains 3 tables:

users

Column Type Constraints
user_id INTEGER PRIMARY KEY
username TEXT NOT NULL
email TEXT NOT NULL
country TEXT NOT NULL
created_at TEXT NOT NULL

products

Column Type Constraints
product_id INTEGER PRIMARY KEY
product_name TEXT NOT NULL
category TEXT NOT NULL
price REAL NOT NULL
stock INTEGER NOT NULL

purchases

Column Type Constraints
purchase_id INTEGER PRIMARY KEY
user_id INTEGER NOT NULL, FOREIGN KEY
product_id INTEGER NOT NULL, FOREIGN KEY
quantity INTEGER NOT NULL
purchase_date TEXT NOT NULL
total_amount REAL NOT NULL

License

MIT License


Acknowledgments

Built for the Meta x Scaler OpenEnv Hackathon - advancing the frontier of LLM agent evaluation through standardized, production-grade reinforcement learning environments.