Spaces:

YashashMathur
/

sql_data_analyst

Sleeping

App Files Files Community

sql_data_analyst / openenv-sql-analyst /README.md

YashashMathur

Sync from GitHub - all files

f762b8d verified about 1 month ago

11.8 kB

title: OpenEnv SQL Analyst
emoji: 📊
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - openenv

SQL Data Analyst RL Environment

A production-grade, containerized Reinforcement Learning environment for evaluating LLM-powered Data Analysts on real SQL business intelligence tasks.

OpenEnv Hackathon Submission | Meta x Scaler

Environment Description and Motivation

This environment simulates a mission-critical enterprise task: an AI agent querying a production SQL database to extract business intelligence. In real-world enterprises, data analysts spend countless hours writing SQL queries to answer ad-hoc business questions from stakeholders. This environment provides a standardized benchmark to evaluate whether LLM agents can safely and accurately perform this task autonomously, measuring both correctness and efficiency.

Why This Matters

Real-World Applicability: Data analysis is one of the most common knowledge work tasks that LLMs are being deployed for
Safety-Critical: Database access requires strict guardrails to prevent data corruption
Measurable Outcomes: Business questions have definitive correct answers, enabling objective evaluation

Production-Grade Security

The environment implements security safeguards that mirror real enterprise database access controls:

Security Layer	Implementation	Purpose
Mutation Blocker	Regex-based blocking of `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `TRUNCATE`	Prevents data corruption
OOM Protection	`cursor.fetchmany(50)` instead of `fetchall()`	Prevents memory exhaustion on large result sets
Query Timeout	2-second timeout wrapper	Prevents runaway queries from consuming resources
Read-Only Sandbox	In-memory SQLite (`:memory:` mode)	Isolated execution environment

Action Space

The agent submits an Action object with exactly one of two fields:

Field	Type	Description
`sql_query`	`Optional[str]`	Execute a SQL query against the database
`submit_answer`	`Optional[str]`	Submit a final answer for grading

Mutual Exclusivity Enforced: A Pydantic @model_validator ensures the agent provides exactly one of sql_query or submit_answer. Providing both or neither raises a ValueError.

# Example Actions
action_query = Action(sql_query="SELECT COUNT(*) FROM users")
action_submit = Action(submit_answer="15")

Observation Space

The agent receives an Observation object containing four fields:

Field	Type	Description
`schema_info`	`str`	Database schema information (tables, columns, types)
`current_question`	`str`	The business question the agent must answer
`last_query_result`	`str`	Result from the most recent SQL query (markdown table format)
`error_message`	`str`	Any error from the last action (empty string if none)

Reward Shaping

The environment implements precise partial reward signals to guide learning:

Event	Reward	Episode Ends?
Successful SQL query (no errors)	`+0.1`	No
SQLite syntax error	`-0.1`	No
Destructive action detected	`-1.0`	Yes
Step count >= 15 (infinite loop shield)	`-0.5`	Yes
Correct answer submitted	`+1.0`	Yes
Incorrect answer submitted	`0.0`	Yes

Final Score Calculation:

If incorrect: score = 0.0
If correct: score = 0.7 + (1 - steps/15) * 0.3
Score range: 0.0 to 1.0

Task Descriptions

The environment includes 3 deterministic tasks of increasing difficulty:

Easy: User Count

Attribute	Value
Task ID	`easy_user_count`
Difficulty	Easy
Question	"How many users are registered in the system? Provide the total count as a single number."
Ground Truth	`15`
SQL Complexity	Single table `COUNT` query
Reference SQL	`SELECT COUNT(*) FROM users`

Medium: USA Revenue

Attribute	Value
Task ID	`medium_usa_revenue`
Difficulty	Medium
Question	"What is the total revenue (sum of total_amount) from purchases made by users in the USA? Provide the total as a number (rounded to 2 decimal places if needed)."
Ground Truth	`2423.87`
SQL Complexity	Two-table `JOIN` with `SUM` aggregation filtered by country
Reference SQL	`SELECT ROUND(SUM(p.total_amount), 2) FROM purchases p JOIN users u ON p.user_id = u.user_id WHERE u.country = 'USA'`

Hard: Top Spender

Attribute	Value
Task ID	`hard_top_spender`
Difficulty	Hard
Question	"Who is the top spender (user with highest total purchase amount)? Provide the username of the user who spent the most money in total."
Ground Truth	`alice`
SQL Complexity	Complex query with `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT`
Reference SQL	`SELECT u.username FROM users u JOIN purchases p ON u.user_id = p.user_id GROUP BY u.user_id, u.username ORDER BY SUM(p.total_amount) DESC LIMIT 1`

Grading System

All graders implement:

Type-agnostic normalization: Whitespace trimming, lowercasing, numeric rounding to 2 decimal places
Numeric tolerance: Answers within 0.01 absolute tolerance are exact matches
Partial credit: Numeric answers within 10% receive 0.5 score
SQL evaluation: If agent submits SQL as answer, it's executed and results compared

Setup and Usage Instructions

Prerequisites

Docker installed and running
Python 3.10+ (for local development)
(Optional) HuggingFace token for inference with HF-hosted models

Quick Start with Docker

# Clone the repository
git clone https://github.com/hitanshu04/openenv-sql-analyst.git
cd openenv_sql_analyst

# Build the Docker image
docker build -t openenv-sql-analyst .

# Run the container
docker run -p 7860:7860 openenv-sql-analyst

The server will be available at http://localhost:7860

API Endpoints

Endpoint	Method	Description
`/`	GET	Health check (returns 200 OK)
`/reset`	POST	Reset environment, returns initial observation
`/step`	POST	Execute action, returns (observation, reward, done, info)
`/state`	GET	Get current internal state

Local Development (Without Docker)

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Run the server directly
python -m server.app

# Or run validation
chmod +x validate.sh
./validate.sh

Running Inference

# Set environment variables
export HF_TOKEN="your-huggingface-token"
export API_BASE_URL="https://api.openai.com/v1"  # or HF inference endpoint
export MODEL_NAME="gpt-4o-mini"

# Run inference
python inference.py

Environment Variables

Variable	Description	Default
`HF_TOKEN`	HuggingFace API token (used as API key)	Required for inference
`API_BASE_URL`	OpenAI-compatible API endpoint	`https://api.openai.com/v1`
`MODEL_NAME`	Model identifier	`gpt-4o-mini`

Validation Gates

Run ./validate.sh before submission. All 4 checks must pass:

Step	Check	Failure Condition
1/4	Prerequisites	`docker` or `openenv` CLI not found
2/4	Docker Build	`Dockerfile` missing or build fails
3/4	OpenEnv Spec	`openenv validate` fails (yaml/models mismatch)
4/4	Inference Logs	Missing `[START]`/`[STEP]`/`[END]` tags or invalid score

Baseline Scores

Expected performance with gpt-4o-mini:

Task	Difficulty	Expected Steps	Expected Score
`easy_user_count`	Easy	2-3	0.90 - 1.00
`medium_usa_revenue`	Medium	3-5	0.85 - 0.95
`hard_top_spender`	Hard	4-7	0.75 - 0.90

STDOUT Log Format

The inference script outputs logs in the exact required format:

[START] task=<task_id> env=sql_analyst model=<model_name>
[STEP]  step=<n> action=<action_type>=<value> reward=<r.rr> done=<bool> error=<msg>
[END]   success=<bool> steps=<n> score=<s.ss> rewards=<r1>,<r2>,...

Example Output:

[START] task=easy_user_count env=sql_analyst model=gpt-4o-mini
[STEP]  step=1 action=sql_query=SELECT COUNT(*) FROM users reward=0.10 done=false error=null
[STEP]  step=2 action=submit_answer=15 reward=1.00 done=true error=null
[END]   success=true steps=2 score=0.96 rewards=0.10,1.00

Project Architecture

openenv_sql_analyst/
├── openenv.yaml          # OpenEnv specification (name, schemas, endpoints)
├── Dockerfile            # Container config (python:3.10-slim, port 7860)
├── requirements.txt      # Python dependencies
├── pyproject.toml        # Python project configuration
├── validate.sh           # Pre-submission validation (4 gates)
├── inference.py          # Baseline LLM agent implementation
├── data/
│   └── mock_data.sql     # SQLite mock database (3 tables, ~50 rows)
├── environment/
│   ├── __init__.py       # Package exports
│   ├── models.py         # Pydantic schemas (Action, Observation, Reward)
│   ├── db_engine.py      # SQLite engine with security safeguards
│   ├── tasks.py          # Task definitions (Easy, Medium, Hard)
│   ├── graders.py        # Deterministic grading system
│   └── env.py            # Main SQLAnalystEnv class (reset, step, state)
└── server/
    └── app.py            # FastAPI server (/reset, /step, /state endpoints)

Technical Specifications

Specification	Value
Python Version	3.10
Container Base	`python:3.10-slim`
Container Port	7860
vCPU Limit	2
Memory Limit	8 GB
Max Runtime	20 minutes
Max Steps per Episode	15
Query Timeout	2 seconds
Max Fetch Rows	50
Database	SQLite (in-memory)

Database Schema

The mock database contains 3 tables:

users

Column	Type	Constraints
user_id	INTEGER	PRIMARY KEY
username	TEXT	NOT NULL
email	TEXT	NOT NULL
country	TEXT	NOT NULL
created_at	TEXT	NOT NULL

products

Column	Type	Constraints
product_id	INTEGER	PRIMARY KEY
product_name	TEXT	NOT NULL
category	TEXT	NOT NULL
price	REAL	NOT NULL
stock	INTEGER	NOT NULL

purchases

Column	Type	Constraints
purchase_id	INTEGER	PRIMARY KEY
user_id	INTEGER	NOT NULL, FOREIGN KEY
product_id	INTEGER	NOT NULL, FOREIGN KEY
quantity	INTEGER	NOT NULL
purchase_date	TEXT	NOT NULL
total_amount	REAL	NOT NULL

License

MIT License

Acknowledgments

Built for the Meta x Scaler OpenEnv Hackathon - advancing the frontier of LLM agent evaluation through standardized, production-grade reinforcement learning environments.