Spaces:

YashashMathur
/

sql_data_analyst

Sleeping

App Files Files Community

sql_data_analyst / openenv-sql-analyst /README.md

YashashMathur

Sync from GitHub - all files

f762b8d verified about 1 month ago

preview code

raw

history blame

11.8 kB

	---
	title: OpenEnv SQL Analyst
	emoji: 📊
	colorFrom: blue
	colorTo: green
	sdk: docker
	pinned: false
	tags:
	- openenv
	---

	# SQL Data Analyst RL Environment

	> A production-grade, containerized Reinforcement Learning environment for evaluating LLM-powered Data Analysts on real SQL business intelligence tasks.

	OpenEnv Hackathon Submission \| Meta x Scaler

	---

	## Environment Description and Motivation

	This environment simulates a mission-critical enterprise task: an AI agent querying a production SQL database to extract business intelligence. In real-world enterprises, data analysts spend countless hours writing SQL queries to answer ad-hoc business questions from stakeholders. This environment provides a standardized benchmark to evaluate whether LLM agents can safely and accurately perform this task autonomously, measuring both correctness and efficiency.

	### Why This Matters

	- Real-World Applicability: Data analysis is one of the most common knowledge work tasks that LLMs are being deployed for
	- Safety-Critical: Database access requires strict guardrails to prevent data corruption
	- Measurable Outcomes: Business questions have definitive correct answers, enabling objective evaluation

	### Production-Grade Security

	The environment implements security safeguards that mirror real enterprise database access controls:

	\| Security Layer \| Implementation \| Purpose \|
	\|----------------\|----------------\|---------\|
	\| Mutation Blocker \| Regex-based blocking of `INSERT`, `UPDATE`, `DELETE`, `DROP`, `ALTER`, `TRUNCATE` \| Prevents data corruption \|
	\| OOM Protection \| `cursor.fetchmany(50)` instead of `fetchall()` \| Prevents memory exhaustion on large result sets \|
	\| Query Timeout \| 2-second timeout wrapper \| Prevents runaway queries from consuming resources \|
	\| Read-Only Sandbox \| In-memory SQLite (`:memory:` mode) \| Isolated execution environment \|

	---

	## Action Space

	The agent submits an `Action` object with exactly one of two fields:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `sql_query` \| `Optional[str]` \| Execute a SQL query against the database \|
	\| `submit_answer` \| `Optional[str]` \| Submit a final answer for grading \|

	Mutual Exclusivity Enforced: A Pydantic `@model_validator` ensures the agent provides exactly one of `sql_query` or `submit_answer`. Providing both or neither raises a `ValueError`.

	```python
	# Example Actions
	action_query = Action(sql_query="SELECT COUNT(*) FROM users")
	action_submit = Action(submit_answer="15")
	```

	---

	## Observation Space

	The agent receives an `Observation` object containing four fields:

	\| Field \| Type \| Description \|
	\|-------\|------\|-------------\|
	\| `schema_info` \| `str` \| Database schema information (tables, columns, types) \|
	\| `current_question` \| `str` \| The business question the agent must answer \|
	\| `last_query_result` \| `str` \| Result from the most recent SQL query (markdown table format) \|
	\| `error_message` \| `str` \| Any error from the last action (empty string if none) \|

	---

	## Reward Shaping

	The environment implements precise partial reward signals to guide learning:

	\| Event \| Reward \| Episode Ends? \|
	\|-------\|--------\|---------------\|
	\| Successful SQL query (no errors) \| `+0.1` \| No \|
	\| SQLite syntax error \| `-0.1` \| No \|
	\| Destructive action detected \| `-1.0` \| Yes \|
	\| Step count >= 15 (infinite loop shield) \| `-0.5` \| Yes \|
	\| Correct answer submitted \| `+1.0` \| Yes \|
	\| Incorrect answer submitted \| `0.0` \| Yes \|

	Final Score Calculation:
	- If incorrect: `score = 0.0`
	- If correct: `score = 0.7 + (1 - steps/15) * 0.3`
	- Score range: `0.0` to `1.0`

	---

	## Task Descriptions

	The environment includes 3 deterministic tasks of increasing difficulty:

	### Easy: User Count
	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Task ID \| `easy_user_count` \|
	\| Difficulty \| Easy \|
	\| Question \| "How many users are registered in the system? Provide the total count as a single number." \|
	\| Ground Truth \| `15` \|
	\| SQL Complexity \| Single table `COUNT` query \|
	\| Reference SQL \| `SELECT COUNT(*) FROM users` \|

	### Medium: USA Revenue
	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Task ID \| `medium_usa_revenue` \|
	\| Difficulty \| Medium \|
	\| Question \| "What is the total revenue (sum of total_amount) from purchases made by users in the USA? Provide the total as a number (rounded to 2 decimal places if needed)." \|
	\| Ground Truth \| `2423.87` \|
	\| SQL Complexity \| Two-table `JOIN` with `SUM` aggregation filtered by country \|
	\| Reference SQL \| `SELECT ROUND(SUM(p.total_amount), 2) FROM purchases p JOIN users u ON p.user_id = u.user_id WHERE u.country = 'USA'` \|

	### Hard: Top Spender
	\| Attribute \| Value \|
	\|-----------\|-------\|
	\| Task ID \| `hard_top_spender` \|
	\| Difficulty \| Hard \|
	\| Question \| "Who is the top spender (user with highest total purchase amount)? Provide the username of the user who spent the most money in total." \|
	\| Ground Truth \| `alice` \|
	\| SQL Complexity \| Complex query with `JOIN`, `GROUP BY`, `ORDER BY`, and `LIMIT` \|
	\| Reference SQL \| `SELECT u.username FROM users u JOIN purchases p ON u.user_id = p.user_id GROUP BY u.user_id, u.username ORDER BY SUM(p.total_amount) DESC LIMIT 1` \|

	### Grading System

	All graders implement:
	- Type-agnostic normalization: Whitespace trimming, lowercasing, numeric rounding to 2 decimal places
	- Numeric tolerance: Answers within 0.01 absolute tolerance are exact matches
	- Partial credit: Numeric answers within 10% receive 0.5 score
	- SQL evaluation: If agent submits SQL as answer, it's executed and results compared

	---

	## Setup and Usage Instructions

	### Prerequisites

	- Docker installed and running
	- Python 3.10+ (for local development)
	- (Optional) HuggingFace token for inference with HF-hosted models

	### Quick Start with Docker

	```bash
	# Clone the repository
	git clone https://github.com/hitanshu04/openenv-sql-analyst.git
	cd openenv_sql_analyst

	# Build the Docker image
	docker build -t openenv-sql-analyst .

	# Run the container
	docker run -p 7860:7860 openenv-sql-analyst
	```

	The server will be available at `http://localhost:7860`

	### API Endpoints

	\| Endpoint \| Method \| Description \|
	\|----------\|--------\|-------------\|
	\| `/` \| GET \| Health check (returns 200 OK) \|
	\| `/reset` \| POST \| Reset environment, returns initial observation \|
	\| `/step` \| POST \| Execute action, returns (observation, reward, done, info) \|
	\| `/state` \| GET \| Get current internal state \|

	### Local Development (Without Docker)

	```bash
	# Create virtual environment
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate

	# Install dependencies
	pip install -r requirements.txt

	# Run the server directly
	python -m server.app

	# Or run validation
	chmod +x validate.sh
	./validate.sh
	```

	### Running Inference

	```bash
	# Set environment variables
	export HF_TOKEN="your-huggingface-token"
	export API_BASE_URL="https://api.openai.com/v1" # or HF inference endpoint
	export MODEL_NAME="gpt-4o-mini"

	# Run inference
	python inference.py
	```

	### Environment Variables

	\| Variable \| Description \| Default \|
	\|----------\|-------------\|---------\|
	\| `HF_TOKEN` \| HuggingFace API token (used as API key) \| Required for inference \|
	\| `API_BASE_URL` \| OpenAI-compatible API endpoint \| `https://api.openai.com/v1` \|
	\| `MODEL_NAME` \| Model identifier \| `gpt-4o-mini` \|

	### Validation Gates

	Run `./validate.sh` before submission. All 4 checks must pass:

	\| Step \| Check \| Failure Condition \|
	\|------\|-------\|-------------------\|
	\| 1/4 \| Prerequisites \| `docker` or `openenv` CLI not found \|
	\| 2/4 \| Docker Build \| `Dockerfile` missing or build fails \|
	\| 3/4 \| OpenEnv Spec \| `openenv validate` fails (yaml/models mismatch) \|
	\| 4/4 \| Inference Logs \| Missing `[START]`/`[STEP]`/`[END]` tags or invalid score \|

	---

	## Baseline Scores

	Expected performance with `gpt-4o-mini`:

	\| Task \| Difficulty \| Expected Steps \| Expected Score \|
	\|------\|------------\|----------------\|----------------\|
	\| `easy_user_count` \| Easy \| 2-3 \| 0.90 - 1.00 \|
	\| `medium_usa_revenue` \| Medium \| 3-5 \| 0.85 - 0.95 \|
	\| `hard_top_spender` \| Hard \| 4-7 \| 0.75 - 0.90 \|

	### STDOUT Log Format

	The inference script outputs logs in the exact required format:

	```
	[START] task=<task_id> env=sql_analyst model=<model_name>
	[STEP] step=<n> action=<action_type>=<value> reward=<r.rr> done=<bool> error=<msg>
	[END] success=<bool> steps=<n> score=<s.ss> rewards=<r1>,<r2>,...
	```

	Example Output:
	```
	[START] task=easy_user_count env=sql_analyst model=gpt-4o-mini
	[STEP] step=1 action=sql_query=SELECT COUNT(*) FROM users reward=0.10 done=false error=null
	[STEP] step=2 action=submit_answer=15 reward=1.00 done=true error=null
	[END] success=true steps=2 score=0.96 rewards=0.10,1.00
	```

	---

	## Project Architecture

	```
	openenv_sql_analyst/
	├── openenv.yaml # OpenEnv specification (name, schemas, endpoints)
	├── Dockerfile # Container config (python:3.10-slim, port 7860)
	├── requirements.txt # Python dependencies
	├── pyproject.toml # Python project configuration
	├── validate.sh # Pre-submission validation (4 gates)
	├── inference.py # Baseline LLM agent implementation
	├── data/
	│ └── mock_data.sql # SQLite mock database (3 tables, ~50 rows)
	├── environment/
	│ ├── __init__.py # Package exports
	│ ├── models.py # Pydantic schemas (Action, Observation, Reward)
	│ ├── db_engine.py # SQLite engine with security safeguards
	│ ├── tasks.py # Task definitions (Easy, Medium, Hard)
	│ ├── graders.py # Deterministic grading system
	│ └── env.py # Main SQLAnalystEnv class (reset, step, state)
	└── server/
	└── app.py # FastAPI server (/reset, /step, /state endpoints)
	```

	---

	## Technical Specifications

	\| Specification \| Value \|
	\|---------------\|-------\|
	\| Python Version \| 3.10 \|
	\| Container Base \| `python:3.10-slim` \|
	\| Container Port \| 7860 \|
	\| vCPU Limit \| 2 \|
	\| Memory Limit \| 8 GB \|
	\| Max Runtime \| 20 minutes \|
	\| Max Steps per Episode \| 15 \|
	\| Query Timeout \| 2 seconds \|
	\| Max Fetch Rows \| 50 \|
	\| Database \| SQLite (in-memory) \|

	---

	## Database Schema

	The mock database contains 3 tables:

	### users
	\| Column \| Type \| Constraints \|
	\|--------\|------\|-------------\|
	\| user_id \| INTEGER \| PRIMARY KEY \|
	\| username \| TEXT \| NOT NULL \|
	\| email \| TEXT \| NOT NULL \|
	\| country \| TEXT \| NOT NULL \|
	\| created_at \| TEXT \| NOT NULL \|

	### products
	\| Column \| Type \| Constraints \|
	\|--------\|------\|-------------\|
	\| product_id \| INTEGER \| PRIMARY KEY \|
	\| product_name \| TEXT \| NOT NULL \|
	\| category \| TEXT \| NOT NULL \|
	\| price \| REAL \| NOT NULL \|
	\| stock \| INTEGER \| NOT NULL \|

	### purchases
	\| Column \| Type \| Constraints \|
	\|--------\|------\|-------------\|
	\| purchase_id \| INTEGER \| PRIMARY KEY \|
	\| user_id \| INTEGER \| NOT NULL, FOREIGN KEY \|
	\| product_id \| INTEGER \| NOT NULL, FOREIGN KEY \|
	\| quantity \| INTEGER \| NOT NULL \|
	\| purchase_date \| TEXT \| NOT NULL \|
	\| total_amount \| REAL \| NOT NULL \|

	---

	## License

	MIT License

	---

	## Acknowledgments

	Built for the Meta x Scaler OpenEnv Hackathon - advancing the frontier of LLM agent evaluation through standardized, production-grade reinforcement learning environments.