Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs_draft /DataInsightEnv_Concept.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 4 days ago

preview code

raw

history blame contribute delete

19.4 kB

DataInsightEnv: Data Analysis Challenge Environment

Conceptual Design Document for OpenEnv Challenge

1. Executive Summary

DataInsightEnv is a reinforcement learning environment where an agent acts as a data analyst—given a dataset and a question, it must explore, analyze, and extract insights through iterative code execution. Unlike SQL-only environments, DataInsightEnv encompasses the full data analysis workflow: data loading, exploration, transformation, statistical analysis, and insight synthesis.

Why this matters: While coding environments exist (Coding Environment Server in OpenEnv), and SQL benchmarks exist (Spider), no environment captures the end-to-end data analysis process where an agent must decide WHAT to analyze, not just HOW to query.

2. Learning from Existing Environments

2.1 How Reviewer Two Handles Open-Ended Evaluation

The "Reviewer Two" environment faces a challenge similar to ours: research plans are subjective, yet it achieves automated evaluation. How?

Reviewer Two Technique	What It Solves	DataInsightEnv Adaptation
Hidden rubric criteria	Evaluates without agent seeing exact targets	Hidden "insight criteria" agent must discover
Rubric coverage scoring	Multi-factor evaluation via semantic similarity	Insight coverage: did agent find key patterns?
Format scoring	Rewards structure, not just content	Rewards well-organized analysis (steps, reasoning)
Progressive hint disclosure	Guides agent toward criteria	After failed attempts, reveal "the answer involves X column"
Compliance penalties	Punishes ignoring revealed hints	If hint revealed, penalize not addressing it

Key insight: Reviewer Two shows that even open-ended tasks (research plans) can be automatically evaluated through rubric coverage + format checks + semantic similarity. DataInsightEnv can use the same pattern for "insight coverage."

2.2 How InfiAgent-DABench Evaluates Data Analysis

The InfiAgent-DABench benchmark specifically targets data analysis with LLM agents:

DABench Approach	What It Does	Why It Works
Format prompting	Converts open questions to closed-form answers	"What's the correlation?" → "Answer as a decimal between -1 and 1"
Sandboxed execution	Runs agent code in isolated environment	Security + reproducibility
Multi-step tasks	Some questions require chained reasoning	Tests planning, not just single operations
Diverse CSV files	52 different datasets	Prevents memorization

Key insight: The "format prompting" technique is crucial. We don't ask "Analyze the data" (unevaluable). We ask "What is the month with highest sales? Answer as abbreviated month name (Jan, Feb, etc.)" Now we can verify.

2.3 How SWE-bench Creates Ground Truth from Natural Artifacts

SWE-bench doesn't synthetically generate problems—it mines real GitHub issues:

SWE-bench Pattern	SQLEnv	DataInsightEnv
Source: GitHub PRs	Source: Spider questions	Source: Kaggle notebooks / data science tutorials
Ground truth: merged patch	Ground truth: gold SQL	Ground truth: known insight from notebook
Verification: run tests	Verification: compare results	Verification: check if insight matches

Key insight: Real data science workflows (Kaggle notebooks, tutorials) already contain question-answer pairs. "In this analysis of Titanic data, we discovered that..." gives us ground truth insights to verify against.

2.4 How CodeBenchGen Scales Test Generation

CodeBenchGen uses LLMs to automatically generate test cases from arbitrary code:

CodeBenchGen Step	What It Does	DataInsightEnv Parallel
Find code in wild	Scrape GitHub for functions	Scrape Kaggle for analysis notebooks
LLM sandboxing	Isolate function, generate tests	Extract question + expected answer
Human validation	Verify solvability (81.3% pass)	Expert review of question clarity

Key insight: We can scale test case generation by having an LLM extract question-answer pairs from existing notebooks, then human-verify a subset.

3. DataInsightEnv Core Design

3.1 The Environment Loop

┌─────────────────────────────────────────────────────────┐
│                     EPISODE START                        │
│  Agent receives:                                         │
│    • Dataset description (name, columns, sample rows)   │
│    • Question to answer                                 │
│    • Answer format specification                        │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    ANALYSIS PHASE                        │
│  Agent iteratively executes Python code:                │
│    • Load and inspect data (pandas)                     │
│    • Transform and clean (fillna, merge, etc.)         │
│    • Compute statistics (groupby, corr, etc.)          │
│    • Optionally visualize (output captured)            │
│  Each code execution returns output + errors           │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                    SUBMISSION PHASE                      │
│  Agent submits: INSIGHT in specified format             │
│  Format enforces verifiability:                         │
│    • Numeric: "42.5"                                    │
│    • Categorical: "March"                               │
│    • List: "['A', 'B', 'C']"                           │
│    • Boolean: "True"                                    │
└─────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────┐
│                     VERIFICATION                         │
│  Compare to gold answer with appropriate matcher:       │
│    • Numeric: within tolerance (±1% or ±0.01)          │
│    • String: case-insensitive exact match              │
│    • List: set equality (order-independent)            │
│    • Boolean: exact match                               │
└─────────────────────────────────────────────────────────┘

3.2 State (Observation)

The agent observes:

Component	Description	Example
dataset_name	Identifier for the dataset	"titanic_passengers"
columns	List of column names and types	`[("age", "float"), ("survived", "int"), ...]`
sample_rows	First 5 rows as preview	`[{"age": 22, "survived": 1, ...}, ...]`
dataset_stats	Basic statistics (nulls, uniques)	`{"age": {"nulls": 177, "mean": 29.7}}`
question	The analytical question	"What is the survival rate for passengers over 50?"
answer_format	Expected format	"Answer as percentage with 1 decimal (e.g., 45.2)"
code_history	Previous code and outputs	`[{"code": "df.head()", "output": "..."}]`
step_count	Current step	4
error	Last execution error if any	"KeyError: 'Survived'"

3.3 Actions

The agent chooses from:

Action	Parameters	Effect	Cost
EXECUTE	python_code	Runs code in sandbox, returns stdout/result	1 step
INSPECT	column_name	Returns detailed stats for that column	1 step
SUBMIT	answer_value	Submits final answer, ends episode	0 steps

Code execution environment includes:

pandas, numpy, scipy, sklearn (for stats)
matplotlib, seaborn (plots saved as images)
Dataset pre-loaded as df

3.4 Reward Structure

Multi-factor reward following Reviewer Two's pattern:

Factor	Weight	How Computed
Correctness	60%	Does submitted answer match gold?
Efficiency	20%	`(max_steps - steps_used) / max_steps`
Code Quality	10%	No errors, clean pandas usage
Reasoning Trace	10%	Did agent show work? (intermediate outputs)

Progressive penalties (adapted from Reviewer Two):

First 3 code executions: Free exploration
Executions 4+: -0.02 × (execution_number - 3)

This encourages focused analysis over trial-and-error spam.

4. Test Case Generation Strategy

4.1 Source: Mining Existing Data Analysis

We don't write questions from scratch. We extract from real analyses:

Source	What We Extract	Quality Control
Kaggle Notebooks	Popular notebooks have known insights	Filter to notebooks with 50+ upvotes
Data Science Tutorials	Step-by-step analyses with answers	Verify answers are still correct
Textbook Exercises	Structured questions with solutions	Already verified
Academic Datasets	Well-documented datasets (UCI ML)	Established ground truths

4.2 Question Types (Difficulty Curriculum)

Level	Question Type	Example	Skill Tested
Easy	Single aggregation	"What is the average price?"	Basic pandas
Easy	Filtering	"How many items cost over $100?"	Boolean indexing
Medium	Grouped aggregation	"Which category has highest avg rating?"	groupby + agg
Medium	Correlation	"Is price correlated with reviews?"	Statistical thinking
Medium	Missing data	"What % of ages are missing?"	Data quality awareness
Hard	Multi-step	"For users who bought X, what % also bought Y?"	Chained reasoning
Hard	Outlier detection	"How many values are >3 std from mean?"	Statistical methods
Hard	Temporal patterns	"Which month shows highest year-over-year growth?"	Time series

4.3 Format-Prompted Answers

Following DABench, every question has a format specification:

Question	Answer Format	Gold Answer	Verification
"Average order value?"	"Float, 2 decimals"	47.23	Within ±0.01
"Month with most sales?"	"Abbreviated (Jan/Feb/...)"	"Mar"	Exact match
"Top 3 products by revenue?"	"List of product names"	["A", "B", "C"]	Set equality
"Is there correlation?"	"Yes or No"	"Yes"	Exact match
"Correlation coefficient?"	"Float, 2 decimals, range [-1,1]"	0.73	Within ±0.02

4.4 Preventing Overfitting

Strategy	Implementation
Dataset variation	Same question type on different datasets
Column renaming	Same data, different column names
Data perturbation	Slightly different values, same patterns
Question paraphrasing	"Average X?" vs "Mean of X?" vs "What's the typical X?"
Hidden test set	70% public dev, 30% hidden eval

5. Comparison: SQLEnv vs DataInsightEnv

Aspect	SQLEnv	DataInsightEnv
Scope	Database querying only	Full analysis workflow
Language	SQL	Python (pandas)
Data Source	Relational database	CSV/DataFrame
Skill Focus	Query formulation	End-to-end analysis
Complexity	Lower	Higher
Implementation Time	2-3 weeks	3-4 weeks
Novelty	Highest (no SQL RL exists)	High (no analysis RL exists)

Recommendation: Start with SQLEnv (simpler), then extend to DataInsightEnv as a "level 2" environment.

6. Why DataInsightEnv Is Novel

6.1 Gap Analysis

Existing	What It Does	What It Lacks
Coding Environment (OpenEnv)	General Python execution	No data analysis focus, no insight verification
InfiAgent-DABench	Static data analysis benchmark	Not RL environment, no multi-turn
Spider/BIRD	SQL benchmarks	No Python, no full analysis workflow
Kaggle competitions	Human data science	No RL formulation

DataInsightEnv fills the gap: First RL environment for end-to-end data analysis with iterative code execution.

6.2 Real-World Value

This matters because:

Universal skill: Every knowledge worker analyzes data
Teachable workflow: Exploration → Hypothesis → Test → Conclude
Verifiable: Format-prompted answers enable automatic evaluation
GRPO-trainable: Clear reward signal for reasoning chains

7. Storytelling Angle

The narrative: "We built an environment that teaches AI to be a data scientist."

A data scientist doesn't just run queries. They:

Understand the question (what are we trying to find?)
Explore the data (what do we have to work with?)
Form hypotheses (I think X relates to Y)
Test hypotheses (let me compute the correlation)
Validate findings (is this statistically significant?)
Communicate results (the answer is Z)

DataInsightEnv captures this workflow. We can demonstrate:

Before GRPO: Agent randomly executes code, misses the insight
After GRPO: Agent systematically explores, forms hypotheses, finds answer

Compelling blog angle:

"We gave an AI the same dataset used in a famous Kaggle competition. Could it discover the same insights that won human data scientists $100k? Let's find out..."

8. Implementation Complexity Assessment

Component	Complexity	Rationale
Python sandbox	Medium	Need secure execution (Docker + resource limits)
Pandas/numpy pre-installed	Low	Standard Python environment
OpenEnv API wrapper	Low	Standard reset/step/state pattern
Code execution capture	Medium	Capture stdout, stderr, return values
Answer verification	Medium	Type-specific matchers (numeric, string, list)
Dataset sourcing	Medium	Mine from Kaggle, clean and format
Question generation	Medium	Extract from notebooks or write manually
GRPO training	Medium	Standard Unsloth/TRL pipeline

Total estimate: 3-4 weeks with AI-assisted development

9. Technical Considerations

9.1 Sandboxing

Code execution requires security:

Docker isolation: Agent code runs in ephemeral container
Resource limits: CPU time, memory caps
Network disabled: No external API calls
Filesystem read-only: Agent can only read the dataset

9.2 Output Capture

Need to capture:

stdout: Print statements
Return value: Last expression result
Errors: Tracebacks (informative for agent)
Plots: Save matplotlib figures as images (optional observation)

9.3 Dataset Preparation

Each dataset needs:

CSV file: The actual data
Metadata: Column descriptions, known issues
Questions: 5-10 questions per dataset
Gold answers: Verified correct answers
Answer formats: How to verify each answer

10. Deliverables Checklist

For OpenEnv Challenge submission:

DataInsightEnv environment with reset(), step(), state() methods
Dockerfile with Python, pandas, numpy, scipy, sklearn, matplotlib
openenv.yaml manifest file
EnvClient implementation for remote access
Type-safe Action, Observation, State Pydantic models
Green Agent wrapper (automated evaluation mode)
10+ datasets with 5-10 questions each (50+ total)
Answer verification suite (numeric, string, list, boolean)
GRPO training script (Unsloth compatible)
Baseline vs trained agent comparison
HuggingFace Space deployment
Blog post with learning curves and demo

11. Extension Ideas (Post-MVP)

Extension	Description	Value
Visualization challenges	"Create a chart showing X"	Tests matplotlib skills
Report generation	"Write a summary of findings"	Tests communication
Multi-dataset joins	"Combine these datasets to answer..."	Tests data integration
Hypothesis testing	"Is the difference statistically significant?"	Tests statistical rigor
ML model building	"Train a model to predict X"	Extends to ML workflow

12. Summary

DataInsightEnv extends the SQLEnv concept to the full data analysis workflow. By combining:

Reviewer Two's rubric-based evaluation and progressive hints
DABench's format-prompted answers for verifiability
SWE-bench's natural artifact mining for test cases
CodeBenchGen's LLM-assisted test generation at scale

We create an environment where agents learn to be data scientists—exploring, hypothesizing, testing, and concluding through iterative code execution. This captures a universal knowledge work skill with clear real-world value.

13. Relationship to SQLEnv

These environments can be connected:

┌─────────────────┐     ┌─────────────────┐
│     SQLEnv      │     │  DataInsightEnv │
│   (Database)    │────▶│   (Analysis)    │
│                 │     │                 │
│ - Schema explore│     │ - Data explore  │
│ - SQL queries   │     │ - Python code   │
│ - Result verify │     │ - Insight verify│
└─────────────────┘     └─────────────────┘
        │                       │
        └───────────┬───────────┘
                    │
                    ▼
         ┌─────────────────────┐
         │   Combined Suite    │
         │                     │
         │ SQL → Pandas → ML   │
         │ (Progressive skill) │
         └─────────────────────┘

Submission strategy: Build SQLEnv first (simpler, highest novelty), then extend to DataInsightEnv, then combine into a comprehensive "Data Skills" environment suite.