sql_env / docs_draft /DataInsightEnv_Concept.md
hjerpe's picture
Upload folder using huggingface_hub
9e64e71 verified

DataInsightEnv: Data Analysis Challenge Environment

Conceptual Design Document for OpenEnv Challenge


1. Executive Summary

DataInsightEnv is a reinforcement learning environment where an agent acts as a data analystβ€”given a dataset and a question, it must explore, analyze, and extract insights through iterative code execution. Unlike SQL-only environments, DataInsightEnv encompasses the full data analysis workflow: data loading, exploration, transformation, statistical analysis, and insight synthesis.

Why this matters: While coding environments exist (Coding Environment Server in OpenEnv), and SQL benchmarks exist (Spider), no environment captures the end-to-end data analysis process where an agent must decide WHAT to analyze, not just HOW to query.


2. Learning from Existing Environments

2.1 How Reviewer Two Handles Open-Ended Evaluation

The "Reviewer Two" environment faces a challenge similar to ours: research plans are subjective, yet it achieves automated evaluation. How?

Reviewer Two Technique What It Solves DataInsightEnv Adaptation
Hidden rubric criteria Evaluates without agent seeing exact targets Hidden "insight criteria" agent must discover
Rubric coverage scoring Multi-factor evaluation via semantic similarity Insight coverage: did agent find key patterns?
Format scoring Rewards structure, not just content Rewards well-organized analysis (steps, reasoning)
Progressive hint disclosure Guides agent toward criteria After failed attempts, reveal "the answer involves X column"
Compliance penalties Punishes ignoring revealed hints If hint revealed, penalize not addressing it

Key insight: Reviewer Two shows that even open-ended tasks (research plans) can be automatically evaluated through rubric coverage + format checks + semantic similarity. DataInsightEnv can use the same pattern for "insight coverage."

2.2 How InfiAgent-DABench Evaluates Data Analysis

The InfiAgent-DABench benchmark specifically targets data analysis with LLM agents:

DABench Approach What It Does Why It Works
Format prompting Converts open questions to closed-form answers "What's the correlation?" β†’ "Answer as a decimal between -1 and 1"
Sandboxed execution Runs agent code in isolated environment Security + reproducibility
Multi-step tasks Some questions require chained reasoning Tests planning, not just single operations
Diverse CSV files 52 different datasets Prevents memorization

Key insight: The "format prompting" technique is crucial. We don't ask "Analyze the data" (unevaluable). We ask "What is the month with highest sales? Answer as abbreviated month name (Jan, Feb, etc.)" Now we can verify.

2.3 How SWE-bench Creates Ground Truth from Natural Artifacts

SWE-bench doesn't synthetically generate problemsβ€”it mines real GitHub issues:

SWE-bench Pattern SQLEnv DataInsightEnv
Source: GitHub PRs Source: Spider questions Source: Kaggle notebooks / data science tutorials
Ground truth: merged patch Ground truth: gold SQL Ground truth: known insight from notebook
Verification: run tests Verification: compare results Verification: check if insight matches

Key insight: Real data science workflows (Kaggle notebooks, tutorials) already contain question-answer pairs. "In this analysis of Titanic data, we discovered that..." gives us ground truth insights to verify against.

2.4 How CodeBenchGen Scales Test Generation

CodeBenchGen uses LLMs to automatically generate test cases from arbitrary code:

CodeBenchGen Step What It Does DataInsightEnv Parallel
Find code in wild Scrape GitHub for functions Scrape Kaggle for analysis notebooks
LLM sandboxing Isolate function, generate tests Extract question + expected answer
Human validation Verify solvability (81.3% pass) Expert review of question clarity

Key insight: We can scale test case generation by having an LLM extract question-answer pairs from existing notebooks, then human-verify a subset.


3. DataInsightEnv Core Design

3.1 The Environment Loop

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     EPISODE START                        β”‚
β”‚  Agent receives:                                         β”‚
β”‚    β€’ Dataset description (name, columns, sample rows)   β”‚
β”‚    β€’ Question to answer                                 β”‚
β”‚    β€’ Answer format specification                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    ANALYSIS PHASE                        β”‚
β”‚  Agent iteratively executes Python code:                β”‚
β”‚    β€’ Load and inspect data (pandas)                     β”‚
β”‚    β€’ Transform and clean (fillna, merge, etc.)         β”‚
β”‚    β€’ Compute statistics (groupby, corr, etc.)          β”‚
β”‚    β€’ Optionally visualize (output captured)            β”‚
β”‚  Each code execution returns output + errors           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    SUBMISSION PHASE                      β”‚
β”‚  Agent submits: INSIGHT in specified format             β”‚
β”‚  Format enforces verifiability:                         β”‚
β”‚    β€’ Numeric: "42.5"                                    β”‚
β”‚    β€’ Categorical: "March"                               β”‚
β”‚    β€’ List: "['A', 'B', 'C']"                           β”‚
β”‚    β€’ Boolean: "True"                                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                           β”‚
                           β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     VERIFICATION                         β”‚
β”‚  Compare to gold answer with appropriate matcher:       β”‚
β”‚    β€’ Numeric: within tolerance (Β±1% or Β±0.01)          β”‚
β”‚    β€’ String: case-insensitive exact match              β”‚
β”‚    β€’ List: set equality (order-independent)            β”‚
β”‚    β€’ Boolean: exact match                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

3.2 State (Observation)

The agent observes:

Component Description Example
dataset_name Identifier for the dataset "titanic_passengers"
columns List of column names and types [("age", "float"), ("survived", "int"), ...]
sample_rows First 5 rows as preview [{"age": 22, "survived": 1, ...}, ...]
dataset_stats Basic statistics (nulls, uniques) {"age": {"nulls": 177, "mean": 29.7}}
question The analytical question "What is the survival rate for passengers over 50?"
answer_format Expected format "Answer as percentage with 1 decimal (e.g., 45.2)"
code_history Previous code and outputs [{"code": "df.head()", "output": "..."}]
step_count Current step 4
error Last execution error if any "KeyError: 'Survived'"

3.3 Actions

The agent chooses from:

Action Parameters Effect Cost
EXECUTE python_code Runs code in sandbox, returns stdout/result 1 step
INSPECT column_name Returns detailed stats for that column 1 step
SUBMIT answer_value Submits final answer, ends episode 0 steps

Code execution environment includes:

  • pandas, numpy, scipy, sklearn (for stats)
  • matplotlib, seaborn (plots saved as images)
  • Dataset pre-loaded as df

3.4 Reward Structure

Multi-factor reward following Reviewer Two's pattern:

Factor Weight How Computed
Correctness 60% Does submitted answer match gold?
Efficiency 20% (max_steps - steps_used) / max_steps
Code Quality 10% No errors, clean pandas usage
Reasoning Trace 10% Did agent show work? (intermediate outputs)

Progressive penalties (adapted from Reviewer Two):

  • First 3 code executions: Free exploration
  • Executions 4+: -0.02 Γ— (execution_number - 3)

This encourages focused analysis over trial-and-error spam.


4. Test Case Generation Strategy

4.1 Source: Mining Existing Data Analysis

We don't write questions from scratch. We extract from real analyses:

Source What We Extract Quality Control
Kaggle Notebooks Popular notebooks have known insights Filter to notebooks with 50+ upvotes
Data Science Tutorials Step-by-step analyses with answers Verify answers are still correct
Textbook Exercises Structured questions with solutions Already verified
Academic Datasets Well-documented datasets (UCI ML) Established ground truths

4.2 Question Types (Difficulty Curriculum)

Level Question Type Example Skill Tested
Easy Single aggregation "What is the average price?" Basic pandas
Easy Filtering "How many items cost over $100?" Boolean indexing
Medium Grouped aggregation "Which category has highest avg rating?" groupby + agg
Medium Correlation "Is price correlated with reviews?" Statistical thinking
Medium Missing data "What % of ages are missing?" Data quality awareness
Hard Multi-step "For users who bought X, what % also bought Y?" Chained reasoning
Hard Outlier detection "How many values are >3 std from mean?" Statistical methods
Hard Temporal patterns "Which month shows highest year-over-year growth?" Time series

4.3 Format-Prompted Answers

Following DABench, every question has a format specification:

Question Answer Format Gold Answer Verification
"Average order value?" "Float, 2 decimals" 47.23 Within Β±0.01
"Month with most sales?" "Abbreviated (Jan/Feb/...)" "Mar" Exact match
"Top 3 products by revenue?" "List of product names" ["A", "B", "C"] Set equality
"Is there correlation?" "Yes or No" "Yes" Exact match
"Correlation coefficient?" "Float, 2 decimals, range [-1,1]" 0.73 Within Β±0.02

4.4 Preventing Overfitting

Strategy Implementation
Dataset variation Same question type on different datasets
Column renaming Same data, different column names
Data perturbation Slightly different values, same patterns
Question paraphrasing "Average X?" vs "Mean of X?" vs "What's the typical X?"
Hidden test set 70% public dev, 30% hidden eval

5. Comparison: SQLEnv vs DataInsightEnv

Aspect SQLEnv DataInsightEnv
Scope Database querying only Full analysis workflow
Language SQL Python (pandas)
Data Source Relational database CSV/DataFrame
Skill Focus Query formulation End-to-end analysis
Complexity Lower Higher
Implementation Time 2-3 weeks 3-4 weeks
Novelty Highest (no SQL RL exists) High (no analysis RL exists)

Recommendation: Start with SQLEnv (simpler), then extend to DataInsightEnv as a "level 2" environment.


6. Why DataInsightEnv Is Novel

6.1 Gap Analysis

Existing What It Does What It Lacks
Coding Environment (OpenEnv) General Python execution No data analysis focus, no insight verification
InfiAgent-DABench Static data analysis benchmark Not RL environment, no multi-turn
Spider/BIRD SQL benchmarks No Python, no full analysis workflow
Kaggle competitions Human data science No RL formulation

DataInsightEnv fills the gap: First RL environment for end-to-end data analysis with iterative code execution.

6.2 Real-World Value

This matters because:

  1. Universal skill: Every knowledge worker analyzes data
  2. Teachable workflow: Exploration β†’ Hypothesis β†’ Test β†’ Conclude
  3. Verifiable: Format-prompted answers enable automatic evaluation
  4. GRPO-trainable: Clear reward signal for reasoning chains

7. Storytelling Angle

The narrative: "We built an environment that teaches AI to be a data scientist."

A data scientist doesn't just run queries. They:

  1. Understand the question (what are we trying to find?)
  2. Explore the data (what do we have to work with?)
  3. Form hypotheses (I think X relates to Y)
  4. Test hypotheses (let me compute the correlation)
  5. Validate findings (is this statistically significant?)
  6. Communicate results (the answer is Z)

DataInsightEnv captures this workflow. We can demonstrate:

  • Before GRPO: Agent randomly executes code, misses the insight
  • After GRPO: Agent systematically explores, forms hypotheses, finds answer

Compelling blog angle:

"We gave an AI the same dataset used in a famous Kaggle competition. Could it discover the same insights that won human data scientists $100k? Let's find out..."


8. Implementation Complexity Assessment

Component Complexity Rationale
Python sandbox Medium Need secure execution (Docker + resource limits)
Pandas/numpy pre-installed Low Standard Python environment
OpenEnv API wrapper Low Standard reset/step/state pattern
Code execution capture Medium Capture stdout, stderr, return values
Answer verification Medium Type-specific matchers (numeric, string, list)
Dataset sourcing Medium Mine from Kaggle, clean and format
Question generation Medium Extract from notebooks or write manually
GRPO training Medium Standard Unsloth/TRL pipeline

Total estimate: 3-4 weeks with AI-assisted development


9. Technical Considerations

9.1 Sandboxing

Code execution requires security:

  • Docker isolation: Agent code runs in ephemeral container
  • Resource limits: CPU time, memory caps
  • Network disabled: No external API calls
  • Filesystem read-only: Agent can only read the dataset

9.2 Output Capture

Need to capture:

  • stdout: Print statements
  • Return value: Last expression result
  • Errors: Tracebacks (informative for agent)
  • Plots: Save matplotlib figures as images (optional observation)

9.3 Dataset Preparation

Each dataset needs:

  • CSV file: The actual data
  • Metadata: Column descriptions, known issues
  • Questions: 5-10 questions per dataset
  • Gold answers: Verified correct answers
  • Answer formats: How to verify each answer

10. Deliverables Checklist

For OpenEnv Challenge submission:

  • DataInsightEnv environment with reset(), step(), state() methods
  • Dockerfile with Python, pandas, numpy, scipy, sklearn, matplotlib
  • openenv.yaml manifest file
  • EnvClient implementation for remote access
  • Type-safe Action, Observation, State Pydantic models
  • Green Agent wrapper (automated evaluation mode)
  • 10+ datasets with 5-10 questions each (50+ total)
  • Answer verification suite (numeric, string, list, boolean)
  • GRPO training script (Unsloth compatible)
  • Baseline vs trained agent comparison
  • HuggingFace Space deployment
  • Blog post with learning curves and demo

11. Extension Ideas (Post-MVP)

Extension Description Value
Visualization challenges "Create a chart showing X" Tests matplotlib skills
Report generation "Write a summary of findings" Tests communication
Multi-dataset joins "Combine these datasets to answer..." Tests data integration
Hypothesis testing "Is the difference statistically significant?" Tests statistical rigor
ML model building "Train a model to predict X" Extends to ML workflow

12. Summary

DataInsightEnv extends the SQLEnv concept to the full data analysis workflow. By combining:

  • Reviewer Two's rubric-based evaluation and progressive hints
  • DABench's format-prompted answers for verifiability
  • SWE-bench's natural artifact mining for test cases
  • CodeBenchGen's LLM-assisted test generation at scale

We create an environment where agents learn to be data scientistsβ€”exploring, hypothesizing, testing, and concluding through iterative code execution. This captures a universal knowledge work skill with clear real-world value.


13. Relationship to SQLEnv

These environments can be connected:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     SQLEnv      β”‚     β”‚  DataInsightEnv β”‚
β”‚   (Database)    │────▢│   (Analysis)    β”‚
β”‚                 β”‚     β”‚                 β”‚
β”‚ - Schema exploreβ”‚     β”‚ - Data explore  β”‚
β”‚ - SQL queries   β”‚     β”‚ - Python code   β”‚
β”‚ - Result verify β”‚     β”‚ - Insight verifyβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚                       β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚   Combined Suite    β”‚
         β”‚                     β”‚
         β”‚ SQL β†’ Pandas β†’ ML   β”‚
         β”‚ (Progressive skill) β”‚
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Submission strategy: Build SQLEnv first (simpler, highest novelty), then extend to DataInsightEnv, then combine into a comprehensive "Data Skills" environment suite.