Spaces:

hjerpe
/

sql_env

Running

App Files Files Community

sql_env / docs_draft /DataInsightEnv_Concept.md

hjerpe

Upload folder using huggingface_hub

9e64e71 verified 4 days ago

preview code

raw

history blame contribute delete

19.4 kB

	# DataInsightEnv: Data Analysis Challenge Environment

	## Conceptual Design Document for OpenEnv Challenge

	---

	## 1. Executive Summary

	DataInsightEnv is a reinforcement learning environment where an agent acts as a data analyst—given a dataset and a question, it must explore, analyze, and extract insights through iterative code execution. Unlike SQL-only environments, DataInsightEnv encompasses the full data analysis workflow: data loading, exploration, transformation, statistical analysis, and insight synthesis.

	Why this matters: While coding environments exist (Coding Environment Server in OpenEnv), and SQL benchmarks exist (Spider), no environment captures the end-to-end data analysis process where an agent must decide WHAT to analyze, not just HOW to query.

	---

	## 2. Learning from Existing Environments

	### 2.1 How Reviewer Two Handles Open-Ended Evaluation

	The "Reviewer Two" environment faces a challenge similar to ours: research plans are subjective, yet it achieves automated evaluation. How?

	\| Reviewer Two Technique \| What It Solves \| DataInsightEnv Adaptation \|
	\|------------------------\|----------------\|---------------------------\|
	\| Hidden rubric criteria \| Evaluates without agent seeing exact targets \| Hidden "insight criteria" agent must discover \|
	\| Rubric coverage scoring \| Multi-factor evaluation via semantic similarity \| Insight coverage: did agent find key patterns? \|
	\| Format scoring \| Rewards structure, not just content \| Rewards well-organized analysis (steps, reasoning) \|
	\| Progressive hint disclosure \| Guides agent toward criteria \| After failed attempts, reveal "the answer involves X column" \|
	\| Compliance penalties \| Punishes ignoring revealed hints \| If hint revealed, penalize not addressing it \|

	Key insight: Reviewer Two shows that even open-ended tasks (research plans) can be automatically evaluated through rubric coverage + format checks + semantic similarity. DataInsightEnv can use the same pattern for "insight coverage."

	### 2.2 How InfiAgent-DABench Evaluates Data Analysis

	The InfiAgent-DABench benchmark specifically targets data analysis with LLM agents:

	\| DABench Approach \| What It Does \| Why It Works \|
	\|------------------\|--------------\|--------------\|
	\| Format prompting \| Converts open questions to closed-form answers \| "What's the correlation?" → "Answer as a decimal between -1 and 1" \|
	\| Sandboxed execution \| Runs agent code in isolated environment \| Security + reproducibility \|
	\| Multi-step tasks \| Some questions require chained reasoning \| Tests planning, not just single operations \|
	\| Diverse CSV files \| 52 different datasets \| Prevents memorization \|

	Key insight: The "format prompting" technique is crucial. We don't ask "Analyze the data" (unevaluable). We ask "What is the month with highest sales? Answer as abbreviated month name (Jan, Feb, etc.)" Now we can verify.

	### 2.3 How SWE-bench Creates Ground Truth from Natural Artifacts

	SWE-bench doesn't synthetically generate problems—it mines real GitHub issues:

	\| SWE-bench Pattern \| SQLEnv \| DataInsightEnv \|
	\|-------------------\|--------\|----------------\|
	\| Source: GitHub PRs \| Source: Spider questions \| Source: Kaggle notebooks / data science tutorials \|
	\| Ground truth: merged patch \| Ground truth: gold SQL \| Ground truth: known insight from notebook \|
	\| Verification: run tests \| Verification: compare results \| Verification: check if insight matches \|

	Key insight: Real data science workflows (Kaggle notebooks, tutorials) already contain question-answer pairs. "In this analysis of Titanic data, we discovered that..." gives us ground truth insights to verify against.

	### 2.4 How CodeBenchGen Scales Test Generation

	CodeBenchGen uses LLMs to automatically generate test cases from arbitrary code:

	\| CodeBenchGen Step \| What It Does \| DataInsightEnv Parallel \|
	\|-------------------\|--------------\|-------------------------\|
	\| Find code in wild \| Scrape GitHub for functions \| Scrape Kaggle for analysis notebooks \|
	\| LLM sandboxing \| Isolate function, generate tests \| Extract question + expected answer \|
	\| Human validation \| Verify solvability (81.3% pass) \| Expert review of question clarity \|

	Key insight: We can scale test case generation by having an LLM extract question-answer pairs from existing notebooks, then human-verify a subset.

	---

	## 3. DataInsightEnv Core Design

	### 3.1 The Environment Loop

	```
	┌─────────────────────────────────────────────────────────┐
	│ EPISODE START │
	│ Agent receives: │
	│ • Dataset description (name, columns, sample rows) │
	│ • Question to answer │
	│ • Answer format specification │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ ANALYSIS PHASE │
	│ Agent iteratively executes Python code: │
	│ • Load and inspect data (pandas) │
	│ • Transform and clean (fillna, merge, etc.) │
	│ • Compute statistics (groupby, corr, etc.) │
	│ • Optionally visualize (output captured) │
	│ Each code execution returns output + errors │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ SUBMISSION PHASE │
	│ Agent submits: INSIGHT in specified format │
	│ Format enforces verifiability: │
	│ • Numeric: "42.5" │
	│ • Categorical: "March" │
	│ • List: "['A', 'B', 'C']" │
	│ • Boolean: "True" │
	└─────────────────────────────────────────────────────────┘
	│
	▼
	┌─────────────────────────────────────────────────────────┐
	│ VERIFICATION │
	│ Compare to gold answer with appropriate matcher: │
	│ • Numeric: within tolerance (±1% or ±0.01) │
	│ • String: case-insensitive exact match │
	│ • List: set equality (order-independent) │
	│ • Boolean: exact match │
	└─────────────────────────────────────────────────────────┘
	```

	### 3.2 State (Observation)

	The agent observes:

	\| Component \| Description \| Example \|
	\|-----------\|-------------\|---------\|
	\| dataset_name \| Identifier for the dataset \| "titanic_passengers" \|
	\| columns \| List of column names and types \| `[("age", "float"), ("survived", "int"), ...]` \|
	\| sample_rows \| First 5 rows as preview \| `[{"age": 22, "survived": 1, ...}, ...]` \|
	\| dataset_stats \| Basic statistics (nulls, uniques) \| `{"age": {"nulls": 177, "mean": 29.7}}` \|
	\| question \| The analytical question \| "What is the survival rate for passengers over 50?" \|
	\| answer_format \| Expected format \| "Answer as percentage with 1 decimal (e.g., 45.2)" \|
	\| code_history \| Previous code and outputs \| `[{"code": "df.head()", "output": "..."}]` \|
	\| step_count \| Current step \| 4 \|
	\| error \| Last execution error if any \| "KeyError: 'Survived'" \|

	### 3.3 Actions

	The agent chooses from:

	\| Action \| Parameters \| Effect \| Cost \|
	\|--------\|------------\|--------\|------\|
	\| EXECUTE \| python_code \| Runs code in sandbox, returns stdout/result \| 1 step \|
	\| INSPECT \| column_name \| Returns detailed stats for that column \| 1 step \|
	\| SUBMIT \| answer_value \| Submits final answer, ends episode \| 0 steps \|

	Code execution environment includes:
	- pandas, numpy, scipy, sklearn (for stats)
	- matplotlib, seaborn (plots saved as images)
	- Dataset pre-loaded as `df`

	### 3.4 Reward Structure

	Multi-factor reward following Reviewer Two's pattern:

	\| Factor \| Weight \| How Computed \|
	\|--------\|--------\|--------------\|
	\| Correctness \| 60% \| Does submitted answer match gold? \|
	\| Efficiency \| 20% \| `(max_steps - steps_used) / max_steps` \|
	\| Code Quality \| 10% \| No errors, clean pandas usage \|
	\| Reasoning Trace \| 10% \| Did agent show work? (intermediate outputs) \|

	Progressive penalties (adapted from Reviewer Two):
	- First 3 code executions: Free exploration
	- Executions 4+: -0.02 × (execution_number - 3)

	This encourages focused analysis over trial-and-error spam.

	---

	## 4. Test Case Generation Strategy

	### 4.1 Source: Mining Existing Data Analysis

	We don't write questions from scratch. We extract from real analyses:

	\| Source \| What We Extract \| Quality Control \|
	\|--------\|-----------------\|-----------------\|
	\| Kaggle Notebooks \| Popular notebooks have known insights \| Filter to notebooks with 50+ upvotes \|
	\| Data Science Tutorials \| Step-by-step analyses with answers \| Verify answers are still correct \|
	\| Textbook Exercises \| Structured questions with solutions \| Already verified \|
	\| Academic Datasets \| Well-documented datasets (UCI ML) \| Established ground truths \|

	### 4.2 Question Types (Difficulty Curriculum)

	\| Level \| Question Type \| Example \| Skill Tested \|
	\|-------\|---------------\|---------\|--------------\|
	\| Easy \| Single aggregation \| "What is the average price?" \| Basic pandas \|
	\| Easy \| Filtering \| "How many items cost over $100?" \| Boolean indexing \|
	\| Medium \| Grouped aggregation \| "Which category has highest avg rating?" \| groupby + agg \|
	\| Medium \| Correlation \| "Is price correlated with reviews?" \| Statistical thinking \|
	\| Medium \| Missing data \| "What % of ages are missing?" \| Data quality awareness \|
	\| Hard \| Multi-step \| "For users who bought X, what % also bought Y?" \| Chained reasoning \|
	\| Hard \| Outlier detection \| "How many values are >3 std from mean?" \| Statistical methods \|
	\| Hard \| Temporal patterns \| "Which month shows highest year-over-year growth?" \| Time series \|

	### 4.3 Format-Prompted Answers

	Following DABench, every question has a format specification:

	\| Question \| Answer Format \| Gold Answer \| Verification \|
	\|----------\|---------------\|-------------\|--------------\|
	\| "Average order value?" \| "Float, 2 decimals" \| 47.23 \| Within ±0.01 \|
	\| "Month with most sales?" \| "Abbreviated (Jan/Feb/...)" \| "Mar" \| Exact match \|
	\| "Top 3 products by revenue?" \| "List of product names" \| ["A", "B", "C"] \| Set equality \|
	\| "Is there correlation?" \| "Yes or No" \| "Yes" \| Exact match \|
	\| "Correlation coefficient?" \| "Float, 2 decimals, range [-1,1]" \| 0.73 \| Within ±0.02 \|

	### 4.4 Preventing Overfitting

	\| Strategy \| Implementation \|
	\|----------\|----------------\|
	\| Dataset variation \| Same question type on different datasets \|
	\| Column renaming \| Same data, different column names \|
	\| Data perturbation \| Slightly different values, same patterns \|
	\| Question paraphrasing \| "Average X?" vs "Mean of X?" vs "What's the typical X?" \|
	\| Hidden test set \| 70% public dev, 30% hidden eval \|

	---

	## 5. Comparison: SQLEnv vs DataInsightEnv

	\| Aspect \| SQLEnv \| DataInsightEnv \|
	\|--------\|--------\|----------------\|
	\| Scope \| Database querying only \| Full analysis workflow \|
	\| Language \| SQL \| Python (pandas) \|
	\| Data Source \| Relational database \| CSV/DataFrame \|
	\| Skill Focus \| Query formulation \| End-to-end analysis \|
	\| Complexity \| Lower \| Higher \|
	\| Implementation Time \| 2-3 weeks \| 3-4 weeks \|
	\| Novelty \| Highest (no SQL RL exists) \| High (no analysis RL exists) \|

	Recommendation: Start with SQLEnv (simpler), then extend to DataInsightEnv as a "level 2" environment.

	---

	## 6. Why DataInsightEnv Is Novel

	### 6.1 Gap Analysis

	\| Existing \| What It Does \| What It Lacks \|
	\|----------\|--------------\|---------------\|
	\| Coding Environment (OpenEnv) \| General Python execution \| No data analysis focus, no insight verification \|
	\| InfiAgent-DABench \| Static data analysis benchmark \| Not RL environment, no multi-turn \|
	\| Spider/BIRD \| SQL benchmarks \| No Python, no full analysis workflow \|
	\| Kaggle competitions \| Human data science \| No RL formulation \|

	DataInsightEnv fills the gap: First RL environment for end-to-end data analysis with iterative code execution.

	### 6.2 Real-World Value

	This matters because:
	1. Universal skill: Every knowledge worker analyzes data
	2. Teachable workflow: Exploration → Hypothesis → Test → Conclude
	3. Verifiable: Format-prompted answers enable automatic evaluation
	4. GRPO-trainable: Clear reward signal for reasoning chains

	---

	## 7. Storytelling Angle

	The narrative: "We built an environment that teaches AI to be a data scientist."

	A data scientist doesn't just run queries. They:
	1. Understand the question (what are we trying to find?)
	2. Explore the data (what do we have to work with?)
	3. Form hypotheses (I think X relates to Y)
	4. Test hypotheses (let me compute the correlation)
	5. Validate findings (is this statistically significant?)
	6. Communicate results (the answer is Z)

	DataInsightEnv captures this workflow. We can demonstrate:
	- Before GRPO: Agent randomly executes code, misses the insight
	- After GRPO: Agent systematically explores, forms hypotheses, finds answer

	Compelling blog angle:
	> "We gave an AI the same dataset used in a famous Kaggle competition. Could it discover the same insights that won human data scientists $100k? Let's find out..."

	---

	## 8. Implementation Complexity Assessment

	\| Component \| Complexity \| Rationale \|
	\|-----------\|------------\|-----------\|
	\| Python sandbox \| Medium \| Need secure execution (Docker + resource limits) \|
	\| Pandas/numpy pre-installed \| Low \| Standard Python environment \|
	\| OpenEnv API wrapper \| Low \| Standard reset/step/state pattern \|
	\| Code execution capture \| Medium \| Capture stdout, stderr, return values \|
	\| Answer verification \| Medium \| Type-specific matchers (numeric, string, list) \|
	\| Dataset sourcing \| Medium \| Mine from Kaggle, clean and format \|
	\| Question generation \| Medium \| Extract from notebooks or write manually \|
	\| GRPO training \| Medium \| Standard Unsloth/TRL pipeline \|

	Total estimate: 3-4 weeks with AI-assisted development

	---

	## 9. Technical Considerations

	### 9.1 Sandboxing

	Code execution requires security:
	- Docker isolation: Agent code runs in ephemeral container
	- Resource limits: CPU time, memory caps
	- Network disabled: No external API calls
	- Filesystem read-only: Agent can only read the dataset

	### 9.2 Output Capture

	Need to capture:
	- stdout: Print statements
	- Return value: Last expression result
	- Errors: Tracebacks (informative for agent)
	- Plots: Save matplotlib figures as images (optional observation)

	### 9.3 Dataset Preparation

	Each dataset needs:
	- CSV file: The actual data
	- Metadata: Column descriptions, known issues
	- Questions: 5-10 questions per dataset
	- Gold answers: Verified correct answers
	- Answer formats: How to verify each answer

	---

	## 10. Deliverables Checklist

	For OpenEnv Challenge submission:

	- [ ] DataInsightEnv environment with `reset()`, `step()`, `state()` methods
	- [ ] Dockerfile with Python, pandas, numpy, scipy, sklearn, matplotlib
	- [ ] `openenv.yaml` manifest file
	- [ ] EnvClient implementation for remote access
	- [ ] Type-safe Action, Observation, State Pydantic models
	- [ ] Green Agent wrapper (automated evaluation mode)
	- [ ] 10+ datasets with 5-10 questions each (50+ total)
	- [ ] Answer verification suite (numeric, string, list, boolean)
	- [ ] GRPO training script (Unsloth compatible)
	- [ ] Baseline vs trained agent comparison
	- [ ] HuggingFace Space deployment
	- [ ] Blog post with learning curves and demo

	---

	## 11. Extension Ideas (Post-MVP)

	\| Extension \| Description \| Value \|
	\|-----------\|-------------\|-------\|
	\| Visualization challenges \| "Create a chart showing X" \| Tests matplotlib skills \|
	\| Report generation \| "Write a summary of findings" \| Tests communication \|
	\| Multi-dataset joins \| "Combine these datasets to answer..." \| Tests data integration \|
	\| Hypothesis testing \| "Is the difference statistically significant?" \| Tests statistical rigor \|
	\| ML model building \| "Train a model to predict X" \| Extends to ML workflow \|

	---

	## 12. Summary

	DataInsightEnv extends the SQLEnv concept to the full data analysis workflow. By combining:
	- Reviewer Two's rubric-based evaluation and progressive hints
	- DABench's format-prompted answers for verifiability
	- SWE-bench's natural artifact mining for test cases
	- CodeBenchGen's LLM-assisted test generation at scale

	We create an environment where agents learn to be data scientists—exploring, hypothesizing, testing, and concluding through iterative code execution. This captures a universal knowledge work skill with clear real-world value.

	---

	## 13. Relationship to SQLEnv

	These environments can be connected:

	```
	┌─────────────────┐ ┌─────────────────┐
	│ SQLEnv │ │ DataInsightEnv │
	│ (Database) │────▶│ (Analysis) │
	│ │ │ │
	│ - Schema explore│ │ - Data explore │
	│ - SQL queries │ │ - Python code │
	│ - Result verify │ │ - Insight verify│
	└─────────────────┘ └─────────────────┘
	│ │
	└───────────┬───────────┘
	│
	▼
	┌─────────────────────┐
	│ Combined Suite │
	│ │
	│ SQL → Pandas → ML │
	│ (Progressive skill) │
	└─────────────────────┘
	```

	Submission strategy: Build SQLEnv first (simpler, highest novelty), then extend to DataInsightEnv, then combine into a comprehensive "Data Skills" environment suite.