| # DataInsightEnv: Data Analysis Challenge Environment |
|
|
| ## Conceptual Design Document for OpenEnv Challenge |
|
|
| --- |
|
|
| ## 1. Executive Summary |
|
|
| **DataInsightEnv** is a reinforcement learning environment where an agent acts as a data analystβgiven a dataset and a question, it must explore, analyze, and extract insights through iterative code execution. Unlike SQL-only environments, DataInsightEnv encompasses the full data analysis workflow: data loading, exploration, transformation, statistical analysis, and insight synthesis. |
|
|
| **Why this matters**: While coding environments exist (Coding Environment Server in OpenEnv), and SQL benchmarks exist (Spider), no environment captures the end-to-end data analysis process where an agent must decide WHAT to analyze, not just HOW to query. |
|
|
| --- |
|
|
| ## 2. Learning from Existing Environments |
|
|
| ### 2.1 How Reviewer Two Handles Open-Ended Evaluation |
|
|
| The "Reviewer Two" environment faces a challenge similar to ours: research plans are subjective, yet it achieves automated evaluation. How? |
|
|
| | Reviewer Two Technique | What It Solves | DataInsightEnv Adaptation | |
| |------------------------|----------------|---------------------------| |
| | **Hidden rubric criteria** | Evaluates without agent seeing exact targets | Hidden "insight criteria" agent must discover | |
| | **Rubric coverage scoring** | Multi-factor evaluation via semantic similarity | Insight coverage: did agent find key patterns? | |
| | **Format scoring** | Rewards structure, not just content | Rewards well-organized analysis (steps, reasoning) | |
| | **Progressive hint disclosure** | Guides agent toward criteria | After failed attempts, reveal "the answer involves X column" | |
| | **Compliance penalties** | Punishes ignoring revealed hints | If hint revealed, penalize not addressing it | |
|
|
| **Key insight**: Reviewer Two shows that even open-ended tasks (research plans) can be automatically evaluated through rubric coverage + format checks + semantic similarity. DataInsightEnv can use the same pattern for "insight coverage." |
|
|
| ### 2.2 How InfiAgent-DABench Evaluates Data Analysis |
|
|
| The InfiAgent-DABench benchmark specifically targets data analysis with LLM agents: |
|
|
| | DABench Approach | What It Does | Why It Works | |
| |------------------|--------------|--------------| |
| | **Format prompting** | Converts open questions to closed-form answers | "What's the correlation?" β "Answer as a decimal between -1 and 1" | |
| | **Sandboxed execution** | Runs agent code in isolated environment | Security + reproducibility | |
| | **Multi-step tasks** | Some questions require chained reasoning | Tests planning, not just single operations | |
| | **Diverse CSV files** | 52 different datasets | Prevents memorization | |
|
|
| **Key insight**: The "format prompting" technique is crucial. We don't ask "Analyze the data" (unevaluable). We ask "What is the month with highest sales? Answer as abbreviated month name (Jan, Feb, etc.)" Now we can verify. |
|
|
| ### 2.3 How SWE-bench Creates Ground Truth from Natural Artifacts |
|
|
| SWE-bench doesn't synthetically generate problemsβit mines real GitHub issues: |
|
|
| | SWE-bench Pattern | SQLEnv | DataInsightEnv | |
| |-------------------|--------|----------------| |
| | Source: GitHub PRs | Source: Spider questions | Source: Kaggle notebooks / data science tutorials | |
| | Ground truth: merged patch | Ground truth: gold SQL | Ground truth: known insight from notebook | |
| | Verification: run tests | Verification: compare results | Verification: check if insight matches | |
|
|
| **Key insight**: Real data science workflows (Kaggle notebooks, tutorials) already contain question-answer pairs. "In this analysis of Titanic data, we discovered that..." gives us ground truth insights to verify against. |
|
|
| ### 2.4 How CodeBenchGen Scales Test Generation |
|
|
| CodeBenchGen uses LLMs to automatically generate test cases from arbitrary code: |
|
|
| | CodeBenchGen Step | What It Does | DataInsightEnv Parallel | |
| |-------------------|--------------|-------------------------| |
| | Find code in wild | Scrape GitHub for functions | Scrape Kaggle for analysis notebooks | |
| | LLM sandboxing | Isolate function, generate tests | Extract question + expected answer | |
| | Human validation | Verify solvability (81.3% pass) | Expert review of question clarity | |
|
|
| **Key insight**: We can scale test case generation by having an LLM extract question-answer pairs from existing notebooks, then human-verify a subset. |
|
|
| --- |
|
|
| ## 3. DataInsightEnv Core Design |
|
|
| ### 3.1 The Environment Loop |
|
|
| ``` |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β EPISODE START β |
| β Agent receives: β |
| β β’ Dataset description (name, columns, sample rows) β |
| β β’ Question to answer β |
| β β’ Answer format specification β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β ANALYSIS PHASE β |
| β Agent iteratively executes Python code: β |
| β β’ Load and inspect data (pandas) β |
| β β’ Transform and clean (fillna, merge, etc.) β |
| β β’ Compute statistics (groupby, corr, etc.) β |
| β β’ Optionally visualize (output captured) β |
| β Each code execution returns output + errors β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β SUBMISSION PHASE β |
| β Agent submits: INSIGHT in specified format β |
| β Format enforces verifiability: β |
| β β’ Numeric: "42.5" β |
| β β’ Categorical: "March" β |
| β β’ List: "['A', 'B', 'C']" β |
| β β’ Boolean: "True" β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| β VERIFICATION β |
| β Compare to gold answer with appropriate matcher: β |
| β β’ Numeric: within tolerance (Β±1% or Β±0.01) β |
| β β’ String: case-insensitive exact match β |
| β β’ List: set equality (order-independent) β |
| β β’ Boolean: exact match β |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ |
| ``` |
|
|
| ### 3.2 State (Observation) |
|
|
| The agent observes: |
|
|
| | Component | Description | Example | |
| |-----------|-------------|---------| |
| | **dataset_name** | Identifier for the dataset | "titanic_passengers" | |
| | **columns** | List of column names and types | `[("age", "float"), ("survived", "int"), ...]` | |
| | **sample_rows** | First 5 rows as preview | `[{"age": 22, "survived": 1, ...}, ...]` | |
| | **dataset_stats** | Basic statistics (nulls, uniques) | `{"age": {"nulls": 177, "mean": 29.7}}` | |
| | **question** | The analytical question | "What is the survival rate for passengers over 50?" | |
| | **answer_format** | Expected format | "Answer as percentage with 1 decimal (e.g., 45.2)" | |
| | **code_history** | Previous code and outputs | `[{"code": "df.head()", "output": "..."}]` | |
| | **step_count** | Current step | 4 | |
| | **error** | Last execution error if any | "KeyError: 'Survived'" | |
| |
| ### 3.3 Actions |
| |
| The agent chooses from: |
| |
| | Action | Parameters | Effect | Cost | |
| |--------|------------|--------|------| |
| | **EXECUTE** | python_code | Runs code in sandbox, returns stdout/result | 1 step | |
| | **INSPECT** | column_name | Returns detailed stats for that column | 1 step | |
| | **SUBMIT** | answer_value | Submits final answer, ends episode | 0 steps | |
|
|
| **Code execution environment** includes: |
| - pandas, numpy, scipy, sklearn (for stats) |
| - matplotlib, seaborn (plots saved as images) |
| - Dataset pre-loaded as `df` |
|
|
| ### 3.4 Reward Structure |
|
|
| Multi-factor reward following Reviewer Two's pattern: |
|
|
| | Factor | Weight | How Computed | |
| |--------|--------|--------------| |
| | **Correctness** | 60% | Does submitted answer match gold? | |
| | **Efficiency** | 20% | `(max_steps - steps_used) / max_steps` | |
| | **Code Quality** | 10% | No errors, clean pandas usage | |
| | **Reasoning Trace** | 10% | Did agent show work? (intermediate outputs) | |
|
|
| **Progressive penalties** (adapted from Reviewer Two): |
| - First 3 code executions: Free exploration |
| - Executions 4+: -0.02 Γ (execution_number - 3) |
| |
| This encourages focused analysis over trial-and-error spam. |
| |
| --- |
| |
| ## 4. Test Case Generation Strategy |
| |
| ### 4.1 Source: Mining Existing Data Analysis |
| |
| We don't write questions from scratch. We extract from real analyses: |
| |
| | Source | What We Extract | Quality Control | |
| |--------|-----------------|-----------------| |
| | **Kaggle Notebooks** | Popular notebooks have known insights | Filter to notebooks with 50+ upvotes | |
| | **Data Science Tutorials** | Step-by-step analyses with answers | Verify answers are still correct | |
| | **Textbook Exercises** | Structured questions with solutions | Already verified | |
| | **Academic Datasets** | Well-documented datasets (UCI ML) | Established ground truths | |
| |
| ### 4.2 Question Types (Difficulty Curriculum) |
| |
| | Level | Question Type | Example | Skill Tested | |
| |-------|---------------|---------|--------------| |
| | **Easy** | Single aggregation | "What is the average price?" | Basic pandas | |
| | **Easy** | Filtering | "How many items cost over $100?" | Boolean indexing | |
| | **Medium** | Grouped aggregation | "Which category has highest avg rating?" | groupby + agg | |
| | **Medium** | Correlation | "Is price correlated with reviews?" | Statistical thinking | |
| | **Medium** | Missing data | "What % of ages are missing?" | Data quality awareness | |
| | **Hard** | Multi-step | "For users who bought X, what % also bought Y?" | Chained reasoning | |
| | **Hard** | Outlier detection | "How many values are >3 std from mean?" | Statistical methods | |
| | **Hard** | Temporal patterns | "Which month shows highest year-over-year growth?" | Time series | |
| |
| ### 4.3 Format-Prompted Answers |
| |
| Following DABench, every question has a format specification: |
| |
| | Question | Answer Format | Gold Answer | Verification | |
| |----------|---------------|-------------|--------------| |
| | "Average order value?" | "Float, 2 decimals" | 47.23 | Within Β±0.01 | |
| | "Month with most sales?" | "Abbreviated (Jan/Feb/...)" | "Mar" | Exact match | |
| | "Top 3 products by revenue?" | "List of product names" | ["A", "B", "C"] | Set equality | |
| | "Is there correlation?" | "Yes or No" | "Yes" | Exact match | |
| | "Correlation coefficient?" | "Float, 2 decimals, range [-1,1]" | 0.73 | Within Β±0.02 | |
| |
| ### 4.4 Preventing Overfitting |
| |
| | Strategy | Implementation | |
| |----------|----------------| |
| | **Dataset variation** | Same question type on different datasets | |
| | **Column renaming** | Same data, different column names | |
| | **Data perturbation** | Slightly different values, same patterns | |
| | **Question paraphrasing** | "Average X?" vs "Mean of X?" vs "What's the typical X?" | |
| | **Hidden test set** | 70% public dev, 30% hidden eval | |
| |
| --- |
| |
| ## 5. Comparison: SQLEnv vs DataInsightEnv |
| |
| | Aspect | SQLEnv | DataInsightEnv | |
| |--------|--------|----------------| |
| | **Scope** | Database querying only | Full analysis workflow | |
| | **Language** | SQL | Python (pandas) | |
| | **Data Source** | Relational database | CSV/DataFrame | |
| | **Skill Focus** | Query formulation | End-to-end analysis | |
| | **Complexity** | Lower | Higher | |
| | **Implementation Time** | 2-3 weeks | 3-4 weeks | |
| | **Novelty** | Highest (no SQL RL exists) | High (no analysis RL exists) | |
| |
| **Recommendation**: Start with SQLEnv (simpler), then extend to DataInsightEnv as a "level 2" environment. |
| |
| --- |
| |
| ## 6. Why DataInsightEnv Is Novel |
| |
| ### 6.1 Gap Analysis |
| |
| | Existing | What It Does | What It Lacks | |
| |----------|--------------|---------------| |
| | Coding Environment (OpenEnv) | General Python execution | No data analysis focus, no insight verification | |
| | InfiAgent-DABench | Static data analysis benchmark | Not RL environment, no multi-turn | |
| | Spider/BIRD | SQL benchmarks | No Python, no full analysis workflow | |
| | Kaggle competitions | Human data science | No RL formulation | |
| |
| **DataInsightEnv fills the gap**: First RL environment for end-to-end data analysis with iterative code execution. |
| |
| ### 6.2 Real-World Value |
| |
| This matters because: |
| 1. **Universal skill**: Every knowledge worker analyzes data |
| 2. **Teachable workflow**: Exploration β Hypothesis β Test β Conclude |
| 3. **Verifiable**: Format-prompted answers enable automatic evaluation |
| 4. **GRPO-trainable**: Clear reward signal for reasoning chains |
| |
| --- |
| |
| ## 7. Storytelling Angle |
| |
| **The narrative**: "We built an environment that teaches AI to be a data scientist." |
| |
| A data scientist doesn't just run queries. They: |
| 1. Understand the question (what are we trying to find?) |
| 2. Explore the data (what do we have to work with?) |
| 3. Form hypotheses (I think X relates to Y) |
| 4. Test hypotheses (let me compute the correlation) |
| 5. Validate findings (is this statistically significant?) |
| 6. Communicate results (the answer is Z) |
| |
| DataInsightEnv captures this workflow. We can demonstrate: |
| - **Before GRPO**: Agent randomly executes code, misses the insight |
| - **After GRPO**: Agent systematically explores, forms hypotheses, finds answer |
| |
| **Compelling blog angle**: |
| > "We gave an AI the same dataset used in a famous Kaggle competition. Could it discover the same insights that won human data scientists $100k? Let's find out..." |
| |
| --- |
| |
| ## 8. Implementation Complexity Assessment |
| |
| | Component | Complexity | Rationale | |
| |-----------|------------|-----------| |
| | Python sandbox | Medium | Need secure execution (Docker + resource limits) | |
| | Pandas/numpy pre-installed | Low | Standard Python environment | |
| | OpenEnv API wrapper | Low | Standard reset/step/state pattern | |
| | Code execution capture | Medium | Capture stdout, stderr, return values | |
| | Answer verification | Medium | Type-specific matchers (numeric, string, list) | |
| | Dataset sourcing | Medium | Mine from Kaggle, clean and format | |
| | Question generation | Medium | Extract from notebooks or write manually | |
| | GRPO training | Medium | Standard Unsloth/TRL pipeline | |
| |
| **Total estimate**: 3-4 weeks with AI-assisted development |
| |
| --- |
| |
| ## 9. Technical Considerations |
| |
| ### 9.1 Sandboxing |
| |
| Code execution requires security: |
| - **Docker isolation**: Agent code runs in ephemeral container |
| - **Resource limits**: CPU time, memory caps |
| - **Network disabled**: No external API calls |
| - **Filesystem read-only**: Agent can only read the dataset |
| |
| ### 9.2 Output Capture |
| |
| Need to capture: |
| - **stdout**: Print statements |
| - **Return value**: Last expression result |
| - **Errors**: Tracebacks (informative for agent) |
| - **Plots**: Save matplotlib figures as images (optional observation) |
| |
| ### 9.3 Dataset Preparation |
| |
| Each dataset needs: |
| - **CSV file**: The actual data |
| - **Metadata**: Column descriptions, known issues |
| - **Questions**: 5-10 questions per dataset |
| - **Gold answers**: Verified correct answers |
| - **Answer formats**: How to verify each answer |
| |
| --- |
| |
| ## 10. Deliverables Checklist |
| |
| For OpenEnv Challenge submission: |
| |
| - [ ] DataInsightEnv environment with `reset()`, `step()`, `state()` methods |
| - [ ] Dockerfile with Python, pandas, numpy, scipy, sklearn, matplotlib |
| - [ ] `openenv.yaml` manifest file |
| - [ ] EnvClient implementation for remote access |
| - [ ] Type-safe Action, Observation, State Pydantic models |
| - [ ] Green Agent wrapper (automated evaluation mode) |
| - [ ] 10+ datasets with 5-10 questions each (50+ total) |
| - [ ] Answer verification suite (numeric, string, list, boolean) |
| - [ ] GRPO training script (Unsloth compatible) |
| - [ ] Baseline vs trained agent comparison |
| - [ ] HuggingFace Space deployment |
| - [ ] Blog post with learning curves and demo |
| |
| --- |
| |
| ## 11. Extension Ideas (Post-MVP) |
| |
| | Extension | Description | Value | |
| |-----------|-------------|-------| |
| | **Visualization challenges** | "Create a chart showing X" | Tests matplotlib skills | |
| | **Report generation** | "Write a summary of findings" | Tests communication | |
| | **Multi-dataset joins** | "Combine these datasets to answer..." | Tests data integration | |
| | **Hypothesis testing** | "Is the difference statistically significant?" | Tests statistical rigor | |
| | **ML model building** | "Train a model to predict X" | Extends to ML workflow | |
| |
| --- |
| |
| ## 12. Summary |
| |
| DataInsightEnv extends the SQLEnv concept to the full data analysis workflow. By combining: |
| - **Reviewer Two's** rubric-based evaluation and progressive hints |
| - **DABench's** format-prompted answers for verifiability |
| - **SWE-bench's** natural artifact mining for test cases |
| - **CodeBenchGen's** LLM-assisted test generation at scale |
| |
| We create an environment where agents learn to be data scientistsβexploring, hypothesizing, testing, and concluding through iterative code execution. This captures a universal knowledge work skill with clear real-world value. |
| |
| --- |
| |
| ## 13. Relationship to SQLEnv |
| |
| These environments can be connected: |
| |
| ``` |
| βββββββββββββββββββ βββββββββββββββββββ |
| β SQLEnv β β DataInsightEnv β |
| β (Database) ββββββΆβ (Analysis) β |
| β β β β |
| β - Schema exploreβ β - Data explore β |
| β - SQL queries β β - Python code β |
| β - Result verify β β - Insight verifyβ |
| βββββββββββββββββββ βββββββββββββββββββ |
| β β |
| βββββββββββββ¬ββββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββ |
| β Combined Suite β |
| β β |
| β SQL β Pandas β ML β |
| β (Progressive skill) β |
| βββββββββββββββββββββββ |
| ``` |
| |
| **Submission strategy**: Build SQLEnv first (simpler, highest novelty), then extend to DataInsightEnv, then combine into a comprehensive "Data Skills" environment suite. |
| |