--- title: Data Cleaning OpenEnv Benchmark emoji: 🧹 colorFrom: blue colorTo: green sdk: docker pinned: false tags: - openenv --- # Data Cleaning OpenEnv Benchmark A practical benchmark where LLM agents clean messy tabular datasets through a structured action API. ## Why This Matters Data cleaning still takes a large share of real analytics work. This environment tests whether an agent can detect and correct common data quality problems such as duplicates, missing values, inconsistent formats, and outliers. ## Tasks | ID | Difficulty | Description | |----|-----------|-------------| | `task1_easy` | Easy | Remove exact duplicates, fill missing emails and ages, standardise country names | | `task2_medium` | Medium | Normalise mixed date formats, convert price strings to float, fix category typos | | `task3_hard` | Hard | Resolve duplicate user IDs, clip session outliers, fix invalid bounce rates | | `task4_medium_alt` | Medium | Alternate order-cleaning scenario that uses the same grader contract as `task2_medium` | | `task5_hard_alt` | Hard | Alternate analytics-cleaning scenario that uses the same grader contract as `task3_hard` | Each task is graded independently, and scores are always strictly between 0 and 1. ## Action Space | Action | Required Fields | |--------|----------------| | `fill_missing` | `column`, `strategy` (`mean`/`median`/`mode`/`constant`), `value` when needed | | `standardize_values` | `column`, `mapping` | | `remove_duplicates` | None | | `remove_row` | `row_id` | | `convert_type` | `column`, `target_type` | | `clip_outliers` | `column`, `lower`, `upper` | | `submit` | None | ## Observation Space Each step the agent receives `table_preview`, `schema_info`, `issues_detected`, `cleaning_log`, `valid_actions`, `step`, and `max_steps`. ## Reward Design Correct cleaning actions receive positive intermediate rewards, wasted actions receive small penalties, invalid actions receive larger penalties, and `submit` returns the final grader score. ## Setup & Local Run ```bash git clone https://huggingface.co/spaces/AnkushRaheja/data-cleaning-benchmark cd data-cleaning-benchmark pip install -r requirements.txt uvicorn app:app --port 7860 ``` ## Run Baseline ```bash export API_BASE_URL="https://api.groq.com/openai/v1" export MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct" export HF_TOKEN="$GROQ_API_KEY" export TASK_ID="task1_easy" python inference.py ``` ## Docker ```bash docker build -t data-cleaning-benchmark . docker run -p 7860:7860 \ -e API_BASE_URL="https://api.groq.com/openai/v1" \ -e MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct" \ -e HF_TOKEN="$GROQ_API_KEY" \ data-cleaning-benchmark ``` ## Baseline Scores | Task | Score | |------|-------| | task1_easy | 0.99 | | task2_medium | 0.99 | | task3_hard | 0.97 | | task4_medium_alt | 0.99 | | task5_hard_alt | 0.97 | ## API Reference | Method | Endpoint | Description | |--------|----------|-------------| | GET | `/health` | Health check | | POST | `/reset` | Start new episode `{"task_id": "task1_easy"}` | | POST | `/step` | Submit action and receive reward (compat route with `session_id` in body/query) | | POST | `/step/{session_id}` | Legacy route for direct session addressing | | GET | `/state` | Retrieve state by query (`session_id`) | | GET | `/state/{session_id}` | Legacy route for direct session addressing | | GET | `/tasks` | List all tasks | | GET | `/metadata` | Benchmark metadata including task and score-range contract | | GET | `/schema` | JSON schemas for action/observation/step response |