metadata
title: Data Cleaning OpenEnv Benchmark
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
- openenv
Data Cleaning OpenEnv Benchmark
A practical benchmark where LLM agents clean messy tabular datasets through a structured action API.
Why This Matters
Data cleaning still takes a large share of real analytics work. This environment tests whether an agent can detect and correct common data quality problems such as duplicates, missing values, inconsistent formats, and outliers.
Tasks
| ID | Difficulty | Description |
|---|---|---|
task1_easy |
Easy | Remove exact duplicates, fill missing emails and ages, standardise country names |
task2_medium |
Medium | Normalise mixed date formats, convert price strings to float, fix category typos |
task3_hard |
Hard | Resolve duplicate user IDs, clip session outliers, fix invalid bounce rates |
task4_medium_alt |
Medium | Alternate order-cleaning scenario that uses the same grader contract as task2_medium |
task5_hard_alt |
Hard | Alternate analytics-cleaning scenario that uses the same grader contract as task3_hard |
Each task is graded independently, and scores are always strictly between 0 and 1.
Action Space
| Action | Required Fields |
|---|---|
fill_missing |
column, strategy (mean/median/mode/constant), value when needed |
standardize_values |
column, mapping |
remove_duplicates |
None |
remove_row |
row_id |
convert_type |
column, target_type |
clip_outliers |
column, lower, upper |
submit |
None |
Observation Space
Each step the agent receives table_preview, schema_info, issues_detected, cleaning_log, valid_actions, step, and max_steps.
Reward Design
Correct cleaning actions receive positive intermediate rewards, wasted actions receive small penalties, invalid actions receive larger penalties, and submit returns the final grader score.
Setup & Local Run
git clone https://huggingface.co/spaces/AnkushRaheja/data-cleaning-benchmark
cd data-cleaning-benchmark
pip install -r requirements.txt
uvicorn app:app --port 7860
Run Baseline
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct"
export HF_TOKEN="$GROQ_API_KEY"
export TASK_ID="task1_easy"
python inference.py
Docker
docker build -t data-cleaning-benchmark .
docker run -p 7860:7860 \
-e API_BASE_URL="https://api.groq.com/openai/v1" \
-e MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct" \
-e HF_TOKEN="$GROQ_API_KEY" \
data-cleaning-benchmark
Baseline Scores
| Task | Score |
|---|---|
| task1_easy | 0.99 |
| task2_medium | 0.99 |
| task3_hard | 0.97 |
| task4_medium_alt | 0.99 |
| task5_hard_alt | 0.97 |
API Reference
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Health check |
| POST | /reset |
Start new episode {"task_id": "task1_easy"} |
| POST | /step |
Submit action and receive reward (compat route with session_id in body/query) |
| POST | /step/{session_id} |
Legacy route for direct session addressing |
| GET | /state |
Retrieve state by query (session_id) |
| GET | /state/{session_id} |
Legacy route for direct session addressing |
| GET | /tasks |
List all tasks |
| GET | /metadata |
Benchmark metadata including task and score-range contract |
| GET | /schema |
JSON schemas for action/observation/step response |