openenv / README.md
AnkushRaheja's picture
Upload 22 files
042e419 verified
metadata
title: Data Cleaning OpenEnv Benchmark
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - openenv

Data Cleaning OpenEnv Benchmark

A practical benchmark where LLM agents clean messy tabular datasets through a structured action API.

Why This Matters

Data cleaning still takes a large share of real analytics work. This environment tests whether an agent can detect and correct common data quality problems such as duplicates, missing values, inconsistent formats, and outliers.

Tasks

ID Difficulty Description
task1_easy Easy Remove exact duplicates, fill missing emails and ages, standardise country names
task2_medium Medium Normalise mixed date formats, convert price strings to float, fix category typos
task3_hard Hard Resolve duplicate user IDs, clip session outliers, fix invalid bounce rates
task4_medium_alt Medium Alternate order-cleaning scenario that uses the same grader contract as task2_medium
task5_hard_alt Hard Alternate analytics-cleaning scenario that uses the same grader contract as task3_hard

Each task is graded independently, and scores are always strictly between 0 and 1.

Action Space

Action Required Fields
fill_missing column, strategy (mean/median/mode/constant), value when needed
standardize_values column, mapping
remove_duplicates None
remove_row row_id
convert_type column, target_type
clip_outliers column, lower, upper
submit None

Observation Space

Each step the agent receives table_preview, schema_info, issues_detected, cleaning_log, valid_actions, step, and max_steps.

Reward Design

Correct cleaning actions receive positive intermediate rewards, wasted actions receive small penalties, invalid actions receive larger penalties, and submit returns the final grader score.

Setup & Local Run

git clone https://huggingface.co/spaces/AnkushRaheja/data-cleaning-benchmark
cd data-cleaning-benchmark
pip install -r requirements.txt
uvicorn app:app --port 7860

Run Baseline

export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct"
export HF_TOKEN="$GROQ_API_KEY"
export TASK_ID="task1_easy"
python inference.py

Docker

docker build -t data-cleaning-benchmark .
docker run -p 7860:7860 \
  -e API_BASE_URL="https://api.groq.com/openai/v1" \
  -e MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct" \
  -e HF_TOKEN="$GROQ_API_KEY" \
  data-cleaning-benchmark

Baseline Scores

Task Score
task1_easy 0.99
task2_medium 0.99
task3_hard 0.97
task4_medium_alt 0.99
task5_hard_alt 0.97

API Reference

Method Endpoint Description
GET /health Health check
POST /reset Start new episode {"task_id": "task1_easy"}
POST /step Submit action and receive reward (compat route with session_id in body/query)
POST /step/{session_id} Legacy route for direct session addressing
GET /state Retrieve state by query (session_id)
GET /state/{session_id} Legacy route for direct session addressing
GET /tasks List all tasks
GET /metadata Benchmark metadata including task and score-range contract
GET /schema JSON schemas for action/observation/step response