Spaces:

Waferz
/

openenv

Sleeping

App Files Files Community

openenv / README.md

AnkushRaheja

Upload 22 files

042e419 verified 10 days ago

preview code

raw

history blame contribute delete

3.57 kB

metadata

title: Data Cleaning OpenEnv Benchmark
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - openenv

Data Cleaning OpenEnv Benchmark

A practical benchmark where LLM agents clean messy tabular datasets through a structured action API.

Why This Matters

Data cleaning still takes a large share of real analytics work. This environment tests whether an agent can detect and correct common data quality problems such as duplicates, missing values, inconsistent formats, and outliers.

Tasks

ID	Difficulty	Description
`task1_easy`	Easy	Remove exact duplicates, fill missing emails and ages, standardise country names
`task2_medium`	Medium	Normalise mixed date formats, convert price strings to float, fix category typos
`task3_hard`	Hard	Resolve duplicate user IDs, clip session outliers, fix invalid bounce rates
`task4_medium_alt`	Medium	Alternate order-cleaning scenario that uses the same grader contract as `task2_medium`
`task5_hard_alt`	Hard	Alternate analytics-cleaning scenario that uses the same grader contract as `task3_hard`

Each task is graded independently, and scores are always strictly between 0 and 1.

Action Space

Action	Required Fields
`fill_missing`	`column`, `strategy` (`mean`/`median`/`mode`/`constant`), `value` when needed
`standardize_values`	`column`, `mapping`
`remove_duplicates`	None
`remove_row`	`row_id`
`convert_type`	`column`, `target_type`
`clip_outliers`	`column`, `lower`, `upper`
`submit`	None

Observation Space

Each step the agent receives table_preview, schema_info, issues_detected, cleaning_log, valid_actions, step, and max_steps.

Reward Design

Correct cleaning actions receive positive intermediate rewards, wasted actions receive small penalties, invalid actions receive larger penalties, and submit returns the final grader score.

Setup & Local Run

git clone https://huggingface.co/spaces/AnkushRaheja/data-cleaning-benchmark
cd data-cleaning-benchmark
pip install -r requirements.txt
uvicorn app:app --port 7860

Run Baseline

export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct"
export HF_TOKEN="$GROQ_API_KEY"
export TASK_ID="task1_easy"
python inference.py

Docker

docker build -t data-cleaning-benchmark .
docker run -p 7860:7860 \
  -e API_BASE_URL="https://api.groq.com/openai/v1" \
  -e MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct" \
  -e HF_TOKEN="$GROQ_API_KEY" \
  data-cleaning-benchmark

Baseline Scores

Task	Score
task1_easy	0.99
task2_medium	0.99
task3_hard	0.97
task4_medium_alt	0.99
task5_hard_alt	0.97

API Reference

Method	Endpoint	Description
GET	`/health`	Health check
POST	`/reset`	Start new episode `{"task_id": "task1_easy"}`
POST	`/step`	Submit action and receive reward (compat route with `session_id` in body/query)
POST	`/step/{session_id}`	Legacy route for direct session addressing
GET	`/state`	Retrieve state by query (`session_id`)
GET	`/state/{session_id}`	Legacy route for direct session addressing
GET	`/tasks`	List all tasks
GET	`/metadata`	Benchmark metadata including task and score-range contract
GET	`/schema`	JSON schemas for action/observation/step response