---
title: Data Cleaning OpenEnv Benchmark
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
tags:
  - openenv
---

# Data Cleaning OpenEnv Benchmark

A practical benchmark where LLM agents clean messy tabular datasets through a structured action API.

## Why This Matters

Data cleaning still takes a large share of real analytics work. This environment tests whether an agent can detect and correct common data quality problems such as duplicates, missing values, inconsistent formats, and outliers.

## Tasks

| ID | Difficulty | Description |
|----|-----------|-------------|
| `task1_easy` | Easy | Remove exact duplicates, fill missing emails and ages, standardise country names |
| `task2_medium` | Medium | Normalise mixed date formats, convert price strings to float, fix category typos |
| `task3_hard` | Hard | Resolve duplicate user IDs, clip session outliers, fix invalid bounce rates |
| `task4_medium_alt` | Medium | Alternate order-cleaning scenario that uses the same grader contract as `task2_medium` |
| `task5_hard_alt` | Hard | Alternate analytics-cleaning scenario that uses the same grader contract as `task3_hard` |

Each task is graded independently, and scores are always strictly between 0 and 1.

## Action Space

| Action | Required Fields |
|--------|----------------|
| `fill_missing` | `column`, `strategy` (`mean`/`median`/`mode`/`constant`), `value` when needed |
| `standardize_values` | `column`, `mapping` |
| `remove_duplicates` | None |
| `remove_row` | `row_id` |
| `convert_type` | `column`, `target_type` |
| `clip_outliers` | `column`, `lower`, `upper` |
| `submit` | None |

## Observation Space

Each step the agent receives `table_preview`, `schema_info`, `issues_detected`, `cleaning_log`, `valid_actions`, `step`, and `max_steps`.

## Reward Design

Correct cleaning actions receive positive intermediate rewards, wasted actions receive small penalties, invalid actions receive larger penalties, and `submit` returns the final grader score.

## Setup & Local Run

```bash
git clone https://huggingface.co/spaces/AnkushRaheja/data-cleaning-benchmark
cd data-cleaning-benchmark
pip install -r requirements.txt
uvicorn app:app --port 7860
```

## Run Baseline

```bash
export API_BASE_URL="https://api.groq.com/openai/v1"
export MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct"
export HF_TOKEN="$GROQ_API_KEY"
export TASK_ID="task1_easy"
python inference.py
```

## Docker

```bash
docker build -t data-cleaning-benchmark .
docker run -p 7860:7860 \
  -e API_BASE_URL="https://api.groq.com/openai/v1" \
  -e MODEL_NAME="meta-llama/llama-4-scout-17b-16e-instruct" \
  -e HF_TOKEN="$GROQ_API_KEY" \
  data-cleaning-benchmark
```

## Baseline Scores

| Task | Score |
|------|-------|
| task1_easy | 0.99 |
| task2_medium | 0.99 |
| task3_hard | 0.97 |
| task4_medium_alt | 0.99 |
| task5_hard_alt | 0.97 |

## API Reference

| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/health` | Health check |
| POST | `/reset` | Start new episode `{"task_id": "task1_easy"}` |
| POST | `/step` | Submit action and receive reward (compat route with `session_id` in body/query) |
| POST | `/step/{session_id}` | Legacy route for direct session addressing |
| GET | `/state` | Retrieve state by query (`session_id`) |
| GET | `/state/{session_id}` | Legacy route for direct session addressing |
| GET | `/tasks` | List all tasks |
| GET | `/metadata` | Benchmark metadata including task and score-range contract |
| GET | `/schema` | JSON schemas for action/observation/step response |