OpenEnv_hack / README.md
srishtichugh's picture
add ui
40fcf49
---
title: Data Cleaning Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
- openenv
- rl
- data-cleaning
- multi-agent
- data-quality
---
# DataMedic β€” AI Data Cleaning OpenEnv
An **agentic data quality environment** for training and evaluating AI agents on real-world data cleaning tasks.
An agent interacts with dirty pandas DataFrames through a standard `reset() / step() / state()` HTTP API, learning to fix missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β€” across **four progressively harder tasks** including a novel multi-source schema alignment challenge.
πŸ€— **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
πŸ–₯️ **Live DataMedic UI:** https://srishtichugh-openenv-hack.hf.space
πŸ“– **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
βœ… **Health check:** https://srishtichugh-openenv-hack.hf.space/health
---
## What Makes This Different
Most data cleaning tools are one-shot. DataMedic is an **RL training environment** where:
- The agent **diagnoses** a dirty dataset via `/profile` (completeness, uniqueness, validity %)
- It **plans** a treatment β€” every observation includes a `plan` field with the next recommended actions
- It **executes** cleaning operations step by step with dense per-step rewards
- It **receives a health certificate** via `/report` summarising what was fixed and how efficiently
- It **exports** the cleaned result via `/export`
Grounded in peer-reviewed research:
- **Bendinelli et al. 2025** β€” LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664)
- **CleanAgent** β€” Qi & Wang 2024 (arXiv:2403.08291)
- **AutoDCWorkflow** β€” EMNLP 2025 Findings
- **HoloClean** β€” Rekatsinas et al. 2017
---
## Environment Description & Motivation
Real-world datasets are almost never clean. Data engineers routinely spend 60–80% of their time on data cleaning. This environment turns that into an RL challenge with:
- **Deterministic, programmatic graders** β€” ground-truth DataFrames generated with `seed=42`; every reward is reproducible
- **Meaningful partial rewards** β€” dense delta reward every step, not just at episode end
- **Four difficulty levels** β€” easy β†’ medium β†’ hard β†’ expert (multi-source merge)
- **Live DQ metrics** β€” completeness %, uniqueness %, validity % in every observation
- **Agentic planning** β€” `plan` field recommends next actions; `tried_operations` prevents loops
- **No external data downloads** β€” all datasets generated synthetically via `numpy` + `Faker`
---
## DataMedic UI
Open `https://srishtichugh-openenv-hack.hf.space` in your browser to see the live monitoring dashboard:
- **Health Score Ring** β€” animated score gauge, color-coded by severity (green/amber/red)
- **DQ Dimension Bars** β€” live completeness, uniqueness, validity bars updating each step
- **Score Trajectory Chart** β€” real-time line chart of score vs steps
- **Agent Treatment Plan** β€” next recommended actions shown before each step
- **Operation Log** β€” every action taken, result, and reward delta streamed live
- **Dataset Preview** β€” first 10 rows with NULL values highlighted in red
- **Export CSV** β€” download the cleaned DataFrame at any point
Click any task button β€” the dataset loads automatically and the demo agent runs end-to-end.
---
## Action Space
Actions are JSON objects sent to `POST /step`.
| `operation` | Required `column` | `params` | Description |
|---|---|---|---|
| `fill_missing` | βœ… | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values in a column |
| `drop_duplicates` | ❌ | β€” | Remove all duplicate rows |
| `fix_format` | βœ… | β€” | Standardise phone/date/country format |
| `replace_value` | βœ… | `{"old": ..., "new": ...}` | Replace a specific value |
| `drop_outliers` | βœ… | β€” | Remove IQR outliers from a numeric column |
| `fix_dtype` | βœ… | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
| `align_schema` | ❌ | β€” | Rename Source A columns to canonical schema *(Task 4 only)* |
| `merge_sources` | ❌ | β€” | Concatenate aligned Source A + Source B *(Task 4 only)* |
**Format rules enforced by `fix_format`:**
| Column | Target format |
|---|---|
| `phone` | `NNN-NNN-NNNN` |
| `listed_date` / `signup_date` | `YYYY-MM-DD` |
| `country` | Canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
---
## Observation Space
Every `POST /reset` and `POST /step` returns:
```json
{
"observation": {
"done": false,
"reward": 0.40,
"data_preview": "name,age,salary,...\n...",
"data_shape": [100, 5],
"missing_counts": {"age": 20, "salary": 20, "department": 10},
"duplicate_count": 0,
"dtype_issues": {},
"task_description": "Task 1 (Easy) β€” Fill Missing Values\n...",
"message": "Filled 20 missing values in 'age' using median.",
"step_count": 1,
"current_score": 0.4000,
"dq_metrics": {
"completeness_pct": 86.67,
"uniqueness_pct": 100.0,
"validity_pct": 94.5,
"total_cells": 500,
"null_cells": 50,
"duplicate_rows": 0,
"invalid_cells": 12
},
"tried_operations": ["fill_missing:age"],
"plan": [
"fill_missing on \"salary\" (20 nulls) using median",
"fill_missing on \"department\" (10 nulls) using mode"
]
},
"reward": 0.40,
"done": false,
"info": {}
}
```
| Field | Type | Description |
|---|---|---|
| `done` | bool | Episode finished (score β‰₯ 0.95 or max steps reached) |
| `reward` | float | Per-step delta reward |
| `data_preview` | string | First 10 rows as CSV |
| `data_shape` | [int, int] | Current `[rows, cols]` |
| `missing_counts` | object | `{column: null_count}` for columns with NaN |
| `duplicate_count` | int | Number of duplicate rows |
| `dtype_issues` | object | `{column: issue_description}` |
| `task_description` | string | Full task instructions |
| `message` | string | Human-readable result of last action |
| `step_count` | int | Steps taken this episode |
| `current_score` | float | Running grader score 0.0–1.0 |
| `dq_metrics` | object | Completeness / uniqueness / validity % + raw counts |
| `tried_operations` | array | Operations already applied β€” prevents agent loops |
| `plan` | array | Up to 3 recommended next actions (rule-based planning engine) |
---
## Tasks
### Task 1 β€” Fill Missing Values *(Easy)*
| Property | Value |
|---|---|
| Dataset | 100-row employee records (name, age, salary, department, experience) |
| Issues | ~20% NaN in `age`, `salary`; ~10% NaN in `department` |
| Goal | Fill all missing values |
| Valid operations | `fill_missing` |
| Grader | `1.0 βˆ’ remaining_nulls / original_nulls` |
| Max steps | 20 |
| Optimal steps | 3 |
### Task 2 β€” Fix Formats + Remove Duplicates *(Medium)*
| Property | Value |
|---|---|
| Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
| Issues | ~60% phone numbers in mixed formats, ~60% dates in mixed formats, 15 duplicate rows |
| Goal | Standardise all phone/date formats and remove duplicates |
| Valid operations | `fix_format`, `drop_duplicates` |
| Grader | `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score` |
| Max steps | 30 |
| Optimal steps | 3 |
### Task 3 β€” Full Cleaning Pipeline *(Hard)*
| Property | Value |
|---|---|
| Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
| Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount`, mixed country case, mixed date formats |
| Goal | Fix all issues end-to-end |
| Valid operations | All 6 operations |
| Grader | `0.25Γ—null + 0.20Γ—dupe + 0.20Γ—outlier + 0.175Γ—country + 0.175Γ—date` |
| Max steps | 40 |
| Optimal steps | 8 |
### Task 4 β€” Multi-Source Schema Alignment + Merge *(Expert)*
| Property | Value |
|---|---|
| Source A | 150-row CRM export: `cust_id, full_name, Age, purchase_amt, Country, signup, email` |
| Source B | 100-row Marketing export: `customer_id, name, age_years, spend, country_name, registration_date, email` |
| Issues | Misaligned schemas, missing values, mixed country case, mixed date formats, 10 duplicate rows |
| Goal | Align schemas β†’ merge β†’ clean |
| Valid operations | `align_schema`, `merge_sources`, `fill_missing`, `fix_format`, `drop_duplicates` |
| Grader | `0.30Γ—schema + 0.25Γ—null + 0.20Γ—country + 0.15Γ—date + 0.10Γ—dupe` |
| Max steps | 50 |
| Optimal steps | 8 |
*Inspired by Meta's DataSchema system β€” column-level semantic annotation across misaligned sources.*
---
## Reward Function
| Scenario | Reward |
|---|---|
| Score improves (delta > 0) | `new_score βˆ’ old_score` (positive) |
| Operation had no effect | `βˆ’0.01` |
| Invalid operation / bad column | `βˆ’0.05` |
Rewards are bounded to **[βˆ’0.05, 0.99]**. Dense signal every step.
---
## Intelligence Endpoints (Phase 2)
| Method | Path | Description |
|---|---|---|
| `GET` | `/profile` | Rich per-column DQ profile β€” null %, unique %, min/max/mean, top values |
| `GET` | `/report` | Full episode cleaning summary β€” score improvement, efficiency, issues fixed |
| `GET` | `/export` | Download current cleaned DataFrame as CSV |
### `/profile` response example
```json
{
"dq_metrics": {
"completeness_pct": 90.0,
"uniqueness_pct": 100.0,
"validity_pct": 88.5
},
"columns": {
"age": {"null_count": 20, "null_pct": 20.0, "min": 22, "max": 59, "mean": 40.3}
}
}
```
### `/report` response example
```json
{
"initial_score": 0.01,
"final_score": 0.99,
"score_improvement": 0.98,
"steps_taken": 3,
"step_efficiency_pct": 85.0,
"issues_fixed": {"nulls_filled": 50, "dupes_removed": 15, "formats_fixed": 168},
"completed": true
}
```
---
## All API Endpoints
| Method | Path | Description |
|---|---|---|
| `GET` | `/` | DataMedic live monitoring UI |
| `GET` | `/health` | Health check β†’ `{"status": "healthy"}` |
| `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3\|4}` |
| `POST` | `/step` | Execute action. Body: action JSON |
| `GET` | `/state` | Episode metadata |
| `GET` | `/metadata` | Environment info + paper citations |
| `GET` | `/schema` | Full action/observation/state JSON schemas |
| `GET` | `/profile` | Rich data quality profile of current DataFrame |
| `GET` | `/report` | Full episode cleaning summary |
| `GET` | `/export` | Download cleaned DataFrame as CSV |
| `GET` | `/docs` | Interactive Swagger UI |
---
## Baseline Scores
| Task | Difficulty | Score |
|---|---|---|
| 1 β€” Fill Missing Values | Easy | 0.999 |
| 2 β€” Fix Formats + Duplicates | Medium | 0.999 |
| 3 β€” Full Cleaning Pipeline | Hard | 0.999 |
| 4 β€” Multi-Source Merge | Expert | 0.990 |
| **Average** | β€” | **0.997** |
---
## Setup & Usage
### Prerequisites
- Python 3.11+
- Docker (for containerised deployment)
### Local β€” Python
```bash
git clone https://github.com/Tanvi51204/openEnv.git
cd openEnv
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
```
Then open:
- UI: http://localhost:8000
- Docs: http://localhost:8000/docs
### Local β€” Docker
```bash
docker build -t data-cleaning-env .
docker run -p 8000:8000 data-cleaning-env
```
### Run baseline inference
```bash
export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="sk-..."
export ENV_URL="http://localhost:8000"
python inference.py
```
Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.json`.
### Environment variables
| Variable | Default | Description |
|---|---|---|
| `API_BASE_URL` | `https://api.openai.com/v1` | LLM API endpoint (OpenAI-compatible) |
| `MODEL_NAME` | `gpt-4o-mini` | Model identifier |
| `HF_TOKEN` | β€” | API key for LLM calls |
| `ENV_URL` | `http://localhost:8000` | Environment server URL |
---
## Project Structure
```
openenv-data-cleaning/
β”œβ”€β”€ models.py Pydantic contracts β€” Action / Observation / State / DQMetrics / Report
β”œβ”€β”€ client.py Sync HTTP client (reset / step / state / health)
β”œβ”€β”€ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging
β”œβ”€β”€ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK
β”œβ”€β”€ requirements.txt pip dependencies
└── server/
β”œβ”€β”€ app.py FastAPI routes + /profile + /report + /export + UI
β”œβ”€β”€ environment.py reset / step / state + 8 operations + planning engine + DQ metrics
β”œβ”€β”€ data_generator.py Synthetic dataset generation (seed=42, reproducible)
β”œβ”€β”€ ui.html DataMedic live monitoring dashboard
└── tasks/
β”œβ”€β”€ task1_missing.py Easy β€” fill NaN grader
β”œβ”€β”€ task2_format.py Medium β€” format + duplicates grader
β”œβ”€β”€ task3_pipeline.py Hard β€” full pipeline grader
└── task4_merge.py Expert β€” multi-source schema alignment + merge grader
```
---
## Live Demo
πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
- UI: https://srishtichugh-openenv-hack.hf.space
- Health: https://srishtichugh-openenv-hack.hf.space/health
- Docs: https://srishtichugh-openenv-hack.hf.space/docs
- Profile: https://srishtichugh-openenv-hack.hf.space/profile
- Report: https://srishtichugh-openenv-hack.hf.space/report