Spaces:
Sleeping
Sleeping
| title: Data Cleaning Environment | |
| emoji: π§Ή | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| pinned: false | |
| app_port: 8000 | |
| tags: | |
| - openenv | |
| - rl | |
| - data-cleaning | |
| - multi-agent | |
| - data-quality | |
| # DataMedic β AI Data Cleaning OpenEnv | |
| An **agentic data quality environment** for training and evaluating AI agents on real-world data cleaning tasks. | |
| An agent interacts with dirty pandas DataFrames through a standard `reset() / step() / state()` HTTP API, learning to fix missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β across **four progressively harder tasks** including a novel multi-source schema alignment challenge. | |
| π€ **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space | |
| π₯οΈ **Live DataMedic UI:** https://srishtichugh-openenv-hack.hf.space | |
| π **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs | |
| β **Health check:** https://srishtichugh-openenv-hack.hf.space/health | |
| --- | |
| ## What Makes This Different | |
| Most data cleaning tools are one-shot. DataMedic is an **RL training environment** where: | |
| - The agent **diagnoses** a dirty dataset via `/profile` (completeness, uniqueness, validity %) | |
| - It **plans** a treatment β every observation includes a `plan` field with the next recommended actions | |
| - It **executes** cleaning operations step by step with dense per-step rewards | |
| - It **receives a health certificate** via `/report` summarising what was fixed and how efficiently | |
| - It **exports** the cleaned result via `/export` | |
| Grounded in peer-reviewed research: | |
| - **Bendinelli et al. 2025** β LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664) | |
| - **CleanAgent** β Qi & Wang 2024 (arXiv:2403.08291) | |
| - **AutoDCWorkflow** β EMNLP 2025 Findings | |
| - **HoloClean** β Rekatsinas et al. 2017 | |
| --- | |
| ## Environment Description & Motivation | |
| Real-world datasets are almost never clean. Data engineers routinely spend 60β80% of their time on data cleaning. This environment turns that into an RL challenge with: | |
| - **Deterministic, programmatic graders** β ground-truth DataFrames generated with `seed=42`; every reward is reproducible | |
| - **Meaningful partial rewards** β dense delta reward every step, not just at episode end | |
| - **Four difficulty levels** β easy β medium β hard β expert (multi-source merge) | |
| - **Live DQ metrics** β completeness %, uniqueness %, validity % in every observation | |
| - **Agentic planning** β `plan` field recommends next actions; `tried_operations` prevents loops | |
| - **No external data downloads** β all datasets generated synthetically via `numpy` + `Faker` | |
| --- | |
| ## DataMedic UI | |
| Open `https://srishtichugh-openenv-hack.hf.space` in your browser to see the live monitoring dashboard: | |
| - **Health Score Ring** β animated score gauge, color-coded by severity (green/amber/red) | |
| - **DQ Dimension Bars** β live completeness, uniqueness, validity bars updating each step | |
| - **Score Trajectory Chart** β real-time line chart of score vs steps | |
| - **Agent Treatment Plan** β next recommended actions shown before each step | |
| - **Operation Log** β every action taken, result, and reward delta streamed live | |
| - **Dataset Preview** β first 10 rows with NULL values highlighted in red | |
| - **Export CSV** β download the cleaned DataFrame at any point | |
| Click any task button β the dataset loads automatically and the demo agent runs end-to-end. | |
| --- | |
| ## Action Space | |
| Actions are JSON objects sent to `POST /step`. | |
| | `operation` | Required `column` | `params` | Description | | |
| |---|---|---|---| | |
| | `fill_missing` | β | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values in a column | | |
| | `drop_duplicates` | β | β | Remove all duplicate rows | | |
| | `fix_format` | β | β | Standardise phone/date/country format | | |
| | `replace_value` | β | `{"old": ..., "new": ...}` | Replace a specific value | | |
| | `drop_outliers` | β | β | Remove IQR outliers from a numeric column | | |
| | `fix_dtype` | β | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype | | |
| | `align_schema` | β | β | Rename Source A columns to canonical schema *(Task 4 only)* | | |
| | `merge_sources` | β | β | Concatenate aligned Source A + Source B *(Task 4 only)* | | |
| **Format rules enforced by `fix_format`:** | |
| | Column | Target format | | |
| |---|---| | |
| | `phone` | `NNN-NNN-NNNN` | | |
| | `listed_date` / `signup_date` | `YYYY-MM-DD` | | |
| | `country` | Canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) | | |
| --- | |
| ## Observation Space | |
| Every `POST /reset` and `POST /step` returns: | |
| ```json | |
| { | |
| "observation": { | |
| "done": false, | |
| "reward": 0.40, | |
| "data_preview": "name,age,salary,...\n...", | |
| "data_shape": [100, 5], | |
| "missing_counts": {"age": 20, "salary": 20, "department": 10}, | |
| "duplicate_count": 0, | |
| "dtype_issues": {}, | |
| "task_description": "Task 1 (Easy) β Fill Missing Values\n...", | |
| "message": "Filled 20 missing values in 'age' using median.", | |
| "step_count": 1, | |
| "current_score": 0.4000, | |
| "dq_metrics": { | |
| "completeness_pct": 86.67, | |
| "uniqueness_pct": 100.0, | |
| "validity_pct": 94.5, | |
| "total_cells": 500, | |
| "null_cells": 50, | |
| "duplicate_rows": 0, | |
| "invalid_cells": 12 | |
| }, | |
| "tried_operations": ["fill_missing:age"], | |
| "plan": [ | |
| "fill_missing on \"salary\" (20 nulls) using median", | |
| "fill_missing on \"department\" (10 nulls) using mode" | |
| ] | |
| }, | |
| "reward": 0.40, | |
| "done": false, | |
| "info": {} | |
| } | |
| ``` | |
| | Field | Type | Description | | |
| |---|---|---| | |
| | `done` | bool | Episode finished (score β₯ 0.95 or max steps reached) | | |
| | `reward` | float | Per-step delta reward | | |
| | `data_preview` | string | First 10 rows as CSV | | |
| | `data_shape` | [int, int] | Current `[rows, cols]` | | |
| | `missing_counts` | object | `{column: null_count}` for columns with NaN | | |
| | `duplicate_count` | int | Number of duplicate rows | | |
| | `dtype_issues` | object | `{column: issue_description}` | | |
| | `task_description` | string | Full task instructions | | |
| | `message` | string | Human-readable result of last action | | |
| | `step_count` | int | Steps taken this episode | | |
| | `current_score` | float | Running grader score 0.0β1.0 | | |
| | `dq_metrics` | object | Completeness / uniqueness / validity % + raw counts | | |
| | `tried_operations` | array | Operations already applied β prevents agent loops | | |
| | `plan` | array | Up to 3 recommended next actions (rule-based planning engine) | | |
| --- | |
| ## Tasks | |
| ### Task 1 β Fill Missing Values *(Easy)* | |
| | Property | Value | | |
| |---|---| | |
| | Dataset | 100-row employee records (name, age, salary, department, experience) | | |
| | Issues | ~20% NaN in `age`, `salary`; ~10% NaN in `department` | | |
| | Goal | Fill all missing values | | |
| | Valid operations | `fill_missing` | | |
| | Grader | `1.0 β remaining_nulls / original_nulls` | | |
| | Max steps | 20 | | |
| | Optimal steps | 3 | | |
| ### Task 2 β Fix Formats + Remove Duplicates *(Medium)* | |
| | Property | Value | | |
| |---|---| | |
| | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) | | |
| | Issues | ~60% phone numbers in mixed formats, ~60% dates in mixed formats, 15 duplicate rows | | |
| | Goal | Standardise all phone/date formats and remove duplicates | | |
| | Valid operations | `fix_format`, `drop_duplicates` | | |
| | Grader | `0.35 Γ phone_score + 0.35 Γ date_score + 0.30 Γ dupe_score` | | |
| | Max steps | 30 | | |
| | Optimal steps | 3 | | |
| ### Task 3 β Full Cleaning Pipeline *(Hard)* | |
| | Property | Value | | |
| |---|---| | |
| | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) | | |
| | Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount`, mixed country case, mixed date formats | | |
| | Goal | Fix all issues end-to-end | | |
| | Valid operations | All 6 operations | | |
| | Grader | `0.25Γnull + 0.20Γdupe + 0.20Γoutlier + 0.175Γcountry + 0.175Γdate` | | |
| | Max steps | 40 | | |
| | Optimal steps | 8 | | |
| ### Task 4 β Multi-Source Schema Alignment + Merge *(Expert)* | |
| | Property | Value | | |
| |---|---| | |
| | Source A | 150-row CRM export: `cust_id, full_name, Age, purchase_amt, Country, signup, email` | | |
| | Source B | 100-row Marketing export: `customer_id, name, age_years, spend, country_name, registration_date, email` | | |
| | Issues | Misaligned schemas, missing values, mixed country case, mixed date formats, 10 duplicate rows | | |
| | Goal | Align schemas β merge β clean | | |
| | Valid operations | `align_schema`, `merge_sources`, `fill_missing`, `fix_format`, `drop_duplicates` | | |
| | Grader | `0.30Γschema + 0.25Γnull + 0.20Γcountry + 0.15Γdate + 0.10Γdupe` | | |
| | Max steps | 50 | | |
| | Optimal steps | 8 | | |
| *Inspired by Meta's DataSchema system β column-level semantic annotation across misaligned sources.* | |
| --- | |
| ## Reward Function | |
| | Scenario | Reward | | |
| |---|---| | |
| | Score improves (delta > 0) | `new_score β old_score` (positive) | | |
| | Operation had no effect | `β0.01` | | |
| | Invalid operation / bad column | `β0.05` | | |
| Rewards are bounded to **[β0.05, 0.99]**. Dense signal every step. | |
| --- | |
| ## Intelligence Endpoints (Phase 2) | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/profile` | Rich per-column DQ profile β null %, unique %, min/max/mean, top values | | |
| | `GET` | `/report` | Full episode cleaning summary β score improvement, efficiency, issues fixed | | |
| | `GET` | `/export` | Download current cleaned DataFrame as CSV | | |
| ### `/profile` response example | |
| ```json | |
| { | |
| "dq_metrics": { | |
| "completeness_pct": 90.0, | |
| "uniqueness_pct": 100.0, | |
| "validity_pct": 88.5 | |
| }, | |
| "columns": { | |
| "age": {"null_count": 20, "null_pct": 20.0, "min": 22, "max": 59, "mean": 40.3} | |
| } | |
| } | |
| ``` | |
| ### `/report` response example | |
| ```json | |
| { | |
| "initial_score": 0.01, | |
| "final_score": 0.99, | |
| "score_improvement": 0.98, | |
| "steps_taken": 3, | |
| "step_efficiency_pct": 85.0, | |
| "issues_fixed": {"nulls_filled": 50, "dupes_removed": 15, "formats_fixed": 168}, | |
| "completed": true | |
| } | |
| ``` | |
| --- | |
| ## All API Endpoints | |
| | Method | Path | Description | | |
| |---|---|---| | |
| | `GET` | `/` | DataMedic live monitoring UI | | |
| | `GET` | `/health` | Health check β `{"status": "healthy"}` | | |
| | `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3\|4}` | | |
| | `POST` | `/step` | Execute action. Body: action JSON | | |
| | `GET` | `/state` | Episode metadata | | |
| | `GET` | `/metadata` | Environment info + paper citations | | |
| | `GET` | `/schema` | Full action/observation/state JSON schemas | | |
| | `GET` | `/profile` | Rich data quality profile of current DataFrame | | |
| | `GET` | `/report` | Full episode cleaning summary | | |
| | `GET` | `/export` | Download cleaned DataFrame as CSV | | |
| | `GET` | `/docs` | Interactive Swagger UI | | |
| --- | |
| ## Baseline Scores | |
| | Task | Difficulty | Score | | |
| |---|---|---| | |
| | 1 β Fill Missing Values | Easy | 0.999 | | |
| | 2 β Fix Formats + Duplicates | Medium | 0.999 | | |
| | 3 β Full Cleaning Pipeline | Hard | 0.999 | | |
| | 4 β Multi-Source Merge | Expert | 0.990 | | |
| | **Average** | β | **0.997** | | |
| --- | |
| ## Setup & Usage | |
| ### Prerequisites | |
| - Python 3.11+ | |
| - Docker (for containerised deployment) | |
| ### Local β Python | |
| ```bash | |
| git clone https://github.com/Tanvi51204/openEnv.git | |
| cd openEnv | |
| pip install -r requirements.txt | |
| python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 | |
| ``` | |
| Then open: | |
| - UI: http://localhost:8000 | |
| - Docs: http://localhost:8000/docs | |
| ### Local β Docker | |
| ```bash | |
| docker build -t data-cleaning-env . | |
| docker run -p 8000:8000 data-cleaning-env | |
| ``` | |
| ### Run baseline inference | |
| ```bash | |
| export API_BASE_URL="https://api.openai.com/v1" | |
| export MODEL_NAME="gpt-4o-mini" | |
| export HF_TOKEN="sk-..." | |
| export ENV_URL="http://localhost:8000" | |
| python inference.py | |
| ``` | |
| Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.json`. | |
| ### Environment variables | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `API_BASE_URL` | `https://api.openai.com/v1` | LLM API endpoint (OpenAI-compatible) | | |
| | `MODEL_NAME` | `gpt-4o-mini` | Model identifier | | |
| | `HF_TOKEN` | β | API key for LLM calls | | |
| | `ENV_URL` | `http://localhost:8000` | Environment server URL | | |
| --- | |
| ## Project Structure | |
| ``` | |
| openenv-data-cleaning/ | |
| βββ models.py Pydantic contracts β Action / Observation / State / DQMetrics / Report | |
| βββ client.py Sync HTTP client (reset / step / state / health) | |
| βββ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging | |
| βββ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK | |
| βββ requirements.txt pip dependencies | |
| βββ server/ | |
| βββ app.py FastAPI routes + /profile + /report + /export + UI | |
| βββ environment.py reset / step / state + 8 operations + planning engine + DQ metrics | |
| βββ data_generator.py Synthetic dataset generation (seed=42, reproducible) | |
| βββ ui.html DataMedic live monitoring dashboard | |
| βββ tasks/ | |
| βββ task1_missing.py Easy β fill NaN grader | |
| βββ task2_format.py Medium β format + duplicates grader | |
| βββ task3_pipeline.py Hard β full pipeline grader | |
| βββ task4_merge.py Expert β multi-source schema alignment + merge grader | |
| ``` | |
| --- | |
| ## Live Demo | |
| π€ **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space | |
| - UI: https://srishtichugh-openenv-hack.hf.space | |
| - Health: https://srishtichugh-openenv-hack.hf.space/health | |
| - Docs: https://srishtichugh-openenv-hack.hf.space/docs | |
| - Profile: https://srishtichugh-openenv-hack.hf.space/profile | |
| - Report: https://srishtichugh-openenv-hack.hf.space/report |