--- title: Data Cleaning Environment emoji: 🧹 colorFrom: blue colorTo: green sdk: docker pinned: false app_port: 8000 tags: - openenv - rl - data-cleaning - multi-agent - data-quality --- # DataMedic β€” AI Data Cleaning OpenEnv An **agentic data quality environment** for training and evaluating AI agents on real-world data cleaning tasks. An agent interacts with dirty pandas DataFrames through a standard `reset() / step() / state()` HTTP API, learning to fix missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β€” across **four progressively harder tasks** including a novel multi-source schema alignment challenge. πŸ€— **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space πŸ–₯️ **Live DataMedic UI:** https://srishtichugh-openenv-hack.hf.space πŸ“– **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs βœ… **Health check:** https://srishtichugh-openenv-hack.hf.space/health --- ## What Makes This Different Most data cleaning tools are one-shot. DataMedic is an **RL training environment** where: - The agent **diagnoses** a dirty dataset via `/profile` (completeness, uniqueness, validity %) - It **plans** a treatment β€” every observation includes a `plan` field with the next recommended actions - It **executes** cleaning operations step by step with dense per-step rewards - It **receives a health certificate** via `/report` summarising what was fixed and how efficiently - It **exports** the cleaned result via `/export` Grounded in peer-reviewed research: - **Bendinelli et al. 2025** β€” LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664) - **CleanAgent** β€” Qi & Wang 2024 (arXiv:2403.08291) - **AutoDCWorkflow** β€” EMNLP 2025 Findings - **HoloClean** β€” Rekatsinas et al. 2017 --- ## Environment Description & Motivation Real-world datasets are almost never clean. Data engineers routinely spend 60–80% of their time on data cleaning. This environment turns that into an RL challenge with: - **Deterministic, programmatic graders** β€” ground-truth DataFrames generated with `seed=42`; every reward is reproducible - **Meaningful partial rewards** β€” dense delta reward every step, not just at episode end - **Four difficulty levels** β€” easy β†’ medium β†’ hard β†’ expert (multi-source merge) - **Live DQ metrics** β€” completeness %, uniqueness %, validity % in every observation - **Agentic planning** β€” `plan` field recommends next actions; `tried_operations` prevents loops - **No external data downloads** β€” all datasets generated synthetically via `numpy` + `Faker` --- ## DataMedic UI Open `https://srishtichugh-openenv-hack.hf.space` in your browser to see the live monitoring dashboard: - **Health Score Ring** β€” animated score gauge, color-coded by severity (green/amber/red) - **DQ Dimension Bars** β€” live completeness, uniqueness, validity bars updating each step - **Score Trajectory Chart** β€” real-time line chart of score vs steps - **Agent Treatment Plan** β€” next recommended actions shown before each step - **Operation Log** β€” every action taken, result, and reward delta streamed live - **Dataset Preview** β€” first 10 rows with NULL values highlighted in red - **Export CSV** β€” download the cleaned DataFrame at any point Click any task button β€” the dataset loads automatically and the demo agent runs end-to-end. --- ## Action Space Actions are JSON objects sent to `POST /step`. | `operation` | Required `column` | `params` | Description | |---|---|---|---| | `fill_missing` | βœ… | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values in a column | | `drop_duplicates` | ❌ | β€” | Remove all duplicate rows | | `fix_format` | βœ… | β€” | Standardise phone/date/country format | | `replace_value` | βœ… | `{"old": ..., "new": ...}` | Replace a specific value | | `drop_outliers` | βœ… | β€” | Remove IQR outliers from a numeric column | | `fix_dtype` | βœ… | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype | | `align_schema` | ❌ | β€” | Rename Source A columns to canonical schema *(Task 4 only)* | | `merge_sources` | ❌ | β€” | Concatenate aligned Source A + Source B *(Task 4 only)* | **Format rules enforced by `fix_format`:** | Column | Target format | |---|---| | `phone` | `NNN-NNN-NNNN` | | `listed_date` / `signup_date` | `YYYY-MM-DD` | | `country` | Canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) | --- ## Observation Space Every `POST /reset` and `POST /step` returns: ```json { "observation": { "done": false, "reward": 0.40, "data_preview": "name,age,salary,...\n...", "data_shape": [100, 5], "missing_counts": {"age": 20, "salary": 20, "department": 10}, "duplicate_count": 0, "dtype_issues": {}, "task_description": "Task 1 (Easy) β€” Fill Missing Values\n...", "message": "Filled 20 missing values in 'age' using median.", "step_count": 1, "current_score": 0.4000, "dq_metrics": { "completeness_pct": 86.67, "uniqueness_pct": 100.0, "validity_pct": 94.5, "total_cells": 500, "null_cells": 50, "duplicate_rows": 0, "invalid_cells": 12 }, "tried_operations": ["fill_missing:age"], "plan": [ "fill_missing on \"salary\" (20 nulls) using median", "fill_missing on \"department\" (10 nulls) using mode" ] }, "reward": 0.40, "done": false, "info": {} } ``` | Field | Type | Description | |---|---|---| | `done` | bool | Episode finished (score β‰₯ 0.95 or max steps reached) | | `reward` | float | Per-step delta reward | | `data_preview` | string | First 10 rows as CSV | | `data_shape` | [int, int] | Current `[rows, cols]` | | `missing_counts` | object | `{column: null_count}` for columns with NaN | | `duplicate_count` | int | Number of duplicate rows | | `dtype_issues` | object | `{column: issue_description}` | | `task_description` | string | Full task instructions | | `message` | string | Human-readable result of last action | | `step_count` | int | Steps taken this episode | | `current_score` | float | Running grader score 0.0–1.0 | | `dq_metrics` | object | Completeness / uniqueness / validity % + raw counts | | `tried_operations` | array | Operations already applied β€” prevents agent loops | | `plan` | array | Up to 3 recommended next actions (rule-based planning engine) | --- ## Tasks ### Task 1 β€” Fill Missing Values *(Easy)* | Property | Value | |---|---| | Dataset | 100-row employee records (name, age, salary, department, experience) | | Issues | ~20% NaN in `age`, `salary`; ~10% NaN in `department` | | Goal | Fill all missing values | | Valid operations | `fill_missing` | | Grader | `1.0 βˆ’ remaining_nulls / original_nulls` | | Max steps | 20 | | Optimal steps | 3 | ### Task 2 β€” Fix Formats + Remove Duplicates *(Medium)* | Property | Value | |---|---| | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) | | Issues | ~60% phone numbers in mixed formats, ~60% dates in mixed formats, 15 duplicate rows | | Goal | Standardise all phone/date formats and remove duplicates | | Valid operations | `fix_format`, `drop_duplicates` | | Grader | `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score` | | Max steps | 30 | | Optimal steps | 3 | ### Task 3 β€” Full Cleaning Pipeline *(Hard)* | Property | Value | |---|---| | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) | | Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount`, mixed country case, mixed date formats | | Goal | Fix all issues end-to-end | | Valid operations | All 6 operations | | Grader | `0.25Γ—null + 0.20Γ—dupe + 0.20Γ—outlier + 0.175Γ—country + 0.175Γ—date` | | Max steps | 40 | | Optimal steps | 8 | ### Task 4 β€” Multi-Source Schema Alignment + Merge *(Expert)* | Property | Value | |---|---| | Source A | 150-row CRM export: `cust_id, full_name, Age, purchase_amt, Country, signup, email` | | Source B | 100-row Marketing export: `customer_id, name, age_years, spend, country_name, registration_date, email` | | Issues | Misaligned schemas, missing values, mixed country case, mixed date formats, 10 duplicate rows | | Goal | Align schemas β†’ merge β†’ clean | | Valid operations | `align_schema`, `merge_sources`, `fill_missing`, `fix_format`, `drop_duplicates` | | Grader | `0.30Γ—schema + 0.25Γ—null + 0.20Γ—country + 0.15Γ—date + 0.10Γ—dupe` | | Max steps | 50 | | Optimal steps | 8 | *Inspired by Meta's DataSchema system β€” column-level semantic annotation across misaligned sources.* --- ## Reward Function | Scenario | Reward | |---|---| | Score improves (delta > 0) | `new_score βˆ’ old_score` (positive) | | Operation had no effect | `βˆ’0.01` | | Invalid operation / bad column | `βˆ’0.05` | Rewards are bounded to **[βˆ’0.05, 0.99]**. Dense signal every step. --- ## Intelligence Endpoints (Phase 2) | Method | Path | Description | |---|---|---| | `GET` | `/profile` | Rich per-column DQ profile β€” null %, unique %, min/max/mean, top values | | `GET` | `/report` | Full episode cleaning summary β€” score improvement, efficiency, issues fixed | | `GET` | `/export` | Download current cleaned DataFrame as CSV | ### `/profile` response example ```json { "dq_metrics": { "completeness_pct": 90.0, "uniqueness_pct": 100.0, "validity_pct": 88.5 }, "columns": { "age": {"null_count": 20, "null_pct": 20.0, "min": 22, "max": 59, "mean": 40.3} } } ``` ### `/report` response example ```json { "initial_score": 0.01, "final_score": 0.99, "score_improvement": 0.98, "steps_taken": 3, "step_efficiency_pct": 85.0, "issues_fixed": {"nulls_filled": 50, "dupes_removed": 15, "formats_fixed": 168}, "completed": true } ``` --- ## All API Endpoints | Method | Path | Description | |---|---|---| | `GET` | `/` | DataMedic live monitoring UI | | `GET` | `/health` | Health check β†’ `{"status": "healthy"}` | | `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3\|4}` | | `POST` | `/step` | Execute action. Body: action JSON | | `GET` | `/state` | Episode metadata | | `GET` | `/metadata` | Environment info + paper citations | | `GET` | `/schema` | Full action/observation/state JSON schemas | | `GET` | `/profile` | Rich data quality profile of current DataFrame | | `GET` | `/report` | Full episode cleaning summary | | `GET` | `/export` | Download cleaned DataFrame as CSV | | `GET` | `/docs` | Interactive Swagger UI | --- ## Baseline Scores | Task | Difficulty | Score | |---|---|---| | 1 β€” Fill Missing Values | Easy | 0.999 | | 2 β€” Fix Formats + Duplicates | Medium | 0.999 | | 3 β€” Full Cleaning Pipeline | Hard | 0.999 | | 4 β€” Multi-Source Merge | Expert | 0.990 | | **Average** | β€” | **0.997** | --- ## Setup & Usage ### Prerequisites - Python 3.11+ - Docker (for containerised deployment) ### Local β€” Python ```bash git clone https://github.com/Tanvi51204/openEnv.git cd openEnv pip install -r requirements.txt python -m uvicorn server.app:app --host 0.0.0.0 --port 8000 ``` Then open: - UI: http://localhost:8000 - Docs: http://localhost:8000/docs ### Local β€” Docker ```bash docker build -t data-cleaning-env . docker run -p 8000:8000 data-cleaning-env ``` ### Run baseline inference ```bash export API_BASE_URL="https://api.openai.com/v1" export MODEL_NAME="gpt-4o-mini" export HF_TOKEN="sk-..." export ENV_URL="http://localhost:8000" python inference.py ``` Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.json`. ### Environment variables | Variable | Default | Description | |---|---|---| | `API_BASE_URL` | `https://api.openai.com/v1` | LLM API endpoint (OpenAI-compatible) | | `MODEL_NAME` | `gpt-4o-mini` | Model identifier | | `HF_TOKEN` | β€” | API key for LLM calls | | `ENV_URL` | `http://localhost:8000` | Environment server URL | --- ## Project Structure ``` openenv-data-cleaning/ β”œβ”€β”€ models.py Pydantic contracts β€” Action / Observation / State / DQMetrics / Report β”œβ”€β”€ client.py Sync HTTP client (reset / step / state / health) β”œβ”€β”€ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging β”œβ”€β”€ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK β”œβ”€β”€ requirements.txt pip dependencies └── server/ β”œβ”€β”€ app.py FastAPI routes + /profile + /report + /export + UI β”œβ”€β”€ environment.py reset / step / state + 8 operations + planning engine + DQ metrics β”œβ”€β”€ data_generator.py Synthetic dataset generation (seed=42, reproducible) β”œβ”€β”€ ui.html DataMedic live monitoring dashboard └── tasks/ β”œβ”€β”€ task1_missing.py Easy β€” fill NaN grader β”œβ”€β”€ task2_format.py Medium β€” format + duplicates grader β”œβ”€β”€ task3_pipeline.py Hard β€” full pipeline grader └── task4_merge.py Expert β€” multi-source schema alignment + merge grader ``` --- ## Live Demo πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space - UI: https://srishtichugh-openenv-hack.hf.space - Health: https://srishtichugh-openenv-hack.hf.space/health - Docs: https://srishtichugh-openenv-hack.hf.space/docs - Profile: https://srishtichugh-openenv-hack.hf.space/profile - Report: https://srishtichugh-openenv-hack.hf.space/report