Spaces:

srishtichugh
/

OpenEnv_hack

Sleeping

App Files Files Community

OpenEnv_hack / README.md

srishtichugh

add ui

40fcf49 14 days ago

preview code

raw

history blame contribute delete

13.5 kB

metadata

title: Data Cleaning Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - rl
  - data-cleaning
  - multi-agent
  - data-quality

DataMedic — AI Data Cleaning OpenEnv

An agentic data quality environment for training and evaluating AI agents on real-world data cleaning tasks.

An agent interacts with dirty pandas DataFrames through a standard reset() / step() / state() HTTP API, learning to fix missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors — across four progressively harder tasks including a novel multi-source schema alignment challenge.

🤗 Live HuggingFace Space: https://srishtichugh-openenv-hack.hf.space 🖥️ Live DataMedic UI: https://srishtichugh-openenv-hack.hf.space 📖 Interactive API docs: https://srishtichugh-openenv-hack.hf.space/docs ✅ Health check: https://srishtichugh-openenv-hack.hf.space/health

What Makes This Different

Most data cleaning tools are one-shot. DataMedic is an RL training environment where:

The agent diagnoses a dirty dataset via /profile (completeness, uniqueness, validity %)
It plans a treatment — every observation includes a plan field with the next recommended actions
It executes cleaning operations step by step with dense per-step rewards
It receives a health certificate via /report summarising what was fixed and how efficiently
It exports the cleaned result via /export

Grounded in peer-reviewed research:

Bendinelli et al. 2025 — LLM Agents for Cleaning Tabular ML Datasets (arXiv:2503.06664)
CleanAgent — Qi & Wang 2024 (arXiv:2403.08291)
AutoDCWorkflow — EMNLP 2025 Findings
HoloClean — Rekatsinas et al. 2017

Environment Description & Motivation

Real-world datasets are almost never clean. Data engineers routinely spend 60–80% of their time on data cleaning. This environment turns that into an RL challenge with:

Deterministic, programmatic graders — ground-truth DataFrames generated with seed=42; every reward is reproducible
Meaningful partial rewards — dense delta reward every step, not just at episode end
Four difficulty levels — easy → medium → hard → expert (multi-source merge)
Live DQ metrics — completeness %, uniqueness %, validity % in every observation
Agentic planning — plan field recommends next actions; tried_operations prevents loops
No external data downloads — all datasets generated synthetically via numpy + Faker

DataMedic UI

Open https://srishtichugh-openenv-hack.hf.space in your browser to see the live monitoring dashboard:

Health Score Ring — animated score gauge, color-coded by severity (green/amber/red)
DQ Dimension Bars — live completeness, uniqueness, validity bars updating each step
Score Trajectory Chart — real-time line chart of score vs steps
Agent Treatment Plan — next recommended actions shown before each step
Operation Log — every action taken, result, and reward delta streamed live
Dataset Preview — first 10 rows with NULL values highlighted in red
Export CSV — download the cleaned DataFrame at any point

Click any task button — the dataset loads automatically and the demo agent runs end-to-end.

Action Space

Actions are JSON objects sent to POST /step.

`operation`	Required `column`	`params`	Description
`fill_missing`	✅	`{"strategy": "median\|mean\|mode\|constant", "value": ...}`	Fill NaN values in a column
`drop_duplicates`	❌	—	Remove all duplicate rows
`fix_format`	✅	—	Standardise phone/date/country format
`replace_value`	✅	`{"old": ..., "new": ...}`	Replace a specific value
`drop_outliers`	✅	—	Remove IQR outliers from a numeric column
`fix_dtype`	✅	`{"dtype": "float\|int\|str"}`	Cast column to correct dtype
`align_schema`	❌	—	Rename Source A columns to canonical schema (Task 4 only)
`merge_sources`	❌	—	Concatenate aligned Source A + Source B (Task 4 only)

Format rules enforced by fix_format:

Column	Target format
`phone`	`NNN-NNN-NNNN`
`listed_date` / `signup_date`	`YYYY-MM-DD`
`country`	Canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`)

Observation Space

Every POST /reset and POST /step returns:

{
  "observation": {
    "done":             false,
    "reward":           0.40,
    "data_preview":     "name,age,salary,...\n...",
    "data_shape":       [100, 5],
    "missing_counts":   {"age": 20, "salary": 20, "department": 10},
    "duplicate_count":  0,
    "dtype_issues":     {},
    "task_description": "Task 1 (Easy) — Fill Missing Values\n...",
    "message":          "Filled 20 missing values in 'age' using median.",
    "step_count":       1,
    "current_score":    0.4000,
    "dq_metrics": {
      "completeness_pct": 86.67,
      "uniqueness_pct":   100.0,
      "validity_pct":     94.5,
      "total_cells":      500,
      "null_cells":       50,
      "duplicate_rows":   0,
      "invalid_cells":    12
    },
    "tried_operations": ["fill_missing:age"],
    "plan": [
      "fill_missing on \"salary\" (20 nulls) using median",
      "fill_missing on \"department\" (10 nulls) using mode"
    ]
  },
  "reward": 0.40,
  "done":   false,
  "info":   {}
}

Field	Type	Description
`done`	bool	Episode finished (score ≥ 0.95 or max steps reached)
`reward`	float	Per-step delta reward
`data_preview`	string	First 10 rows as CSV
`data_shape`	[int, int]	Current `[rows, cols]`
`missing_counts`	object	`{column: null_count}` for columns with NaN
`duplicate_count`	int	Number of duplicate rows
`dtype_issues`	object	`{column: issue_description}`
`task_description`	string	Full task instructions
`message`	string	Human-readable result of last action
`step_count`	int	Steps taken this episode
`current_score`	float	Running grader score 0.0–1.0
`dq_metrics`	object	Completeness / uniqueness / validity % + raw counts
`tried_operations`	array	Operations already applied — prevents agent loops
`plan`	array	Up to 3 recommended next actions (rule-based planning engine)

Tasks

Task 1 — Fill Missing Values (Easy)

Property	Value
Dataset	100-row employee records (name, age, salary, department, experience)
Issues	~20% NaN in `age`, `salary`; ~10% NaN in `department`
Goal	Fill all missing values
Valid operations	`fill_missing`
Grader	`1.0 − remaining_nulls / original_nulls`
Max steps	20
Optimal steps	3

Task 2 — Fix Formats + Remove Duplicates (Medium)

Property	Value
Dataset	215-row product catalog (product_id, price, category, phone, listed_date)
Issues	~60% phone numbers in mixed formats, ~60% dates in mixed formats, 15 duplicate rows
Goal	Standardise all phone/date formats and remove duplicates
Valid operations	`fix_format`, `drop_duplicates`
Grader	`0.35 × phone_score + 0.35 × date_score + 0.30 × dupe_score`
Max steps	30
Optimal steps	3

Task 3 — Full Cleaning Pipeline (Hard)

Property	Value
Dataset	320-row customer database (name, age, purchase_amount, country, email, signup_date)
Issues	Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount`, mixed country case, mixed date formats
Goal	Fix all issues end-to-end
Valid operations	All 6 operations
Grader	`0.25×null + 0.20×dupe + 0.20×outlier + 0.175×country + 0.175×date`
Max steps	40
Optimal steps	8

Task 4 — Multi-Source Schema Alignment + Merge (Expert)

Property	Value
Source A	150-row CRM export: `cust_id, full_name, Age, purchase_amt, Country, signup, email`
Source B	100-row Marketing export: `customer_id, name, age_years, spend, country_name, registration_date, email`
Issues	Misaligned schemas, missing values, mixed country case, mixed date formats, 10 duplicate rows
Goal	Align schemas → merge → clean
Valid operations	`align_schema`, `merge_sources`, `fill_missing`, `fix_format`, `drop_duplicates`
Grader	`0.30×schema + 0.25×null + 0.20×country + 0.15×date + 0.10×dupe`
Max steps	50
Optimal steps	8

Inspired by Meta's DataSchema system — column-level semantic annotation across misaligned sources.

Reward Function

Scenario	Reward
Score improves (delta > 0)	`new_score − old_score` (positive)
Operation had no effect	`−0.01`
Invalid operation / bad column	`−0.05`

Rewards are bounded to [−0.05, 0.99]. Dense signal every step.

Intelligence Endpoints (Phase 2)

Method	Path	Description
`GET`	`/profile`	Rich per-column DQ profile — null %, unique %, min/max/mean, top values
`GET`	`/report`	Full episode cleaning summary — score improvement, efficiency, issues fixed
`GET`	`/export`	Download current cleaned DataFrame as CSV

`/profile` response example

{
  "dq_metrics": {
    "completeness_pct": 90.0,
    "uniqueness_pct": 100.0,
    "validity_pct": 88.5
  },
  "columns": {
    "age": {"null_count": 20, "null_pct": 20.0, "min": 22, "max": 59, "mean": 40.3}
  }
}

`/report` response example

{
  "initial_score": 0.01,
  "final_score": 0.99,
  "score_improvement": 0.98,
  "steps_taken": 3,
  "step_efficiency_pct": 85.0,
  "issues_fixed": {"nulls_filled": 50, "dupes_removed": 15, "formats_fixed": 168},
  "completed": true
}

All API Endpoints

Method	Path	Description
`GET`	`/`	DataMedic live monitoring UI
`GET`	`/health`	Health check → `{"status": "healthy"}`
`POST`	`/reset`	Start episode. Body: `{"task_id": 1\|2\|3\|4}`
`POST`	`/step`	Execute action. Body: action JSON
`GET`	`/state`	Episode metadata
`GET`	`/metadata`	Environment info + paper citations
`GET`	`/schema`	Full action/observation/state JSON schemas
`GET`	`/profile`	Rich data quality profile of current DataFrame
`GET`	`/report`	Full episode cleaning summary
`GET`	`/export`	Download cleaned DataFrame as CSV
`GET`	`/docs`	Interactive Swagger UI

Baseline Scores

Task	Difficulty	Score
1 — Fill Missing Values	Easy	0.999
2 — Fix Formats + Duplicates	Medium	0.999
3 — Full Cleaning Pipeline	Hard	0.999
4 — Multi-Source Merge	Expert	0.990
Average	—	0.997

Setup & Usage

Prerequisites

Python 3.11+
Docker (for containerised deployment)

Local — Python

git clone https://github.com/Tanvi51204/openEnv.git
cd openEnv
pip install -r requirements.txt
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000

Then open:

UI: http://localhost:8000
Docs: http://localhost:8000/docs

Local — Docker

docker build -t data-cleaning-env .
docker run -p 8000:8000 data-cleaning-env

Run baseline inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="sk-..."
export ENV_URL="http://localhost:8000"

python inference.py

Produces [START] / [STEP] / [END] lines to stdout and baseline_scores.json.

Environment variables

Variable	Default	Description
`API_BASE_URL`	`https://api.openai.com/v1`	LLM API endpoint (OpenAI-compatible)
`MODEL_NAME`	`gpt-4o-mini`	Model identifier
`HF_TOKEN`	—	API key for LLM calls
`ENV_URL`	`http://localhost:8000`	Environment server URL

Project Structure

openenv-data-cleaning/
├── models.py              Pydantic contracts — Action / Observation / State / DQMetrics / Report
├── client.py              Sync HTTP client (reset / step / state / health)
├── inference.py           Baseline LLM agent with [START]/[STEP]/[END] logging
├── Dockerfile             python:3.11-slim, non-root user, HEALTHCHECK
├── requirements.txt       pip dependencies
└── server/
    ├── app.py             FastAPI routes + /profile + /report + /export + UI
    ├── environment.py     reset / step / state + 8 operations + planning engine + DQ metrics
    ├── data_generator.py  Synthetic dataset generation (seed=42, reproducible)
    ├── ui.html            DataMedic live monitoring dashboard
    └── tasks/
        ├── task1_missing.py    Easy   — fill NaN grader
        ├── task2_format.py     Medium — format + duplicates grader
        ├── task3_pipeline.py   Hard   — full pipeline grader
        └── task4_merge.py      Expert — multi-source schema alignment + merge grader

Live Demo

🤗 HuggingFace Space: https://srishtichugh-openenv-hack.hf.space

UI: https://srishtichugh-openenv-hack.hf.space
Health: https://srishtichugh-openenv-hack.hf.space/health
Docs: https://srishtichugh-openenv-hack.hf.space/docs
Profile: https://srishtichugh-openenv-hack.hf.space/profile
Report: https://srishtichugh-openenv-hack.hf.space/report