Spaces:

srishtichugh
/

OpenEnv_hack

Sleeping

App Files Files Community

OpenEnv_hack / README.md

srishtichugh

update baseline scores and readme

b37dbfa about 1 month ago

11.1 kB

title: Data Cleaning Environment
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
tags:
  - openenv
  - rl
  - data-cleaning

Data Cleaning OpenEnv

A real-world data cleaning environment for training and evaluating AI agents.

An agent interacts with a dirty pandas DataFrame through a standard reset() / step() / state() HTTP API, learning to fix common data quality problems — missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors — across three progressively harder tasks.

🤗 Live HuggingFace Space: https://srishtichugh-openenv-hack.hf.space 📖 Interactive API docs: https://srishtichugh-openenv-hack.hf.space/docs ✅ Health check: https://srishtichugh-openenv-hack.hf.space/health

Environment Description & Motivation

Real-world datasets are almost never clean. Data engineers routinely spend 60–80 % of their time on data cleaning tasks: filling missing values with statistically appropriate strategies, removing duplicates, standardising inconsistent formats (phone numbers, dates, country names), and detecting extreme outliers.

This environment turns those tasks into a reinforcement learning challenge with:

Deterministic, programmatic graders — ground-truth clean DataFrames are generated with a fixed seed; every reward signal is reproducible.
Meaningful partial rewards — every step emits a delta reward proportional to how much of the dataset it cleaned, so the agent receives useful signal throughout the episode rather than only at the end.
Three difficulty levels — easy, medium, hard — letting agents learn a curriculum from simple null-filling up to full multi-issue pipelines.
No external data downloads — all datasets are generated synthetically via numpy + Faker with seed=42.

Action Space

Actions are JSON objects sent to POST /step.

`operation`	Required `column`	`params`	Description
`fill_missing`	✅	`{"strategy": "median\|mean\|mode\|constant", "value": ...}`	Fill NaN values in a column
`drop_duplicates`	❌	—	Remove all duplicate rows
`fix_format`	✅	—	Standardise phone/date/country format
`replace_value`	✅	`{"old": ..., "new": ...}`	Replace a specific value
`drop_outliers`	✅	—	Remove IQR outliers from a numeric column
`fix_dtype`	✅	`{"dtype": "float\|int\|str"}`	Cast column to correct dtype

Format rules enforced by fix_format:

Column	Target format
`phone`	`NNN-NNN-NNNN`
`listed_date` / `signup_date`	`YYYY-MM-DD`
`country`	Title-cased canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`)

Example actions:

{"operation": "fill_missing",    "column": "salary",          "params": {"strategy": "median"}}
{"operation": "fill_missing",    "column": "department",      "params": {"strategy": "mode"}}
{"operation": "drop_duplicates"}
{"operation": "fix_format",      "column": "phone"}
{"operation": "fix_format",      "column": "signup_date"}
{"operation": "drop_outliers",   "column": "purchase_amount"}

Observation Space

Every POST /reset and POST /step returns:

{
  "observation": {
    "done":             false,
    "reward":           0.40,
    "data_preview":     "name,age,salary,...\n...",
    "data_shape":       [100, 5],
    "missing_counts":   {"age": 20, "salary": 20, "department": 10},
    "duplicate_count":  0,
    "dtype_issues":     {},
    "task_description": "Task 1 (Easy) — Fill Missing Values\n...",
    "message":          "Filled 20 missing values in 'age' using median.",
    "step_count":       1,
    "current_score":    0.4000
  },
  "reward": 0.40,
  "done":   false,
  "info":   {}
}

Field	Type	Description
`done`	bool	Episode finished (score ≥ 0.95 or max steps reached)
`reward`	float	Per-step delta reward (see Reward Function)
`data_preview`	string	First 10 rows of current DataFrame as CSV
`data_shape`	[int, int]	Current `[rows, cols]`
`missing_counts`	object	`{column: null_count}` for columns with NaN
`duplicate_count`	int	Number of duplicate rows
`dtype_issues`	object	`{column: issue_description}` for suspected dtype mismatches
`task_description`	string	Full task instructions with available operations
`message`	string	Human-readable result of the last action
`step_count`	int	Steps taken in this episode
`current_score`	float	Running grader score 0.0 – 1.0

State Space

GET /state returns episode metadata (does not modify state):

{
  "episode_id":      "a8f026a9-...",
  "task_id":         1,
  "step_count":      2,
  "max_steps":       20,
  "total_errors":    50,
  "errors_remaining": 30
}

Tasks

Task 1 — Fill Missing Values (Easy)

Property	Value
Dataset	100-row employee records (name, age, salary, department, experience)
Issues	~20 % NaN in `age`, `salary`; ~10 % NaN in `department`
Goal	Fill all missing values
Valid operations	`fill_missing`
Grader	`1.0 − remaining_nulls / original_nulls`
Max steps	20
Optimal steps	3 (one per affected column)

Task 2 — Fix Formats + Remove Duplicates (Medium)

Property	Value
Dataset	215-row product catalog (product_id, price, category, phone, listed_date)
Issues	~60 % phone numbers in mixed formats, ~60 % dates in mixed formats, 15 duplicate rows
Goal	Standardise all phone/date formats and remove duplicates
Valid operations	`fix_format`, `drop_duplicates`
Grader	`0.35 × phone_score + 0.35 × date_score + 0.30 × dupe_score`
Max steps	30
Optimal steps	3

Task 3 — Full Cleaning Pipeline (Hard)

Property	Value
Dataset	320-row customer database (name, age, purchase_amount, country, email, signup_date)
Issues	Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount` (~3× normal), mixed country capitalisation, mixed date formats
Goal	Fix all issues end-to-end
Valid operations	All 6 operations
Grader	`0.25×null + 0.20×dupe + 0.20×outlier + 0.175×country + 0.175×date`
Max steps	40
Optimal steps	8

Reward Function

Scenario	Reward
Score improves (delta > 0)	`new_score − old_score` (positive)
Operation had no effect	`−0.01`
Invalid operation / bad column	`−0.05`
Episode completed (score ≥ 0.95)	`delta + 0.20` terminal bonus

Rewards are bounded to [−0.05, 1.2]. A partial reward is emitted on every step, giving the agent dense signal throughout the episode.

API Endpoints

Method	Path	Description
`GET`	`/health`	Health check → `{"status": "healthy"}`
`POST`	`/reset`	Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin)
`POST`	`/step`	Execute action. Body: action JSON
`POST`	`/state`	Get episode metadata
`GET`	`/metadata`	Environment name, version, task list
`GET`	`/schema`	Full action / observation / state JSON schemas
`GET`	`/docs`	Interactive Swagger UI

Baseline Scores

Task	Difficulty	Score
1 — Fill Missing Values	Easy	0.999
2 — Fix Formats + Duplicates	Medium	0.999
3 — Full Cleaning Pipeline	Hard	0.999
Average	—	0.999

Produced by google/gemma-3-27b-it via NVIDIA NIM, temperature=0. Full step-by-step agent logs: inference_log.txt.

Setup & Usage

Prerequisites

Python 3.11+
Docker (for containerised deployment)

Local — Python

# 1. Clone and install dependencies
git clone https://github.com/Tanvi51204/openEnv.git
cd openEnv
pip install -r requirements.txt

# 2. Start the server
uvicorn server.app:app --host 0.0.0.0 --port 8000

# 3. Open Swagger UI
open http://localhost:8000/docs

Local — Docker

docker build -t data-cleaning-env .
docker run -p 8000:8000 data-cleaning-env

Quick API test

# Health
curl http://localhost:8000/health

# Start Task 1
curl -X POST http://localhost:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id": 1}'

# Fill missing values
curl -X POST http://localhost:8000/step \
  -H "Content-Type: application/json" \
  -d '{"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}'

Python client

from client import DataCleaningEnvClient
from models import DataCleaningAction

with DataCleaningEnvClient("http://localhost:8000") as env:
    result = env.reset(task_id=1)
    print(result.observation.missing_counts)   # {'age': 20, 'salary': 20, 'department': 10}

    action = DataCleaningAction(
        operation="fill_missing",
        column="salary",
        params={"strategy": "median"},
    )
    result = env.step(action)
    print(result.observation.current_score)    # 0.4
    print(result.reward)                       # 0.4

Run baseline inference

export API_BASE_URL="https://api.openai.com/v1"
export MODEL_NAME="gpt-4o-mini"
export HF_TOKEN="sk-..."          # your API key
export ENV_URL="http://localhost:8000"

python inference.py

Produces [START] / [STEP] / [END] lines to stdout and baseline_scores.json.

Environment variables

Variable	Default	Description
`API_BASE_URL`	`https://api.openai.com/v1`	LLM API endpoint (OpenAI-compatible)
`MODEL_NAME`	`gpt-4o-mini`	Model identifier
`HF_TOKEN`	—	API key for LLM calls
`ENV_URL`	`http://localhost:8000`	Environment server URL

Project Structure

openenv-data-cleaning/
├── models.py              Pydantic contracts — Action / Observation / State
├── client.py              Sync HTTP client (reset / step / state / health)
├── inference.py           Baseline LLM agent with [START]/[STEP]/[END] logging
├── openenv.yaml           OpenEnv manifest
├── Dockerfile             python:3.11-slim, non-root user, HEALTHCHECK
├── requirements.txt       pip dependencies
├── pyproject.toml         Python package metadata + openenv-core dependency
└── server/
    ├── app.py             FastAPI routes + /metadata + /schema
    ├── environment.py     reset / step / state logic + 6 operations + rewards
    ├── data_generator.py  Synthetic dataset generation (seed=42, reproducible)
    └── tasks/
        ├── task1_missing.py    Easy  — fill NaN grader
        ├── task2_format.py     Medium — format + duplicates grader
        └── task3_pipeline.py   Hard  — full pipeline grader

Live Demo

🤗 HuggingFace Space: https://srishtichugh-openenv-hack.hf.space