data_clean_env / README.md
anugrah55's picture
Upload folder using huggingface_hub
7b49766 verified
metadata
title: Data Clean Env
emoji: 🧹
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv

🧹 OpenEnv: Data Clean Environment

The Real-World Benchmarking for Agentic Data Engineering

OpenEnv Spec HF Space License


🌟 Overview

Data Clean Env is a high-fidelity, production-grade OpenEnv implementation designed to evaluate and train Reinforcement Learning (RL) agents on the messy, complex reality of Data Cleaning.

Unlike "toy" environments, this project simulates the exact workflow of a data engineer: identifying schema inconsistencies, handling missing values, casting types, and pruning noise from real-world datasets using the power of pandas.


πŸ› οΈ Environment Architecture

🧠 Action Space

The agent interacts with the environment through atomic, high-level data operations defined in models.py:

Action Parameters Description
fill_na column_name, value Replaces missing values with a specific constant.
drop_na column_name Removes rows containing missing data in the target column.
drop_column column_name Deletes irrelevant or noisy features from the dataset.
rename_column column_name, value Fixes naming inconsistencies to match target schemas.
change_type column_name, value Casts columns to int, float, or str for downstream compatibility.
submit - Finalizes the cleaning process and triggers the programmatic grader.

πŸ‘οΈ Observation Space

The agent perceives the state of the data through a detailed schema:

  • df_schema: Real-time dictionary of column data types.
  • missing_values: Current counts of NaN values per column.
  • head: A preview of the first 5 rows to identify formatting patterns.
  • feedback: Semantic descriptions of the impact of the last action.

πŸ“ˆ Task Progression & Grading

Each task is evaluated by a deterministic programmatic grader that compares the agent's output against a "Gold Standard" target, producing a score strictly between (0.0, 1.0).

  1. 🟒 Easy (easy_clean):
    • Goal: Basic imputation.
    • Challenge: Fill missing 'age' values.
  2. 🟑 Medium (medium_clean):
    • Goal: Noise reduction.
    • Challenge: Handle missing values across multiple columns and remove "junk" features.
  3. πŸ”΄ Hard (hard_clean):
    • Goal: Full schema alignment.
    • Challenge: Rename columns, perform safe type casting on dirty strings, and handle complex missing value fallbacks.

πŸš€ Quick Start

🐳 Run with Docker

# Build the production image
docker build -t openenv_data_clean:latest -f server/Dockerfile .

# Start the environment server
docker run -p 8000:8000 openenv_data_clean:latest

πŸ§ͺ Baseline Inference

We provide a deterministic, zero-temperature baseline script using the OpenAI client:

export HF_TOKEN="your_huggingface_token"
export IMAGE_NAME="openenv_data_clean:latest"
python inference.py

βš–οΈ Reward Shaping

Our reward function is designed for efficient RL convergence:

  • Incremental Progress: +0.1 for every valid schema improvement.
  • Penalization: -0.05 for invalid operations (e.g., targetting non-existent columns).
  • Completion Bonus: A final reward scaling with the total grader score [0.01 - 0.99].

🎯 Meta Hackathon Compliance

  • βœ… Typed Models: Fully Pydantic-powered Observation and Action.
  • βœ… API Standard: Implements step(), reset(), and state().
  • βœ… Strict Logs: Emits [START], [STEP], and [END] traces exactly as required.
  • βœ… Robustness: Handles network timeouts and invalid JSON carefully.

Built with ❀️ for the Meta & Hugging Face OpenEnv Hackathon.