anugrah55 commited on
Commit
d2bd149
Β·
verified Β·
1 Parent(s): 995d4d5

Enhance README layout and metadata

Browse files
Files changed (1) hide show
  1. README.md +89 -54
README.md CHANGED
@@ -1,57 +1,92 @@
 
 
 
 
 
 
 
1
  ---
2
- title: Data Clean Env
3
- emoji: 🧹
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- tags:
10
- - openenv
11
- base_path: /web
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- # Data Clean Environment for OpenEnv
15
-
16
- ## Overview and Motivation
17
- Data cleaning is one of the most time-consuming real-world tasks for data scientists and analysts.
18
- This OpenEnv simulates a data cleaning scenario where an AI agent must clean a dirty pandas DataFrame.
19
- The agent interacts with the DataFrame using discrete operations (filling NaNs, dropping columns, etc.)
20
- and receives a score based on how perfectly it cleans the data according to the task objective.
21
-
22
- ## Action Space
23
- The environment expects a `DataCleanAction` which performs one atomic change to the dataframe:
24
- - `fill_na`: Provide `column_name` and `value` to fill NaNs.
25
- - `drop_na`: Provide `column_name` to drop rows with NaNs in that column.
26
- - `drop_column`: Provide `column_name` to drop it.
27
- - `rename_column`: Provide `column_name` and `value` (new name).
28
- - `change_type`: Provide `column_name` and `value` ('int', 'float', 'str').
29
- - `submit`: Commit the final dataframe for grading.
30
-
31
- ## Observation Space
32
- The environment returns a `DataCleanObservation` detailing the current dataframe state:
33
- - `df_schema`: The dictionary representation of column types.
34
- - `missing_values`: A dictionary representation of NaN counts per column.
35
- - `head`: The first 5 rows in string format.
36
- - `feedback`: Text feedback of the last action.
37
- - `last_error`: Text description of any error encountered.
38
-
39
- ## Tasks and Difficulty
40
- - **easy_clean (Easy)**: Fill missing values in a single column ('age').
41
- - **medium_clean (Medium)**: Handle multiple missing value types and drop an unnecessary column.
42
- - **hard_clean (Hard)**: Handle missing values, rename columns, and change column data types.
43
-
44
- ## Setup and Usage
45
- 1. Build the Docker image:
46
- `docker build -t openenv_data_clean:latest -f server/Dockerfile .`
47
- 2. Run the server locally:
48
- `docker run -p 8000:8000 openenv_data_clean:latest`
49
- 3. Run inference baseline:
50
- `export HF_TOKEN="your_token"`
51
- `export IMAGE_NAME="openenv_data_clean:latest"`
52
- `python inference.py`
53
-
54
- ## Baseline Scores
55
- - easy_clean: 1.00
56
- - medium_clean: 1.00
57
- - hard_clean: 1.00
 
 
 
 
 
 
 
 
 
 
1
+ # 🧹 OpenEnv: Data Clean Environment
2
+ ### The Real-World Benchmarking for Agentic Data Engineering
3
+
4
+ [![OpenEnv Spec](https://img.shields.io/badge/OpenEnv-Compatible-green?style=for-the-badge&logo=pytorch)](https://github.com/meta-pytorch/OpenEnv)
5
+ [![HF Space](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue?style=for-the-badge)](https://huggingface.co/spaces/anugrah55/data_clean_env)
6
+ [![License](https://img.shields.io/badge/License-MIT-yellow.svg?style=for-the-badge)](LICENSE)
7
+
8
  ---
9
+
10
+ ## 🌟 Overview
11
+ **Data Clean Env** is a high-fidelity, production-grade [OpenEnv](https://github.com/meta-pytorch/OpenEnv) implementation designed to evaluate and train Reinforcement Learning (RL) agents on the messy, complex reality of **Data Cleaning**.
12
+
13
+ Unlike "toy" environments, this project simulates the exact workflow of a data engineer: identifying schema inconsistencies, handling missing values, casting types, and pruning noise from real-world datasets using the power of `pandas`.
14
+
15
+ ---
16
+
17
+ ## πŸ› οΈ Environment Architecture
18
+
19
+ ### 🧠 Action Space
20
+ The agent interacts with the environment through atomic, high-level data operations defined in `models.py`:
21
+
22
+ | Action | Parameters | Description |
23
+ | :--- | :--- | :--- |
24
+ | `fill_na` | `column_name`, `value` | Replaces missing values with a specific constant. |
25
+ | `drop_na` | `column_name` | Removes rows containing missing data in the target column. |
26
+ | `drop_column`| `column_name` | Deletes irrelevant or noisy features from the dataset. |
27
+ | `rename_column`| `column_name`, `value`| Fixes naming inconsistencies to match target schemas. |
28
+ | `change_type` | `column_name`, `value` | Casts columns to `int`, `float`, or `str` for downstream compatibility. |
29
+ | `submit` | - | Finalizes the cleaning process and triggers the programmatic grader. |
30
+
31
+ ### πŸ‘οΈ Observation Space
32
+ The agent perceives the state of the data through a detailed schema:
33
+ - **`df_schema`**: Real-time dictionary of column data types.
34
+ - **`missing_values`**: Current counts of `NaN` values per column.
35
+ - **`head`**: A preview of the first 5 rows to identify formatting patterns.
36
+ - **`feedback`**: Semantic descriptions of the impact of the last action.
37
+
38
  ---
39
 
40
+ ## πŸ“ˆ Task Progression & Grading
41
+
42
+ Each task is evaluated by a **deterministic programmatic grader** that compares the agent's output against a "Gold Standard" target, producing a score strictly between **(0.0, 1.0)**.
43
+
44
+ 1. **🟒 Easy (`easy_clean`)**:
45
+ - **Goal**: Basic imputation.
46
+ - **Challenge**: Fill missing 'age' values.
47
+ 2. **🟑 Medium (`medium_clean`)**:
48
+ - **Goal**: Noise reduction.
49
+ - **Challenge**: Handle missing values across multiple columns and remove "junk" features.
50
+ 3. **πŸ”΄ Hard (`hard_clean`)**:
51
+ - **Goal**: Full schema alignment.
52
+ - **Challenge**: Rename columns, perform safe type casting on dirty strings, and handle complex missing value fallbacks.
53
+
54
+ ---
55
+
56
+ ## πŸš€ Quick Start
57
+
58
+ ### 🐳 Run with Docker
59
+ ```bash
60
+ # Build the production image
61
+ docker build -t openenv_data_clean:latest -f server/Dockerfile .
62
+
63
+ # Start the environment server
64
+ docker run -p 8000:8000 openenv_data_clean:latest
65
+ ```
66
+
67
+ ### πŸ§ͺ Baseline Inference
68
+ We provide a deterministic, zero-temperature baseline script using the OpenAI client:
69
+ ```bash
70
+ export HF_TOKEN="your_huggingface_token"
71
+ export IMAGE_NAME="openenv_data_clean:latest"
72
+ python inference.py
73
+ ```
74
+
75
+ ---
76
+
77
+ ## βš–οΈ Reward Shaping
78
+ Our reward function is designed for efficient RL convergence:
79
+ - **Incremental Progress**: `+0.1` for every valid schema improvement.
80
+ - **Penalization**: `-0.05` for invalid operations (e.g., targetting non-existent columns).
81
+ - **Completion Bonus**: A final reward scaling with the total grader score `[0.01 - 0.99]`.
82
+
83
+ ---
84
+
85
+ ## 🎯 Meta Hackathon Compliance
86
+ - βœ… **Typed Models**: Fully Pydantic-powered `Observation` and `Action`.
87
+ - βœ… **API Standard**: Implements `step()`, `reset()`, and `state()`.
88
+ - βœ… **Strict Logs**: Emits `[START]`, `[STEP]`, and `[END]` traces exactly as required.
89
+ - βœ… **Robustness**: Handles network timeouts and invalid JSON gracefully.
90
+
91
+ ---
92
+ Built with ❀️ for the Meta & Hugging Face OpenEnv Hackathon.