Taniieeee83 commited on
Commit
da84c63
Β·
1 Parent(s): 2305b9f

updated readme, requirements.txt

Browse files
README.md CHANGED
@@ -1,181 +1,120 @@
1
  ---
2
- title: Data Cleaning Environment
3
- emoji: 🧹
4
- colorFrom: blue
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
8
  app_port: 8000
9
  tags:
10
  - openenv
11
  - rl
12
- - data-cleaning
 
13
  ---
14
 
15
- # Data Cleaning OpenEnv
16
 
17
- A **real-world data cleaning environment** for training and evaluating AI agents.
18
 
19
- An agent interacts with a dirty pandas DataFrame through a standard `reset() / step() / state()` HTTP API, learning to fix common data quality problems β€” missing values, duplicate rows, inconsistent formats, statistical outliers, and dtype errors β€” across three progressively harder tasks.
20
-
21
- πŸ€— **Live HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
22
- πŸ“– **Interactive API docs:** https://srishtichugh-openenv-hack.hf.space/docs
23
- βœ… **Health check:** https://srishtichugh-openenv-hack.hf.space/health
24
 
25
  ---
26
 
27
- ## Environment Description & Motivation
28
-
29
- Real-world datasets are almost never clean. Data engineers routinely spend 60–80 % of their time on data cleaning tasks: filling missing values with statistically appropriate strategies, removing duplicates, standardising inconsistent formats (phone numbers, dates, country names), and detecting extreme outliers.
30
 
31
- This environment turns those tasks into a reinforcement learning challenge with:
32
 
33
- - **Deterministic, programmatic graders** β€” ground-truth clean DataFrames are generated with a fixed seed; every reward signal is reproducible.
34
- - **Meaningful partial rewards** β€” every step emits a delta reward proportional to how much of the dataset it cleaned, so the agent receives useful signal throughout the episode rather than only at the end.
35
- - **Three difficulty levels** β€” easy, medium, hard β€” letting agents learn a curriculum from simple null-filling up to full multi-issue pipelines.
36
- - **No external data downloads** β€” all datasets are generated synthetically via `numpy` + `Faker` with `seed=42`.
 
37
 
38
  ---
39
 
40
- ## Action Space
41
 
42
- Actions are JSON objects sent to `POST /step`.
 
 
 
 
 
 
 
43
 
44
- | `operation` | Required `column` | `params` | Description |
45
- |---|---|---|---|
46
- | `fill_missing` | βœ… | `{"strategy": "median\|mean\|mode\|constant", "value": ...}` | Fill NaN values in a column |
47
- | `drop_duplicates` | ❌ | β€” | Remove all duplicate rows |
48
- | `fix_format` | βœ… | β€” | Standardise phone/date/country format |
49
- | `replace_value` | βœ… | `{"old": ..., "new": ...}` | Replace a specific value |
50
- | `drop_outliers` | βœ… | β€” | Remove IQR outliers from a numeric column |
51
- | `fix_dtype` | βœ… | `{"dtype": "float\|int\|str"}` | Cast column to correct dtype |
52
 
53
- **Format rules enforced by `fix_format`:**
54
 
55
- | Column | Target format |
56
  |---|---|
57
- | `phone` | `NNN-NNN-NNNN` |
58
- | `listed_date` / `signup_date` | `YYYY-MM-DD` |
59
- | `country` | Title-cased canonical name (`USA`, `UK`, `Canada`, `Australia`, `Germany`) |
60
-
61
- **Example actions:**
62
- ```json
63
- {"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}
64
- {"operation": "fill_missing", "column": "department", "params": {"strategy": "mode"}}
65
- {"operation": "drop_duplicates"}
66
- {"operation": "fix_format", "column": "phone"}
67
- {"operation": "fix_format", "column": "signup_date"}
68
- {"operation": "drop_outliers", "column": "purchase_amount"}
69
- ```
70
 
71
  ---
72
 
73
- ## Observation Space
74
 
75
- Every `POST /reset` and `POST /step` returns:
76
- ```json
77
- {
78
- "observation": {
79
- "done": false,
80
- "reward": 0.40,
81
- "data_preview": "name,age,salary,...\n...",
82
- "data_shape": [100, 5],
83
- "missing_counts": {"age": 20, "salary": 20, "department": 10},
84
- "duplicate_count": 0,
85
- "dtype_issues": {},
86
- "task_description": "Task 1 (Easy) β€” Fill Missing Values\n...",
87
- "message": "Filled 20 missing values in 'age' using median.",
88
- "step_count": 1,
89
- "current_score": 0.4000
90
- },
91
- "reward": 0.40,
92
- "done": false,
93
- "info": {}
94
- }
95
- ```
96
 
97
- | Field | Type | Description |
98
- |---|---|---|
99
- | `done` | bool | Episode finished (score β‰₯ 0.95 or max steps reached) |
100
- | `reward` | float | Per-step delta reward (see Reward Function) |
101
- | `data_preview` | string | First 10 rows of current DataFrame as CSV |
102
- | `data_shape` | [int, int] | Current `[rows, cols]` |
103
- | `missing_counts` | object | `{column: null_count}` for columns with NaN |
104
- | `duplicate_count` | int | Number of duplicate rows |
105
- | `dtype_issues` | object | `{column: issue_description}` for suspected dtype mismatches |
106
- | `task_description` | string | Full task instructions with available operations |
107
- | `message` | string | Human-readable result of the last action |
108
- | `step_count` | int | Steps taken in this episode |
109
- | `current_score` | float | Running grader score 0.0 – 1.0 |
110
 
111
  ---
112
 
113
- ## State Space
114
 
115
- `GET /state` returns episode metadata (does not modify state):
 
 
 
 
 
116
  ```json
117
  {
118
- "episode_id": "a8f026a9-...",
119
- "task_id": 1,
120
- "step_count": 2,
121
- "max_steps": 20,
122
- "total_errors": 50,
123
- "errors_remaining": 30
124
  }
125
  ```
126
 
127
  ---
128
 
129
- ## Tasks
130
-
131
- ### Task 1 β€” Fill Missing Values *(Easy)*
132
-
133
- | Property | Value |
134
- |---|---|
135
- | Dataset | 100-row employee records (name, age, salary, department, experience) |
136
- | Issues | ~20 % NaN in `age`, `salary`; ~10 % NaN in `department` |
137
- | Goal | Fill all missing values |
138
- | Valid operations | `fill_missing` |
139
- | Grader | `1.0 βˆ’ remaining_nulls / original_nulls` |
140
- | Max steps | 20 |
141
- | Optimal steps | 3 (one per affected column) |
142
-
143
- ### Task 2 β€” Fix Formats + Remove Duplicates *(Medium)*
144
-
145
- | Property | Value |
146
- |---|---|
147
- | Dataset | 215-row product catalog (product_id, price, category, phone, listed_date) |
148
- | Issues | ~60 % phone numbers in mixed formats, ~60 % dates in mixed formats, 15 duplicate rows |
149
- | Goal | Standardise all phone/date formats and remove duplicates |
150
- | Valid operations | `fix_format`, `drop_duplicates` |
151
- | Grader | `0.35 Γ— phone_score + 0.35 Γ— date_score + 0.30 Γ— dupe_score` |
152
- | Max steps | 30 |
153
- | Optimal steps | 3 |
154
-
155
- ### Task 3 β€” Full Cleaning Pipeline *(Hard)*
156
-
157
- | Property | Value |
158
- |---|---|
159
- | Dataset | 320-row customer database (name, age, purchase_amount, country, email, signup_date) |
160
- | Issues | Missing values (4 cols), 20 duplicate rows, outliers in `purchase_amount` (~3Γ— normal), mixed country capitalisation, mixed date formats |
161
- | Goal | Fix all issues end-to-end |
162
- | Valid operations | All 6 operations |
163
- | Grader | `0.25Γ—null + 0.20Γ—dupe + 0.20Γ—outlier + 0.175Γ—country + 0.175Γ—date` |
164
- | Max steps | 40 |
165
- | Optimal steps | 8 |
166
-
167
- ---
168
-
169
  ## Reward Function
170
 
171
- | Scenario | Reward |
172
- |---|---|
173
- | Score improves (delta > 0) | `new_score βˆ’ old_score` (positive) |
174
- | Operation had no effect | `βˆ’0.01` |
175
- | Invalid operation / bad column | `βˆ’0.05` |
176
- | Episode completed (score β‰₯ 0.95) | `delta + 0.20` terminal bonus |
177
-
178
- Rewards are bounded to **[βˆ’0.05, 1.2]**. A partial reward is emitted on every step, giving the agent dense signal throughout the episode.
 
 
 
 
179
 
180
  ---
181
 
@@ -183,139 +122,84 @@ Rewards are bounded to **[βˆ’0.05, 1.2]**. A partial reward is emitted on every
183
 
184
  | Method | Path | Description |
185
  |---|---|---|
186
- | `GET` | `/health` | Health check β†’ `{"status": "healthy"}` |
187
- | `POST` | `/reset` | Start episode. Body: `{"task_id": 1\|2\|3}` (optional; default: round-robin) |
188
- | `POST` | `/step` | Execute action. Body: action JSON |
189
- | `POST` | `/state` | Get episode metadata |
190
- | `GET` | `/metadata` | Environment name, version, task list |
191
- | `GET` | `/schema` | Full action / observation / state JSON schemas |
192
- | `GET` | `/docs` | Interactive Swagger UI |
 
193
 
194
  ---
195
 
196
- ## Baseline Scores
197
 
198
- | Task | Difficulty | Score |
199
- |---|---|---|
200
- | 1 β€” Fill Missing Values | Easy | 0.999 |
201
- | 2 β€” Fix Formats + Duplicates | Medium | 0.999 |
202
- | 3 β€” Full Cleaning Pipeline | Hard | 0.999 |
203
- | **Average** | β€” | **0.999** |
204
 
205
- *Produced by `google/gemma-3-27b-it` via NVIDIA NIM, `temperature=0`. Full step-by-step agent logs: `inference_log.txt`.*
 
 
206
 
207
  ---
208
 
209
- ## Setup & Usage
210
-
211
- ### Prerequisites
212
-
213
- - Python 3.11+
214
- - Docker (for containerised deployment)
215
 
216
- ### Local β€” Python
217
  ```bash
218
- # 1. Clone and install dependencies
219
- git clone https://github.com/Tanvi51204/openEnv.git
220
- cd openEnv
221
  pip install -r requirements.txt
222
 
223
- # 2. Start the server
224
  uvicorn server.app:app --host 0.0.0.0 --port 8000
225
 
226
- # 3. Open Swagger UI
227
- open http://localhost:8000/docs
228
- ```
229
-
230
- ### Local β€” Docker
231
- ```bash
232
- docker build -t data-cleaning-env .
233
- docker run -p 8000:8000 data-cleaning-env
234
- ```
235
 
236
- ### Quick API test
237
- ```bash
238
- # Health
239
- curl http://localhost:8000/health
240
-
241
- # Start Task 1
242
- curl -X POST http://localhost:8000/reset \
243
- -H "Content-Type: application/json" \
244
- -d '{"task_id": 1}'
245
-
246
- # Fill missing values
247
- curl -X POST http://localhost:8000/step \
248
- -H "Content-Type: application/json" \
249
- -d '{"operation": "fill_missing", "column": "salary", "params": {"strategy": "median"}}'
250
  ```
251
 
252
- ### Python client
253
- ```python
254
- from client import DataCleaningEnvClient
255
- from models import DataCleaningAction
256
-
257
- with DataCleaningEnvClient("http://localhost:8000") as env:
258
- result = env.reset(task_id=1)
259
- print(result.observation.missing_counts) # {'age': 20, 'salary': 20, 'department': 10}
260
-
261
- action = DataCleaningAction(
262
- operation="fill_missing",
263
- column="salary",
264
- params={"strategy": "median"},
265
- )
266
- result = env.step(action)
267
- print(result.observation.current_score) # 0.4
268
- print(result.reward) # 0.4
269
- ```
270
 
271
- ### Run baseline inference
272
  ```bash
273
- export API_BASE_URL="https://api.openai.com/v1"
274
- export MODEL_NAME="gpt-4o-mini"
275
- export HF_TOKEN="sk-..." # your API key
276
- export ENV_URL="http://localhost:8000"
277
-
278
- python inference.py
279
  ```
280
 
281
- Produces `[START]` / `[STEP]` / `[END]` lines to stdout and `baseline_scores.json`.
282
-
283
- ### Environment variables
284
-
285
- | Variable | Default | Description |
286
- |---|---|---|
287
- | `API_BASE_URL` | `https://api.openai.com/v1` | LLM API endpoint (OpenAI-compatible) |
288
- | `MODEL_NAME` | `gpt-4o-mini` | Model identifier |
289
- | `HF_TOKEN` | β€” | API key for LLM calls |
290
- | `ENV_URL` | `http://localhost:8000` | Environment server URL |
291
-
292
  ---
293
 
294
  ## Project Structure
 
295
  ```
296
- openenv-data-cleaning/
297
- β”œβ”€β”€ models.py Pydantic contracts β€” Action / Observation / State
298
- β”œβ”€β”€ client.py Sync HTTP client (reset / step / state / health)
299
- β”œβ”€β”€ inference.py Baseline LLM agent with [START]/[STEP]/[END] logging
300
- β”œβ”€β”€ openenv.yaml OpenEnv manifest
301
- β”œβ”€β”€ Dockerfile python:3.11-slim, non-root user, HEALTHCHECK
302
- β”œβ”€β”€ requirements.txt pip dependencies
303
- β”œβ”€β”€ pyproject.toml Python package metadata + openenv-core dependency
304
- └── server/
305
- β”œβ”€β”€ app.py FastAPI routes + /metadata + /schema
306
- β”œβ”€β”€ environment.py reset / step / state logic + 6 operations + rewards
307
- β”œβ”€β”€ data_generator.py Synthetic dataset generation (seed=42, reproducible)
308
- └── tasks/
309
- β”œβ”€β”€ task1_missing.py Easy β€” fill NaN grader
310
- β”œβ”€β”€ task2_format.py Medium β€” format + duplicates grader
311
- └── task3_pipeline.py Hard β€” full pipeline grader
 
 
 
 
 
312
  ```
313
 
314
  ---
315
 
316
- ## Live Demo
317
-
318
- πŸ€— **HuggingFace Space:** https://srishtichugh-openenv-hack.hf.space
319
-
320
- - Health: https://srishtichugh-openenv-hack.hf.space/health
321
- - Docs: https://srishtichugh-openenv-hack.hf.space/docs
 
1
  ---
2
+ title: OrgOS Enterprise Workflow RL Environment
3
+ emoji: 🏒
4
+ colorFrom: indigo
5
+ colorTo: cyan
6
  sdk: docker
7
  pinned: false
8
  app_port: 8000
9
  tags:
10
  - openenv
11
  - rl
12
+ - enterprise
13
+ - multi-app
14
  ---
15
 
16
+ # OrgOS β€” Enterprise Workflow RL Environment
17
 
18
+ **OrgOS** is a multi-app enterprise reinforcement learning environment where an AI agent completes real business workflows across four interconnected SaaS applications. Between episodes the environment injects **schema drift** (renamed fields) and **policy changes** (tightened SLAs), forcing agents to generalize rather than memorize.
19
 
20
+ Built for the [Meta PyTorch Γ— Scaler OpenEnv Hackathon](https://huggingface.co/) β€” targeting the **Multi-App Enterprise Workflow** sub-theme.
 
 
 
 
21
 
22
  ---
23
 
24
+ ## Live Demo
 
 
25
 
26
+ πŸš€ **[HuggingFace Space β†’](https://huggingface.co/spaces/tanvibisht/orgos-openenv)**
27
 
28
+ ```bash
29
+ # Local quickstart
30
+ uvicorn server.app:app --host 0.0.0.0 --port 8000
31
+ # Open http://localhost:8000 for the live dashboard
32
+ ```
33
 
34
  ---
35
 
36
+ ## What Makes OrgOS Unique
37
 
38
+ | Feature | Description |
39
+ |---|---|
40
+ | **4 Mock SaaS Apps** | Jira, Zendesk, Salesforce, Workday β€” each with realistic operations |
41
+ | **Schema Drift** | Fields rename between episodes (e.g. `priority β†’ severity β†’ urgency_level`). Agent gets `-0.20` for stale names, `+0.10` for adapted names |
42
+ | **Policy Drift** | Every 3rd episode, SLA thresholds tighten automatically |
43
+ | **3 Workflows** | Cross-app tasks of increasing complexity: Bug Fix β†’ Onboarding β†’ Churn Alert |
44
+ | **RBAC** | Support vs. manager roles enforced; `-0.25` penalty for unauthorized actions |
45
+ | **Dense Reward** | Per-step composite signal tied to 5 measurable business outcomes |
46
 
47
+ ---
 
 
 
 
 
 
 
48
 
49
+ ## Applications & Operations
50
 
51
+ | App | Key Operations |
52
  |---|---|
53
+ | **Jira** | `get_issue`, `create_issue`, `update_status`, `set_priority`, `assign_owner`, `link_zendesk_ticket`, `close_issue`, `list_issues` |
54
+ | **Zendesk** | `get_ticket`, `acknowledge_ticket`, `set_urgency`, `assign_agent`, `escalate_to_jira`, `resolve_ticket`, `add_note`, `list_tickets` |
55
+ | **Salesforce** | `get_account`, `list_accounts`, `update_deal_stage`, `flag_churn_risk`, `assign_account_owner`, `log_interaction`, `get_opportunity` |
56
+ | **Workday** | `get_employee`, `list_employees`, `provision_access`, `log_sla_event`, `request_budget_approval`, `create_onboarding_task`, `complete_task` |
 
 
 
 
 
 
 
 
 
57
 
58
  ---
59
 
60
+ ## Workflows
61
 
62
+ ### Workflow A β€” Customer Bug Fix (support role, 5 steps, max 15)
63
+ 1. Acknowledge Zendesk ticket
64
+ 2. Create linked Jira issue
65
+ 3. Assign Jira issue to engineer
66
+ 4. Log SLA event in Workday
67
+ 5. Query Salesforce for account health
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
68
 
69
+ ### Workflow B β€” Employee Onboarding (manager role, 4 steps, max 20)
70
+ 1. Create employee record in Workday
71
+ 2. Provision Jira access
72
+ 3. Add employee to Salesforce team
73
+ 4. Create Zendesk support profile
74
+
75
+ ### Workflow C β€” Churn Risk Alert (support role, 4 steps, max 18)
76
+ 1. Flag churn risk in Salesforce
77
+ 2. Escalate to Zendesk ticket
78
+ 3. Create Jira tracking issue
79
+ 4. Log SLA event in Workday
 
 
80
 
81
  ---
82
 
83
+ ## Action / Observation Format
84
 
85
+ **Action:**
86
+ ```json
87
+ {"app": "zendesk", "operation": "acknowledge_ticket", "args": {"ticket_number": "ZD-001"}}
88
+ ```
89
+
90
+ **Observation (key fields):**
91
  ```json
92
  {
93
+ "workflow_goal": "Resolve customer bug report end-to-end",
94
+ "pending_steps": ["Assign Jira issue to engineer", "Log SLA event in Workday"],
95
+ "schema_hints": {"jira.priority": "severity"},
96
+ "active_rules": {"sla_p0_minutes": 30},
97
+ "current_score": 0.42,
98
+ "message": "Jira issue JI-001 created and linked to ZD-001"
99
  }
100
  ```
101
 
102
  ---
103
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
104
  ## Reward Function
105
 
106
+ ```
107
+ score = 0.30 Γ— workflow_completion
108
+ + 0.25 Γ— rule_compliance
109
+ + 0.20 Γ— schema_adaptation
110
+ + 0.15 Γ— efficiency
111
+ + 0.10 Γ— policy_drift_handling
112
+
113
+ Per-step delta = new_score βˆ’ old_score
114
+ Schema error penalty = βˆ’0.20
115
+ RBAC violation penalty = βˆ’0.25
116
+ Terminal completion bonus = +0.20
117
+ ```
118
 
119
  ---
120
 
 
122
 
123
  | Method | Path | Description |
124
  |---|---|---|
125
+ | `GET` | `/health` | Health check |
126
+ | `POST` | `/reset` | Start new episode (`{"workflow_id": "A"\|"B"\|"C"}`) |
127
+ | `POST` | `/step` | Take action (`{"app": ..., "operation": ..., "args": {...}}`) |
128
+ | `GET` | `/state` | Current episode metadata |
129
+ | `GET` | `/schema/apps` | All app operations catalogue |
130
+ | `GET` | `/docs` | Swagger UI |
131
+ | `GET` | `/` | Live dashboard (UI) |
132
+ | `GET` | `/ui/run-agent` | SSE stream: live agent inference |
133
 
134
  ---
135
 
136
+ ## Training
137
 
138
+ The `training/grpo_orgos.ipynb` notebook trains **Qwen2.5-3B-Instruct** with **Unsloth 4-bit LoRA** using **HF TRL GRPOTrainer**:
 
 
 
 
 
139
 
140
+ - Before training: ~0.55 score (uses stale canonical field names β†’ schema error penalties)
141
+ - After training: ~0.75 score (reads `schema_hints`, uses drifted field names β†’ adaptation bonuses)
142
+ - **Ξ” β‰ˆ +0.20** per episode, visible in `before_after_curves.png`
143
 
144
  ---
145
 
146
+ ## Local Setup
 
 
 
 
 
147
 
 
148
  ```bash
149
+ # 1. Install dependencies
 
 
150
  pip install -r requirements.txt
151
 
152
+ # 2. Start server
153
  uvicorn server.app:app --host 0.0.0.0 --port 8000
154
 
155
+ # 3. Run baseline inference (requires LLM API)
156
+ export API_BASE_URL=https://api.openai.com/v1
157
+ export MODEL_NAME=gpt-4o-mini
158
+ export HF_TOKEN=your_token
159
+ python inference.py
 
 
 
 
160
 
161
+ # 4. Or use the Python client
162
+ from client import OrgOSEnvClient
163
+ client = OrgOSEnvClient("http://localhost:8000")
164
+ result = client.reset(workflow_id="A")
165
+ print(result.observation.workflow_goal)
 
 
 
 
 
 
 
 
 
166
  ```
167
 
168
+ ## Docker
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
169
 
 
170
  ```bash
171
+ docker build -t orgos-env .
172
+ docker run -p 8000:8000 orgos-env
 
 
 
 
173
  ```
174
 
 
 
 
 
 
 
 
 
 
 
 
175
  ---
176
 
177
  ## Project Structure
178
+
179
  ```
180
+ openEnv/
181
+ β”œβ”€β”€ server/
182
+ β”‚ β”œβ”€β”€ app.py # FastAPI routes (15 endpoints)
183
+ β”‚ β”œβ”€β”€ environment.py # OrgOSEnvironment β€” reset/step/state
184
+ β”‚ β”œβ”€β”€ schema_drift.py # Per-episode field renames
185
+ β”‚ β”œβ”€β”€ business_rules.py # RBAC + SLA enforcement
186
+ β”‚ β”œβ”€β”€ workflow_engine.py # 3 cross-app workflow definitions
187
+ β”‚ β”œβ”€β”€ data_generator.py # Synthetic data (seed=42)
188
+ β”‚ └── apps/
189
+ β”‚ β”œβ”€β”€ jira.py
190
+ β”‚ β”œβ”€β”€ zendesk.py
191
+ β”‚ β”œβ”€β”€ salesforce.py
192
+ β”‚ └── workday.py
193
+ β”œβ”€β”€ models.py # Pydantic models
194
+ β”œβ”€β”€ client.py # OrgOSEnvClient
195
+ β”œβ”€β”€ inference.py # Baseline inference loop + SSE generator
196
+ β”œβ”€β”€ ui/index.html # Live dashboard (Tailwind + Alpine.js + Chart.js)
197
+ β”œβ”€β”€ training/
198
+ β”‚ └── grpo_orgos.ipynb # GRPO training notebook (Colab)
199
+ β”œβ”€β”€ openenv.yaml # OpenEnv manifest
200
+ └── Dockerfile
201
  ```
202
 
203
  ---
204
 
205
+ MIT License Β· Built for Meta PyTorch Γ— Scaler OpenEnv Hackathon Round 2
 
 
 
 
 
openenv.yaml CHANGED
@@ -1,73 +1,86 @@
1
- name: data-cleaning-env
2
- version: "0.1.0"
3
  description: >
4
- A real-world data cleaning environment where an AI agent fixes missing
5
- values, duplicate rows, format inconsistencies, outliers, and dtype errors
6
- across three progressively harder tasks.
 
7
 
8
- author: openenv-hackathon
9
  tags:
10
  - openenv
11
- - data-cleaning
 
 
12
  - rl
13
- - real-world
14
 
15
  tasks:
16
- - id: task1
17
- name: "Fill Missing Values"
18
  difficulty: easy
19
- max_steps: 20
20
  description: >
21
- Fill all NaN values in an employee records dataset.
22
- Columns with missing data: age, salary, department.
 
23
 
24
- - id: task2
25
- name: "Fix Formats and Remove Duplicates"
26
  difficulty: medium
27
- max_steps: 30
28
  description: >
29
- Standardise phone numbers (NNN-NNN-NNNN) and dates (YYYY-MM-DD)
30
- in a product catalog, and remove ~15 duplicate rows.
 
31
 
32
- - id: task3
33
- name: "Full Cleaning Pipeline"
34
- difficulty: hard
35
- max_steps: 40
36
  description: >
37
- End-to-end pipeline on a customer database: fill missing values,
38
- remove duplicates, drop outliers in purchase_amount, standardise
39
- country capitalisation, and fix mixed date formats.
40
 
41
  api:
42
- health: GET /health
43
- reset: POST /reset
44
- step: POST /step
45
- state: POST /state
46
- docs: GET /docs
 
 
47
 
48
  reward:
49
  range: [0.001, 0.999]
50
- partial: true
51
- terminal_bonus: 0.0
 
 
 
 
 
 
52
 
53
  observation_space:
54
- type: object
55
  fields:
56
- done: boolean
57
- reward: float
58
- data_preview: string # First 10 rows as CSV
59
- data_shape: list # [rows, cols]
60
- missing_counts: object # {column: count}
61
- duplicate_count: integer
62
- dtype_issues: object # {column: issue_description}
63
- task_description: string
64
- message: string
65
- step_count: integer
66
- current_score: float # 0.0–1.0
 
 
 
67
 
68
  action_space:
69
- type: object
70
  fields:
71
- operation: string # fill_missing | drop_duplicates | fix_format | replace_value | drop_outliers | fix_dtype
72
- column: string # optional depending on operation
73
- params: object # optional operation parameters
 
1
+ name: orgos-openenv
2
+ version: "2.0.0"
3
  description: >
4
+ OrgOS is a multi-app enterprise RL environment where an agent completes
5
+ business workflows across Jira, Zendesk, Salesforce, and Workday.
6
+ Between episodes, schema drift renames fields and policy drift tightens SLAs,
7
+ forcing agents to generalize rather than memorize.
8
 
9
+ author: tanvibisht
10
  tags:
11
  - openenv
12
+ - enterprise
13
+ - multi-app
14
+ - schema-drift
15
  - rl
 
16
 
17
  tasks:
18
+ - id: workflow_a
19
+ name: "Customer Bug Fix"
20
  difficulty: easy
21
+ max_steps: 15
22
  description: >
23
+ Triage a customer bug report end-to-end: acknowledge the Zendesk ticket,
24
+ create a linked Jira issue, assign it to an engineer, log the SLA event
25
+ in Workday, and query Salesforce for account health. Support role only.
26
 
27
+ - id: workflow_b
28
+ name: "Employee Onboarding"
29
  difficulty: medium
30
+ max_steps: 20
31
  description: >
32
+ Onboard a new employee: create their Workday record, provision Jira access
33
+ based on role, add them to the correct Salesforce territory team, and
34
+ create their Zendesk support profile. Manager role required.
35
 
36
+ - id: workflow_c
37
+ name: "Churn Risk Alert"
38
+ difficulty: medium
39
+ max_steps: 18
40
  description: >
41
+ Respond to a churn risk signal: flag the account in Salesforce, escalate
42
+ to a Zendesk ticket, create a Jira tracking issue, and log the SLA event
43
+ in Workday. Support role. Policy drift may tighten SLA thresholds.
44
 
45
  api:
46
+ routes:
47
+ health: GET /health
48
+ reset: POST /reset
49
+ step: POST /step
50
+ state: GET /state
51
+ docs: GET /docs
52
+ schema: GET /schema/apps
53
 
54
  reward:
55
  range: [0.001, 0.999]
56
+ partial_rewards: true
57
+ terminal_bonus: 0.20
58
+ components:
59
+ workflow_completion: 0.30
60
+ rule_compliance: 0.25
61
+ schema_adaptation: 0.20
62
+ efficiency: 0.15
63
+ policy_drift_handling: 0.10
64
 
65
  observation_space:
 
66
  fields:
67
+ - done: bool
68
+ - reward: float
69
+ - current_score: "float in [0.001, 0.999]"
70
+ - workflow_id: "A | B | C"
71
+ - step_count: int
72
+ - app_states: "dict[app_name, str] β€” preview of each app's records"
73
+ - workflow_goal: str
74
+ - completed_steps: "list[str]"
75
+ - pending_steps: "list[str]"
76
+ - schema_hints: "dict[str, str] β€” e.g. {\"jira.priority\": \"severity\"}"
77
+ - active_rules: "dict β€” current SLA thresholds and RBAC rules"
78
+ - rule_violations: "list[str] β€” violations from last action"
79
+ - reward_breakdown: "RewardBreakdown β€” 5-component score snapshot"
80
+ - message: "str β€” feedback from last action"
81
 
82
  action_space:
 
83
  fields:
84
+ - app: "jira | zendesk | salesforce | workday"
85
+ - operation: str
86
+ - args: "dict β€” operation-specific arguments"
pyproject.toml CHANGED
@@ -1,26 +1,23 @@
 
 
 
 
1
  [project]
2
- name = "data-cleaning-env"
3
- version = "0.1.0"
4
- description = "Real-world data cleaning environment for OpenEnv / Scaler hackathon"
5
  requires-python = ">=3.11"
6
  dependencies = [
7
- "fastapi==0.135.2",
8
- "uvicorn[standard]==0.40.0",
9
- "pydantic==2.12.5",
10
- "pandas==2.2.3",
11
- "numpy==2.2.4",
12
- "faker==40.12.0",
13
- "openai==2.15.0",
14
- "httpx==0.28.1",
15
- "openenv-core==0.2.3",
16
  ]
17
 
18
- [project.scripts]
19
- server = "server.app:main"
20
-
21
- [build-system]
22
- requires = ["hatchling"]
23
- build-backend = "hatchling.build"
24
-
25
  [tool.hatch.build.targets.wheel]
26
- packages = ["server"]
 
1
+ [build-system]
2
+ requires = ["hatchling"]
3
+ build-backend = "hatchling.build"
4
+
5
  [project]
6
+ name = "orgos"
7
+ version = "2.0.0"
8
+ description = "OrgOS β€” Multi-App Enterprise Workflow RL Environment"
9
  requires-python = ">=3.11"
10
  dependencies = [
11
+ "fastapi",
12
+ "uvicorn[standard]",
13
+ "pydantic",
14
+ "numpy",
15
+ "faker",
16
+ "openai",
17
+ "httpx",
18
+ "openenv-core",
19
+ "aiofiles",
20
  ]
21
 
 
 
 
 
 
 
 
22
  [tool.hatch.build.targets.wheel]
23
+ packages = ["server"]
requirements.txt CHANGED
@@ -1,9 +1,9 @@
1
  fastapi==0.135.2
2
  uvicorn[standard]==0.40.0
3
  pydantic==2.12.5
4
- pandas==2.2.3
5
  numpy==2.2.4
6
  faker==40.12.0
7
  openai==2.15.0
8
  httpx==0.28.1
9
- openenv-core==0.2.3
 
 
1
  fastapi==0.135.2
2
  uvicorn[standard]==0.40.0
3
  pydantic==2.12.5
 
4
  numpy==2.2.4
5
  faker==40.12.0
6
  openai==2.15.0
7
  httpx==0.28.1
8
+ openenv-core==0.2.3
9
+ aiofiles>=23.0.0
server/tasks/task1_missing.py DELETED
@@ -1,39 +0,0 @@
1
- """
2
- Task 1 β€” Easy: Fill Missing Values
3
- Objective: Fill all NaN values in the employee records DataFrame.
4
- Score: 1.0 - (remaining_nulls / original_nulls)
5
- """
6
-
7
- from server.data_generator import generate_task1_datasets
8
-
9
- TASK_ID = 1
10
- MAX_STEPS = 20
11
- DESCRIPTION = (
12
- "Task 1 (Easy) β€” Fill Missing Values\n"
13
- "You have an employee records dataset with missing values (NaN) in "
14
- "'age', 'salary', and 'department' columns. "
15
- "Your goal is to fill all missing values so the dataset is complete.\n\n"
16
- "Available operation: fill_missing\n"
17
- " params.strategy: 'median' | 'mean' | 'mode' | 'constant'\n"
18
- " params.value: (required when strategy='constant') the fill value\n"
19
- "Example action: {\"operation\": \"fill_missing\", \"column\": \"age\", \"params\": {\"strategy\": \"median\"}}"
20
- )
21
-
22
-
23
- def load():
24
- """Return (dirty_df, clean_df, original_null_count)."""
25
- dirty, clean = generate_task1_datasets()
26
- original_nulls = int(dirty.isnull().sum().sum())
27
- return dirty.copy(), clean, original_nulls
28
-
29
-
30
- def score(current_df, original_nulls: int) -> float:
31
- """Score in [0, 1]: fraction of nulls filled."""
32
- if original_nulls == 0:
33
- return 0.99
34
- remaining = int(current_df.isnull().sum().sum())
35
- return round(max(0.01, min(0.99, 1.0 - remaining / original_nulls)), 4)
36
-
37
-
38
- def count_errors(current_df) -> int:
39
- return int(current_df.isnull().sum().sum())
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
server/tasks/task2_format.py DELETED
@@ -1,68 +0,0 @@
1
- """
2
- Task 2 β€” Medium: Fix Formats + Remove Duplicates
3
- Objective: Standardise phone & date formats and drop duplicate rows.
4
- Score: weighted average of format_score (0.7) + dupe_score (0.3)
5
- """
6
-
7
- import re
8
- import pandas as pd
9
- from server.data_generator import generate_task2_datasets
10
-
11
- TASK_ID = 2
12
- MAX_STEPS = 30
13
- DESCRIPTION = (
14
- "Task 2 (Medium) β€” Fix Formats and Remove Duplicates\n"
15
- "You have a product catalog with:\n"
16
- " β€’ Phone numbers in mixed formats (need: NNN-NNN-NNNN)\n"
17
- " β€’ Dates in mixed formats (need: YYYY-MM-DD)\n"
18
- " β€’ Duplicate rows (~15)\n\n"
19
- "Available operations:\n"
20
- " fix_format β€” column: 'phone' | 'listed_date'\n"
21
- " drop_duplicates β€” no column needed\n\n"
22
- "Example actions:\n"
23
- ' {"operation": "fix_format", "column": "phone"}\n'
24
- ' {"operation": "fix_format", "column": "listed_date"}\n'
25
- ' {"operation": "drop_duplicates"}'
26
- )
27
-
28
- PHONE_RE = re.compile(r"^\d{3}-\d{3}-\d{4}$")
29
- DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
30
-
31
-
32
- def load():
33
- dirty, clean = generate_task2_datasets()
34
- original_phone_issues = int((~dirty["phone"].str.match(PHONE_RE)).sum())
35
- original_date_issues = int((~dirty["listed_date"].apply(
36
- lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
37
- )).sum())
38
- original_dupes = len(dirty) - len(dirty.drop_duplicates())
39
- meta = {
40
- "orig_phone": original_phone_issues,
41
- "orig_date": original_date_issues,
42
- "orig_dupes": original_dupes,
43
- }
44
- return dirty.copy(), clean, meta
45
-
46
-
47
- def score(current_df, meta: dict) -> float:
48
- phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
49
- date_issues = int((~current_df["listed_date"].apply(
50
- lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
51
- )).sum())
52
- dupes = len(current_df) - len(current_df.drop_duplicates())
53
-
54
- phone_score = 1.0 - phone_issues / max(meta["orig_phone"], 1)
55
- date_score = 1.0 - date_issues / max(meta["orig_date"], 1)
56
- dupe_score = 1.0 - dupes / max(meta["orig_dupes"], 1)
57
-
58
- combined = 0.35 * phone_score + 0.35 * date_score + 0.30 * dupe_score
59
- return round(max(0.01, min(0.99, combined)), 4)
60
-
61
-
62
- def count_errors(current_df, meta: dict) -> int:
63
- phone_issues = int((~current_df["phone"].str.match(PHONE_RE)).sum())
64
- date_issues = int((~current_df["listed_date"].apply(
65
- lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
66
- )).sum())
67
- dupes = len(current_df) - len(current_df.drop_duplicates())
68
- return phone_issues + date_issues + dupes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
server/tasks/task3_pipeline.py DELETED
@@ -1,104 +0,0 @@
1
- """
2
- Task 3 β€” Hard: Full Cleaning Pipeline
3
- Objective: Fix missing values, remove duplicates, handle outliers, standardise
4
- country capitalisation and date formats.
5
- Score: equal-weight average of 4 sub-scores.
6
- """
7
-
8
- import re
9
- import numpy as np
10
- import pandas as pd
11
- from server.data_generator import generate_task3_datasets
12
-
13
- TASK_ID = 3
14
- MAX_STEPS = 40
15
- DESCRIPTION = (
16
- "Task 3 (Hard) β€” Full Cleaning Pipeline\n"
17
- "You have a customer database with multiple issues:\n"
18
- " 1. Missing values in 'age', 'purchase_amount', 'country', 'signup_date'\n"
19
- " 2. ~20 duplicate rows\n"
20
- " 3. Outliers in 'purchase_amount' (injected values ~10x normal)\n"
21
- " 4. Mixed case in 'country' (need: title case, e.g. 'Usa' β†’ 'USA')\n"
22
- " 5. Mixed date formats in 'signup_date' (need: YYYY-MM-DD)\n\n"
23
- "Available operations:\n"
24
- " fill_missing β€” column + params.strategy ('median'|'mean'|'mode'|'constant')\n"
25
- " drop_duplicates β€” no column needed\n"
26
- " drop_outliers β€” column (numeric); uses IQR method\n"
27
- " fix_format β€” column: 'country' | 'signup_date'\n"
28
- " fix_dtype β€” column + params.dtype ('float'|'int'|'str')\n\n"
29
- "Example actions:\n"
30
- ' {"operation": "fill_missing", "column": "age", "params": {"strategy": "median"}}\n'
31
- ' {"operation": "drop_duplicates"}\n'
32
- ' {"operation": "drop_outliers", "column": "purchase_amount"}\n'
33
- ' {"operation": "fix_format", "column": "signup_date"}\n'
34
- ' {"operation": "fix_format", "column": "country"}'
35
- )
36
-
37
- DATE_RE = re.compile(r"^\d{4}-\d{2}-\d{2}$")
38
- VALID_COUNTRIES = {"USA", "UK", "Canada", "Australia", "Germany"}
39
-
40
-
41
- def load():
42
- dirty, clean = generate_task3_datasets()
43
- orig_nulls = int(dirty.isnull().sum().sum())
44
- orig_dupes = len(dirty) - len(dirty.drop_duplicates())
45
-
46
- # Outlier baseline: count rows where purchase_amount > Q3 + 3*IQR
47
- pa = dirty["purchase_amount"].dropna()
48
- q1, q3 = pa.quantile(0.25), pa.quantile(0.75)
49
- iqr = q3 - q1
50
- orig_outliers = int((pa > q3 + 3 * iqr).sum())
51
-
52
- orig_country_issues = int((~dirty["country"].isin(VALID_COUNTRIES) &
53
- dirty["country"].notna()).sum())
54
- orig_date_issues = int((~dirty["signup_date"].apply(
55
- lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
56
- )).sum())
57
-
58
- meta = {
59
- "orig_nulls": orig_nulls,
60
- "orig_dupes": orig_dupes,
61
- "orig_outliers": max(orig_outliers, 1),
62
- "orig_country_issues": max(orig_country_issues, 1),
63
- "orig_date_issues": max(orig_date_issues, 1),
64
- "q1": q1, "q3": q3, "iqr": iqr,
65
- }
66
- return dirty.copy(), clean, meta
67
-
68
-
69
- def score(current_df, meta: dict) -> float:
70
- remaining_nulls = int(current_df.isnull().sum().sum())
71
- remaining_dupes = len(current_df) - len(current_df.drop_duplicates())
72
-
73
- pa = current_df["purchase_amount"].dropna()
74
- remaining_outliers = int((pa > meta["q3"] + 3 * meta["iqr"]).sum())
75
-
76
- remaining_country = int((~current_df["country"].isin(VALID_COUNTRIES) &
77
- current_df["country"].notna()).sum())
78
- remaining_dates = int((~current_df["signup_date"].apply(
79
- lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
80
- )).sum())
81
-
82
- null_score = 1.0 - remaining_nulls / max(meta["orig_nulls"], 1)
83
- dupe_score = 1.0 - remaining_dupes / max(meta["orig_dupes"], 1)
84
- outlier_score = 1.0 - remaining_outliers / meta["orig_outliers"]
85
- country_score = 1.0 - remaining_country / meta["orig_country_issues"]
86
- date_score = 1.0 - remaining_dates / meta["orig_date_issues"]
87
-
88
- combined = 0.25 * null_score + 0.20 * dupe_score + 0.20 * outlier_score \
89
- + 0.175 * country_score + 0.175 * date_score
90
- return round(max(0.01, min(0.99, combined)), 4)
91
-
92
-
93
- def count_errors(current_df, meta: dict) -> int:
94
- remaining_nulls = int(current_df.isnull().sum().sum())
95
- remaining_dupes = len(current_df) - len(current_df.drop_duplicates())
96
- pa = current_df["purchase_amount"].dropna()
97
- remaining_outliers = int((pa > meta["q3"] + 3 * meta["iqr"]).sum())
98
- remaining_country = int((~current_df["country"].isin(VALID_COUNTRIES) &
99
- current_df["country"].notna()).sum())
100
- remaining_dates = int((~current_df["signup_date"].apply(
101
- lambda x: bool(DATE_RE.match(str(x))) if pd.notna(x) else False
102
- )).sum())
103
- return remaining_nulls + remaining_dupes + remaining_outliers + \
104
- remaining_country + remaining_dates