Spaces:

graheetphartyal
/

dataops-env

Sleeping

App Files Files Community

graheetphartyal commited on 10 days ago

Commit

b07d8dc

verified ·

1 Parent(s): 4d67e80

Update README.md

Browse files

Files changed (1) hide show

README.md +256 -191

README.md CHANGED Viewed

@@ -1,5 +1,4 @@
 ---
 title: Dataops Env
 emoji: 🧼
 colorFrom: indigo
@@ -7,306 +6,372 @@ colorTo: gray
 sdk: docker
 app_port: 7860
 pinned: false
 ---
-# ✨ DataOps Gym
-### ⚡ The First Hallucination-Aware Data Cleaning Environment
-> ❌ Most systems ask: *“Did you fix the data?”*
-> ✅ We ask: *“Did you think before fixing?”*
----
-# 🚨 THE PROBLEM
-**60–80% of a data scientist’s time is spent cleaning data.**
-But current systems:
-* blindly fix values
-* hallucinate corrections
-* ignore contradictions
-* break real-world logic
----
-> 💡 **Wrong data is worse than missing data.**
----
-# 🧠 WHAT THIS PROJECT DOES
-DataOps Gym is a **step-based OpenEnv environment** where an AI agent:
-1. Detects semantic inconsistencies
-2. Fixes data **only when confident**
-3. Outputs **"cannot determine"** when uncertain
-4. Maintains **cross-record consistency**
-5. Learns through **reward-based feedback**
 ---
-Each step teaches the agent:
-* when to fix ✅
-* when to abstain ⚠️
-* when to say “I don’t know” 🧠
 ---
-# 🧩 ACTION SPACE
-All actions must follow strict JSON format:
-```json
-{
-  "action_type": "detect_issue | fix_value | cannot_determine | skip",
-  "record_id": "string",
-  "field": "string",
-  "value": "string",
-  "confidence": 0.0
-}
-```
----
-## 🔥 Key Innovation
-👉 `cannot_determine` is a **first-class action**
 ---
-# 🧠 WHY THIS IS DIFFERENT
-| Traditional Systems | DataOps Gym            |
-| ------------------- | ---------------------- |
-| Fix everything      | Fix only when safe     |
-| Always answer       | Can abstain            |
-| Ignore confidence   | Confidence-aware       |
-| Single-row logic    | Cross-record reasoning |
-| Output-based        | Behavior-based         |
----
-# 💰 REWARD SYSTEM
 ---
-## ✅ Rewards
-* correct reasoning
-* safe corrections
-* correct uncertainty
-* consistency across records
 ---
-## ❌ Penalties
-* hallucinated fixes 🚫
-* overconfidence 🚫
-* over-correction 🚫
-* inconsistency 🚫
 ---
-### 🔥 Core Principle
-> **“Better to not fix than to fix incorrectly.”**
----
-# 📊 FINAL SCORING (0–1)
-```text
-task_score =
-  0.5 * normalized_record_score
-+ 0.2 * (1 - hallucination_rate)
-+ 0.15 * uncertainty_accuracy
-+ 0.15 * consistency_score
 ```
 ---
-# 📉 METRICS
-| Metric                  | Description            |
-| ----------------------- | ---------------------- |
-| 🧠 Hallucination Rate   | Wrong invented fixes   |
-| ⚖️ Uncertainty Accuracy | Correct abstentions    |
-| 🔗 Consistency Score    | Cross-record reasoning |
----
-# 🧪 TASKS
-> ⚡ Each task is carefully designed to evaluate **reasoning, restraint, and reliability** — not just accuracy.
 ---
-## 🟢 EASY — *Foundational Data Hygiene*
-<p align="left">
-  <b>“Can the agent fix obvious issues without breaking anything?”</b>
-</p>
-* Basic inconsistencies
-* Missing values
-* Duplicate records
 ---
-## 🟡 MEDIUM — *Contextual Reasoning & Ambiguity*
-<p align="left">
-  <b>“Can the agent reason across records and handle uncertainty?”</b>
-</p>
-* Cross-table inconsistencies
-* Identity ambiguity
-* Data normalization
----
-## 🔴 HARD — *Real-World Data Chaos*
-<p align="left">
-  <b>“Can the agent survive contradictions, missing context, and unsolvable data?”</b>
-</p>
-* Multi-table conflicts
-* Temporal inconsistencies
-* Non-fixable contradictions
----
-> 🔥 **Difficulty is not about complexity — it's about uncertainty.**
-| Level  | Focus |
-|--------|------|
-| 🟢 Easy   | Precision on clear signals |
-| 🟡 Medium | Reasoning under ambiguity |
-| 🔴 Hard   | Decision-making under uncertainty |
 ---
-# 🧪 EXAMPLE FAILURE LOG
-```json
-{
-  "record_id": "T3",
-  "error_type": "hallucination",
-  "details": "assigned value without evidence",
-  "confidence": 0.9
-}
-```
----
-# 🚀 QUICK START
----
-## Install
-```bash
-pip install -r requirements.txt
-```
 ---
-## Run Server
-```bash
-python -m server.app
-```
----
-## Run Baseline
-```bash
-python inference.py
-```
 ---
-## Example Output
-```text
-easy   → 0.73
-medium → 0.55
-hard   → 0.38
-```
-> ⚠️ Replace with your actual results
----
-# 🌐 API ENDPOINTS
-| Endpoint  | Description       |
-| --------- | ----------------- |
-| `/reset`  | Start new episode |
-| `/step`   | Take action       |
-| `/state`  | Get current state |
-| `/health` | Health check      |
 ---
-# 🐳 DOCKER
-```bash
-docker build -t dataops-gym .
-docker run -p 7860:7860 dataops-gym
-```
----
-# 🧠 DESIGN PRINCIPLES
-1. Prefer uncertainty over hallucination
-2. Penalize confident mistakes
-3. Avoid over-correction
-4. Enforce cross-record consistency
-5. Reward safe reasoning
 ---
-# 🏆 BENCHMARK (EXPECTED)
-| Task   | Score       |
-| ------ | ----------- |
-| Easy   | 0.65 – 0.85 |
-| Medium | 0.45 – 0.65 |
-| Hard   | 0.05 – 0.40 |
----
-# 📌 USE CASES
-* AI data pipelines
-* automated ETL validation
-* financial data cleaning
-* healthcare record validation
-* LLM safety benchmarking
 ---
-# 🏁 FINAL TAKEAWAY
-> 🧠 **The future of AI is not about answering everything.**
-> ⚡ **It’s about knowing when NOT to answer.**
----
-# 🔥 TAGLINE
-> **“We built a system that teaches AI when NOT to change data.”**
----

 ---
 title: Dataops Env
 emoji: 🧼
 colorFrom: indigo
 sdk: docker
 app_port: 7860
 pinned: false
 ---
+<div align="center">
+# 🏋️ DataOps GYM
+### *The Benchmark That Punishes Overconfidence — Not Just Wrong Answers*
+**A semantic, step-based reinforcement learning environment for evaluating data-cleaning agents on tabular datasets**
+<br/>
+[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)](https://python.org)
+[![FastAPI](https://img.shields.io/badge/FastAPI-REST_API-009688?logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)
+[![Pydantic](https://img.shields.io/badge/Pydantic-Schema_Validation-E92063?logo=pydantic&logoColor=white)](https://docs.pydantic.dev)
+[![Docker](https://img.shields.io/badge/Docker-Containerized-2496ED?logo=docker&logoColor=white)](https://docker.com)
+[![HuggingFace](https://img.shields.io/badge/🤗_HuggingFace-Spaces_Compatible-FFD21E)](https://huggingface.co/spaces)
+<br/>
+> **"Any model can clean data. Only a smart one knows when *not* to."**
+>
+> DataOps GYM is an interactive gym environment for training and benchmarking LLM-based data-cleaning agents —
+> with dense per-step rewards, structured action protocols, and deliberate adversarial traps
+> designed to expose hallucination, overcorrection, and overconfidence.
+> **The first benchmark that penalizes an LLM for being too confident about dirty data — not just for being wrong.**
+<br/>
+</div>
 ---
+## 📌 Table of Contents
+- [Why DataOps GYM Exists](#-why-dataops-gym-exists)
+- [Core Philosophy](#-core-philosophy)
+- [Architecture Overview](#-architecture-overview)
+- [Repository Layout](#-repository-layout)
+- [The Environment Model](#-the-environment-model)
+- [Action Protocol](#-action-protocol)
+- [Task Difficulty Tiers](#-task-difficulty-tiers)
+- [Scoring & Reward System](#-scoring--reward-system)
+- [HTTP API Reference](#-http-api-reference)
 ---
+## 🔍 Why DataOps GYM Exists
+Real-world data pipelines fail silently. Automated cleaners and LLM agents frequently:
+- **Hallucinate corrections** — inventing plausible-sounding values with no evidentiary basis
+- **Over-correct valid data** — mistaking unusual-but-correct formats as errors *(e.g., `q.xu+vip@example.com` is a valid plus-address — don't touch it)*
+- **Flatten genuine ambiguity** — making irreversible decisions where `cannot_determine` was the right call
+- **Ignore cross-record consistency** — fixing one row while silently creating a new constraint violation in another
+**DataOps GYM was built to measure all of these failure modes simultaneously**, forcing agents to balance **precision, restraint, and consistency** — not just produce a tidy-looking output table.
 ---
+## 🧠 Core Philosophy
+| Traditional Benchmark | DataOps GYM |
+|---|---|
+| Compares final table to ground truth | Evaluates **every step** semantically |
+| Rewards correct fixes | Also **penalizes hallucination** and **rewards appropriate abstention** |
+| Single-pass evaluation | Multi-turn, stateful episode loop |
+| No cross-record awareness | Tracks **consistency across related rows** |
+| Ignores agent confidence | **Confidence calibration** affects reward directly |
+| `cannot_determine` = failure | `cannot_determine` = **first-class correct action** |
+> DataOps GYM is purpose-built around the insight that **knowing when not to act is as important as knowing how to act.**
 ---
+## 🏗 Architecture Overview
+```
+┌──────────────────────────────────────────────────────────────────┐
+│                          DataOps GYM                             │
+│                                                                  │
+│   ┌──────────┐      ┌──────────────┐      ┌──────────────┐      │
+│   │ task.py  │─────▶│    env.py    │─────▶│  grader.py   │      │
+│   │          │      │              │      │              │      │
+│   │  Task    │      │   Episode    │      │  Per-step    │      │
+│   │ Factory  │      │  Lifecycle   │      │  Reward  +   │      │
+│   │ 3 tiers  │      │  + State     │      │  Final Score │      │
+│   │ 2 vars   │      │  Tracking    │      │              │      │
+│   └──────────┘      └──────────────┘      └──────────────┘      │
+│                            │                                     │
+│                            ▼                                     │
+│                     ┌──────────────┐                             │
+│                     │  models.py   │                             │
+│                     │  Action /    │                             │
+│                     │  Observation │                             │
+│                     │  (Pydantic)  │                             │
+│                     └──────────────┘                             │
+│                            │                                     │
+│                            ▼                                     │
+│                     ┌──────────────┐      ┌──────────────────┐  │
+│                     │   server/    │◀─────│  inference.py    │  │
+│                     │   app.py     │      │  Reference Agent │  │
+│                     │  (FastAPI)   │      │  / Evaluator     │  │
+│                     └──────────────┘      └──────────────────┘  │
+└──────────────────────────────────────────────────────────────────┘
+```
+Every layer is cleanly separated — the environment knows nothing about the HTTP layer; the grader knows nothing about environment internals. Each component is independently testable and swappable.
 ---
+## 📁 Repository Layout
+```
+DataOps-GYM/
+│
+├── env.py                       # Core RL environment: reset / step / observe / metrics
+├── task.py                      # Task factories: easy / medium / hard (2 variants each)
+├── grader.py                    # Per-step reward math + final task score formula
+├── models.py                    # Pydantic schemas: Action, Observation
+├── inference.py                 # Reference baseline agent + evaluator script
+│
+├── server/
+│   └── app.py                   # FastAPI HTTP server (/reset, /step, /state, /health)
+│
+├── utils/                       # Shared helper utilities
+├── .dataops_policy_cache.json   # Cached policy artifacts
+│
+├── Dockerfile                   # Container definition (port 7860, HF Spaces-ready)
+├── .dockerignore
+├── openenv.yaml                 # HuggingFace Spaces metadata
+├── pyproject.toml               # Project metadata & build configuration
+├── requirements.txt             # Python dependencies
+└── uv.lock                      # Reproducible lock file for uv package manager
+```
 ---
+## ⚙️ The Environment Model
+### Episode Lifecycle
+Every interaction follows the standard gym pattern:
+```python
+# 1. Initialize a task episode (easy / medium / hard, seeded for reproducibility)
+obs = env.reset(task_name="hard", seed=42)
+# 2. Agent acts step-by-step until done
+while not done:
+    action = agent.decide(obs)
+    obs, reward, done, info = env.step(action)
+# 3. Retrieve terminal score in range (0, 1)
+final_score = info["final_task_score"]
 ```
+When `task_name` is not fixed, the environment randomly samples a difficulty tier and variant (both seeded), making the benchmark resistant to test-set memorization.
 ---
+### What the Agent Sees — `Observation`
+The observation gives the agent everything it needs to reason — without ever revealing the hidden answer key:
+| Field | Description |
+|---|---|
+| `dataset.original` | Immutable snapshot of the table at episode start |
+| `dataset.modified` | Current working table reflecting all accepted fixes so far |
+| `action_history` | Full sequence of all past actions taken this episode |
+| `per_record_scores` | Cumulative score contribution per row ID |
+| `current_iteration_score` | Score delta from the most recent step |
+| `previous_iteration_score` | Score delta from the prior step (for trend awareness) |
+| `steps_remaining` | Hard cap on remaining interactions |
+> ⚠️ The agent **never** sees `hidden_issues`. All semantic evaluation is performed internally.
 ---
+### Hidden Issues — What's Lurking in the Data
+Each task defines a set of typed hidden issues the agent must discover and resolve:
+| Issue Type | Description | Fixable? |
+|---|---|---|
+| `duplicate` | Two rows represent the same real entity | ❌ Not by `fix_value` alone |
+| `missing_value` | A required field is null | ✅ Yes |
+| `invalid_format` | Email / phone / date doesn't match expected pattern | ✅ Yes |
+| `inconsistent_casing` | Name or city uses wrong casing convention | ✅ Yes |
+| `conflict` | Same customer has contradictory field values across rows | ❌ Irreconcilable |
+| `constraint_violation` | Two distinct rows violate a uniqueness constraint (e.g., same email) | ❌ Requires judgment |
+| `valid_trap` | Row looks suspicious but is actually correct — **do not touch** | N/A |
 ---
+## 🎮 Action Protocol
+Agents interact through a strict, typed JSON protocol validated by Pydantic:
+```json
+{
+  "action_type": "fix_value",
+  "record_id": "C201",
+  "field": "email",
+  "value": "evan.cole@example.com",
+  "confidence": 0.92
+}
+```
+### Action Types
+| Action | When to Use | Reward Signal |
+|---|---|---|
+| `detect_issue` | Flag a problem without yet resolving it | Low positive — passive identification only |
+| `fix_value` | Apply a concrete correction to a specific field | High positive if correct; severe penalty if hallucinated |
+| `cannot_determine` | Abstain when conflict is genuinely irreconcilable | Rewarded when `fixable: false`; penalized otherwise |
+| `skip` | Explicitly pass on a record/field | Penalized if a real issue existed there |
+### Protocol Validation Rules
+- `value` is **required** for `fix_value` and **forbidden** for all other action types
+- `record_id` and `field` must be non-empty strings
+- `confidence` must be a float in `[0.0, 1.0]`
+### Behavioral Discipline
+The environment enforces **follow-through discipline** across steps:
+- After `detect_issue`, the agent must follow up on that same record/field before moving on — or receive a `passive_penalty`
+- Handling a duplicate/conflict pair inconsistently (different strategies for related rows) triggers `inconsistent_handling` penalty
+- Re-flagging an already-detected issue yields `repeated_detection` penalty
 ---
+## 📊 Task Difficulty Tiers
+### 🟢 Easy — `easy_cleaning_task`
+**Scenarios:** `easy_customer_master`, `easy_vendor_onboarding`
+**Goal:** Foundational hygiene — deduplicate obvious duplicate rows and fill required missing values without deleting rows just because they are incomplete.
+**Issues planted:**
+- Exact duplicate rows (identical across all fields)
+- Missing required values (`city`, `email`)
+**Agent strategy:** Detect duplicates → deduplicate → fill missing fields. No traps. No ambiguity.
 ---
+### 🟡 Medium — `medium_normalization_task`
+**Scenarios:** `medium_customer_normalization`, `medium_partner_directory`
+**Goal:** Normalize — consistent casing, valid email shapes, deduplication where needed.
+**Issues planted:**
+- Duplicate rows
+- Inconsistent casing on `name` and `city` (e.g., `"OMAR HASSAN"` → `"Omar Hassan"`)
+- Invalid email tokens (e.g., `[at]` instead of `@`, missing `@` entirely)
+**Agent strategy:** Normalize casing to `title_case`, repair malformed emails, deduplicate. Validators check format correctness, not just non-null values.
 ---
+### 🔴 Hard — `hard_conflict_resolution_task`
+**Scenarios:** `hard_customer_conflicts`, `hard_account_merges`
+**Goal:** Multi-way reasoning under adversarial traps — deduplicate, handle irreconcilable conflicts, enforce unique constraints, fix formats, and **leave valid-looking unusual rows completely untouched**.
+**Issues planted:**
+- Exact duplicates
+- **Irreconcilable conflicts** — same customer ID with contradictory `age` values (e.g., `250` vs `45`). Correct answer: `cannot_determine`
+- Invalid email and phone formats
+- **Unique constraint violations** — two distinct customers sharing the same email address
+- **`valid_trap` rows** — rows that look suspicious but are correct:
+  - `q.xu+vip@example.com` — a valid RFC-compliant plus-address
+  - `A. J. Brown` — a valid abbreviated name
+**Agent strategy:** Nuanced multi-step reasoning, cross-record constraint checking, confident abstention, and deliberate non-intervention on valid traps.
 ---
+## 🏆 Scoring & Reward System
+### Per-Step Reward — `grade_step_details`
+Each step produces a composite scalar reward (no clamping — scores can go negative):
+| Component | Condition | Δ Score |
+|---|---|---|
+| **Classification** | Correct action type for the situation | `+0.1` (detect) / `+0.2` (fix or cd) |
+| **Classification** | Wrong action type | `−0.20` |
+| **Issue Detection** | Correctly identified real issue | `+0.05` (detect) / `+0.15` (fix or cd) |
+| **Issue Detection** | Missed a real issue | `−0.15` |
+| **Issue Detection** | False positive (no issue there) | `−0.05` |
+| **Decision** | Correct fix (passes `validate_fix`) | `+0.25` |
+| **Decision** | Correct `cannot_determine` on non-fixable issue | `+0.25` |
+| **Decision** | Hallucinated fix (no matching issue) | `−0.50` |
+| **Decision** | Wrong fix (fails validation) | `−0.40` |
+| **Decision** | Wrong `cannot_determine` (abstained when fixable) | `−0.20` |
+| **Cross-record Consistency** | Consistent handling of related row pair | `+0.20` |
+| **Cross-record Consistency** | Inconsistent handling of related row pair | `−0.30` |
+| **Confidence Calibration** | confidence > 0.7 AND correct | `+0.05` |
+| **Confidence Calibration** | confidence > 0.7 AND wrong | `−0.10` |
+| **Confident Hallucination** | confidence > 0.8 AND hallucinated fix | `−0.20` (amplifier) |
+| **Resolution Reward** | Previously detected issue now resolved | `+0.15` |
+| **Passive Penalty** | Unresolved detection + off-topic action | `−0.05` |
+| **Overcorrection** | Extra fields modified unintentionally | `−0.05 × N` |
+| **Repeated Detection** | Same issue flagged again | `−0.10` |
+> The returned step reward also adjusts by **±0.1** based on whether the sum of `per_record_scores` improved over the previous iteration.
 ---
+### Final Task Score — `grade_task_result`
+Terminal score is a weighted composite guaranteed in the open interval **(0, 1)**:
+```
+Final Score =  0.50 × normalized_record_score
+             + 0.20 × (1 − hallucination_rate)
+             + 0.15 × uncertainty_accuracy
+             + 0.15 × consistency_score
+```
+| Task | Difficulty | Score |
+|---|---|---|
+| `easy_vendor_onboarding` | 🟢 Easy | `0.73` |
+| `medium_customer_normalization` | 🟡 Medium | `0.40` |
+| `hard_customer_conflicts` | 🔴 Hard | `0.39` |
+> Evaluated using `inference.py` with `Qwen/Qwen3-VL-30B-A3B-Instruct` via Novita.
+### Failure Telemetry
+The `task_failure_messages` function surfaces structured, human-readable failure logs from the episode — making it straightforward to diagnose specific agent failure modes during evaluation and iteration.
 ---
+## 🌐 HTTP API Reference
+The FastAPI server exposes a clean REST interface for agent integration:
+| Endpoint | Method | Body / Params | Description |
+|---|---|---|---|
+| `/reset` | `POST` | `{ "seed": 42, "task_name": "hard" }` | Start a new episode |
+| `/step` | `POST` | JSON matching `Action` schema | Submit one agent action |
+| `/state` | `GET` | — | Full internal state snapshot (debugging) |
+| `/health` | `GET` | — | Liveness probe |
+| `/docs` | `GET` | — | Interactive Swagger UI |
+<div align="center">
+<br/>
+**Built to make data-cleaning agents honest — not just accurate.**
+<br/>
+⭐ **Star this repo** if DataOps GYM helped your research or evaluation work!
+<br/>
+</div>