--- title: Dataops Env emoji: ๐Ÿงผ colorFrom: indigo colorTo: gray sdk: docker app_port: 7860 pinned: false ---
# ๐Ÿ‹๏ธ DataOps GYM ### *The Benchmark That Punishes Overconfidence โ€” Not Just Wrong Answers* **A semantic, step-based reinforcement learning environment for evaluating data-cleaning agents on tabular datasets**
[![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)](https://python.org) [![FastAPI](https://img.shields.io/badge/FastAPI-REST_API-009688?logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com) [![Pydantic](https://img.shields.io/badge/Pydantic-Schema_Validation-E92063?logo=pydantic&logoColor=white)](https://docs.pydantic.dev) [![Docker](https://img.shields.io/badge/Docker-Containerized-2496ED?logo=docker&logoColor=white)](https://docker.com) [![HuggingFace](https://img.shields.io/badge/๐Ÿค—_HuggingFace-Spaces_Compatible-FFD21E)](https://huggingface.co/spaces)
> **"Any model can clean data. Only a smart one knows when *not* to."** > > DataOps GYM is an interactive gym environment for training and benchmarking LLM-based data-cleaning agents โ€” > with dense per-step rewards, structured action protocols, and deliberate adversarial traps > designed to expose hallucination, overcorrection, and overconfidence. > **The first benchmark that penalizes an LLM for being too confident about dirty data โ€” not just for being wrong.**
--- ## ๐Ÿ“Œ Table of Contents - [Why DataOps GYM Exists](#-why-dataops-gym-exists) - [Core Philosophy](#-core-philosophy) - [Architecture Overview](#-architecture-overview) - [Repository Layout](#-repository-layout) - [The Environment Model](#-the-environment-model) - [Action Protocol](#-action-protocol) - [Task Difficulty Tiers](#-task-difficulty-tiers) - [Scoring & Reward System](#-scoring--reward-system) - [HTTP API Reference](#-http-api-reference) --- ## ๐Ÿ” Why DataOps GYM Exists Real-world data pipelines fail silently. Automated cleaners and LLM agents frequently: - **Hallucinate corrections** โ€” inventing plausible-sounding values with no evidentiary basis - **Over-correct valid data** โ€” mistaking unusual-but-correct formats as errors *(e.g., `q.xu+vip@example.com` is a valid plus-address โ€” don't touch it)* - **Flatten genuine ambiguity** โ€” making irreversible decisions where `cannot_determine` was the right call - **Ignore cross-record consistency** โ€” fixing one row while silently creating a new constraint violation in another **DataOps GYM was built to measure all of these failure modes simultaneously**, forcing agents to balance **precision, restraint, and consistency** โ€” not just produce a tidy-looking output table. --- ## ๐Ÿง  Core Philosophy | Traditional Benchmark | DataOps GYM | |---|---| | Compares final table to ground truth | Evaluates **every step** semantically | | Rewards correct fixes | Also **penalizes hallucination** and **rewards appropriate abstention** | | Single-pass evaluation | Multi-turn, stateful episode loop | | No cross-record awareness | Tracks **consistency across related rows** | | Ignores agent confidence | **Confidence calibration** affects reward directly | | `cannot_determine` = failure | `cannot_determine` = **first-class correct action** | > DataOps GYM is purpose-built around the insight that **knowing when not to act is as important as knowing how to act.** --- ## ๐Ÿ— Architecture Overview ``` โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ DataOps GYM โ”‚ โ”‚ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ task.py โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ env.py โ”‚โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ grader.py โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ Task โ”‚ โ”‚ Episode โ”‚ โ”‚ Per-step โ”‚ โ”‚ โ”‚ โ”‚ Factory โ”‚ โ”‚ Lifecycle โ”‚ โ”‚ Reward + โ”‚ โ”‚ โ”‚ โ”‚ 3 tiers โ”‚ โ”‚ + State โ”‚ โ”‚ Final Score โ”‚ โ”‚ โ”‚ โ”‚ 2 vars โ”‚ โ”‚ Tracking โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ models.py โ”‚ โ”‚ โ”‚ โ”‚ Action / โ”‚ โ”‚ โ”‚ โ”‚ Observation โ”‚ โ”‚ โ”‚ โ”‚ (Pydantic) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ–ผ โ”‚ โ”‚ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ โ”‚ โ”‚ server/ โ”‚โ—€โ”€โ”€โ”€โ”€โ”€โ”‚ inference.py โ”‚ โ”‚ โ”‚ โ”‚ app.py โ”‚ โ”‚ Reference Agent โ”‚ โ”‚ โ”‚ โ”‚ (FastAPI) โ”‚ โ”‚ / Evaluator โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ ``` Every layer is cleanly separated โ€” the environment knows nothing about the HTTP layer; the grader knows nothing about environment internals. Each component is independently testable and swappable. --- ## ๐Ÿ“ Repository Layout ``` DataOps-GYM/ โ”‚ โ”œโ”€โ”€ env.py # Core RL environment: reset / step / observe / metrics โ”œโ”€โ”€ task.py # Task factories: easy / medium / hard (2 variants each) โ”œโ”€โ”€ grader.py # Per-step reward math + final task score formula โ”œโ”€โ”€ models.py # Pydantic schemas: Action, Observation โ”œโ”€โ”€ inference.py # Reference baseline agent + evaluator script โ”‚ โ”œโ”€โ”€ server/ โ”‚ โ””โ”€โ”€ app.py # FastAPI HTTP server (/reset, /step, /state, /health) โ”‚ โ”œโ”€โ”€ utils/ # Shared helper utilities โ”œโ”€โ”€ .dataops_policy_cache.json # Cached policy artifacts โ”‚ โ”œโ”€โ”€ Dockerfile # Container definition (port 7860, HF Spaces-ready) โ”œโ”€โ”€ .dockerignore โ”œโ”€โ”€ openenv.yaml # HuggingFace Spaces metadata โ”œโ”€โ”€ pyproject.toml # Project metadata & build configuration โ”œโ”€โ”€ requirements.txt # Python dependencies โ””โ”€โ”€ uv.lock # Reproducible lock file for uv package manager ``` --- ## โš™๏ธ The Environment Model ### Episode Lifecycle Every interaction follows the standard gym pattern: ```python # 1. Initialize a task episode (easy / medium / hard, seeded for reproducibility) obs = env.reset(task_name="hard", seed=42) # 2. Agent acts step-by-step until done while not done: action = agent.decide(obs) obs, reward, done, info = env.step(action) # 3. Retrieve terminal score in range (0, 1) final_score = info["final_task_score"] ``` When `task_name` is not fixed, the environment randomly samples a difficulty tier and variant (both seeded), making the benchmark resistant to test-set memorization. --- ### What the Agent Sees โ€” `Observation` The observation gives the agent everything it needs to reason โ€” without ever revealing the hidden answer key: | Field | Description | |---|---| | `dataset.original` | Immutable snapshot of the table at episode start | | `dataset.modified` | Current working table reflecting all accepted fixes so far | | `action_history` | Full sequence of all past actions taken this episode | | `per_record_scores` | Cumulative score contribution per row ID | | `current_iteration_score` | Score delta from the most recent step | | `previous_iteration_score` | Score delta from the prior step (for trend awareness) | | `steps_remaining` | Hard cap on remaining interactions | > โš ๏ธ The agent **never** sees `hidden_issues`. All semantic evaluation is performed internally. --- ### Hidden Issues โ€” What's Lurking in the Data Each task defines a set of typed hidden issues the agent must discover and resolve: | Issue Type | Description | Fixable? | |---|---|---| | `duplicate` | Two rows represent the same real entity | โŒ Not by `fix_value` alone | | `missing_value` | A required field is null | โœ… Yes | | `invalid_format` | Email / phone / date doesn't match expected pattern | โœ… Yes | | `inconsistent_casing` | Name or city uses wrong casing convention | โœ… Yes | | `conflict` | Same customer has contradictory field values across rows | โŒ Irreconcilable | | `constraint_violation` | Two distinct rows violate a uniqueness constraint (e.g., same email) | โŒ Requires judgment | | `valid_trap` | Row looks suspicious but is actually correct โ€” **do not touch** | N/A | --- ## ๐ŸŽฎ Action Protocol Agents interact through a strict, typed JSON protocol validated by Pydantic: ```json { "action_type": "fix_value", "record_id": "C201", "field": "email", "value": "evan.cole@example.com", "confidence": 0.92 } ``` ### Action Types | Action | When to Use | Reward Signal | |---|---|---| | `detect_issue` | Flag a problem without yet resolving it | Low positive โ€” passive identification only | | `fix_value` | Apply a concrete correction to a specific field | High positive if correct; severe penalty if hallucinated | | `cannot_determine` | Abstain when conflict is genuinely irreconcilable | Rewarded when `fixable: false`; penalized otherwise | | `skip` | Explicitly pass on a record/field | Penalized if a real issue existed there | ### Protocol Validation Rules - `value` is **required** for `fix_value` and **forbidden** for all other action types - `record_id` and `field` must be non-empty strings - `confidence` must be a float in `[0.0, 1.0]` ### Behavioral Discipline The environment enforces **follow-through discipline** across steps: - After `detect_issue`, the agent must follow up on that same record/field before moving on โ€” or receive a `passive_penalty` - Handling a duplicate/conflict pair inconsistently (different strategies for related rows) triggers `inconsistent_handling` penalty - Re-flagging an already-detected issue yields `repeated_detection` penalty --- ## ๐Ÿ“Š Task Difficulty Tiers ### ๐ŸŸข Easy โ€” `easy_cleaning_task` **Scenarios:** `easy_customer_master`, `easy_vendor_onboarding` **Goal:** Foundational hygiene โ€” deduplicate obvious duplicate rows and fill required missing values without deleting rows just because they are incomplete. **Issues planted:** - Exact duplicate rows (identical across all fields) - Missing required values (`city`, `email`) **Agent strategy:** Detect duplicates โ†’ deduplicate โ†’ fill missing fields. No traps. No ambiguity. --- ### ๐ŸŸก Medium โ€” `medium_normalization_task` **Scenarios:** `medium_customer_normalization`, `medium_partner_directory` **Goal:** Normalize โ€” consistent casing, valid email shapes, deduplication where needed. **Issues planted:** - Duplicate rows - Inconsistent casing on `name` and `city` (e.g., `"OMAR HASSAN"` โ†’ `"Omar Hassan"`) - Invalid email tokens (e.g., `[at]` instead of `@`, missing `@` entirely) **Agent strategy:** Normalize casing to `title_case`, repair malformed emails, deduplicate. Validators check format correctness, not just non-null values. --- ### ๐Ÿ”ด Hard โ€” `hard_conflict_resolution_task` **Scenarios:** `hard_customer_conflicts`, `hard_account_merges` **Goal:** Multi-way reasoning under adversarial traps โ€” deduplicate, handle irreconcilable conflicts, enforce unique constraints, fix formats, and **leave valid-looking unusual rows completely untouched**. **Issues planted:** - Exact duplicates - **Irreconcilable conflicts** โ€” same customer ID with contradictory `age` values (e.g., `250` vs `45`). Correct answer: `cannot_determine` - Invalid email and phone formats - **Unique constraint violations** โ€” two distinct customers sharing the same email address - **`valid_trap` rows** โ€” rows that look suspicious but are correct: - `q.xu+vip@example.com` โ€” a valid RFC-compliant plus-address - `A. J. Brown` โ€” a valid abbreviated name **Agent strategy:** Nuanced multi-step reasoning, cross-record constraint checking, confident abstention, and deliberate non-intervention on valid traps. --- ## ๐Ÿ† Scoring & Reward System ### Per-Step Reward โ€” `grade_step_details` Each step produces a composite scalar reward (no clamping โ€” scores can go negative): | Component | Condition | ฮ” Score | |---|---|---| | **Classification** | Correct action type for the situation | `+0.1` (detect) / `+0.2` (fix or cd) | | **Classification** | Wrong action type | `โˆ’0.20` | | **Issue Detection** | Correctly identified real issue | `+0.05` (detect) / `+0.15` (fix or cd) | | **Issue Detection** | Missed a real issue | `โˆ’0.15` | | **Issue Detection** | False positive (no issue there) | `โˆ’0.05` | | **Decision** | Correct fix (passes `validate_fix`) | `+0.25` | | **Decision** | Correct `cannot_determine` on non-fixable issue | `+0.25` | | **Decision** | Hallucinated fix (no matching issue) | `โˆ’0.50` | | **Decision** | Wrong fix (fails validation) | `โˆ’0.40` | | **Decision** | Wrong `cannot_determine` (abstained when fixable) | `โˆ’0.20` | | **Cross-record Consistency** | Consistent handling of related row pair | `+0.20` | | **Cross-record Consistency** | Inconsistent handling of related row pair | `โˆ’0.30` | | **Confidence Calibration** | confidence > 0.7 AND correct | `+0.05` | | **Confidence Calibration** | confidence > 0.7 AND wrong | `โˆ’0.10` | | **Confident Hallucination** | confidence > 0.8 AND hallucinated fix | `โˆ’0.20` (amplifier) | | **Resolution Reward** | Previously detected issue now resolved | `+0.15` | | **Passive Penalty** | Unresolved detection + off-topic action | `โˆ’0.05` | | **Overcorrection** | Extra fields modified unintentionally | `โˆ’0.05 ร— N` | | **Repeated Detection** | Same issue flagged again | `โˆ’0.10` | > The returned step reward also adjusts by **ยฑ0.1** based on whether the sum of `per_record_scores` improved over the previous iteration. --- ### Final Task Score โ€” `grade_task_result` Terminal score is a weighted composite guaranteed in the open interval **(0, 1)**: ``` Final Score = 0.50 ร— normalized_record_score + 0.20 ร— (1 โˆ’ hallucination_rate) + 0.15 ร— uncertainty_accuracy + 0.15 ร— consistency_score ``` | Task | Difficulty | Score | |---|---|---| | `easy_vendor_onboarding` | ๐ŸŸข Easy | `0.73` | | `medium_customer_normalization` | ๐ŸŸก Medium | `0.40` | | `hard_customer_conflicts` | ๐Ÿ”ด Hard | `0.39` | > Evaluated using `inference.py` with `Qwen/Qwen3-VL-30B-A3B-Instruct` via Novita. ### Failure Telemetry The `task_failure_messages` function surfaces structured, human-readable failure logs from the episode โ€” making it straightforward to diagnose specific agent failure modes during evaluation and iteration. --- ## ๐ŸŒ HTTP API Reference The FastAPI server exposes a clean REST interface for agent integration: | Endpoint | Method | Body / Params | Description | |---|---|---|---| | `/reset` | `POST` | `{ "seed": 42, "task_name": "hard" }` | Start a new episode | | `/step` | `POST` | JSON matching `Action` schema | Submit one agent action | | `/state` | `GET` | โ€” | Full internal state snapshot (debugging) | | `/health` | `GET` | โ€” | Liveness probe | | `/docs` | `GET` | โ€” | Interactive Swagger UI |

**Built to make data-cleaning agents honest โ€” not just accurate.**
โญ **Star this repo** if DataOps GYM helped your research or evaluation work!