---
title: Dataops Env
emoji: ๐งผ
colorFrom: indigo
colorTo: gray
sdk: docker
app_port: 7860
pinned: false
---
# ๐๏ธ DataOps GYM
### *The Benchmark That Punishes Overconfidence โ Not Just Wrong Answers*
**A semantic, step-based reinforcement learning environment for evaluating data-cleaning agents on tabular datasets**
[](https://python.org)
[](https://fastapi.tiangolo.com)
[](https://docs.pydantic.dev)
[](https://docker.com)
[](https://huggingface.co/spaces)
> **"Any model can clean data. Only a smart one knows when *not* to."**
>
> DataOps GYM is an interactive gym environment for training and benchmarking LLM-based data-cleaning agents โ
> with dense per-step rewards, structured action protocols, and deliberate adversarial traps
> designed to expose hallucination, overcorrection, and overconfidence.
> **The first benchmark that penalizes an LLM for being too confident about dirty data โ not just for being wrong.**
---
## ๐ Table of Contents
- [Why DataOps GYM Exists](#-why-dataops-gym-exists)
- [Core Philosophy](#-core-philosophy)
- [Architecture Overview](#-architecture-overview)
- [Repository Layout](#-repository-layout)
- [The Environment Model](#-the-environment-model)
- [Action Protocol](#-action-protocol)
- [Task Difficulty Tiers](#-task-difficulty-tiers)
- [Scoring & Reward System](#-scoring--reward-system)
- [HTTP API Reference](#-http-api-reference)
---
## ๐ Why DataOps GYM Exists
Real-world data pipelines fail silently. Automated cleaners and LLM agents frequently:
- **Hallucinate corrections** โ inventing plausible-sounding values with no evidentiary basis
- **Over-correct valid data** โ mistaking unusual-but-correct formats as errors *(e.g., `q.xu+vip@example.com` is a valid plus-address โ don't touch it)*
- **Flatten genuine ambiguity** โ making irreversible decisions where `cannot_determine` was the right call
- **Ignore cross-record consistency** โ fixing one row while silently creating a new constraint violation in another
**DataOps GYM was built to measure all of these failure modes simultaneously**, forcing agents to balance **precision, restraint, and consistency** โ not just produce a tidy-looking output table.
---
## ๐ง Core Philosophy
| Traditional Benchmark | DataOps GYM |
|---|---|
| Compares final table to ground truth | Evaluates **every step** semantically |
| Rewards correct fixes | Also **penalizes hallucination** and **rewards appropriate abstention** |
| Single-pass evaluation | Multi-turn, stateful episode loop |
| No cross-record awareness | Tracks **consistency across related rows** |
| Ignores agent confidence | **Confidence calibration** affects reward directly |
| `cannot_determine` = failure | `cannot_determine` = **first-class correct action** |
> DataOps GYM is purpose-built around the insight that **knowing when not to act is as important as knowing how to act.**
---
## ๐ Architecture Overview
```
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DataOps GYM โ
โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ task.py โโโโโโโถโ env.py โโโโโโโถโ grader.py โ โ
โ โ โ โ โ โ โ โ
โ โ Task โ โ Episode โ โ Per-step โ โ
โ โ Factory โ โ Lifecycle โ โ Reward + โ โ
โ โ 3 tiers โ โ + State โ โ Final Score โ โ
โ โ 2 vars โ โ Tracking โ โ โ โ
โ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ models.py โ โ
โ โ Action / โ โ
โ โ Observation โ โ
โ โ (Pydantic) โ โ
โ โโโโโโโโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โ โ server/ โโโโโโโโ inference.py โ โ
โ โ app.py โ โ Reference Agent โ โ
โ โ (FastAPI) โ โ / Evaluator โ โ
โ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
```
Every layer is cleanly separated โ the environment knows nothing about the HTTP layer; the grader knows nothing about environment internals. Each component is independently testable and swappable.
---
## ๐ Repository Layout
```
DataOps-GYM/
โ
โโโ env.py # Core RL environment: reset / step / observe / metrics
โโโ task.py # Task factories: easy / medium / hard (2 variants each)
โโโ grader.py # Per-step reward math + final task score formula
โโโ models.py # Pydantic schemas: Action, Observation
โโโ inference.py # Reference baseline agent + evaluator script
โ
โโโ server/
โ โโโ app.py # FastAPI HTTP server (/reset, /step, /state, /health)
โ
โโโ utils/ # Shared helper utilities
โโโ .dataops_policy_cache.json # Cached policy artifacts
โ
โโโ Dockerfile # Container definition (port 7860, HF Spaces-ready)
โโโ .dockerignore
โโโ openenv.yaml # HuggingFace Spaces metadata
โโโ pyproject.toml # Project metadata & build configuration
โโโ requirements.txt # Python dependencies
โโโ uv.lock # Reproducible lock file for uv package manager
```
---
## โ๏ธ The Environment Model
### Episode Lifecycle
Every interaction follows the standard gym pattern:
```python
# 1. Initialize a task episode (easy / medium / hard, seeded for reproducibility)
obs = env.reset(task_name="hard", seed=42)
# 2. Agent acts step-by-step until done
while not done:
action = agent.decide(obs)
obs, reward, done, info = env.step(action)
# 3. Retrieve terminal score in range (0, 1)
final_score = info["final_task_score"]
```
When `task_name` is not fixed, the environment randomly samples a difficulty tier and variant (both seeded), making the benchmark resistant to test-set memorization.
---
### What the Agent Sees โ `Observation`
The observation gives the agent everything it needs to reason โ without ever revealing the hidden answer key:
| Field | Description |
|---|---|
| `dataset.original` | Immutable snapshot of the table at episode start |
| `dataset.modified` | Current working table reflecting all accepted fixes so far |
| `action_history` | Full sequence of all past actions taken this episode |
| `per_record_scores` | Cumulative score contribution per row ID |
| `current_iteration_score` | Score delta from the most recent step |
| `previous_iteration_score` | Score delta from the prior step (for trend awareness) |
| `steps_remaining` | Hard cap on remaining interactions |
> โ ๏ธ The agent **never** sees `hidden_issues`. All semantic evaluation is performed internally.
---
### Hidden Issues โ What's Lurking in the Data
Each task defines a set of typed hidden issues the agent must discover and resolve:
| Issue Type | Description | Fixable? |
|---|---|---|
| `duplicate` | Two rows represent the same real entity | โ Not by `fix_value` alone |
| `missing_value` | A required field is null | โ
Yes |
| `invalid_format` | Email / phone / date doesn't match expected pattern | โ
Yes |
| `inconsistent_casing` | Name or city uses wrong casing convention | โ
Yes |
| `conflict` | Same customer has contradictory field values across rows | โ Irreconcilable |
| `constraint_violation` | Two distinct rows violate a uniqueness constraint (e.g., same email) | โ Requires judgment |
| `valid_trap` | Row looks suspicious but is actually correct โ **do not touch** | N/A |
---
## ๐ฎ Action Protocol
Agents interact through a strict, typed JSON protocol validated by Pydantic:
```json
{
"action_type": "fix_value",
"record_id": "C201",
"field": "email",
"value": "evan.cole@example.com",
"confidence": 0.92
}
```
### Action Types
| Action | When to Use | Reward Signal |
|---|---|---|
| `detect_issue` | Flag a problem without yet resolving it | Low positive โ passive identification only |
| `fix_value` | Apply a concrete correction to a specific field | High positive if correct; severe penalty if hallucinated |
| `cannot_determine` | Abstain when conflict is genuinely irreconcilable | Rewarded when `fixable: false`; penalized otherwise |
| `skip` | Explicitly pass on a record/field | Penalized if a real issue existed there |
### Protocol Validation Rules
- `value` is **required** for `fix_value` and **forbidden** for all other action types
- `record_id` and `field` must be non-empty strings
- `confidence` must be a float in `[0.0, 1.0]`
### Behavioral Discipline
The environment enforces **follow-through discipline** across steps:
- After `detect_issue`, the agent must follow up on that same record/field before moving on โ or receive a `passive_penalty`
- Handling a duplicate/conflict pair inconsistently (different strategies for related rows) triggers `inconsistent_handling` penalty
- Re-flagging an already-detected issue yields `repeated_detection` penalty
---
## ๐ Task Difficulty Tiers
### ๐ข Easy โ `easy_cleaning_task`
**Scenarios:** `easy_customer_master`, `easy_vendor_onboarding`
**Goal:** Foundational hygiene โ deduplicate obvious duplicate rows and fill required missing values without deleting rows just because they are incomplete.
**Issues planted:**
- Exact duplicate rows (identical across all fields)
- Missing required values (`city`, `email`)
**Agent strategy:** Detect duplicates โ deduplicate โ fill missing fields. No traps. No ambiguity.
---
### ๐ก Medium โ `medium_normalization_task`
**Scenarios:** `medium_customer_normalization`, `medium_partner_directory`
**Goal:** Normalize โ consistent casing, valid email shapes, deduplication where needed.
**Issues planted:**
- Duplicate rows
- Inconsistent casing on `name` and `city` (e.g., `"OMAR HASSAN"` โ `"Omar Hassan"`)
- Invalid email tokens (e.g., `[at]` instead of `@`, missing `@` entirely)
**Agent strategy:** Normalize casing to `title_case`, repair malformed emails, deduplicate. Validators check format correctness, not just non-null values.
---
### ๐ด Hard โ `hard_conflict_resolution_task`
**Scenarios:** `hard_customer_conflicts`, `hard_account_merges`
**Goal:** Multi-way reasoning under adversarial traps โ deduplicate, handle irreconcilable conflicts, enforce unique constraints, fix formats, and **leave valid-looking unusual rows completely untouched**.
**Issues planted:**
- Exact duplicates
- **Irreconcilable conflicts** โ same customer ID with contradictory `age` values (e.g., `250` vs `45`). Correct answer: `cannot_determine`
- Invalid email and phone formats
- **Unique constraint violations** โ two distinct customers sharing the same email address
- **`valid_trap` rows** โ rows that look suspicious but are correct:
- `q.xu+vip@example.com` โ a valid RFC-compliant plus-address
- `A. J. Brown` โ a valid abbreviated name
**Agent strategy:** Nuanced multi-step reasoning, cross-record constraint checking, confident abstention, and deliberate non-intervention on valid traps.
---
## ๐ Scoring & Reward System
### Per-Step Reward โ `grade_step_details`
Each step produces a composite scalar reward (no clamping โ scores can go negative):
| Component | Condition | ฮ Score |
|---|---|---|
| **Classification** | Correct action type for the situation | `+0.1` (detect) / `+0.2` (fix or cd) |
| **Classification** | Wrong action type | `โ0.20` |
| **Issue Detection** | Correctly identified real issue | `+0.05` (detect) / `+0.15` (fix or cd) |
| **Issue Detection** | Missed a real issue | `โ0.15` |
| **Issue Detection** | False positive (no issue there) | `โ0.05` |
| **Decision** | Correct fix (passes `validate_fix`) | `+0.25` |
| **Decision** | Correct `cannot_determine` on non-fixable issue | `+0.25` |
| **Decision** | Hallucinated fix (no matching issue) | `โ0.50` |
| **Decision** | Wrong fix (fails validation) | `โ0.40` |
| **Decision** | Wrong `cannot_determine` (abstained when fixable) | `โ0.20` |
| **Cross-record Consistency** | Consistent handling of related row pair | `+0.20` |
| **Cross-record Consistency** | Inconsistent handling of related row pair | `โ0.30` |
| **Confidence Calibration** | confidence > 0.7 AND correct | `+0.05` |
| **Confidence Calibration** | confidence > 0.7 AND wrong | `โ0.10` |
| **Confident Hallucination** | confidence > 0.8 AND hallucinated fix | `โ0.20` (amplifier) |
| **Resolution Reward** | Previously detected issue now resolved | `+0.15` |
| **Passive Penalty** | Unresolved detection + off-topic action | `โ0.05` |
| **Overcorrection** | Extra fields modified unintentionally | `โ0.05 ร N` |
| **Repeated Detection** | Same issue flagged again | `โ0.10` |
> The returned step reward also adjusts by **ยฑ0.1** based on whether the sum of `per_record_scores` improved over the previous iteration.
---
### Final Task Score โ `grade_task_result`
Terminal score is a weighted composite guaranteed in the open interval **(0, 1)**:
```
Final Score = 0.50 ร normalized_record_score
+ 0.20 ร (1 โ hallucination_rate)
+ 0.15 ร uncertainty_accuracy
+ 0.15 ร consistency_score
```
| Task | Difficulty | Score |
|---|---|---|
| `easy_vendor_onboarding` | ๐ข Easy | `0.73` |
| `medium_customer_normalization` | ๐ก Medium | `0.40` |
| `hard_customer_conflicts` | ๐ด Hard | `0.39` |
> Evaluated using `inference.py` with `Qwen/Qwen3-VL-30B-A3B-Instruct` via Novita.
### Failure Telemetry
The `task_failure_messages` function surfaces structured, human-readable failure logs from the episode โ making it straightforward to diagnose specific agent failure modes during evaluation and iteration.
---
## ๐ HTTP API Reference
The FastAPI server exposes a clean REST interface for agent integration:
| Endpoint | Method | Body / Params | Description |
|---|---|---|---|
| `/reset` | `POST` | `{ "seed": 42, "task_name": "hard" }` | Start a new episode |
| `/step` | `POST` | JSON matching `Action` schema | Submit one agent action |
| `/state` | `GET` | โ | Full internal state snapshot (debugging) |
| `/health` | `GET` | โ | Liveness probe |
| `/docs` | `GET` | โ | Interactive Swagger UI |
**Built to make data-cleaning agents honest โ not just accurate.**
โญ **Star this repo** if DataOps GYM helped your research or evaluation work!