graheetphartyal commited on
Commit
b07d8dc
·
verified ·
1 Parent(s): 4d67e80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +256 -191
README.md CHANGED
@@ -1,5 +1,4 @@
1
  ---
2
-
3
  title: Dataops Env
4
  emoji: 🧼
5
  colorFrom: indigo
@@ -7,306 +6,372 @@ colorTo: gray
7
  sdk: docker
8
  app_port: 7860
9
  pinned: false
10
-
11
  ---
12
 
13
- # ✨ DataOps Gym
14
-
15
- ### ⚡ The First Hallucination-Aware Data Cleaning Environment
16
-
17
- > ❌ Most systems ask: *“Did you fix the data?”*
18
- > ✅ We ask: *“Did you think before fixing?”*
19
 
20
- ---
21
-
22
- # 🚨 THE PROBLEM
23
 
24
- **60–80% of a data scientist’s time is spent cleaning data.**
25
 
26
- But current systems:
27
 
28
- * blindly fix values
29
- * hallucinate corrections
30
- * ignore contradictions
31
- * break real-world logic
32
-
33
- ---
34
 
35
- > 💡 **Wrong data is worse than missing data.**
 
 
 
 
36
 
37
- ---
38
 
39
- # 🧠 WHAT THIS PROJECT DOES
 
 
 
 
 
40
 
41
- DataOps Gym is a **step-based OpenEnv environment** where an AI agent:
42
 
43
- 1. Detects semantic inconsistencies
44
- 2. Fixes data **only when confident**
45
- 3. Outputs **"cannot determine"** when uncertain
46
- 4. Maintains **cross-record consistency**
47
- 5. Learns through **reward-based feedback**
48
 
49
  ---
50
 
51
- Each step teaches the agent:
52
 
53
- * when to fix
54
- * when to abstain ⚠️
55
- * when to say “I don’t know” 🧠
 
 
 
 
 
 
56
 
57
  ---
58
 
59
- # 🧩 ACTION SPACE
60
 
61
- All actions must follow strict JSON format:
62
-
63
- ```json
64
- {
65
- "action_type": "detect_issue | fix_value | cannot_determine | skip",
66
- "record_id": "string",
67
- "field": "string",
68
- "value": "string",
69
- "confidence": 0.0
70
- }
71
- ```
72
-
73
- ---
74
 
75
- ## 🔥 Key Innovation
 
 
 
76
 
77
- 👉 `cannot_determine` is a **first-class action**
78
 
79
  ---
80
 
81
- # 🧠 WHY THIS IS DIFFERENT
82
 
83
- | Traditional Systems | DataOps Gym |
84
- | ------------------- | ---------------------- |
85
- | Fix everything | Fix only when safe |
86
- | Always answer | Can abstain |
87
- | Ignore confidence | Confidence-aware |
88
- | Single-row logic | Cross-record reasoning |
89
- | Output-based | Behavior-based |
 
90
 
91
- ---
92
-
93
- # 💰 REWARD SYSTEM
94
 
95
  ---
96
 
97
- ## Rewards
98
 
99
- * correct reasoning
100
- * safe corrections
101
- * correct uncertainty
102
- * consistency across records
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
 
104
  ---
105
 
106
- ## Penalties
107
 
108
- * hallucinated fixes 🚫
109
- * overconfidence 🚫
110
- * over-correction 🚫
111
- * inconsistency 🚫
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
112
 
113
  ---
114
 
115
- ### 🔥 Core Principle
116
 
117
- > **“Better to not fix than to fix incorrectly.”**
118
 
119
- ---
 
 
 
 
120
 
121
- # 📊 FINAL SCORING (0–1)
 
 
 
122
 
123
- ```text
124
- task_score =
125
- 0.5 * normalized_record_score
126
- + 0.2 * (1 - hallucination_rate)
127
- + 0.15 * uncertainty_accuracy
128
- + 0.15 * consistency_score
129
  ```
130
 
 
 
131
  ---
132
 
133
- # 📉 METRICS
134
 
135
- | Metric | Description |
136
- | ----------------------- | ---------------------- |
137
- | 🧠 Hallucination Rate | Wrong invented fixes |
138
- | ⚖️ Uncertainty Accuracy | Correct abstentions |
139
- | 🔗 Consistency Score | Cross-record reasoning |
140
 
141
- ---
 
 
 
 
 
 
 
 
142
 
143
- # 🧪 TASKS
144
- > ⚡ Each task is carefully designed to evaluate **reasoning, restraint, and reliability** — not just accuracy.
145
 
146
  ---
147
 
148
- ## 🟢 EASY*Foundational Data Hygiene*
149
 
150
- <p align="left">
151
- <b>“Can the agent fix obvious issues without breaking anything?”</b>
152
- </p>
153
 
154
- * Basic inconsistencies
155
- * Missing values
156
- * Duplicate records
 
 
 
 
 
 
157
 
158
  ---
159
 
160
- ## 🟡 MEDIUM — *Contextual Reasoning & Ambiguity*
161
 
162
- <p align="left">
163
- <b>“Can the agent reason across records and handle uncertainty?”</b>
164
- </p>
165
 
166
- * Cross-table inconsistencies
167
- * Identity ambiguity
168
- * Data normalization
 
 
 
 
 
 
169
 
170
- ---
171
 
172
- ## 🔴 HARD *Real-World Data Chaos*
 
 
 
 
 
173
 
174
- <p align="left">
175
- <b>“Can the agent survive contradictions, missing context, and unsolvable data?”</b>
176
- </p>
177
 
178
- * Multi-table conflicts
179
- * Temporal inconsistencies
180
- * Non-fixable contradictions
181
 
182
- ---
183
 
184
- > 🔥 **Difficulty is not about complexity — it's about uncertainty.**
185
 
186
- | Level | Focus |
187
- |--------|------|
188
- | 🟢 Easy | Precision on clear signals |
189
- | 🟡 Medium | Reasoning under ambiguity |
190
- | 🔴 Hard | Decision-making under uncertainty |
191
 
192
  ---
193
 
194
- # 🧪 EXAMPLE FAILURE LOG
195
-
196
- ```json
197
- {
198
- "record_id": "T3",
199
- "error_type": "hallucination",
200
- "details": "assigned value without evidence",
201
- "confidence": 0.9
202
- }
203
- ```
204
 
205
- ---
206
 
207
- # 🚀 QUICK START
208
 
209
- ---
210
 
211
- ## Install
 
 
212
 
213
- ```bash
214
- pip install -r requirements.txt
215
- ```
216
 
217
  ---
218
 
219
- ## Run Server
220
 
221
- ```bash
222
- python -m server.app
223
- ```
224
 
225
- ---
226
 
227
- ## Run Baseline
 
 
 
228
 
229
- ```bash
230
- python inference.py
231
- ```
232
 
233
  ---
234
 
235
- ## Example Output
236
 
237
- ```text
238
- easy → 0.73
239
- medium → 0.55
240
- hard → 0.38
241
- ```
242
-
243
- > ⚠️ Replace with your actual results
244
 
245
- ---
246
 
247
- # 🌐 API ENDPOINTS
 
 
 
 
 
 
 
248
 
249
- | Endpoint | Description |
250
- | --------- | ----------------- |
251
- | `/reset` | Start new episode |
252
- | `/step` | Take action |
253
- | `/state` | Get current state |
254
- | `/health` | Health check |
255
 
256
  ---
257
 
258
- # 🐳 DOCKER
259
 
260
- ```bash
261
- docker build -t dataops-gym .
262
- docker run -p 7860:7860 dataops-gym
263
- ```
264
 
265
- ---
266
 
267
- # 🧠 DESIGN PRINCIPLES
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
268
 
269
- 1. Prefer uncertainty over hallucination
270
- 2. Penalize confident mistakes
271
- 3. Avoid over-correction
272
- 4. Enforce cross-record consistency
273
- 5. Reward safe reasoning
274
 
275
  ---
276
 
277
- # 🏆 BENCHMARK (EXPECTED)
278
 
279
- | Task | Score |
280
- | ------ | ----------- |
281
- | Easy | 0.65 – 0.85 |
282
- | Medium | 0.45 – 0.65 |
283
- | Hard | 0.05 – 0.40 |
284
 
285
- ---
 
 
 
 
 
 
 
 
 
 
 
286
 
287
- # 📌 USE CASES
288
 
289
- * AI data pipelines
290
- * automated ETL validation
291
- * financial data cleaning
292
- * healthcare record validation
293
- * LLM safety benchmarking
294
 
295
  ---
296
 
297
- # 🏁 FINAL TAKEAWAY
298
 
299
- > 🧠 **The future of AI is not about answering everything.**
300
- > ⚡ **It’s about knowing when NOT to answer.**
301
 
302
- ---
 
 
 
 
 
 
303
 
304
- # 🔥 TAGLINE
305
 
306
- > **“We built a system that teaches AI when NOT to change data.”**
 
 
 
 
307
 
308
- ---
309
 
 
310
 
 
311
 
 
312
 
 
 
 
1
  ---
 
2
  title: Dataops Env
3
  emoji: 🧼
4
  colorFrom: indigo
 
6
  sdk: docker
7
  app_port: 7860
8
  pinned: false
 
9
  ---
10
 
11
+ <div align="center">
 
 
 
 
 
12
 
13
+ # 🏋️ DataOps GYM
 
 
14
 
15
+ ### *The Benchmark That Punishes Overconfidence Not Just Wrong Answers*
16
 
17
+ **A semantic, step-based reinforcement learning environment for evaluating data-cleaning agents on tabular datasets**
18
 
19
+ <br/>
 
 
 
 
 
20
 
21
+ [![Python](https://img.shields.io/badge/Python-3.10%2B-blue?logo=python&logoColor=white)](https://python.org)
22
+ [![FastAPI](https://img.shields.io/badge/FastAPI-REST_API-009688?logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com)
23
+ [![Pydantic](https://img.shields.io/badge/Pydantic-Schema_Validation-E92063?logo=pydantic&logoColor=white)](https://docs.pydantic.dev)
24
+ [![Docker](https://img.shields.io/badge/Docker-Containerized-2496ED?logo=docker&logoColor=white)](https://docker.com)
25
+ [![HuggingFace](https://img.shields.io/badge/🤗_HuggingFace-Spaces_Compatible-FFD21E)](https://huggingface.co/spaces)
26
 
27
+ <br/>
28
 
29
+ > **"Any model can clean data. Only a smart one knows when *not* to."**
30
+ >
31
+ > DataOps GYM is an interactive gym environment for training and benchmarking LLM-based data-cleaning agents —
32
+ > with dense per-step rewards, structured action protocols, and deliberate adversarial traps
33
+ > designed to expose hallucination, overcorrection, and overconfidence.
34
+ > **The first benchmark that penalizes an LLM for being too confident about dirty data — not just for being wrong.**
35
 
36
+ <br/>
37
 
38
+ </div>
 
 
 
 
39
 
40
  ---
41
 
42
+ ## 📌 Table of Contents
43
 
44
+ - [Why DataOps GYM Exists](#-why-dataops-gym-exists)
45
+ - [Core Philosophy](#-core-philosophy)
46
+ - [Architecture Overview](#-architecture-overview)
47
+ - [Repository Layout](#-repository-layout)
48
+ - [The Environment Model](#-the-environment-model)
49
+ - [Action Protocol](#-action-protocol)
50
+ - [Task Difficulty Tiers](#-task-difficulty-tiers)
51
+ - [Scoring & Reward System](#-scoring--reward-system)
52
+ - [HTTP API Reference](#-http-api-reference)
53
 
54
  ---
55
 
56
+ ## 🔍 Why DataOps GYM Exists
57
 
58
+ Real-world data pipelines fail silently. Automated cleaners and LLM agents frequently:
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
+ - **Hallucinate corrections** — inventing plausible-sounding values with no evidentiary basis
61
+ - **Over-correct valid data** — mistaking unusual-but-correct formats as errors *(e.g., `q.xu+vip@example.com` is a valid plus-address — don't touch it)*
62
+ - **Flatten genuine ambiguity** — making irreversible decisions where `cannot_determine` was the right call
63
+ - **Ignore cross-record consistency** — fixing one row while silently creating a new constraint violation in another
64
 
65
+ **DataOps GYM was built to measure all of these failure modes simultaneously**, forcing agents to balance **precision, restraint, and consistency** — not just produce a tidy-looking output table.
66
 
67
  ---
68
 
69
+ ## 🧠 Core Philosophy
70
 
71
+ | Traditional Benchmark | DataOps GYM |
72
+ |---|---|
73
+ | Compares final table to ground truth | Evaluates **every step** semantically |
74
+ | Rewards correct fixes | Also **penalizes hallucination** and **rewards appropriate abstention** |
75
+ | Single-pass evaluation | Multi-turn, stateful episode loop |
76
+ | No cross-record awareness | Tracks **consistency across related rows** |
77
+ | Ignores agent confidence | **Confidence calibration** affects reward directly |
78
+ | `cannot_determine` = failure | `cannot_determine` = **first-class correct action** |
79
 
80
+ > DataOps GYM is purpose-built around the insight that **knowing when not to act is as important as knowing how to act.**
 
 
81
 
82
  ---
83
 
84
+ ## 🏗 Architecture Overview
85
 
86
+ ```
87
+ ┌──────────────────────────────────────────────────────────────────┐
88
+ │ DataOps GYM │
89
+ │ │
90
+ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │
91
+ │ │ task.py │─────▶│ env.py │─────▶│ grader.py │ │
92
+ │ │ │ │ │ │ │ │
93
+ │ │ Task │ │ Episode │ │ Per-step │ │
94
+ │ │ Factory │ │ Lifecycle │ │ Reward + │ │
95
+ │ │ 3 tiers │ │ + State │ │ Final Score │ │
96
+ │ │ 2 vars │ │ Tracking │ │ │ │
97
+ │ └──────────┘ └──────────────┘ └──────────────┘ │
98
+ │ │ │
99
+ │ ▼ │
100
+ │ ┌──────────────┐ │
101
+ │ │ models.py │ │
102
+ │ │ Action / │ │
103
+ │ │ Observation │ │
104
+ │ │ (Pydantic) │ │
105
+ │ └──────────────┘ │
106
+ │ │ │
107
+ │ ▼ │
108
+ │ ┌──────────────┐ ┌──────────────────┐ │
109
+ │ │ server/ │◀─────│ inference.py │ │
110
+ │ │ app.py │ │ Reference Agent │ │
111
+ │ │ (FastAPI) │ │ / Evaluator │ │
112
+ │ └──────────────┘ └──────────────────┘ │
113
+ └──────────────────────────────────────────────────────────────────┘
114
+ ```
115
+
116
+ Every layer is cleanly separated — the environment knows nothing about the HTTP layer; the grader knows nothing about environment internals. Each component is independently testable and swappable.
117
 
118
  ---
119
 
120
+ ## 📁 Repository Layout
121
 
122
+ ```
123
+ DataOps-GYM/
124
+
125
+ ├── env.py # Core RL environment: reset / step / observe / metrics
126
+ ├── task.py # Task factories: easy / medium / hard (2 variants each)
127
+ ├── grader.py # Per-step reward math + final task score formula
128
+ ├── models.py # Pydantic schemas: Action, Observation
129
+ ├── inference.py # Reference baseline agent + evaluator script
130
+
131
+ ├── server/
132
+ │ └── app.py # FastAPI HTTP server (/reset, /step, /state, /health)
133
+
134
+ ├── utils/ # Shared helper utilities
135
+ ├── .dataops_policy_cache.json # Cached policy artifacts
136
+
137
+ ├── Dockerfile # Container definition (port 7860, HF Spaces-ready)
138
+ ├── .dockerignore
139
+ ├── openenv.yaml # HuggingFace Spaces metadata
140
+ ├── pyproject.toml # Project metadata & build configuration
141
+ ├── requirements.txt # Python dependencies
142
+ └── uv.lock # Reproducible lock file for uv package manager
143
+ ```
144
 
145
  ---
146
 
147
+ ## ⚙️ The Environment Model
148
 
149
+ ### Episode Lifecycle
150
 
151
+ Every interaction follows the standard gym pattern:
152
+
153
+ ```python
154
+ # 1. Initialize a task episode (easy / medium / hard, seeded for reproducibility)
155
+ obs = env.reset(task_name="hard", seed=42)
156
 
157
+ # 2. Agent acts step-by-step until done
158
+ while not done:
159
+ action = agent.decide(obs)
160
+ obs, reward, done, info = env.step(action)
161
 
162
+ # 3. Retrieve terminal score in range (0, 1)
163
+ final_score = info["final_task_score"]
 
 
 
 
164
  ```
165
 
166
+ When `task_name` is not fixed, the environment randomly samples a difficulty tier and variant (both seeded), making the benchmark resistant to test-set memorization.
167
+
168
  ---
169
 
170
+ ### What the Agent Sees — `Observation`
171
 
172
+ The observation gives the agent everything it needs to reason — without ever revealing the hidden answer key:
 
 
 
 
173
 
174
+ | Field | Description |
175
+ |---|---|
176
+ | `dataset.original` | Immutable snapshot of the table at episode start |
177
+ | `dataset.modified` | Current working table reflecting all accepted fixes so far |
178
+ | `action_history` | Full sequence of all past actions taken this episode |
179
+ | `per_record_scores` | Cumulative score contribution per row ID |
180
+ | `current_iteration_score` | Score delta from the most recent step |
181
+ | `previous_iteration_score` | Score delta from the prior step (for trend awareness) |
182
+ | `steps_remaining` | Hard cap on remaining interactions |
183
 
184
+ > ⚠️ The agent **never** sees `hidden_issues`. All semantic evaluation is performed internally.
 
185
 
186
  ---
187
 
188
+ ### Hidden IssuesWhat's Lurking in the Data
189
 
190
+ Each task defines a set of typed hidden issues the agent must discover and resolve:
 
 
191
 
192
+ | Issue Type | Description | Fixable? |
193
+ |---|---|---|
194
+ | `duplicate` | Two rows represent the same real entity | ❌ Not by `fix_value` alone |
195
+ | `missing_value` | A required field is null | ✅ Yes |
196
+ | `invalid_format` | Email / phone / date doesn't match expected pattern | ✅ Yes |
197
+ | `inconsistent_casing` | Name or city uses wrong casing convention | ✅ Yes |
198
+ | `conflict` | Same customer has contradictory field values across rows | ❌ Irreconcilable |
199
+ | `constraint_violation` | Two distinct rows violate a uniqueness constraint (e.g., same email) | ❌ Requires judgment |
200
+ | `valid_trap` | Row looks suspicious but is actually correct — **do not touch** | N/A |
201
 
202
  ---
203
 
204
+ ## 🎮 Action Protocol
205
 
206
+ Agents interact through a strict, typed JSON protocol validated by Pydantic:
 
 
207
 
208
+ ```json
209
+ {
210
+ "action_type": "fix_value",
211
+ "record_id": "C201",
212
+ "field": "email",
213
+ "value": "evan.cole@example.com",
214
+ "confidence": 0.92
215
+ }
216
+ ```
217
 
218
+ ### Action Types
219
 
220
+ | Action | When to Use | Reward Signal |
221
+ |---|---|---|
222
+ | `detect_issue` | Flag a problem without yet resolving it | Low positive — passive identification only |
223
+ | `fix_value` | Apply a concrete correction to a specific field | High positive if correct; severe penalty if hallucinated |
224
+ | `cannot_determine` | Abstain when conflict is genuinely irreconcilable | Rewarded when `fixable: false`; penalized otherwise |
225
+ | `skip` | Explicitly pass on a record/field | Penalized if a real issue existed there |
226
 
227
+ ### Protocol Validation Rules
 
 
228
 
229
+ - `value` is **required** for `fix_value` and **forbidden** for all other action types
230
+ - `record_id` and `field` must be non-empty strings
231
+ - `confidence` must be a float in `[0.0, 1.0]`
232
 
233
+ ### Behavioral Discipline
234
 
235
+ The environment enforces **follow-through discipline** across steps:
236
 
237
+ - After `detect_issue`, the agent must follow up on that same record/field before moving on — or receive a `passive_penalty`
238
+ - Handling a duplicate/conflict pair inconsistently (different strategies for related rows) triggers `inconsistent_handling` penalty
239
+ - Re-flagging an already-detected issue yields `repeated_detection` penalty
 
 
240
 
241
  ---
242
 
243
+ ## 📊 Task Difficulty Tiers
 
 
 
 
 
 
 
 
 
244
 
245
+ ### 🟢 Easy — `easy_cleaning_task`
246
 
247
+ **Scenarios:** `easy_customer_master`, `easy_vendor_onboarding`
248
 
249
+ **Goal:** Foundational hygiene — deduplicate obvious duplicate rows and fill required missing values without deleting rows just because they are incomplete.
250
 
251
+ **Issues planted:**
252
+ - Exact duplicate rows (identical across all fields)
253
+ - Missing required values (`city`, `email`)
254
 
255
+ **Agent strategy:** Detect duplicates → deduplicate → fill missing fields. No traps. No ambiguity.
 
 
256
 
257
  ---
258
 
259
+ ### 🟡 Medium — `medium_normalization_task`
260
 
261
+ **Scenarios:** `medium_customer_normalization`, `medium_partner_directory`
 
 
262
 
263
+ **Goal:** Normalize — consistent casing, valid email shapes, deduplication where needed.
264
 
265
+ **Issues planted:**
266
+ - Duplicate rows
267
+ - Inconsistent casing on `name` and `city` (e.g., `"OMAR HASSAN"` → `"Omar Hassan"`)
268
+ - Invalid email tokens (e.g., `[at]` instead of `@`, missing `@` entirely)
269
 
270
+ **Agent strategy:** Normalize casing to `title_case`, repair malformed emails, deduplicate. Validators check format correctness, not just non-null values.
 
 
271
 
272
  ---
273
 
274
+ ### 🔴 Hard — `hard_conflict_resolution_task`
275
 
276
+ **Scenarios:** `hard_customer_conflicts`, `hard_account_merges`
 
 
 
 
 
 
277
 
278
+ **Goal:** Multi-way reasoning under adversarial traps — deduplicate, handle irreconcilable conflicts, enforce unique constraints, fix formats, and **leave valid-looking unusual rows completely untouched**.
279
 
280
+ **Issues planted:**
281
+ - Exact duplicates
282
+ - **Irreconcilable conflicts** — same customer ID with contradictory `age` values (e.g., `250` vs `45`). Correct answer: `cannot_determine`
283
+ - Invalid email and phone formats
284
+ - **Unique constraint violations** — two distinct customers sharing the same email address
285
+ - **`valid_trap` rows** — rows that look suspicious but are correct:
286
+ - `q.xu+vip@example.com` — a valid RFC-compliant plus-address
287
+ - `A. J. Brown` — a valid abbreviated name
288
 
289
+ **Agent strategy:** Nuanced multi-step reasoning, cross-record constraint checking, confident abstention, and deliberate non-intervention on valid traps.
 
 
 
 
 
290
 
291
  ---
292
 
293
+ ## 🏆 Scoring & Reward System
294
 
295
+ ### Per-Step Reward — `grade_step_details`
 
 
 
296
 
297
+ Each step produces a composite scalar reward (no clamping — scores can go negative):
298
 
299
+ | Component | Condition | Δ Score |
300
+ |---|---|---|
301
+ | **Classification** | Correct action type for the situation | `+0.1` (detect) / `+0.2` (fix or cd) |
302
+ | **Classification** | Wrong action type | `−0.20` |
303
+ | **Issue Detection** | Correctly identified real issue | `+0.05` (detect) / `+0.15` (fix or cd) |
304
+ | **Issue Detection** | Missed a real issue | `−0.15` |
305
+ | **Issue Detection** | False positive (no issue there) | `−0.05` |
306
+ | **Decision** | Correct fix (passes `validate_fix`) | `+0.25` |
307
+ | **Decision** | Correct `cannot_determine` on non-fixable issue | `+0.25` |
308
+ | **Decision** | Hallucinated fix (no matching issue) | `−0.50` |
309
+ | **Decision** | Wrong fix (fails validation) | `−0.40` |
310
+ | **Decision** | Wrong `cannot_determine` (abstained when fixable) | `−0.20` |
311
+ | **Cross-record Consistency** | Consistent handling of related row pair | `+0.20` |
312
+ | **Cross-record Consistency** | Inconsistent handling of related row pair | `−0.30` |
313
+ | **Confidence Calibration** | confidence > 0.7 AND correct | `+0.05` |
314
+ | **Confidence Calibration** | confidence > 0.7 AND wrong | `−0.10` |
315
+ | **Confident Hallucination** | confidence > 0.8 AND hallucinated fix | `−0.20` (amplifier) |
316
+ | **Resolution Reward** | Previously detected issue now resolved | `+0.15` |
317
+ | **Passive Penalty** | Unresolved detection + off-topic action | `−0.05` |
318
+ | **Overcorrection** | Extra fields modified unintentionally | `−0.05 × N` |
319
+ | **Repeated Detection** | Same issue flagged again | `−0.10` |
320
 
321
+ > The returned step reward also adjusts by **±0.1** based on whether the sum of `per_record_scores` improved over the previous iteration.
 
 
 
 
322
 
323
  ---
324
 
325
+ ### Final Task Score — `grade_task_result`
326
 
327
+ Terminal score is a weighted composite guaranteed in the open interval **(0, 1)**:
 
 
 
 
328
 
329
+ ```
330
+ Final Score = 0.50 × normalized_record_score
331
+ + 0.20 × (1 − hallucination_rate)
332
+ + 0.15 × uncertainty_accuracy
333
+ + 0.15 × consistency_score
334
+ ```
335
+
336
+ | Task | Difficulty | Score |
337
+ |---|---|---|
338
+ | `easy_vendor_onboarding` | 🟢 Easy | `0.73` |
339
+ | `medium_customer_normalization` | 🟡 Medium | `0.40` |
340
+ | `hard_customer_conflicts` | 🔴 Hard | `0.39` |
341
 
342
+ > Evaluated using `inference.py` with `Qwen/Qwen3-VL-30B-A3B-Instruct` via Novita.
343
 
344
+ ### Failure Telemetry
345
+
346
+ The `task_failure_messages` function surfaces structured, human-readable failure logs from the episode — making it straightforward to diagnose specific agent failure modes during evaluation and iteration.
 
 
347
 
348
  ---
349
 
350
+ ## 🌐 HTTP API Reference
351
 
352
+ The FastAPI server exposes a clean REST interface for agent integration:
 
353
 
354
+ | Endpoint | Method | Body / Params | Description |
355
+ |---|---|---|---|
356
+ | `/reset` | `POST` | `{ "seed": 42, "task_name": "hard" }` | Start a new episode |
357
+ | `/step` | `POST` | JSON matching `Action` schema | Submit one agent action |
358
+ | `/state` | `GET` | — | Full internal state snapshot (debugging) |
359
+ | `/health` | `GET` | — | Liveness probe |
360
+ | `/docs` | `GET` | — | Interactive Swagger UI |
361
 
 
362
 
363
+ <div align="center">
364
+
365
+ <br/>
366
+
367
+ **Built to make data-cleaning agents honest — not just accurate.**
368
 
 
369
 
370
+ <br/>
371
 
372
+ ⭐ **Star this repo** if DataOps GYM helped your research or evaluation work!
373
 
374
+ <br/>
375
 
376
+ </div>
377
+