varb15 commited on
Commit
346b0b1
Β·
verified Β·
1 Parent(s): 6c1b2ac

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +267 -21
README.md CHANGED
@@ -107,24 +107,160 @@ CODING TASK β€” Code instruction-response pairs
107
  β€’ Duplicate "merge sort" instruction across rows
108
  ```
109
 
 
 
110
  ## Environment API
111
 
112
  | Endpoint | Method | Description |
113
  |----------|--------|-------------|
114
- | `/reset` | POST | Start a new episode with `{"task_id": "easy"}` |
115
  | `/step` | POST | Submit identified issues + proposed fixes |
116
  | `/state` | GET | Get current episode state |
117
  | `/health` | GET | Health check |
118
 
119
- ## Action Format
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
- **Identify:** `row:<N>,col:<column>,issue:<type>` where type is one of: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`
 
 
122
 
123
- **Fix:** `row:<N>,col:<column>,fix:<corrected_value>`
124
 
125
- Both can be submitted in the same step or across multiple steps (3 steps max).
126
 
127
- ## Reward Design
 
 
 
 
 
 
 
 
 
 
128
 
129
  | Property | Detail |
130
  |----------|--------|
@@ -134,44 +270,154 @@ Both can be submitted in the same step or across multiple steps (3 steps max).
134
  | Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
135
  | Multi-step improvement | Detailed feedback enables learning across attempts |
136
 
137
- **Fix grading tiers** (by issue type):
138
- - Exact match with clean value β†’ 1.0
139
- - Valid fix: right type/range, addresses the issue β†’ 0.8
140
- - Partially valid: reasonable attempt, right direction β†’ 0.4
141
- - Right cell, wrong value β†’ 0.1
142
- - Non-issue cell β†’ 0.0
 
 
 
143
 
144
- ## Quick Start
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ```bash
 
147
  pip install -e .
 
 
148
  uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
149
 
150
- # Run baseline agent
151
  API_BASE_URL=https://router.huggingface.co/v1 \
152
  MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
153
  HF_TOKEN=your-token \
154
  python inference.py
155
  ```
156
 
157
- ## Testing
158
 
159
- 128 tests covering task creation, reward computation, fix grading, environment lifecycle, inference parsing, and extensibility API.
 
 
 
 
 
160
 
161
  ```bash
162
  pip install -e ".[dev]"
163
  pytest tests/ -v
164
  ```
165
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  ## Architecture
167
 
168
  ```
169
  dataqa_env/
170
- β”œβ”€β”€ models.py # DataQAAction (issues + fixes), DataQAObservation
 
 
171
  β”œβ”€β”€ server/
172
- β”‚ β”œβ”€β”€ environment.py # Two-phase grading engine (identify + fix + combined reward)
173
  β”‚ β”œβ”€β”€ tasks.py # 7 task definitions + contamination rules + extensibility API
174
- β”‚ └── app.py # FastAPI server (via openenv-core)
175
- inference.py # Two-phase baseline agent (identify β†’ fix)
176
- tests/ # 128 tests
 
 
 
 
 
 
 
 
 
177
  ```
 
 
 
 
 
 
 
 
 
 
 
107
  β€’ Duplicate "merge sort" instruction across rows
108
  ```
109
 
110
+ > The interactive replay UI with color-coded dataset visualization is available on the HF Space.
111
+
112
  ## Environment API
113
 
114
  | Endpoint | Method | Description |
115
  |----------|--------|-------------|
116
+ | `/reset` | POST | Start a new episode with a corrupted dataset |
117
  | `/step` | POST | Submit identified issues + proposed fixes |
118
  | `/state` | GET | Get current episode state |
119
  | `/health` | GET | Health check |
120
 
121
+ ## Tasks
122
+
123
+ **Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage). Expert tasks (alignment, coding, toolcalling, moderation) require domain expertise, semantic reasoning, and cross-row comparison.
124
+
125
+ ### Alignment Task: LLM Training Data Quality (Expert)
126
+
127
+ Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** β€” 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
128
+
129
+ This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
130
+
131
+ | Issue | Difficulty | Why It's Hard |
132
+ |---|---|---|
133
+ | Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym β€” sounds plausible, requires domain knowledge |
134
+ | Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
135
+ | Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion β€” trains confused models |
136
+ | Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics β€” most dangerous for training |
137
+ | Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
138
+ | Toxic/biased response scored as helpful | 3.0 | Gender-biased stereotypes with helpfulness=4 β€” poisons alignment training |
139
+ | Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
140
+ | Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
141
+ | Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
142
+ | Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
143
+ | Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
144
+ | Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
145
+
146
+ ### Coding Task: Code Quality (Expert)
147
+
148
+ 20-row dataset of code instruction-response pairs (Python algorithms, data structures, web, design patterns). 10 planted issues:
149
+
150
+ - Syntax errors in "correct" code (unbalanced parens)
151
+ - Logic bugs marked `is_correct=true` (binary search off-by-one infinite loop)
152
+ - Security vulnerabilities (`eval()` on user input) marked correct
153
+ - Language mismatches (JavaScript response labeled Python)
154
+ - Truncated code, difficulty label mismatches, duplicate instructions, wrong categories, missing test cases
155
+
156
+ ### Tool-Calling Task: Function Schema Quality (Expert)
157
+
158
+ 20-row dataset of function definitions with parameter schemas, example calls, and outputs. 10 planted issues:
159
+
160
+ - Function name mismatch between definition and example call
161
+ - Missing required parameters in example call
162
+ - Hallucinated parameters not in schema
163
+ - Type mismatches (string "high" for integer quality parameter)
164
+ - Invalid JSON, duplicate function names, misleading descriptions, wrong categories
165
+
166
+ ### Moderation Task: Content Label Quality (Expert)
167
+
168
+ 30-row dataset modeled on content moderation pipelines. 10 planted issues:
169
+
170
+ - Mislabeled hate speech and violence (unflagged toxic content)
171
+ - False positives on clean text (idioms flagged as hate)
172
+ - Subset rule violations (`hate_threatening` without `hate` flag)
173
+ - Out-of-range label values
174
+
175
+ ## Two-Phase Action Space
176
+
177
+ ### Phase 1: Identify Issues
178
+
179
+ Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
180
+
181
+ - `row_number`: 1-indexed data row position (after header)
182
+ - `column_name`: Exact column header name, lowercase
183
+ - `issue_type`: One of the supported types below
184
+
185
+ ### Phase 2: Propose Fixes
186
+
187
+ Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
188
+
189
+ The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
190
+
191
+ Both phases can be submitted in the same step or across multiple steps.
192
+
193
+ **Supported Issue Types:**
194
+
195
+ | Type | Description | Example |
196
+ |------|-------------|---------|
197
+ | `missing_value` | Null, empty, or whitespace-only | Empty name field |
198
+ | `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
199
+ | `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
200
+ | `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
201
+ | `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
202
+ | `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
203
+ | `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
204
+ | `referential_integrity` | Foreign key violation | (available for custom tasks) |
205
+
206
+ ## Observation Space
207
+
208
+ | Field | Type | Description |
209
+ |-------|------|-------------|
210
+ | `dataset_csv` | str | The corrupted dataset in CSV format |
211
+ | `schema_description` | str | Column types, ranges, and constraints |
212
+ | `validation_rules` | str | Business rules the data must satisfy |
213
+ | `task_description` | str | Task context and instructions |
214
+ | `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
215
+ | `num_issues_hint` | int | Exact count of planted issues |
216
+ | `max_steps` | int | Maximum attempts allowed |
217
+ | `done` | bool | Whether episode has terminated |
218
+ | `reward` | float | Best combined reward so far (strict 0-1 range) |
219
+
220
+ **Observation Metadata** (per step):
221
+ - Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
222
+ - Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
223
+ - Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
224
+
225
+ ## Reward Function
226
+
227
+ ### Combined Reward
228
+
229
+ ```
230
+ combined_reward = 0.6 * identify_score + 0.4 * fix_score
231
+ ```
232
+
233
+ If no fixes are submitted, `combined_reward = identify_score` (no penalty β€” backward compatible).
234
+
235
+ ### Identify Score (Difficulty-Weighted F1)
236
+
237
+ Each planted issue has a **difficulty weight** (1.0-3.0):
238
+
239
+ | Weight | Category | Examples |
240
+ |--------|----------|----------|
241
+ | 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
242
+ | 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
243
+ | 2.5-3.0 | Hard | Data leakage, statistical outliers, hallucinated citations |
244
 
245
+ - **Weighted Recall** = (difficulty of found issues) / (total difficulty)
246
+ - **Weighted Precision** = penalizes false positives proportional to average difficulty
247
+ - **Weighted F1** = harmonic mean
248
 
249
+ ### Fix Score (Tiered Grading by Issue Type)
250
 
251
+ Each proposed fix is graded with tiered scoring that gives partial credit for reasonable attempts:
252
 
253
+ | Fix Quality | Score | Description |
254
+ |-------------|-------|-------------|
255
+ | Exact match | 1.0 | Case-insensitive, whitespace-stripped match with clean value |
256
+ | Valid fix | 0.8 | Right type/range, addresses the issue (e.g., any non-empty value for missing field) |
257
+ | Partially valid | 0.4 | Reasonable attempt, right direction (e.g., numeric in right ballpark) |
258
+ | Right cell, wrong value | 0.1 | Targets correct cell but fix doesn't address the issue |
259
+ | Non-issue cell | 0.0 | Fix targets a cell with no issue |
260
+
261
+ Fix score = (sum of best fix score per issue x difficulty weight) / (total difficulty weight)
262
+
263
+ ### Reward Properties
264
 
265
  | Property | Detail |
266
  |----------|--------|
 
270
  | Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
271
  | Multi-step improvement | Detailed feedback enables learning across attempts |
272
 
273
+ ### Episode Boundaries
274
+
275
+ - Each task allows up to 3 steps (attempts)
276
+ - Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
277
+ - Agent receives detailed feedback after each step to improve on next attempt
278
+
279
+ ## Extensibility
280
+
281
+ ### Custom Contamination Rules
282
 
283
+ ```python
284
+ from dataqa_env import register_contamination_rule
285
+ from dataqa_env.server.tasks import PlantedIssue
286
+
287
+ def swap_digits(rows, header, col_idx, row_idx, rng):
288
+ val = rows[row_idx][col_idx]
289
+ corrupted = val[::-1]
290
+ issue = PlantedIssue(
291
+ row=row_idx + 1, col=header[col_idx],
292
+ issue_type="format_violation",
293
+ description=f"Digits swapped in {header[col_idx]}",
294
+ difficulty=2.0,
295
+ )
296
+ return corrupted, issue
297
+
298
+ register_contamination_rule("swap_digits", swap_digits)
299
+ ```
300
+
301
+ ### Custom Tasks from Config
302
+
303
+ ```python
304
+ from dataqa_env import create_task_from_config, register_task
305
+
306
+ task = create_task_from_config(
307
+ task_id="custom",
308
+ name="Custom Validation",
309
+ description="Find quality issues in this dataset.",
310
+ schema_description="id: int, name: str, score: int (0-100)",
311
+ validation_rules="No missing values. Scores must be 0-100.",
312
+ clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
313
+ contaminations=[
314
+ {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
315
+ {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
316
+ ],
317
+ )
318
+ register_task("custom", lambda seed: task)
319
+ ```
320
+
321
+ ### Built-in Contamination Rules
322
+
323
+ | Rule | Effect | Default Difficulty |
324
+ |------|--------|--------------------|
325
+ | `missing_value` | Sets field to empty string | 1.0 |
326
+ | `whitespace_value` | Sets field to single space | 2.5 |
327
+ | `wrong_type_text` | Replaces with random text | 1.0 |
328
+ | `negative_value` | Negates numeric value | 1.0 |
329
+
330
+ ## Setup & Quick Start
331
 
332
  ```bash
333
+ # Install
334
  pip install -e .
335
+
336
+ # Run server locally
337
  uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
338
 
339
+ # Run inference (set your API credentials)
340
  API_BASE_URL=https://router.huggingface.co/v1 \
341
  MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
342
  HF_TOKEN=your-token \
343
  python inference.py
344
  ```
345
 
346
+ ## Docker
347
 
348
+ ```bash
349
+ docker build -t dataqa-env .
350
+ docker run -p 8000:8000 dataqa-env
351
+ ```
352
+
353
+ ## Testing
354
 
355
  ```bash
356
  pip install -e ".[dev]"
357
  pytest tests/ -v
358
  ```
359
 
360
+ 128 tests covering:
361
+ - Task creation, corruption, and difficulty weights for all 7 tasks
362
+ - Issue key and fix parsing (standard, lenient, edge cases)
363
+ - F1, weighted reward, and fix quality computation
364
+ - Full environment lifecycle (identify-only and identify+fix)
365
+ - Combined reward calculation and weight verification
366
+ - Inference script parsing and prompt building
367
+ - Structured log format ([START], [STEP], [END])
368
+ - Score bounds (strict 0-1), best-score monotonicity
369
+ - Extensibility API (custom rules, custom tasks)
370
+ - Moderation task determinism and label consistency
371
+
372
+ ## Validation
373
+
374
+ ```bash
375
+ # OpenEnv spec validation
376
+ openenv validate .
377
+
378
+ # Pre-submission validation (requires HF Space URL)
379
+ ./prevalidation_script.sh https://your-space.hf.space
380
+ ```
381
+
382
+ ## Environment Variables
383
+
384
+ | Variable | Description | Default |
385
+ |----------|-------------|---------|
386
+ | `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
387
+ | `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
388
+ | `HF_TOKEN` | HuggingFace token / API key | - |
389
+ | `ENV_URL` | Environment server URL | `http://localhost:8000` |
390
+
391
  ## Architecture
392
 
393
  ```
394
  dataqa_env/
395
+ β”œβ”€β”€ __init__.py # Public API + extensibility exports
396
+ β”œβ”€β”€ models.py # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
397
+ β”œβ”€β”€ client.py # EnvClient for WebSocket connections
398
  β”œβ”€β”€ server/
399
+ β”‚ β”œβ”€β”€ environment.py # Two-phase DataQAEnvironment (identify + fix + combined reward)
400
  β”‚ β”œβ”€β”€ tasks.py # 7 task definitions + contamination rules + extensibility API
401
+ β”‚ β”œβ”€β”€ gradio_ui.py # Interactive web UI with agent trajectory replay
402
+ β”‚ β”œβ”€β”€ app.py # FastAPI server (via openenv-core create_app)
403
+ β”‚ └── Dockerfile
404
+ tests/
405
+ β”œβ”€β”€ test_tasks.py # Task creation, corruption, difficulty weights (all 7 tasks)
406
+ β”œβ”€β”€ test_environment.py # Identify scoring, fix grading, combined reward, lifecycle
407
+ β”œβ”€β”€ test_inference.py # LLM response parsing, fix parsing, prompt building, log format
408
+ └── test_extensibility.py # Custom rules, custom tasks, registration API
409
+ inference.py # Two-phase baseline agent (identify then fix)
410
+ openenv.yaml # OpenEnv/HF Spaces spec
411
+ pyproject.toml # Package metadata and dependencies
412
+ Dockerfile # Production container
413
  ```
414
+
415
+ ### Key Modules
416
+
417
+ **`dataqa_env/server/tasks.py`** β€” The core of the environment. Each task function (`create_task_easy`, `create_task_coding`, etc.) builds a clean CSV dataset, injects corruptions as `PlantedIssue` objects with row/col/type/difficulty, and returns a `Task` dataclass. The `TASK_REGISTRY` dict maps task IDs to factory functions. The extensibility API (`register_task`, `register_contamination_rule`, `create_task_from_config`) allows users to add domains without modifying source.
418
+
419
+ **`dataqa_env/server/environment.py`** β€” The `DataQAEnvironment` class inherits from OpenEnv's `Environment` base. `reset()` loads a task by ID and returns the corrupted CSV + schema. `step()` parses issue keys and fix proposals from the action, computes difficulty-weighted F1 for identification, grades fixes with tiered scoring by issue type, and returns combined reward with detailed feedback. Handles HTTP statelessness via auto-reset from `action.task_id`.
420
+
421
+ **`dataqa_env/models.py`** β€” Pydantic models for the OpenEnv interface. `DataQAAction` carries `issues: List[str]`, `fixes: List[str]`, and `task_id: str`. `DataQAObservation` carries the CSV, schema, rules, feedback, and scoring metadata. `DataQAState` tracks episode progress.
422
+
423
+ **`inference.py`** β€” Baseline LLM agent using OpenAI-compatible API. Runs all 7 tasks sequentially with 3 steps each. Lenient regex parsing handles case variations and delimiter differences in LLM output. Structured logging in `[START]/[STEP]/[END]` format for evaluation.