Spaces:

varb15
/

dataqa-env

Sleeping

App Files Files Community

varb15 commited on 12 days ago

Commit

346b0b1

verified ·

1 Parent(s): 6c1b2ac

Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +267 -21

README.md CHANGED Viewed

@@ -107,24 +107,160 @@ CODING TASK — Code instruction-response pairs
     • Duplicate "merge sort" instruction across rows
 ```
 ## Environment API
 | Endpoint | Method | Description |
 |----------|--------|-------------|
-| `/reset` | POST | Start a new episode with `{"task_id": "easy"}` |
 | `/step` | POST | Submit identified issues + proposed fixes |
 | `/state` | GET | Get current episode state |
 | `/health` | GET | Health check |
-## Action Format
-**Identify:** `row:<N>,col:<column>,issue:<type>` where type is one of: `missing_value`, `wrong_type`, `duplicate_row`, `out_of_range`, `format_violation`, `inconsistent_value`, `statistical_outlier`
-**Fix:** `row:<N>,col:<column>,fix:<corrected_value>`
-Both can be submitted in the same step or across multiple steps (3 steps max).
-## Reward Design
 | Property | Detail |
 |----------|--------|
@@ -134,44 +270,154 @@ Both can be submitted in the same step or across multiple steps (3 steps max).
 | Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
 | Multi-step improvement | Detailed feedback enables learning across attempts |
-**Fix grading tiers** (by issue type):
-- Exact match with clean value → 1.0
-- Valid fix: right type/range, addresses the issue → 0.8
-- Partially valid: reasonable attempt, right direction → 0.4
-- Right cell, wrong value → 0.1
-- Non-issue cell → 0.0
-## Quick Start
 ```bash
 pip install -e .
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
-# Run baseline agent
 API_BASE_URL=https://router.huggingface.co/v1 \
 MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
 HF_TOKEN=your-token \
 python inference.py
 ```
-## Testing
-128 tests covering task creation, reward computation, fix grading, environment lifecycle, inference parsing, and extensibility API.
 ```bash
 pip install -e ".[dev]"
 pytest tests/ -v
 ```
 ## Architecture
 ```
 dataqa_env/
-├── models.py              # DataQAAction (issues + fixes), DataQAObservation
 ├── server/
-│   ├── environment.py     # Two-phase grading engine (identify + fix + combined reward)
 │   ├── tasks.py           # 7 task definitions + contamination rules + extensibility API
-│   └── app.py             # FastAPI server (via openenv-core)
-inference.py               # Two-phase baseline agent (identify → fix)
-tests/                     # 128 tests
 ```

     • Duplicate "merge sort" instruction across rows
 ```
+> The interactive replay UI with color-coded dataset visualization is available on the HF Space.
 ## Environment API
 | Endpoint | Method | Description |
 |----------|--------|-------------|
+| `/reset` | POST | Start a new episode with a corrupted dataset |
 | `/step` | POST | Submit identified issues + proposed fixes |
 | `/state` | GET | Get current episode state |
 | `/health` | GET | Health check |
+## Tasks
+**Difficulty progression**: Easy issues are individually obvious (empty fields, text in numeric columns). Medium issues require cross-column reasoning (total != qty * price) and set membership checks. Hard issues require ML domain knowledge (val_loss < train_loss = data leakage). Expert tasks (alignment, coding, toolcalling, moderation) require domain expertise, semantic reasoning, and cross-row comparison.
+### Alignment Task: LLM Training Data Quality (Expert)
+Built on **real data from [NVIDIA HelpSteer](https://huggingface.co/datasets/nvidia/HelpSteer)** — 30 human-annotated prompt-response pairs with quality scores (helpfulness, correctness, coherence, complexity, verbosity on 0-4 scale).
+This task targets a critical real-world problem: **catching quality issues in LLM fine-tuning data before it corrupts model training**. The 12 planted issues represent failure modes actually seen in production data pipelines:
+| Issue | Difficulty | Why It's Hard |
+|---|---|---|
+| Subtle factual error (*Cerasus* vs *Prunus serrulata*) | 3.0 | Old taxonomic synonym — sounds plausible, requires domain knowledge |
+| Plausible wrong numbers ($400.3M at Sotheby's vs $450.3M at Christie's) | 3.0 | Right painting, wrong price by $50M and wrong auction house |
+| Self-contradictory reasoning ("does NOT learn via backprop" then describes backprop) | 3.0 | Response negates its own conclusion — trains confused models |
+| Hallucinated citation (fake Nature paper by fake Dr. Sarah Chen) | 3.0 | Fabricated study with specific fake statistics — most dangerous for training |
+| Harmful coding advice ("use bare except everywhere") with high quality scores | 3.0 | Teaches dangerous practices if used for fine-tuning |
+| Toxic/biased response scored as helpful | 3.0 | Gender-biased stereotypes with helpfulness=4 — poisons alignment training |
+| Leaked system prompt (`[SYSTEM] You are a helpful AI...`) in response | 2.5 | Data pipeline failed to strip prompt template |
+| Semantic near-duplicate prompt (rephrased, not exact copy) | 2.5 | Requires semantic similarity detection, not just string matching |
+| Truncated response (cut mid-sentence) | 2.5 | `max_length` truncation without sentence boundary detection |
+| Response in French for English prompt | 2.0 | Language contamination from multilingual training data |
+| Response plagiarized from another row | 2.0 | Data pipeline shuffling/dedup failure |
+| Whitespace-only prompt | 2.0 | Empty training example from pipeline artifact |
+### Coding Task: Code Quality (Expert)
+20-row dataset of code instruction-response pairs (Python algorithms, data structures, web, design patterns). 10 planted issues:
+- Syntax errors in "correct" code (unbalanced parens)
+- Logic bugs marked `is_correct=true` (binary search off-by-one infinite loop)
+- Security vulnerabilities (`eval()` on user input) marked correct
+- Language mismatches (JavaScript response labeled Python)
+- Truncated code, difficulty label mismatches, duplicate instructions, wrong categories, missing test cases
+### Tool-Calling Task: Function Schema Quality (Expert)
+20-row dataset of function definitions with parameter schemas, example calls, and outputs. 10 planted issues:
+- Function name mismatch between definition and example call
+- Missing required parameters in example call
+- Hallucinated parameters not in schema
+- Type mismatches (string "high" for integer quality parameter)
+- Invalid JSON, duplicate function names, misleading descriptions, wrong categories
+### Moderation Task: Content Label Quality (Expert)
+30-row dataset modeled on content moderation pipelines. 10 planted issues:
+- Mislabeled hate speech and violence (unflagged toxic content)
+- False positives on clean text (idioms flagged as hate)
+- Subset rule violations (`hate_threatening` without `hate` flag)
+- Out-of-range label values
+## Two-Phase Action Space
+### Phase 1: Identify Issues
+Submit issues in format: `row:<row_number>,col:<column_name>,issue:<issue_type>`
+- `row_number`: 1-indexed data row position (after header)
+- `column_name`: Exact column header name, lowercase
+- `issue_type`: One of the supported types below
+### Phase 2: Propose Fixes
+Submit fixes in format: `row:<row_number>,col:<column_name>,fix:<corrected_value>`
+The agent proposes the **correct value** that should replace the corrupted data. Fixes are graded against the original clean dataset.
+Both phases can be submitted in the same step or across multiple steps.
+**Supported Issue Types:**
+| Type | Description | Example |
+|------|-------------|---------|
+| `missing_value` | Null, empty, or whitespace-only | Empty name field |
+| `wrong_type` | Value doesn't match expected type | Salary as "seventy-five thousand" |
+| `duplicate_row` | Exact duplicate or duplicate key | Two rows with same employee_id |
+| `out_of_range` | Value outside valid range | Salary of 5000 when min is 50000 |
+| `format_violation` | Wrong format or invalid enum | Date as DD/MM/YYYY instead of YYYY-MM-DD |
+| `inconsistent_value` | Computed field mismatch, logical inconsistency | total != qty * price |
+| `statistical_outlier` | Unreasonable value given context | resnet18 using 42.5GB GPU |
+| `referential_integrity` | Foreign key violation | (available for custom tasks) |
+## Observation Space
+| Field | Type | Description |
+|-------|------|-------------|
+| `dataset_csv` | str | The corrupted dataset in CSV format |
+| `schema_description` | str | Column types, ranges, and constraints |
+| `validation_rules` | str | Business rules the data must satisfy |
+| `task_description` | str | Task context and instructions |
+| `feedback` | str | Per-step results: TP/FP/FN, precision/recall, fix scores |
+| `num_issues_hint` | int | Exact count of planted issues |
+| `max_steps` | int | Maximum attempts allowed |
+| `done` | bool | Whether episode has terminated |
+| `reward` | float | Best combined reward so far (strict 0-1 range) |
+**Observation Metadata** (per step):
+- Identify: `identify_f1`, `identify_score`, `precision`, `recall`, `tp`, `fp`, `fn`
+- Fix: `fix_score`, `fixes_correct`, `fixes_partial`, `fixes_wrong`, `fixes_attempted`
+- Combined: `combined_reward`, `difficulty_found`, `difficulty_missed`
+## Reward Function
+### Combined Reward
+```
+combined_reward = 0.6 * identify_score + 0.4 * fix_score
+```
+If no fixes are submitted, `combined_reward = identify_score` (no penalty — backward compatible).
+### Identify Score (Difficulty-Weighted F1)
+Each planted issue has a **difficulty weight** (1.0-3.0):
+| Weight | Category | Examples |
+|--------|----------|----------|
+| 1.0 | Easy | Missing values, obvious out-of-range, wrong type |
+| 1.5-2.0 | Medium | Duplicate keys, format violations, cross-column checks |
+| 2.5-3.0 | Hard | Data leakage, statistical outliers, hallucinated citations |
+- **Weighted Recall** = (difficulty of found issues) / (total difficulty)
+- **Weighted Precision** = penalizes false positives proportional to average difficulty
+- **Weighted F1** = harmonic mean
+### Fix Score (Tiered Grading by Issue Type)
+Each proposed fix is graded with tiered scoring that gives partial credit for reasonable attempts:
+| Fix Quality | Score | Description |
+|-------------|-------|-------------|
+| Exact match | 1.0 | Case-insensitive, whitespace-stripped match with clean value |
+| Valid fix | 0.8 | Right type/range, addresses the issue (e.g., any non-empty value for missing field) |
+| Partially valid | 0.4 | Reasonable attempt, right direction (e.g., numeric in right ballpark) |
+| Right cell, wrong value | 0.1 | Targets correct cell but fix doesn't address the issue |
+| Non-issue cell | 0.0 | Fix targets a cell with no issue |
+Fix score = (sum of best fix score per issue x difficulty weight) / (total difficulty weight)
+### Reward Properties
 | Property | Detail |
 |----------|--------|
 | Penalizes guessing | False positives reduce precision, fixing non-issues scores 0 |
 | Multi-step improvement | Detailed feedback enables learning across attempts |
+### Episode Boundaries
+- Each task allows up to 3 steps (attempts)
+- Episode ends when F1 >= 0.999 (perfect identification) or max steps reached
+- Agent receives detailed feedback after each step to improve on next attempt
+## Extensibility
+### Custom Contamination Rules
+```python
+from dataqa_env import register_contamination_rule
+from dataqa_env.server.tasks import PlantedIssue
+def swap_digits(rows, header, col_idx, row_idx, rng):
+    val = rows[row_idx][col_idx]
+    corrupted = val[::-1]
+    issue = PlantedIssue(
+        row=row_idx + 1, col=header[col_idx],
+        issue_type="format_violation",
+        description=f"Digits swapped in {header[col_idx]}",
+        difficulty=2.0,
+    )
+    return corrupted, issue
+register_contamination_rule("swap_digits", swap_digits)
+```
+### Custom Tasks from Config
+```python
+from dataqa_env import create_task_from_config, register_task
+task = create_task_from_config(
+    task_id="custom",
+    name="Custom Validation",
+    description="Find quality issues in this dataset.",
+    schema_description="id: int, name: str, score: int (0-100)",
+    validation_rules="No missing values. Scores must be 0-100.",
+    clean_csv="id,name,score\n1,Alice,95\n2,Bob,87\n3,Carol,92",
+    contaminations=[
+        {"rule": "missing_value", "row": 0, "col": 1, "difficulty": 1.0},
+        {"rule": "negative_value", "row": 2, "col": 2, "difficulty": 1.5},
+    ],
+)
+register_task("custom", lambda seed: task)
+```
+### Built-in Contamination Rules
+| Rule | Effect | Default Difficulty |
+|------|--------|--------------------|
+| `missing_value` | Sets field to empty string | 1.0 |
+| `whitespace_value` | Sets field to single space | 2.5 |
+| `wrong_type_text` | Replaces with random text | 1.0 |
+| `negative_value` | Negates numeric value | 1.0 |
+## Setup & Quick Start
 ```bash
+# Install
 pip install -e .
+# Run server locally
 uvicorn dataqa_env.server.app:app --host 0.0.0.0 --port 8000
+# Run inference (set your API credentials)
 API_BASE_URL=https://router.huggingface.co/v1 \
 MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
 HF_TOKEN=your-token \
 python inference.py
 ```
+## Docker
+```bash
+docker build -t dataqa-env .
+docker run -p 8000:8000 dataqa-env
+```
+## Testing
 ```bash
 pip install -e ".[dev]"
 pytest tests/ -v
 ```
+128 tests covering:
+- Task creation, corruption, and difficulty weights for all 7 tasks
+- Issue key and fix parsing (standard, lenient, edge cases)
+- F1, weighted reward, and fix quality computation
+- Full environment lifecycle (identify-only and identify+fix)
+- Combined reward calculation and weight verification
+- Inference script parsing and prompt building
+- Structured log format ([START], [STEP], [END])
+- Score bounds (strict 0-1), best-score monotonicity
+- Extensibility API (custom rules, custom tasks)
+- Moderation task determinism and label consistency
+## Validation
+```bash
+# OpenEnv spec validation
+openenv validate .
+# Pre-submission validation (requires HF Space URL)
+./prevalidation_script.sh https://your-space.hf.space
+```
+## Environment Variables
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `API_BASE_URL` | LLM API endpoint | `https://router.huggingface.co/v1` |
+| `MODEL_NAME` | Model identifier | `Qwen/Qwen2.5-72B-Instruct` |
+| `HF_TOKEN` | HuggingFace token / API key | - |
+| `ENV_URL` | Environment server URL | `http://localhost:8000` |
 ## Architecture
 ```
 dataqa_env/
+├── __init__.py            # Public API + extensibility exports
+├── models.py              # Pydantic: DataQAAction (issues + fixes), DataQAObservation, DataQAState
+├── client.py              # EnvClient for WebSocket connections
 ├── server/
+│   ├── environment.py     # Two-phase DataQAEnvironment (identify + fix + combined reward)
 │   ├── tasks.py           # 7 task definitions + contamination rules + extensibility API
+│   ├── gradio_ui.py       # Interactive web UI with agent trajectory replay
+│   ├── app.py             # FastAPI server (via openenv-core create_app)
+│   └── Dockerfile
+tests/
+├── test_tasks.py          # Task creation, corruption, difficulty weights (all 7 tasks)
+├── test_environment.py    # Identify scoring, fix grading, combined reward, lifecycle
+├── test_inference.py      # LLM response parsing, fix parsing, prompt building, log format
+└── test_extensibility.py  # Custom rules, custom tasks, registration API
+inference.py               # Two-phase baseline agent (identify then fix)
+openenv.yaml               # OpenEnv/HF Spaces spec
+pyproject.toml             # Package metadata and dependencies
+Dockerfile                 # Production container
 ```
+### Key Modules
+**`dataqa_env/server/tasks.py`** — The core of the environment. Each task function (`create_task_easy`, `create_task_coding`, etc.) builds a clean CSV dataset, injects corruptions as `PlantedIssue` objects with row/col/type/difficulty, and returns a `Task` dataclass. The `TASK_REGISTRY` dict maps task IDs to factory functions. The extensibility API (`register_task`, `register_contamination_rule`, `create_task_from_config`) allows users to add domains without modifying source.
+**`dataqa_env/server/environment.py`** — The `DataQAEnvironment` class inherits from OpenEnv's `Environment` base. `reset()` loads a task by ID and returns the corrupted CSV + schema. `step()` parses issue keys and fix proposals from the action, computes difficulty-weighted F1 for identification, grades fixes with tiered scoring by issue type, and returns combined reward with detailed feedback. Handles HTTP statelessness via auto-reset from `action.task_id`.
+**`dataqa_env/models.py`** — Pydantic models for the OpenEnv interface. `DataQAAction` carries `issues: List[str]`, `fixes: List[str]`, and `task_id: str`. `DataQAObservation` carries the CSV, schema, rules, feedback, and scoring metadata. `DataQAState` tracks episode progress.
+**`inference.py`** — Baseline LLM agent using OpenAI-compatible API. Runs all 7 tasks sequentially with 3 steps each. Lenient regex parsing handles case variations and delimiter differences in LLM output. Structured logging in `[START]/[STEP]/[END]` format for evaluation.