Spaces:

rampluto
/

medusa_env

Running

App Files Files Community

rampluto commited on 22 days ago

Commit

fd09b74

verified ·

1 Parent(s): ea782b7

Upload folder using huggingface_hub

Browse files

Files changed (25) hide show

Dockerfile +43 -0
LICENSE +21 -0
README.md +213 -6
__init__.py +34 -0
client.py +127 -0
grader.py +179 -0
models.py +115 -0
openenv.yaml +6 -0
openenv_medusa.egg-info/PKG-INFO +16 -0
openenv_medusa.egg-info/SOURCES.txt +25 -0
openenv_medusa.egg-info/dependency_links.txt +1 -0
openenv_medusa.egg-info/entry_points.txt +2 -0
openenv_medusa.egg-info/requires.txt +10 -0
openenv_medusa.egg-info/top_level.txt +2 -0
operators.py +315 -0
pyproject.toml +37 -0
rewards.py +107 -0
scenarios.py +215 -0
scripts/inference.py +288 -0
server/__init__.py +6 -0
server/app.py +37 -0
server/medusa_env.py +514 -0
tasks.py +286 -0
tests/test_medusa_environment.py +591 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,43 @@

+# HF Space root-level Dockerfile — targets port 7860 (HF default).
+# This file lives at envs/medusa_env/Dockerfile and is the file
+# HF Spaces uses when deploying a Docker Space from this directory.
+FROM python:3.12-slim
+WORKDIR /app
+# Install uv for fast dependency resolution
+RUN pip install uv --no-cache-dir
+# Copy environment code
+COPY . /app/env
+WORKDIR /app/env
+# Install all dependencies including openenv-core + pandas + numpy
+RUN uv pip install --system --no-cache \
+    "openenv-core[core]>=0.2.2" \
+    fastapi \
+    "uvicorn[standard]" \
+    pydantic \
+    pandas \
+    numpy \
+    websockets
+# Install the medusa package itself (so medusa_env.* imports resolve)
+RUN uv pip install --system --no-cache -e .
+# HF Spaces requires port 7860
+ENV PORT=7860
+EXPOSE 7860
+# PYTHONPATH so imports resolve correctly when running from /app/env
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+ENV ENABLE_WEB_INTERFACE=true
+# Health check on HF port
+HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')" || exit 1
+# Run on port 7860 — HF Space requirement
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2026 Ram Janam Yadav
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,219 @@
 ---
-title: Medusa Env
-emoji: 🏢
-colorFrom: pink
 colorTo: blue
 sdk: docker
 pinned: false
-license: mit
-short_description: 'Reinforcement Learning developed using OpenEnv '
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: MEDUSA Environment
+emoji: 🦑
+colorFrom: purple
 colorTo: blue
 sdk: docker
 pinned: false
+tags:
+  - openenv
+  - reinforcement-learning
+  - data-engineering
+app_port: 7860
+base_path: /web
 ---
+# MEDUSA
+**Medallion-Engineered Deterministic Unified Storage Agent**
+An OpenEnv reinforcement learning environment that trains agents to act as *Relational Controllers* — orchestrating multi-source Bronze→Silver data integration pipelines inside a Medallion Architecture.
+---
+## Problem
+Modern data platforms fail not because they can't clean a single table, but because they can't reliably integrate **multiple shifting sources**. The Bronze→Silver transition is a minefield of:
+- **Stale data** — processing yesterday's snapshot wastes compute and produces wrong results
+- **Schema drift** — new columns appear in sources that Silver doesn't know about yet
+- **Dirty join keys** — NULLs and whitespace cause 0-row joins and silent data loss
+- **Cartesian explosions** — joining on non-unique Dimension keys multiplies rows catastrophically
+- **Orphaned records** — unmatched Fact rows must be quarantined, not silently dropped
+MEDUSA trains an agent to detect and handle all of these autonomously.
+---
+## Environment Overview
+```
+Bronze A (Fact)  ──┐
+                   ├──► [Agent] ──► Silver  +  /quarantine
+Bronze B (Dim)   ──┘
+```
+The agent observes data quality signals and selects ETL actions step-by-step. At the end it issues `COMMIT`, triggering a deterministic grader audit.
+---
+## The MDP
+### Observation Space
+A **16-element normalised float vector** `[0, 1]`:
+| Index | Feature | Description |
+|-------|---------|-------------|
+| 0–1 | `time_delta_a/b_norm` | Source freshness (hours / 48h ceiling) |
+| 2–3 | `is_stale_a/b` | Binary staleness flag |
+| 4–5 | `null_ratio_key_a/b` | Fraction of null join keys |
+| 6–7 | `uniqueness_a/b` | Key uniqueness ratio (1.0 = fully unique) |
+| 8 | `match_rate` | % of Fact keys found in Dimension |
+| 9–10 | `new_cols_a/b_norm` | Schema drift columns pending |
+| 11 | `schema_compat` | Key type compatibility score |
+| 12–14 | `did_prep_a/b`, `did_dedup_b` | Prerequisite action flags |
+| 15 | `step_frac` | Episode progress (step / max_steps) |
+### Action Space
+11 discrete actions:
+| Action | Description |
+|--------|-------------|
+| `SYNC_CHECK` | Verify freshness of both sources |
+| `EVOLVE_SCHEMA` | Add new columns from A/B into Silver schema |
+| `PREP_KEYS_A` | Cast, strip, null-fill join key in Source A |
+| `PREP_KEYS_B` | Cast, strip, null-fill join key in Source B |
+| `DEDUPLICATE_B` | Ensure Dimension (B) is unique on the join key |
+| `EXECUTE_JOIN_INNER` | Inner join A ⋈ B |
+| `EXECUTE_JOIN_LEFT` | Left join A ⋈ B (orphans → quarantine) |
+| `EXECUTE_JOIN_ANTI` | Anti-join: extract rows in A with no match in B |
+| `APPLY_SCD_1` | Overwrite Silver records (SCD Type 1) |
+| `APPLY_SCD_2` | Close old records, insert new with timestamps (SCD Type 2) |
+| `COMMIT` | Finalise pipeline; triggers grader audit |
+### Reward Model
+| Event | Reward | Trigger |
+|-------|--------|---------|
+| High-Match Join | **+25.0** | `match_rate > 90%` after join |
+| Quarantine Precision | **+10.0** | Orphaned rows correctly isolated |
+| Correct SCD-2 | **+5.0** | SCD-2 applied on a tracked column |
+| Grader All-Pass Bonus | **+15.0** | All 4 post-commit checks pass |
+| Row Explosion | **−100.0** | Join output > 105% of Fact row count |
+| Join on Dirty Keys | **−30.0** | Join without PREP_KEYS → 0-row result |
+| Stale Processing | **−15.0** | Action taken while source is stale, SYNC_CHECK never called |
+| Step Penalty | **−0.2** | Applied every step (efficiency incentive) |
+---
+## Post-Commit Grader
+After `COMMIT` the deterministic grader runs 4 checks:
+| Check | Pass Condition |
+|-------|---------------|
+| **Volume** | `Silver rows ≤ Source A rows` (for left joins) |
+| **Integrity** | Quarantine holds only true orphans (not keys that could have joined if cleaned) |
+| **Schema** | Silver contains the union of all required columns from A and B |
+| **History** | SCD-2 `valid_from`/`valid_to` timestamps are non-overlapping |
+All 4 pass → **+15.0** bonus. Each failure costs **−5.0**.
+---
+## Episode Scenarios
+Four canonical scenarios (selectable by seed):
+| Seed | Scenario | Challenge |
+|------|----------|-----------|
+| 0 | `clean` | Fresh, unique keys, ~100% match rate. Baseline. |
+| 1 | `dirty_keys` | NULLs + whitespace in join keys. Must PREP first. |
+| 2 | `stale` | Source A is 8–24h old. Must SYNC_CHECK first. |
+| 3 | `schema_drift` | New columns in A and B not yet in Silver. Must EVOLVE first. |
+Random seeds produce blended variants.
+---
+## Setup
+```bash
+# Clone / navigate to repo
+cd /path/to/OpenEnv
+# Create venv and install all deps (including pandas, numpy)
+uv sync
+# Activate
+source .venv/bin/activate
+```
+---
+## Running
+### Start the FastAPI server
+```bash
+uvicorn envs.medusa_env.server.app:app --reload --host 0.0.0.0 --port 8000
+```
+API docs available at `http://localhost:8000/docs`.
+### Run tests
+```bash
+python -m pytest tests/envs/test_medusa_environment.py -v
+# 39 passed in ~4s
+```
+### Run a manual episode (Python)
+```python
+from envs.medusa_env import MedusaEnv, MedusaAction
+from envs.medusa_env.models import MedusaActionType
+env = MedusaEnv(n_fact_rows=200, n_dim_rows=150)
+obs = env.reset(seed=0)  # seed 0 = clean scenario
+print(obs.message)
+for action_type in [
+    MedusaActionType.SYNC_CHECK,
+    MedusaActionType.EVOLVE_SCHEMA,
+    MedusaActionType.PREP_KEYS_A,
+    MedusaActionType.PREP_KEYS_B,
+    MedusaActionType.DEDUPLICATE_B,
+    MedusaActionType.EXECUTE_JOIN_LEFT,
+    MedusaActionType.APPLY_SCD_2,
+    MedusaActionType.COMMIT,
+]:
+    obs = env.step(MedusaAction(action=action_type))
+    print(f"{action_type.value:25s} reward={obs.reward:+.1f}  done={obs.done}")
+print(f"\nGrader: {env.state.grader_report}")
+```
+---
+## Architecture
+```
+envs/medusa_env/
+├── __init__.py          # Package exports
+├── medusa_env.py        # MedusaEnv — reset / step / commit loop
+├── models.py            # MedusaAction, MedusaObservation, MedusaState (Pydantic)
+├── scenarios.py         # ScenarioGenerator — procedural Bronze A/B DataFrames
+├── operators.py         # Stateless ETL functions (sync_check, prep_keys, execute_join, apply_scd …)
+├── rewards.py           # RewardEngine — per-step reward computation
+├── grader.py            # Grader — post-commit deterministic audit
+├── openenv.yaml         # OpenEnv environment manifest
+└── server/
+    └── app.py           # FastAPI app via create_app()
+tests/envs/
+└── test_medusa_environment.py   # 39 tests across 6 test classes
+```
+**Stack:** Python 3.10+ · Pandas · Pydantic v2 · FastAPI · OpenEnv
+---
+## Technical Notes
+- **No external data required.** All Bronze tables are generated procedurally per episode.
+- **No Spark or Delta Lake required.** All logic uses Pandas — identical semantics, zero cluster setup.
+- The grader is fully deterministic: same Silver + quarantine tables always produce the same audit result.
+- The governance log (accessible at `env._tables.governance_log`) records every agent decision with its reward and operator metrics.

__init__.py ADDED Viewed

	@@ -0,0 +1,34 @@

+"""MEDUSA (Medallion-Engineered Deterministic Unified Storage Agent) environment.
+Full Bronze→Silver integration controller with:
+- Multi-source join orchestration (inner / left / anti)
+- Schema drift handling (EVOLVE_SCHEMA)
+- Key preparation and deduplication
+- SCD-1 and SCD-2 merge logic
+- Per-step RL reward engine
+- Deterministic post-commit grader
+"""
+from .client import medusa_env
+from .grader import Grader, GraderResult
+from .models import MedusaAction, MedusaActionType, MedusaObservation, MedusaState
+from .rewards import RewardEngine
+from .scenarios import Scenario, ScenarioGenerator
+from .tasks import TASKS, Task, TaskResult, score_episode
+__all__ = [
+    "medusa_env",
+    "MedusaAction",
+    "MedusaActionType",
+    "MedusaObservation",
+    "MedusaState",
+    "Scenario",
+    "ScenarioGenerator",
+    "RewardEngine",
+    "Grader",
+    "GraderResult",
+    "TASKS",
+    "Task",
+    "TaskResult",
+    "score_episode",
+]

client.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""MEDUSA Environment Client.
+Connects to a running MEDUSA server via WebSocket for persistent sessions.
+Example:
+    >>> # Connect to a running server
+    >>> with medusa_env(base_url="http://localhost:8000") as client:
+    ...     result = client.reset(seed=0)
+    ...     print(result.observation.message)
+    ...
+    ...     from envs.medusa_env.models import MedusaActionType
+    ...     result = client.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+    ...     print(f"Reward: {result.reward}")
+Example with Docker:
+    >>> client = medusa_env.from_docker_image("medusa_env:latest")
+    >>> try:
+    ...     result = client.reset()
+    ...     result = client.step(MedusaAction(action=MedusaActionType.COMMIT))
+    ... finally:
+    ...     client.close()
+"""
+from typing import Any, Dict
+# Support both in-repo and standalone imports
+try:
+    from openenv.core.client_types import StepResult
+    from openenv.core.env_client import EnvClient
+    from .models import MedusaAction, MedusaObservation, MedusaState
+except ImportError:
+    from models import MedusaAction, MedusaObservation, MedusaState
+    from openenv.core.client_types import StepResult
+    from openenv.core.env_client import EnvClient
+class medusa_env(EnvClient[MedusaAction, MedusaObservation, MedusaState]):
+    """Client for the MEDUSA Bronze→Silver integration environment.
+    Maintains a persistent WebSocket connection to the MEDUSA server.
+    Each client instance has its own dedicated environment session.
+    The agent observes a 16-float data quality feature vector and chooses
+    from 11 discrete ETL actions to build a correct Silver entity from
+    two Bronze sources (Fact + Dimension).
+    Example:
+        >>> with medusa_env(base_url="http://localhost:8000") as env:
+        ...     result = env.reset(seed=0)          # clean scenario
+        ...     result = env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        ...     result = env.step(MedusaAction(action=MedusaActionType.PREP_KEYS_A))
+        ...     result = env.step(MedusaAction(action=MedusaActionType.PREP_KEYS_B))
+        ...     result = env.step(MedusaAction(action=MedusaActionType.DEDUPLICATE_B))
+        ...     result = env.step(MedusaAction(action=MedusaActionType.EXECUTE_JOIN_LEFT))
+        ...     result = env.step(MedusaAction(action=MedusaActionType.APPLY_SCD_2))
+        ...     result = env.step(MedusaAction(action=MedusaActionType.COMMIT))
+        ...     print(result.reward)
+    """
+    def _step_payload(self, action: MedusaAction) -> Dict[str, Any]:
+        """Convert MedusaAction to JSON payload for the step request."""
+        return {
+            "action": action.action.value,
+            "params": action.params,
+        }
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[MedusaObservation]:
+        """Parse server response into StepResult[MedusaObservation]."""
+        obs_data = payload.get("observation", {})
+        observation = MedusaObservation(
+            message=obs_data.get("message", ""),
+            features=obs_data.get("features", []),
+            metrics=obs_data.get("metrics", {}),
+            metadata=obs_data.get("metadata", {}),
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> MedusaState:
+        """Parse server response into MedusaState."""
+        return MedusaState(
+            run_id=payload.get("run_id"),
+            seed=payload.get("seed"),
+            scenario_id=payload.get("scenario_id"),
+            step_idx=payload.get("step_idx", 0),
+            stage=payload.get("stage", "init"),
+            # Freshness
+            time_delta_a=payload.get("time_delta_a", 0.0),
+            time_delta_b=payload.get("time_delta_b", 0.0),
+            is_stale_a=payload.get("is_stale_a", False),
+            is_stale_b=payload.get("is_stale_b", False),
+            did_sync_check=payload.get("did_sync_check", False),
+            # Key health
+            null_ratio_key_a=payload.get("null_ratio_key_a", 0.0),
+            null_ratio_key_b=payload.get("null_ratio_key_b", 0.0),
+            uniqueness_a=payload.get("uniqueness_a", 1.0),
+            uniqueness_b=payload.get("uniqueness_b", 1.0),
+            did_prep_a=payload.get("did_prep_a", False),
+            did_prep_b=payload.get("did_prep_b", False),
+            did_dedup_b=payload.get("did_dedup_b", False),
+            # Join
+            match_rate=payload.get("match_rate", 0.0),
+            did_join=payload.get("did_join", False),
+            join_type=payload.get("join_type"),
+            join_row_count=payload.get("join_row_count", 0),
+            explosion_detected=payload.get("explosion_detected", False),
+            # SCD
+            did_scd=payload.get("did_scd", False),
+            scd_type=payload.get("scd_type"),
+            scd_inserts=payload.get("scd_inserts", 0),
+            scd_updates=payload.get("scd_updates", 0),
+            # Silver / Quarantine
+            silver_row_count=payload.get("silver_row_count", 0),
+            quarantine_row_count=payload.get("quarantine_row_count", 0),
+            source_a_row_count=payload.get("source_a_row_count", 0),
+            # Grader
+            grader_passed=payload.get("grader_passed", False),
+            grader_report=payload.get("grader_report", ""),
+            cumulative_reward=payload.get("cumulative_reward", 0.0),
+        )

grader.py ADDED Viewed

	@@ -0,0 +1,179 @@

+"""MEDUSA deterministic post-commit grader.
+Runs a four-check audit after the agent issues COMMIT and returns a
+``GraderResult`` that feeds a bonus/penalty into the terminal reward.
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, List
+import pandas as pd
+if TYPE_CHECKING:
+    from .scenarios import Scenario
+# ---------------------------------------------------------------------------
+# GraderResult
+# ---------------------------------------------------------------------------
+@dataclass
+class GraderResult:
+    """Outcome of the post-commit audit."""
+    passed: bool = False
+    volume_ok: bool = False    # Silver rows ≤ Source A rows (no duplicates from join)
+    integrity_ok: bool = False # Quarantine holds only true orphans
+    schema_ok: bool = False    # Silver has union of required columns
+    history_ok: bool = False   # SCD-2 timestamps non-overlapping
+    failures: List[str] = field(default_factory=list)
+    bonus_reward: float = 0.0
+    report: str = ""
+# Reward tuning
+_BONUS_ALL_PASS = +15.0
+_PENALTY_ALL_FAIL = -20.0
+_BONUS_PER_CHECK = +3.0
+_PENALTY_PER_FAIL = -5.0
+# ---------------------------------------------------------------------------
+# Grader
+# ---------------------------------------------------------------------------
+class Grader:
+    """Post-commit deterministic audit following MEDUSA spec §4."""
+    def audit(
+        self,
+        silver: pd.DataFrame,
+        quarantine: pd.DataFrame,
+        bronze_a: pd.DataFrame,
+        bronze_b: pd.DataFrame,
+        join_key: str,
+        join_type: str,
+        scd_type: int,
+        scenario: "Scenario",
+    ) -> GraderResult:
+        """Run all four grader checks and compute bonus reward.
+        Args:
+            silver: The final Silver DataFrame after SCD merge.
+            quarantine: Rows from A that did not match B.
+            bronze_a: Original fact source (pre-cleaning).
+            bronze_b: Original dimension source (pre-cleaning).
+            join_key: Column used for the join.
+            join_type: "inner" | "left" | "anti"
+            scd_type: 1 or 2
+            scenario: The current episode's scenario (has tracked_cols etc.)
+        Returns:
+            GraderResult with individual check statuses and bonus_reward.
+        """
+        result = GraderResult()
+        # ── 1. Volume Check ──────────────────────────────────────────────
+        # For left joins, Silver should not exceed Source A row count.
+        if join_type == "left":
+            source_a_rows = len(bronze_a.dropna(subset=[join_key]))
+            silver_rows = len(silver[silver.get("is_current", pd.Series(True, index=silver.index)) == True]) if "is_current" in silver.columns else len(silver)  # noqa: E712
+            result.volume_ok = silver_rows <= source_a_rows * 1.05  # 5% tolerance
+            if not result.volume_ok:
+                result.failures.append(
+                    f"VOLUME_FAIL: Silver {silver_rows} rows > Source A {source_a_rows} rows"
+                )
+        else:
+            result.volume_ok = True  # Not applicable for inner/anti joins
+        # ── 2. Integrity Check ───────────────────────────────────────────
+        # Quarantine rows should be true orphans (no match in B even after cleaning).
+        if not quarantine.empty and join_key in quarantine.columns:
+            dim_keys = set(bronze_b[join_key].dropna().astype(str).str.strip())
+            quarantine_keys = set(quarantine[join_key].dropna().astype(str).str.strip())
+            # Orphan = quarantine key truly not in dim
+            could_join = quarantine_keys & dim_keys
+            if could_join:
+                result.integrity_ok = False
+                result.failures.append(
+                    f"INTEGRITY_FAIL: {len(could_join)} quarantine row(s) could have "
+                    f"been joined if keys were cleaned."
+                )
+            else:
+                result.integrity_ok = True
+        else:
+            result.integrity_ok = True  # Empty quarantine is fine
+        # ── 3. Schema Check ──────────────────────────────────────────────
+        # Silver must contain all required columns from A and B.
+        required_from_a = [c for c in bronze_a.columns if c != join_key]
+        required_from_b = [c for c in bronze_b.columns if c != join_key]
+        required = set(required_from_a + required_from_b + scenario.new_cols_a + scenario.new_cols_b)
+        silver_cols = set(silver.columns)
+        missing = required - silver_cols
+        if missing:
+            result.schema_ok = False
+            result.failures.append(f"SCHEMA_FAIL: Missing columns in Silver: {sorted(missing)}")
+        else:
+            result.schema_ok = True
+        # ── 4. History Check (SCD-2 only) ────────────────────────────────
+        if scd_type == 2 and "valid_from" in silver.columns and "valid_to" in silver.columns:
+            overlap_found = False
+            for key_val, group in silver.groupby(join_key):
+                if len(group) < 2:
+                    continue
+                closed = group[group["valid_to"].notna()].sort_values("valid_from")
+                for i in range(len(closed) - 1):
+                    vt_i = closed.iloc[i]["valid_to"]
+                    vf_next = closed.iloc[i + 1]["valid_from"]
+                    if pd.notna(vt_i) and pd.notna(vf_next) and vt_i > vf_next:
+                        overlap_found = True
+                        break
+                if overlap_found:
+                    break
+            if overlap_found:
+                result.history_ok = False
+                result.failures.append("HISTORY_FAIL: SCD-2 timestamps overlap for some keys.")
+            else:
+                result.history_ok = True
+        else:
+            result.history_ok = True  # Not applicable for SCD-1
+        # ── Compute bonus ────────────────────────────────────────────────
+        checks = [result.volume_ok, result.integrity_ok, result.schema_ok, result.history_ok]
+        passed_count = sum(checks)
+        failed_count = len(checks) - passed_count
+        result.passed = all(checks)
+        if result.passed:
+            result.bonus_reward = _BONUS_ALL_PASS
+        elif failed_count == len(checks):
+            result.bonus_reward = _PENALTY_ALL_FAIL
+        else:
+            result.bonus_reward = passed_count * _BONUS_PER_CHECK - failed_count * _PENALTY_PER_FAIL
+        result.report = _build_report(result)
+        return result
+# ---------------------------------------------------------------------------
+# Internal helpers
+# ---------------------------------------------------------------------------
+def _build_report(result: GraderResult) -> str:
+    lines = ["=== MEDUSA Grader Audit ==="]
+    lines.append(f"  Volume OK:    {'✓' if result.volume_ok else '✗'}")
+    lines.append(f"  Integrity OK: {'✓' if result.integrity_ok else '✗'}")
+    lines.append(f"  Schema OK:    {'✓' if result.schema_ok else '✗'}")
+    lines.append(f"  History OK:   {'✓' if result.history_ok else '✗'}")
+    lines.append(f"  Bonus Reward: {result.bonus_reward:+.1f}")
+    if result.failures:
+        lines.append("  Failures:")
+        for f in result.failures:
+            lines.append(f"    - {f}")
+    lines.append(f"  {'PASS ✓' if result.passed else 'FAIL ✗'}")
+    return "\n".join(lines)

models.py ADDED Viewed

	@@ -0,0 +1,115 @@

+from __future__ import annotations
+from enum import Enum
+from typing import Any, Dict, List, Optional
+from pydantic import Field
+from openenv.core.env_server.types import Action, Observation, State
+class MedusaActionType(str, Enum):
+    """Discrete action set for the MEDUSA controller."""
+    SYNC_CHECK = "SYNC_CHECK"
+    EVOLVE_SCHEMA = "EVOLVE_SCHEMA"
+    PREP_KEYS_A = "PREP_KEYS_A"
+    PREP_KEYS_B = "PREP_KEYS_B"
+    DEDUPLICATE_B = "DEDUPLICATE_B"
+    EXECUTE_JOIN_INNER = "EXECUTE_JOIN_INNER"
+    EXECUTE_JOIN_LEFT = "EXECUTE_JOIN_LEFT"
+    EXECUTE_JOIN_ANTI = "EXECUTE_JOIN_ANTI"
+    APPLY_SCD_1 = "APPLY_SCD_1"
+    APPLY_SCD_2 = "APPLY_SCD_2"
+    COMMIT = "COMMIT"
+class MedusaAction(Action):
+    """One controller action (enum + optional params for future use)."""
+    action: MedusaActionType
+    params: Dict[str, Any] = Field(default_factory=dict)
+class MedusaState(State):
+    """Full pipeline controller state.
+    Tracks every book-keeping flag needed by the reward engine and grader.
+    """
+    run_id: Optional[str] = None
+    seed: Optional[int] = None
+    scenario_id: Optional[str] = None
+    max_steps: int = 20
+    step_idx: int = 0
+    stage: str = "init"  # init | running | committed | failed
+    # --- Freshness ---
+    time_delta_a: float = 0.0  # Hours since Source A last updated
+    time_delta_b: float = 0.0
+    is_stale_a: bool = False
+    is_stale_b: bool = False
+    did_sync_check: bool = False
+    # --- Schema ---
+    did_evolve_schema: bool = False
+    new_cols_a: int = 0   # Number of new columns in A not yet in Silver
+    new_cols_b: int = 0
+    schema_compat: float = 1.0  # 0-1 key-type compatibility score
+    # --- Key Health ---
+    null_ratio_key_a: float = 0.0
+    null_ratio_key_b: float = 0.0
+    uniqueness_a: float = 1.0   # 1.0 = fully unique
+    uniqueness_b: float = 1.0
+    did_prep_a: bool = False
+    did_prep_b: bool = False
+    did_dedup_b: bool = False
+    # --- Referential Integrity ---
+    match_rate: float = 0.0  # % of Key_A values found in Key_B
+    # --- Join Result ---
+    did_join: bool = False
+    join_type: Optional[str] = None
+    join_row_count: int = 0
+    explosion_detected: bool = False
+    # --- SCD ---
+    did_scd: bool = False
+    scd_type: Optional[str] = None
+    scd_inserts: int = 0
+    scd_updates: int = 0
+    # --- Silver / Quarantine ---
+    silver_row_count: int = 0
+    quarantine_row_count: int = 0
+    source_a_row_count: int = 0
+    # --- Grader ---
+    grader_passed: bool = False
+    grader_report: str = ""
+    # --- Governance ---
+    cumulative_reward: float = 0.0
+class MedusaObservation(Observation):
+    """Observation returned to the agent after every step.
+    ``features`` is a 16-element normalised float vector suitable as
+    direct RL input::
+        [time_delta_a_norm, time_delta_b_norm, is_stale_a, is_stale_b,
+         null_ratio_key_a, null_ratio_key_b, uniqueness_a, uniqueness_b,
+         match_rate, new_cols_a_norm, new_cols_b_norm, schema_compat,
+         did_prep_a, did_prep_b, did_dedup_b, step_frac]
+    """
+    message: str = ""
+    features: List[float] = Field(default_factory=list)
+    metrics: Dict[str, Any] = Field(default_factory=dict)
+    metadata: Dict[str, Any] = Field(default_factory=dict)
+    reward: Optional[float] = None
+    done: bool = False

openenv.yaml ADDED Viewed

	@@ -0,0 +1,6 @@

+spec_version: 1
+name: medusa_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

openenv_medusa.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,16 @@

+Metadata-Version: 2.4
+Name: openenv-medusa
+Version: 0.2.0
+Summary: MEDUSA: Medallion-Engineered Deterministic Unified Storage Agent — Bronze→Silver RL environment for OpenEnv
+Requires-Python: >=3.10
+License-File: LICENSE
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: fastapi>=0.115.0
+Requires-Dist: pydantic>=2.0.0
+Requires-Dist: uvicorn>=0.24.0
+Requires-Dist: pandas>=2.0.0
+Requires-Dist: numpy>=1.24.0
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-asyncio>=0.23.0; extra == "dev"
+Dynamic: license-file

openenv_medusa.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,25 @@

+LICENSE
+README.md
+pyproject.toml
+./__init__.py
+./client.py
+./grader.py
+./models.py
+./openenv.yaml
+./operators.py
+./rewards.py
+./scenarios.py
+./tasks.py
+./server/__init__.py
+./server/app.py
+./server/medusa_env.py
+openenv_medusa.egg-info/PKG-INFO
+openenv_medusa.egg-info/SOURCES.txt
+openenv_medusa.egg-info/dependency_links.txt
+openenv_medusa.egg-info/entry_points.txt
+openenv_medusa.egg-info/requires.txt
+openenv_medusa.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/medusa_env.py
+tests/test_medusa_environment.py

openenv_medusa.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_medusa.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = server.app:main

openenv_medusa.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+openenv-core[core]>=0.2.2
+fastapi>=0.115.0
+pydantic>=2.0.0
+uvicorn>=0.24.0
+pandas>=2.0.0
+numpy>=1.24.0
+[dev]
+pytest>=8.0.0
+pytest-asyncio>=0.23.0

openenv_medusa.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ medusa_env
2	+ server

operators.py ADDED Viewed

	@@ -0,0 +1,315 @@

+"""MEDUSA ETL operators.
+Each operator is a stateless function that takes DataFrame(s) and returns a
+(result_df_or_None, metrics_dict) tuple. The environment calls these from
+``step()`` and passes the metrics to the reward engine.
+"""
+from __future__ import annotations
+import datetime
+from typing import Any, Dict, Optional, Tuple
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Type alias
+# ---------------------------------------------------------------------------
+Metrics = Dict[str, Any]
+OpResult = Tuple[Optional[pd.DataFrame], Metrics]
+# ---------------------------------------------------------------------------
+# Operator: sync_check
+# ---------------------------------------------------------------------------
+def sync_check(
+    bronze_a: pd.DataFrame,
+    bronze_b: pd.DataFrame,
+    time_delta_a: float,
+    time_delta_b: float,
+    stale_threshold_hours: float = 6.0,
+) -> OpResult:
+    """Inspect freshness of both sources.
+    Returns metrics about staleness without modifying any data.
+    """
+    is_stale_a = time_delta_a > stale_threshold_hours
+    is_stale_b = time_delta_b > stale_threshold_hours
+    metrics: Metrics = {
+        "time_delta_a": time_delta_a,
+        "time_delta_b": time_delta_b,
+        "is_stale_a": is_stale_a,
+        "is_stale_b": is_stale_b,
+        "rows_a": len(bronze_a),
+        "rows_b": len(bronze_b),
+    }
+    return None, metrics
+# ---------------------------------------------------------------------------
+# Operator: evolve_schema
+# ---------------------------------------------------------------------------
+def evolve_schema(
+    silver: pd.DataFrame,
+    bronze_a: pd.DataFrame,
+    bronze_b: pd.DataFrame,
+    new_cols_a: list[str],
+    new_cols_b: list[str],
+) -> OpResult:
+    """Add new columns (from schema drift) to the Silver DataFrame.
+    Fills missing historical rows with NaN.
+    """
+    added: list[str] = []
+    result = silver.copy()
+    for col in new_cols_a + new_cols_b:
+        if col not in result.columns:
+            result[col] = pd.NA
+            added.append(col)
+    metrics: Metrics = {
+        "cols_added": added,
+        "new_cols_count": len(added),
+        "silver_col_count": len(result.columns),
+    }
+    return result, metrics
+# ---------------------------------------------------------------------------
+# Operator: prep_keys
+# ---------------------------------------------------------------------------
+def prep_keys(df: pd.DataFrame, key_col: str) -> OpResult:
+    """Cast, strip whitespace, and null-fill the join key column.
+    Returns a cleaned copy of ``df`` with metrics about how many rows were
+    affected.
+    """
+    result = df.copy()
+    original_nulls = result[key_col].isna().sum()
+    original_len = len(result)
+    # Strip whitespace (treat blank strings as nulls)
+    result[key_col] = result[key_col].astype(str).str.strip()
+    result[key_col] = result[key_col].replace({"None": pd.NA, "nan": pd.NA, "": pd.NA})
+    # Cast to string (uniform type for join)
+    result[key_col] = result[key_col].astype("string")
+    after_nulls = result[key_col].isna().sum()
+    null_ratio_before = original_nulls / max(original_len, 1)
+    null_ratio_after = int(after_nulls) / max(original_len, 1)
+    metrics: Metrics = {
+        "null_ratio_before": null_ratio_before,
+        "null_ratio_after": null_ratio_after,
+        "rows_trimmed": original_len - int(after_nulls),
+        "null_rows_dropped": 0,  # We do NOT drop nulls; grader catches orphans
+    }
+    return result, metrics
+# ---------------------------------------------------------------------------
+# Operator: deduplicate
+# ---------------------------------------------------------------------------
+def deduplicate(df: pd.DataFrame, key_col: str) -> OpResult:
+    """Ensure Dimension (Source B) is unique on ``key_col``.
+    Keeps the last occurrence so the most-recent record wins.
+    """
+    original_len = len(df)
+    result = df.drop_duplicates(subset=[key_col], keep="last").reset_index(drop=True)
+    dupes_removed = original_len - len(result)
+    non_null = result[key_col].notna().sum()
+    uniqueness = non_null / max(len(result), 1)
+    metrics: Metrics = {
+        "dupes_removed": dupes_removed,
+        "uniqueness": float(uniqueness),
+        "rows_after": len(result),
+    }
+    return result, metrics
+# ---------------------------------------------------------------------------
+# Operator: execute_join
+# ---------------------------------------------------------------------------
+_EXPLOSION_MULTIPLIER = 1.05  # > 5% extra rows triggers explosion alert
+def execute_join(
+    fact: pd.DataFrame,
+    dim: pd.DataFrame,
+    key_col: str,
+    join_type: str,  # "inner" | "left" | "anti"
+) -> Tuple[pd.DataFrame, pd.DataFrame, Metrics]:
+    """Join Fact (A) with Dimension (B).
+    Returns (joined_df, quarantine_df, metrics).
+    ``quarantine_df`` contains rows from A that did not match B (orphans).
+    """
+    # Drop null-keyed rows from both before joining
+    fact_clean = fact.dropna(subset=[key_col])
+    dim_clean = dim.dropna(subset=[key_col])
+    # Compute match rate before join
+    fact_keys = set(fact_clean[key_col].astype(str))
+    dim_keys = set(dim_clean[key_col].astype(str))
+    overlap = fact_keys & dim_keys
+    match_rate = len(overlap) / max(len(fact_keys), 1)
+    if join_type == "anti":
+        # Anti-join: rows in A NOT in B → goes to quarantine
+        mask = ~fact_clean[key_col].astype(str).isin(dim_keys)
+        joined = pd.DataFrame(columns=list(fact_clean.columns) + [
+            c for c in dim_clean.columns if c != key_col
+        ])
+        quarantine = fact_clean[mask].copy()
+    elif join_type == "inner":
+        merged = fact_clean.merge(dim_clean, on=key_col, how="inner",
+                                   suffixes=("_a", "_b"))
+        quarantine = fact_clean[~fact_clean[key_col].astype(str).isin(dim_keys)].copy()
+        joined = merged
+    else:  # left
+        merged = fact_clean.merge(dim_clean, on=key_col, how="left",
+                                   suffixes=("_a", "_b"))
+        # Quarantine = rows where all dim columns are NaN (no match)
+        dim_cols = [c for c in dim_clean.columns if c != key_col]
+        if dim_cols:
+            no_match_mask = merged[dim_cols[0]].isna() if dim_cols else pd.Series(False, index=merged.index)
+        else:
+            no_match_mask = pd.Series(False, index=merged.index)
+        quarantine = merged[no_match_mask][[key_col]].copy()
+        joined = merged
+    # Explosion detection
+    explosion = len(joined) > len(fact_clean) * _EXPLOSION_MULTIPLIER
+    metrics: Metrics = {
+        "join_type": join_type,
+        "fact_rows": len(fact_clean),
+        "dim_rows": len(dim_clean),
+        "join_rows": len(joined),
+        "quarantine_rows": len(quarantine),
+        "match_rate": match_rate,
+        "explosion_detected": explosion,
+    }
+    return joined, quarantine, metrics
+# ---------------------------------------------------------------------------
+# Operator: apply_scd
+# ---------------------------------------------------------------------------
+def apply_scd(
+    silver: pd.DataFrame,
+    joined: pd.DataFrame,
+    key_col: str,
+    tracked_col: str,
+    scd_type: int,  # 1 or 2
+) -> OpResult:
+    """Merge ``joined`` result into Silver using SCD-1 or SCD-2.
+    SCD-1: overwrite existing records.
+    SCD-2: close old records (valid_to = now) and insert new ones with
+           a new valid_from / valid_to = None (open record).
+    """
+    now = datetime.datetime.now(datetime.UTC)
+    inserts = 0
+    updates = 0
+    if joined.empty:
+        metrics: Metrics = {
+            "scd_type": scd_type,
+            "inserts": 0,
+            "updates": 0,
+            "silver_rows": len(silver),
+        }
+        return silver, metrics
+    if silver.empty:
+        # First load — treat everything as inserts
+        result = joined.copy()
+        if scd_type == 2:
+            result["valid_from"] = now
+            result["valid_to"] = pd.NaT
+            result["is_current"] = True
+        inserts = len(result)
+        metrics = {
+            "scd_type": scd_type,
+            "inserts": inserts,
+            "updates": 0,
+            "silver_rows": len(result),
+        }
+        return result, metrics
+    if scd_type == 1:
+        # Upsert: overwrite matching records
+        exists_mask = silver[key_col].isin(joined[key_col])
+        new_keys_mask = ~joined[key_col].isin(silver[key_col])
+        result = silver[~exists_mask].copy()
+        result = pd.concat([result, joined], ignore_index=True)
+        updates = int(exists_mask.sum())
+        inserts = int(new_keys_mask.sum())
+    else:  # SCD-2
+        # Ensure Silver has timestamp columns
+        if "valid_from" not in silver.columns:
+            silver = silver.copy()
+            silver["valid_from"] = now - datetime.timedelta(days=30)
+            silver["valid_to"] = pd.NaT
+            silver["is_current"] = True
+        silver_result = silver.copy()
+        new_rows: list[pd.DataFrame] = []
+        for _, new_row in joined.iterrows():
+            key_val = new_row[key_col]
+            current_mask = (silver_result[key_col] == key_val) & (silver_result["is_current"] == True)  # noqa: E712
+            current_rows = silver_result[current_mask]
+            if current_rows.empty:
+                # New record
+                row_df = pd.DataFrame([new_row])
+                row_df["valid_from"] = now
+                row_df["valid_to"] = pd.NaT
+                row_df["is_current"] = True
+                new_rows.append(row_df)
+                inserts += 1
+            else:
+                # Check if tracked column changed
+                old_val = current_rows.iloc[0].get(tracked_col)
+                new_val = new_row.get(tracked_col)
+                if old_val != new_val:
+                    # Close old record
+                    silver_result.loc[current_mask, "valid_to"] = now
+                    silver_result.loc[current_mask, "is_current"] = False
+                    # Insert new record
+                    row_df = pd.DataFrame([new_row])
+                    row_df["valid_from"] = now
+                    row_df["valid_to"] = pd.NaT
+                    row_df["is_current"] = True
+                    new_rows.append(row_df)
+                    updates += 1
+        if new_rows:
+            silver_result = pd.concat([silver_result] + new_rows, ignore_index=True)
+        result = silver_result
+    metrics = {
+        "scd_type": scd_type,
+        "inserts": inserts,
+        "updates": updates,
+        "silver_rows": len(result),
+    }
+    return result, metrics

pyproject.toml ADDED Viewed

	@@ -0,0 +1,37 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-medusa"
+version = "0.2.0"
+description = "MEDUSA: Medallion-Engineered Deterministic Unified Storage Agent — Bronze→Silver RL environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv dependencies
+    "openenv-core[core]>=0.2.2",
+    "fastapi>=0.115.0",
+    "pydantic>=2.0.0",
+    "uvicorn>=0.24.0",
+    # Data pipeline dependencies
+    "pandas>=2.0.0",
+    "numpy>=1.24.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-asyncio>=0.23.0",
+]
+[project.scripts]
+# Enables: uv run server  (from the medusa_env directory)
+server = "server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["medusa_env", "medusa_env.server", "server"]
+package-dir = { "medusa_env" = ".", "server" = "server" }
+[tool.setuptools.package-data]
+medusa_env = ["**/*.yaml", "**/*.yml"]

rewards.py ADDED Viewed

	@@ -0,0 +1,107 @@

+"""MEDUSA reward engine.
+Reward model as defined in the MEDUSA blueprint. All reward logic is in a
+single ``RewardEngine`` class so it can be unit-tested in isolation from the
+environment.
+"""
+from __future__ import annotations
+from typing import Any, Dict
+# ---------------------------------------------------------------------------
+# Reward table (blueprint §3)
+# ---------------------------------------------------------------------------
+REWARD_TABLE: Dict[str, float] = {
+    "high_match_join": +25.0,      # match_rate > 0.90
+    "correct_scd2": +5.0,          # SCD-2 used on a tracked column
+    "quarantine_precision": +10.0, # Orphaned rows correctly moved to quarantine
+    "row_explosion": -100.0,       # Cartesian product detected
+    "dirty_join": -30.0,           # Join attempted without PREP_KEYS → 0-row result
+    "stale_processing": -15.0,     # Action taken while source is stale (not synced first)
+    "step_penalty": -0.2,          # Per-step efficiency penalty
+}
+HIGH_MATCH_THRESHOLD = 0.90
+# ---------------------------------------------------------------------------
+# RewardEngine
+# ---------------------------------------------------------------------------
+class RewardEngine:
+    """Compute per-step reward from action context and operator metrics."""
+    def evaluate(
+        self,
+        action_type: str,
+        metrics: Dict[str, Any],
+        state_before: Any,  # MedusaState snapshot before step
+    ) -> float:
+        """Return the scalar reward for a single step.
+        Args:
+            action_type: The ``MedusaActionType`` value string (e.g. "SYNC_CHECK").
+            metrics: Dictionary returned by the corresponding operator.
+            state_before: State object *before* this step was applied.
+        Returns:
+            Scalar float reward.
+        """
+        reward = REWARD_TABLE["step_penalty"]  # always applied
+        if action_type == "SYNC_CHECK":
+            # No positive/negative signal from sync_check itself
+            pass
+        elif action_type in ("PREP_KEYS_A", "PREP_KEYS_B"):
+            # Neutral — prep is just a prerequisite
+            pass
+        elif action_type == "DEDUPLICATE_B":
+            pass
+        elif action_type == "EVOLVE_SCHEMA":
+            pass
+        elif action_type in ("EXECUTE_JOIN_INNER", "EXECUTE_JOIN_LEFT", "EXECUTE_JOIN_ANTI"):
+            explosion = metrics.get("explosion_detected", False)
+            if explosion:
+                reward += REWARD_TABLE["row_explosion"]
+            else:
+                join_rows = metrics.get("join_rows", 0)
+                fact_rows = metrics.get("fact_rows", 1)
+                # "Dirty join" = join executed without PREP_KEYS and produced 0 rows
+                # even though the source was non-empty
+                if join_rows == 0 and fact_rows > 0:
+                    if not state_before.did_prep_a or not state_before.did_prep_b:
+                        reward += REWARD_TABLE["dirty_join"]
+                else:
+                    match_rate = metrics.get("match_rate", 0.0)
+                    if match_rate >= HIGH_MATCH_THRESHOLD:
+                        reward += REWARD_TABLE["high_match_join"]
+                    # Quarantine precision: reward if orphans were quarantined
+                    quarantine_rows = metrics.get("quarantine_rows", 0)
+                    if quarantine_rows > 0 and action_type == "EXECUTE_JOIN_LEFT":
+                        reward += REWARD_TABLE["quarantine_precision"]
+            # Stale processing: ran join while a source was stale (never synced)
+            if (state_before.is_stale_a or state_before.is_stale_b) and not state_before.did_sync_check:
+                reward += REWARD_TABLE["stale_processing"]
+        elif action_type in ("APPLY_SCD_1", "APPLY_SCD_2"):
+            if action_type == "APPLY_SCD_2":
+                # Reward if SCD-2 was the right choice (tracked col involved)
+                reward += REWARD_TABLE["correct_scd2"]
+            if (state_before.is_stale_a or state_before.is_stale_b) and not state_before.did_sync_check:
+                reward += REWARD_TABLE["stale_processing"]
+        elif action_type == "COMMIT":
+            # Base commit — grader adds bonus/penalty separately
+            pass
+        return reward

scenarios.py ADDED Viewed

	@@ -0,0 +1,215 @@

+"""MEDUSA scenario generator.
+Produces randomised Bronze A (Fact) and Bronze B (Dimension) DataFrames to
+drive each training episode. Four canonical scenarios cover the canonical
+failure modes described in the MEDUSA blueprint.
+"""
+from __future__ import annotations
+import random
+from dataclasses import dataclass, field
+from typing import List, Optional
+import numpy as np
+import pandas as pd
+# ---------------------------------------------------------------------------
+# Scenario dataclass
+# ---------------------------------------------------------------------------
+@dataclass
+class Scenario:
+    """One episode's worth of Bronze source data + configuration."""
+    id: str
+    bronze_a: pd.DataFrame          # Fact table (source of truth for volume)
+    bronze_b: pd.DataFrame          # Dimension table (must be unique on key)
+    join_key: str                   # Column name used to join A and B
+    tracked_cols: List[str]         # Columns in B that require SCD-2 history
+    is_stale_a: bool                # Whether Source A is past the freshness threshold
+    is_stale_b: bool
+    time_delta_a: float             # Hours since Source A was last refreshed
+    time_delta_b: float
+    new_cols_a: List[str]           # Extra columns in A not in Silver yet
+    new_cols_b: List[str]           # Extra columns in B not in Silver yet
+    description: str = ""
+# ---------------------------------------------------------------------------
+# Internal helpers
+# ---------------------------------------------------------------------------
+_STALE_THRESHOLD_HOURS = 6.0
+def _make_fact(
+    rng: random.Random,
+    n_rows: int,
+    key_col: str,
+    null_ratio: float = 0.0,
+    extra_cols: Optional[List[str]] = None,
+) -> pd.DataFrame:
+    """Create a synthetic Fact (Bronze A) DataFrame."""
+    keys = [f"K{i:04d}" for i in rng.sample(range(1, n_rows * 2), n_rows)]
+    # Inject nulls into the key
+    null_mask = rng.sample(range(n_rows), int(n_rows * null_ratio))
+    for idx in null_mask:
+        keys[idx] = None  # type: ignore[call-overload]
+    data = {
+        key_col: keys,
+        "fact_value": [rng.uniform(0, 1000) for _ in range(n_rows)],
+        "fact_category": [rng.choice(["A", "B", "C"]) for _ in range(n_rows)],
+        "created_at": pd.date_range("2024-01-01", periods=n_rows, freq="h"),
+    }
+    for col in (extra_cols or []):
+        data[col] = [rng.uniform(0, 100) for _ in range(n_rows)]
+    return pd.DataFrame(data)
+def _make_dim(
+    rng: random.Random,
+    n_rows: int,
+    key_col: str,
+    null_ratio: float = 0.0,
+    uniqueness: float = 1.0,   # < 1.0 means some keys are duplicated
+    match_keys: Optional[List[str]] = None,  # If given, use these as the key pool
+    extra_cols: Optional[List[str]] = None,
+    tracked_cols: Optional[List[str]] = None,
+) -> pd.DataFrame:
+    """Create a synthetic Dimension (Bronze B) DataFrame."""
+    if match_keys:
+        # Choose from overlap pool to control referential integrity
+        available = list(match_keys)
+        keys = [rng.choice(available) for _ in range(n_rows)]
+    else:
+        keys = [f"K{i:04d}" for i in rng.sample(range(1, n_rows * 3), n_rows)]
+    # Inject duplicates (lower uniqueness)
+    if uniqueness < 1.0:
+        n_dupes = int(n_rows * (1 - uniqueness))
+        for i in rng.sample(range(n_rows), n_dupes):
+            keys[i] = keys[rng.randint(0, i - 1)] if i > 0 else keys[0]
+    # Inject nulls
+    null_mask = rng.sample(range(n_rows), int(n_rows * null_ratio))
+    for idx in null_mask:
+        keys[idx] = None  # type: ignore[call-overload]
+    data: dict = {key_col: keys, "dim_name": [f"Name_{k}" for k in keys]}
+    for col in (tracked_cols or []):
+        data[col] = [rng.choice(["x", "y", "z"]) for _ in range(n_rows)]
+    for col in (extra_cols or []):
+        data[col] = [rng.uniform(0, 100) for _ in range(n_rows)]
+    return pd.DataFrame(data)
+# ---------------------------------------------------------------------------
+# Scenario Generator
+# ---------------------------------------------------------------------------
+class ScenarioGenerator:
+    """Generates Bronze A/B DataFrames for MEDUSA episodes."""
+    STALE_THRESHOLD = _STALE_THRESHOLD_HOURS
+    JOIN_KEY = "entity_id"
+    TRACKED_COLS = ["dim_status"]
+    # Four canonical scenario types
+    CANONICAL: List[str] = ["clean", "dirty_keys", "stale", "schema_drift"]
+    def __init__(self, n_fact_rows: int = 200, n_dim_rows: int = 150):
+        self.n_fact_rows = n_fact_rows
+        self.n_dim_rows = n_dim_rows
+    def generate(self, seed: Optional[int] = None) -> Scenario:
+        """Generate a random scenario. Canonical scenarios cycle through seeds 0-3."""
+        rng = random.Random(seed)
+        if seed is not None and 0 <= seed < len(self.CANONICAL):
+            return self._canonical(self.CANONICAL[seed], seed)
+        variant = rng.choice(self.CANONICAL)
+        return self._canonical(variant, seed)
+    def _canonical(self, variant: str, seed: Optional[int]) -> Scenario:
+        rng = random.Random(seed)
+        np_rng = np.random.default_rng(seed)
+        key = self.JOIN_KEY
+        n_a = self.n_fact_rows
+        n_b = self.n_dim_rows
+        if variant == "clean":
+            # Fresh, unique keys, ~100% match rate
+            fact = _make_fact(rng, n_a, key, null_ratio=0.0)
+            valid_keys = fact[key].dropna().tolist()
+            dim = _make_dim(rng, n_b, key, null_ratio=0.0, uniqueness=1.0,
+                            match_keys=valid_keys, tracked_cols=self.TRACKED_COLS)
+            return Scenario(
+                id=f"clean_{seed}",
+                bronze_a=fact, bronze_b=dim,
+                join_key=key, tracked_cols=self.TRACKED_COLS,
+                is_stale_a=False, is_stale_b=False,
+                time_delta_a=1.0, time_delta_b=2.0,
+                new_cols_a=[], new_cols_b=[],
+                description="Clean scenario: fresh, unique keys, high match rate.",
+            )
+        elif variant == "dirty_keys":
+            # High null ratio in keys, no trimming / type-casting yet
+            fact = _make_fact(rng, n_a, key, null_ratio=0.25)
+            fact[key] = fact[key].apply(
+                lambda k: f"  {k}  " if k and rng.random() < 0.3 else k  # whitespace noise
+            )
+            dim = _make_dim(rng, n_b, key, null_ratio=0.15, uniqueness=0.85,
+                            tracked_cols=self.TRACKED_COLS)
+            return Scenario(
+                id=f"dirty_keys_{seed}",
+                bronze_a=fact, bronze_b=dim,
+                join_key=key, tracked_cols=self.TRACKED_COLS,
+                is_stale_a=False, is_stale_b=False,
+                time_delta_a=2.0, time_delta_b=3.0,
+                new_cols_a=[], new_cols_b=[],
+                description="Dirty keys: nulls + whitespace in join keys.",
+            )
+        elif variant == "stale":
+            # One or both sources have not refreshed recently
+            fact = _make_fact(rng, n_a, key, null_ratio=0.0)
+            valid_keys = fact[key].dropna().tolist()
+            dim = _make_dim(rng, n_b, key, null_ratio=0.0, uniqueness=1.0,
+                            match_keys=valid_keys, tracked_cols=self.TRACKED_COLS)
+            td_a = rng.uniform(8.0, 24.0)   # definitely stale
+            td_b = rng.uniform(0.5, 4.0)
+            return Scenario(
+                id=f"stale_{seed}",
+                bronze_a=fact, bronze_b=dim,
+                join_key=key, tracked_cols=self.TRACKED_COLS,
+                is_stale_a=td_a > self.STALE_THRESHOLD,
+                is_stale_b=td_b > self.STALE_THRESHOLD,
+                time_delta_a=td_a, time_delta_b=td_b,
+                new_cols_a=[], new_cols_b=[],
+                description=f"Stale scenario: Source A is {td_a:.1f}h old.",
+            )
+        else:  # schema_drift
+            # New columns in A and/or B not yet registered in Silver
+            extra_a = ["new_metric_a"]
+            extra_b = ["new_attr_b"]
+            fact = _make_fact(rng, n_a, key, null_ratio=0.0, extra_cols=extra_a)
+            valid_keys = fact[key].dropna().tolist()
+            dim = _make_dim(rng, n_b, key, null_ratio=0.0, uniqueness=1.0,
+                            match_keys=valid_keys,
+                            tracked_cols=self.TRACKED_COLS, extra_cols=extra_b)
+            return Scenario(
+                id=f"schema_drift_{seed}",
+                bronze_a=fact, bronze_b=dim,
+                join_key=key, tracked_cols=self.TRACKED_COLS,
+                is_stale_a=False, is_stale_b=False,
+                time_delta_a=1.0, time_delta_b=1.5,
+                new_cols_a=extra_a, new_cols_b=extra_b,
+                description="Schema drift: new columns in A and B.",
+            )

scripts/inference.py ADDED Viewed

	@@ -0,0 +1,288 @@

+"""MEDUSA inference script — OpenEnv Hackathon submission.
+Runs an LLM agent (via OpenAI-compatible API) against all three MEDUSA tasks
+and reports per-task scores (0.0–1.0).
+Required environment variables:
+    API_BASE_URL   The API endpoint for the LLM (OpenAI-compatible).
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key (used as the API key).
+Usage:
+    export API_BASE_URL="https://api.openai.com/v1"
+    export MODEL_NAME="gpt-4o-mini"
+    export HF_TOKEN="hf-..."
+    python inference.py
+Output:
+    Prints per-task results and a final summary table to stdout.
+    Exits with code 0 if all tasks score >= 0.35, else 1.
+"""
+from __future__ import annotations
+import json
+import os
+import sys
+import textwrap
+import time
+from typing import List, Optional
+# ---------------------------------------------------------------------------
+# Validate required environment variables before anything else
+# ---------------------------------------------------------------------------
+API_BASE_URL = os.environ.get("API_BASE_URL", "").rstrip("/")
+MODEL_NAME = os.environ.get("MODEL_NAME", "")
+HF_TOKEN = os.environ.get("HF_TOKEN", "")
+_missing = [k for k, v in {
+    "API_BASE_URL": API_BASE_URL,
+    "MODEL_NAME": MODEL_NAME,
+    "HF_TOKEN": HF_TOKEN,
+}.items() if not v]
+if _missing:
+    print(f"ERROR: Missing required environment variables: {', '.join(_missing)}", file=sys.stderr)
+    print("Set them before running:", file=sys.stderr)
+    for k in _missing:
+        print(f"  export {k}=<value>", file=sys.stderr)
+    sys.exit(1)
+# ---------------------------------------------------------------------------
+# OpenAI client (uses API_BASE_URL + HF_TOKEN as the key)
+# ---------------------------------------------------------------------------
+from openai import OpenAI  # noqa: E402
+client = OpenAI(
+    base_url=API_BASE_URL,
+    api_key=HF_TOKEN,
+)
+# ---------------------------------------------------------------------------
+# MEDUSA environment imports
+# ---------------------------------------------------------------------------
+from pathlib import Path
+# Dynamically add the OpenEnv repo root to sys.path so absolute imports work
+# no matter where this script is executed from.
+repo_root = str(Path(__file__).resolve().parent.parent.parent)
+if repo_root not in sys.path:
+    sys.path.insert(0, repo_root)
+try:
+    # In-repo
+    from envs.medusa_env import MedusaEnv
+    from envs.medusa_env.models import MedusaAction, MedusaActionType
+    from envs.medusa_env.tasks import TASKS, TaskResult, score_episode
+except ImportError:
+    # Standalone (running from inside envs/medusa_env/ installation)
+    from medusa_env import MedusaEnv  # type: ignore
+    from models import MedusaAction, MedusaActionType  # type: ignore
+    from tasks import TASKS, TaskResult, score_episode  # type: ignore
+# ---------------------------------------------------------------------------
+# System prompt
+# ---------------------------------------------------------------------------
+SYSTEM_PROMPT = textwrap.dedent("""
+You are a data integration agent controlling a Bronze→Silver ETL pipeline.
+You observe a 16-float feature vector describing data quality signals, and
+you must choose one action per step from the list below.
+ACTIONS (respond with ONLY the action name — nothing else):
+  SYNC_CHECK          — Verify source freshness before processing
+  EVOLVE_SCHEMA       — Add new columns from sources into Silver schema
+  PREP_KEYS_A         — Clean and normalise join keys in Source A (Fact)
+  PREP_KEYS_B         — Clean and normalise join keys in Source B (Dimension)
+  DEDUPLICATE_B       — Remove duplicate keys from Source B
+  EXECUTE_JOIN_INNER  — Inner join A ⋈ B
+  EXECUTE_JOIN_LEFT   — Left join A ⋈ B (keeps all Fact rows; orphans → quarantine)
+  EXECUTE_JOIN_ANTI   — Anti-join: extract Fact rows with no Dimension match
+  APPLY_SCD_1         — Overwrite Silver records (SCD Type 1)
+  APPLY_SCD_2         — Close old records and insert new with timestamps (SCD Type 2)
+  COMMIT              — Finalise pipeline and trigger audit
+STRATEGY:
+1. Always call SYNC_CHECK first to verify freshness.
+2. If schema drift signals are non-zero (features[9] or [10] > 0), call EVOLVE_SCHEMA.
+3. If null key ratios (features[4] or [5] > 0), call PREP_KEYS_A and/or PREP_KEYS_B.
+4. If Dimension uniqueness (features[7]) < 1.0, call DEDUPLICATE_B.
+5. Prefer EXECUTE_JOIN_LEFT to preserve all Fact rows.
+6. Prefer APPLY_SCD_2 for tracked history.
+7. Call COMMIT when pipeline is complete.
+The feature vector indices:
+  [0]  time_delta_a_norm   [1]  time_delta_b_norm
+  [2]  is_stale_a          [3]  is_stale_b
+  [4]  null_ratio_key_a    [5]  null_ratio_key_b
+  [6]  uniqueness_a        [7]  uniqueness_b
+  [8]  match_rate          [9]  new_cols_a_norm
+  [10] new_cols_b_norm     [11] schema_compat
+  [12] did_prep_a          [13] did_prep_b
+  [14] did_dedup_b         [15] step_frac
+""").strip()
+# ---------------------------------------------------------------------------
+# LLM action chooser
+# ---------------------------------------------------------------------------
+VALID_ACTIONS = {a.value for a in MedusaActionType}
+def choose_action(
+    features: List[float],
+    history: List[dict],
+    step: int,
+) -> str:
+    """Ask the LLM to choose the next action given the current observation."""
+    feature_str = ", ".join(f"{v:.3f}" for v in features)
+    user_msg = (
+        f"Step {step}. Feature vector: [{feature_str}]\n"
+        "What is the single best next action? Respond with ONLY the action name."
+    )
+    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+    # Include the last 4 steps of history for context (keep prompt short)
+    for h in history[-4:]:
+        messages.append({"role": "user", "content": h["user"]})
+        messages.append({"role": "assistant", "content": h["assistant"]})
+    messages.append({"role": "user", "content": user_msg})
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=messages,
+        max_tokens=20,
+        temperature=0.0,
+    )
+    raw = response.choices[0].message.content.strip().upper().replace(" ", "_")
+    # Fuzzy match: accept if the response contains a valid action name
+    for action in VALID_ACTIONS:
+        if action in raw:
+            return action
+    # Fallback: extract the longest matching token
+    for action in sorted(VALID_ACTIONS, key=len, reverse=True):
+        if action.replace("_", "") in raw.replace("_", ""):
+            return action
+    # Hard fallback: commit to end gracefully
+    return MedusaActionType.COMMIT.value
+# ---------------------------------------------------------------------------
+# Run one task
+# ---------------------------------------------------------------------------
+def run_task(task_id: str, max_steps: int = 15) -> TaskResult:
+    """Run the LLM agent for one MEDUSA task. Returns the TaskResult."""
+    task = TASKS[task_id]
+    print(f"\n{'='*60}")
+    print(f"TASK: {task.name} [{task.difficulty.upper()}]  (seed={task.seed})")
+    print(f"  {task.description}")
+    print(f"{'='*60}")
+    env = MedusaEnv(n_fact_rows=200, n_dim_rows=150, max_steps=max_steps)
+    obs = env.reset(seed=task.seed)
+    history: List[dict] = []
+    step = 0
+    t0 = time.time()
+    while not obs.done and step < max_steps:
+        step += 1
+        action_str = choose_action(obs.features, history, step)
+        action_type = MedusaActionType(action_str)
+        action = MedusaAction(action=action_type)
+        obs = env.step(action)
+        reward = obs.reward or 0.0
+        print(f"  Step {step:2d}: {action_str:25s}  reward={reward:+7.2f}  "
+              f"cumulative={env.state.cumulative_reward:+8.2f}")
+        history.append({
+            "user": (f"Step {step}. Features: [{', '.join(f'{v:.3f}' for v in obs.features)}]"
+                     " What action?"),
+            "assistant": action_str,
+        })
+    elapsed = time.time() - t0
+    result = score_episode(task_id, env.state, env._tables)
+    print(f"\n  → Score: {result.score:.4f}  Grade: {result.grade}  "
+          f"Passed: {result.passed}  ({elapsed:.1f}s)")
+    if result.notes:
+        for note in result.notes:
+            print(f"    ⚠  {note}")
+    print(f"  → Breakdown: " +
+          ", ".join(f"{k}={v:.2f}" for k, v in result.breakdown.items()))
+    return result
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main() -> None:
+    print("MEDUSA — Baseline Inference")
+    print(f"Model: {MODEL_NAME}")
+    print(f"API:   {API_BASE_URL}")
+    print()
+    task_ids = ["clean_pipeline", "dirty_integration", "full_medallion"]
+    results: dict[str, TaskResult] = {}
+    total_start = time.time()
+    for task_id in task_ids:
+        result = run_task(task_id)
+        results[task_id] = result
+    total_elapsed = time.time() - total_start
+    # Summary
+    print(f"\n{'='*60}")
+    print("SUMMARY")
+    print(f"{'='*60}")
+    print(f"{'Task':<25}  {'Difficulty':<8}  {'Score':>6}  {'Grade':>5}  {'Pass?':>5}")
+    print("-" * 60)
+    all_passed = True
+    for task_id, result in results.items():
+        task = TASKS[task_id]
+        print(f"{task.name:<25}  {task.difficulty:<8}  "
+              f"{result.score:>6.4f}  {result.grade:>5}  {'YES' if result.passed else 'NO':>5}")
+        if not result.passed:
+            all_passed = False
+    print("-" * 60)
+    avg = sum(r.score for r in results.values()) / len(results)
+    print(f"{'Average':<25}  {'':8}  {avg:>6.4f}")
+    print(f"\nTotal time: {total_elapsed:.1f}s")
+    # Machine-readable output for the evaluator
+    output = {
+        "model": MODEL_NAME,
+        "tasks": {
+            tid: {
+                "score": r.score,
+                "grade": r.grade,
+                "passed": r.passed,
+                "breakdown": r.breakdown,
+            }
+            for tid, r in results.items()
+        },
+        "average_score": avg,
+        "all_passed": all_passed,
+    }
+    print("\n--- JSON RESULTS ---")
+    print(json.dumps(output, indent=2))
+    sys.exit(0 if all_passed else 1)
+if __name__ == "__main__":
+    main()

server/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+"""FastAPI server package for medusa_env."""
+from .medusa_env import MedusaEnv
+__all__ = [
+        "MedusaEnv"
+]

server/app.py ADDED Viewed

	@@ -0,0 +1,37 @@

+"""FastAPI server for the MEDUSA environment.
+Usage:
+    # Development:
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Via openenv CLI:
+    openenv serve medusa_env
+"""
+from __future__ import annotations
+# Support three import contexts:
+#   1. In-repo (from OpenEnv root): relative imports via `..`
+#   2. Standalone installed (uv run server): medusa_env.* package
+#   3. Direct execution inside env dir: bare module names
+from openenv.core.env_server.http_server import create_app
+from medusa_env.server import MedusaEnv
+from medusa_env.models import MedusaAction, MedusaObservation
+app = create_app(
+    MedusaEnv,
+    MedusaAction,
+    MedusaObservation,
+    env_name="medusa_env",
+)
+def main() -> None:
+    """Entry point for direct execution."""
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/medusa_env.py ADDED Viewed

	@@ -0,0 +1,514 @@

+"""MEDUSA — full environment implementation.
+Replaces the Phase-1 skeleton with a complete reset/step pipeline that:
+  • Generates Bronze A/B data from ``ScenarioGenerator``
+  • Dispatches each action to the appropriate operator
+  • Computes per-step rewards via ``RewardEngine``
+  • Runs the deterministic grader on COMMIT
+  • Builds a 16-float normalized feature vector for the RL agent
+  • Maintains a governance log of every decision
+"""
+from __future__ import annotations
+import copy
+import time
+import uuid
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional
+import pandas as pd
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import EnvironmentMetadata
+from medusa_env.grader import Grader
+from medusa_env.models import MedusaAction, MedusaActionType, MedusaObservation, MedusaState
+from medusa_env.operators import (
+    apply_scd,
+    deduplicate,
+    evolve_schema,
+    execute_join,
+    prep_keys,
+    sync_check,
+)
+from medusa_env.rewards import RewardEngine
+from medusa_env.scenarios import Scenario, ScenarioGenerator
+# ---------------------------------------------------------------------------
+# Internal episode tables
+# ---------------------------------------------------------------------------
+@dataclass
+class _EpisodeTables:
+    """In-memory tables for one episode."""
+    bronze_a: pd.DataFrame = field(default_factory=pd.DataFrame)
+    bronze_a_prepped: pd.DataFrame = field(default_factory=pd.DataFrame)
+    bronze_b: pd.DataFrame = field(default_factory=pd.DataFrame)
+    bronze_b_prepped: pd.DataFrame = field(default_factory=pd.DataFrame)
+    joined: pd.DataFrame = field(default_factory=pd.DataFrame)
+    silver: pd.DataFrame = field(default_factory=pd.DataFrame)
+    quarantine: pd.DataFrame = field(default_factory=pd.DataFrame)
+    governance_log: List[Dict[str, Any]] = field(default_factory=list)
+# ---------------------------------------------------------------------------
+# Feature vector builder
+# ---------------------------------------------------------------------------
+_MAX_TIME_DELTA = 48.0   # Normalisation ceiling (hours)
+_MAX_COLS = 10.0          # Normalisation ceiling (new columns)
+def _build_features(state: MedusaState) -> List[float]:
+    """Build the 16-float normalised observation vector."""
+    return [
+        min(state.time_delta_a / _MAX_TIME_DELTA, 1.0),
+        min(state.time_delta_b / _MAX_TIME_DELTA, 1.0),
+        float(state.is_stale_a),
+        float(state.is_stale_b),
+        state.null_ratio_key_a,
+        state.null_ratio_key_b,
+        state.uniqueness_a,
+        state.uniqueness_b,
+        state.match_rate,
+        min(state.new_cols_a / _MAX_COLS, 1.0),
+        min(state.new_cols_b / _MAX_COLS, 1.0),
+        state.schema_compat,
+        float(state.did_prep_a),
+        float(state.did_prep_b),
+        float(state.did_dedup_b),
+        min(state.step_idx / max(state.max_steps, 1), 1.0),
+    ]
+# ---------------------------------------------------------------------------
+# Main environment
+# ---------------------------------------------------------------------------
+class MedusaEnv(Environment[MedusaAction, MedusaObservation, MedusaState]):
+    """MEDUSA: Medallion-Engineered Deterministic Unified Storage Agent.
+    Simulates a Bronze→Silver data integration pipeline. The agent observes
+    data quality signals and chooses ETL actions to produce a correct,
+    historically consistent Silver entity.
+    Args:
+        scenario_seed: Fixed seed for deterministic episodes. ``None`` = random.
+        max_steps: Maximum steps per episode before forced termination.
+        stale_threshold_hours: Age (hours) at which a source is deemed stale.
+        n_fact_rows: Size of the Fact / Source A table.
+        n_dim_rows: Size of the Dimension / Source B table.
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(
+        self,
+        scenario_seed: Optional[int] = None,
+        max_steps: int = 20,
+        stale_threshold_hours: float = 6.0,
+        n_fact_rows: int = 200,
+        n_dim_rows: int = 150,
+        **kwargs: Any,
+    ):
+        super().__init__(**kwargs)
+        self._scenario_seed = scenario_seed
+        self._max_steps = max_steps
+        self._stale_threshold = stale_threshold_hours
+        self._generator = ScenarioGenerator(
+            n_fact_rows=n_fact_rows, n_dim_rows=n_dim_rows
+        )
+        self._reward_engine = RewardEngine()
+        self._grader = Grader()
+        self._state = MedusaState()
+        self._tables = _EpisodeTables()
+        self._scenario: Optional[Scenario] = None
+    # ------------------------------------------------------------------
+    # Metadata
+    # ------------------------------------------------------------------
+    def get_metadata(self) -> EnvironmentMetadata:
+        return EnvironmentMetadata(
+            name="medusa_env",
+            description=(
+                "MEDUSA: simulated Bronze→Silver integration controller for "
+                "multi-source joins, schema drift, and SCD merges."
+            ),
+            version="0.2.0",
+            documentation="envs/medusa_env/README.md",
+        )
+    # ------------------------------------------------------------------
+    # State
+    # ------------------------------------------------------------------
+    @property
+    def state(self) -> MedusaState:
+        return self._state
+    # ------------------------------------------------------------------
+    # Reset
+    # ------------------------------------------------------------------
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs: Any,
+    ) -> MedusaObservation:
+        self._reset_rubric()
+        effective_seed = seed if seed is not None else self._scenario_seed
+        run_id = episode_id or str(uuid.uuid4())
+        # Generate scenario
+        self._scenario = self._generator.generate(seed=effective_seed)
+        scen = self._scenario
+        # Initialise tables
+        self._tables = _EpisodeTables(
+            bronze_a=scen.bronze_a.copy(),
+            bronze_a_prepped=scen.bronze_a.copy(),
+            bronze_b=scen.bronze_b.copy(),
+            bronze_b_prepped=scen.bronze_b.copy(),
+        )
+        # Compute initial key health metrics from raw Bronze
+        na_a = scen.bronze_a[scen.join_key].isna().sum()
+        na_b = scen.bronze_b[scen.join_key].isna().sum()
+        null_ratio_a = na_a / max(len(scen.bronze_a), 1)
+        null_ratio_b = na_b / max(len(scen.bronze_b), 1)
+        # Uniqueness of raw keys
+        nna_a = scen.bronze_a[scen.join_key].dropna()
+        nna_b = scen.bronze_b[scen.join_key].dropna()
+        uniq_a = nna_a.nunique() / max(len(nna_a), 1)
+        uniq_b = nna_b.nunique() / max(len(nna_b), 1)
+        # Match rate on raw keys
+        keys_a = set(nna_a.astype(str))
+        keys_b = set(nna_b.astype(str))
+        match_rate = len(keys_a & keys_b) / max(len(keys_a), 1)
+        self._state = MedusaState(
+            run_id=run_id,
+            seed=effective_seed,
+            scenario_id=scen.id,
+            max_steps=self._max_steps,
+            step_idx=0,
+            stage="running",
+            time_delta_a=scen.time_delta_a,
+            time_delta_b=scen.time_delta_b,
+            is_stale_a=scen.is_stale_a,
+            is_stale_b=scen.is_stale_b,
+            null_ratio_key_a=float(null_ratio_a),
+            null_ratio_key_b=float(null_ratio_b),
+            uniqueness_a=float(uniq_a),
+            uniqueness_b=float(uniq_b),
+            match_rate=float(match_rate),
+            new_cols_a=len(scen.new_cols_a),
+            new_cols_b=len(scen.new_cols_b),
+            source_a_row_count=len(scen.bronze_a),
+        )
+        features = _build_features(self._state)
+        obs = MedusaObservation(
+            message=(
+                f"MEDUSA episode started. Scenario: {scen.id}. "
+                f"{scen.description} "
+                f"Source A: {len(scen.bronze_a)} rows | "
+                f"Source B: {len(scen.bronze_b)} rows."
+            ),
+            features=features,
+            metrics={
+                "scenario_id": scen.id,
+                "null_ratio_key_a": null_ratio_a,
+                "null_ratio_key_b": null_ratio_b,
+                "match_rate": match_rate,
+                "is_stale_a": scen.is_stale_a,
+                "is_stale_b": scen.is_stale_b,
+                "new_cols_a": scen.new_cols_a,
+                "new_cols_b": scen.new_cols_b,
+            },
+            metadata={"run_id": run_id, "seed": effective_seed},
+            reward=None,
+            done=False,
+        )
+        return self._apply_transform(obs)
+    # ------------------------------------------------------------------
+    # Step
+    # ------------------------------------------------------------------
+    def step(
+        self,
+        action: MedusaAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> MedusaObservation:
+        if self._state.stage != "running":
+            return self._apply_transform(MedusaObservation(
+                message=f"Episode not running (stage={self._state.stage}). Call reset().",
+                done=True,
+                reward=0.0,
+                features=_build_features(self._state),
+                metadata={"run_id": self._state.run_id},
+            ))
+        # Snapshot state *before* applying action (for reward evaluation)
+        state_before = copy.copy(self._state)
+        self._state.step_idx += 1
+        action_type = action.action
+        metrics: dict = {}
+        step_message = ""
+        scen = self._scenario
+        assert scen is not None, "reset() must be called before step()"
+        # ── Dispatch ──────────────────────────────────────────────────
+        try:
+            if action_type == MedusaActionType.SYNC_CHECK:
+                _, metrics = sync_check(
+                    self._tables.bronze_a,
+                    self._tables.bronze_b,
+                    scen.time_delta_a,
+                    scen.time_delta_b,
+                    self._stale_threshold,
+                )
+                self._state.did_sync_check = True
+                step_message = (
+                    f"SYNC_CHECK: A={scen.time_delta_a:.1f}h "
+                    f"{'[STALE]' if scen.is_stale_a else '[FRESH]'} | "
+                    f"B={scen.time_delta_b:.1f}h "
+                    f"{'[STALE]' if scen.is_stale_b else '[FRESH]'}"
+                )
+            elif action_type == MedusaActionType.EVOLVE_SCHEMA:
+                result_df, metrics = evolve_schema(
+                    self._tables.silver,
+                    self._tables.bronze_a,
+                    self._tables.bronze_b,
+                    scen.new_cols_a,
+                    scen.new_cols_b,
+                )
+                if result_df is not None:
+                    self._tables.silver = result_df
+                self._state.did_evolve_schema = True
+                step_message = f"EVOLVE_SCHEMA: added {metrics.get('new_cols_count', 0)} column(s)."
+            elif action_type == MedusaActionType.PREP_KEYS_A:
+                result_df, metrics = prep_keys(
+                    self._tables.bronze_a_prepped, scen.join_key
+                )
+                if result_df is not None:
+                    self._tables.bronze_a_prepped = result_df
+                self._state.did_prep_a = True
+                self._state.null_ratio_key_a = float(metrics.get("null_ratio_after", 0.0))
+                step_message = (
+                    f"PREP_KEYS_A: null ratio {metrics.get('null_ratio_before', 0):.2%}"
+                    f"→{metrics.get('null_ratio_after', 0):.2%}."
+                )
+            elif action_type == MedusaActionType.PREP_KEYS_B:
+                result_df, metrics = prep_keys(
+                    self._tables.bronze_b_prepped, scen.join_key
+                )
+                if result_df is not None:
+                    self._tables.bronze_b_prepped = result_df
+                self._state.did_prep_b = True
+                self._state.null_ratio_key_b = float(metrics.get("null_ratio_after", 0.0))
+                step_message = (
+                    f"PREP_KEYS_B: null ratio {metrics.get('null_ratio_before', 0):.2%}"
+                    f"→{metrics.get('null_ratio_after', 0):.2%}."
+                )
+            elif action_type == MedusaActionType.DEDUPLICATE_B:
+                result_df, metrics = deduplicate(
+                    self._tables.bronze_b_prepped, scen.join_key
+                )
+                if result_df is not None:
+                    self._tables.bronze_b_prepped = result_df
+                self._state.did_dedup_b = True
+                self._state.uniqueness_b = float(metrics.get("uniqueness", 1.0))
+                step_message = f"DEDUPLICATE_B: removed {metrics.get('dupes_removed', 0)} duplicate(s)."
+            elif action_type in {
+                MedusaActionType.EXECUTE_JOIN_INNER,
+                MedusaActionType.EXECUTE_JOIN_LEFT,
+                MedusaActionType.EXECUTE_JOIN_ANTI,
+            }:
+                join_map = {
+                    MedusaActionType.EXECUTE_JOIN_INNER: "inner",
+                    MedusaActionType.EXECUTE_JOIN_LEFT: "left",
+                    MedusaActionType.EXECUTE_JOIN_ANTI: "anti",
+                }
+                join_type_str = join_map[action_type]
+                joined, quarantine, metrics = execute_join(
+                    self._tables.bronze_a_prepped,
+                    self._tables.bronze_b_prepped,
+                    scen.join_key,
+                    join_type_str,
+                )
+                self._tables.joined = joined
+                self._tables.quarantine = quarantine
+                self._state.did_join = True
+                self._state.join_type = join_type_str
+                self._state.join_row_count = int(metrics.get("join_rows", 0))
+                self._state.explosion_detected = bool(metrics.get("explosion_detected", False))
+                self._state.match_rate = float(metrics.get("match_rate", 0.0))
+                self._state.quarantine_row_count = len(quarantine)
+                step_message = (
+                    f"EXECUTE_JOIN ({join_type_str.upper()}): "
+                    f"{self._state.join_row_count} rows | "
+                    f"match_rate={self._state.match_rate:.1%} | "
+                    f"quarantine={self._state.quarantine_row_count} | "
+                    f"{'⚠ EXPLOSION' if self._state.explosion_detected else 'OK'}"
+                )
+            elif action_type in {MedusaActionType.APPLY_SCD_1, MedusaActionType.APPLY_SCD_2}:
+                scd_type_int = 1 if action_type == MedusaActionType.APPLY_SCD_1 else 2
+                tracked_col = scen.tracked_cols[0] if scen.tracked_cols else scen.join_key
+                result_df, metrics = apply_scd(
+                    self._tables.silver,
+                    self._tables.joined,
+                    scen.join_key,
+                    tracked_col,
+                    scd_type_int,
+                )
+                if result_df is not None:
+                    self._tables.silver = result_df
+                self._state.did_scd = True
+                self._state.scd_type = f"SCD-{scd_type_int}"
+                self._state.scd_inserts = int(metrics.get("inserts", 0))
+                self._state.scd_updates = int(metrics.get("updates", 0))
+                self._state.silver_row_count = int(metrics.get("silver_rows", 0))
+                step_message = (
+                    f"APPLY_SCD-{scd_type_int}: "
+                    f"{self._state.scd_inserts} inserts, "
+                    f"{self._state.scd_updates} updates → "
+                    f"Silver {self._state.silver_row_count} rows."
+                )
+            elif action_type == MedusaActionType.COMMIT:
+                return self._do_commit(state_before)
+        except Exception as exc:  # noqa: BLE001
+            step_message = f"ERROR in {action_type}: {exc}"
+            metrics = {"error": str(exc)}
+        # ── Reward ────────────────────────────────────────────────────
+        reward = self._reward_engine.evaluate(
+            action_type=action_type.value,
+            metrics=metrics,
+            state_before=state_before,
+        )
+        self._state.cumulative_reward += reward
+        # ── Governance log ────────────────────────────────────────────
+        self._tables.governance_log.append({
+            "step": self._state.step_idx,
+            "action": action_type.value,
+            "reward": reward,
+            "cumulative_reward": self._state.cumulative_reward,
+            "metrics": metrics,
+            "timestamp": time.time(),
+        })
+        # Check step limit
+        done = self._state.step_idx >= self._state.max_steps
+        if done:
+            self._state.stage = "failed"
+            step_message += " [MAX STEPS REACHED]"
+        features = _build_features(self._state)
+        obs = MedusaObservation(
+            message=step_message,
+            features=features,
+            metrics=metrics,
+            metadata={
+                "run_id": self._state.run_id,
+                "step": self._state.step_idx,
+                "cumulative_reward": self._state.cumulative_reward,
+            },
+            reward=reward,
+            done=done,
+        )
+        return self._apply_transform(obs)
+    # ------------------------------------------------------------------
+    # Commit (terminal step)
+    # ------------------------------------------------------------------
+    def _do_commit(self, state_before: MedusaState) -> MedusaObservation:
+        """Run grader then finalise the episode."""
+        scen = self._scenario
+        assert scen is not None
+        # Base step reward
+        reward = self._reward_engine.evaluate(
+            action_type=MedusaActionType.COMMIT.value,
+            metrics={},
+            state_before=state_before,
+        )
+        # Grader audit
+        grader_result = self._grader.audit(
+            silver=self._tables.silver,
+            quarantine=self._tables.quarantine,
+            bronze_a=scen.bronze_a,
+            bronze_b=scen.bronze_b,
+            join_key=scen.join_key,
+            join_type=self._state.join_type or "left",
+            scd_type=int(self._state.scd_type[-1]) if self._state.scd_type else 1,
+            scenario=scen,
+        )
+        reward += grader_result.bonus_reward
+        self._state.grader_passed = grader_result.passed
+        self._state.grader_report = grader_result.report
+        self._state.cumulative_reward += reward
+        self._state.silver_row_count = len(self._tables.silver)
+        self._state.quarantine_row_count = len(self._tables.quarantine)
+        self._state.stage = "committed"
+        self._tables.governance_log.append({
+            "step": self._state.step_idx,
+            "action": "COMMIT",
+            "reward": reward,
+            "cumulative_reward": self._state.cumulative_reward,
+            "grader_passed": grader_result.passed,
+            "grader_report": grader_result.report,
+            "timestamp": time.time(),
+        })
+        features = _build_features(self._state)
+        obs = MedusaObservation(
+            message=(
+                f"COMMIT: episode finalized. "
+                f"{'Grader: PASS ✓' if grader_result.passed else 'Grader: FAIL ✗'} "
+                f"Bonus: {grader_result.bonus_reward:+.1f} | "
+                f"Total reward: {self._state.cumulative_reward:.1f}"
+            ),
+            features=features,
+            metrics={
+                "grader_passed": grader_result.passed,
+                "grader_report": grader_result.report,
+                "silver_rows": self._state.silver_row_count,
+                "quarantine_rows": self._state.quarantine_row_count,
+                "governance_log_entries": len(self._tables.governance_log),
+            },
+            metadata={
+                "run_id": self._state.run_id,
+                "steps": self._state.step_idx,
+                "cumulative_reward": self._state.cumulative_reward,
+            },
+            reward=reward,
+            done=True,
+        )
+        return self._apply_transform(obs)

tasks.py ADDED Viewed

	@@ -0,0 +1,286 @@

+"""MEDUSA Task Definitions.
+Three formally graded tasks covering the easy → medium → hard spectrum.
+Each task returns a deterministic score in [0.0, 1.0] after COMMIT.
+Usage::
+    from envs.medusa_env.tasks import TASKS, score_episode
+    task = TASKS["clean_pipeline"]          # easy
+    env = MedusaEnv(n_fact_rows=200, n_dim_rows=150)
+    obs = env.reset(seed=task.seed)
+    # ... agent takes actions ...
+    obs = env.step(MedusaAction(action=MedusaActionType.COMMIT))
+    result = score_episode(task.id, env.state, env._tables)
+    print(f"Score: {result.score:.2f}  ({result.grade})")
+"""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import TYPE_CHECKING, Dict, List, Optional
+if TYPE_CHECKING:
+    from .medusa_env import _EpisodeTables
+    from .models import MedusaState
+# ---------------------------------------------------------------------------
+# Task definition
+# ---------------------------------------------------------------------------
+@dataclass
+class Task:
+    """A MEDUSA task definition."""
+    id: str
+    name: str
+    difficulty: str          # "easy" | "medium" | "hard"
+    seed: int                # Controls ScenarioGenerator variant
+    description: str
+    success_criteria: List[str]
+    scoring_rubric: Dict[str, float]
+# ---------------------------------------------------------------------------
+# Scoring result
+# ---------------------------------------------------------------------------
+@dataclass
+class TaskResult:
+    """Outcome of scoring a completed episode against a task."""
+    task_id: str
+    score: float             # 0.0 – 1.0
+    grade: str               # "S" | "A" | "B" | "C" | "F"
+    breakdown: Dict[str, float]   # per-criterion scores
+    passed: bool
+    notes: List[str] = field(default_factory=list)
+def _grade(score: float) -> str:
+    if score >= 0.90:
+        return "S"
+    if score >= 0.75:
+        return "A"
+    if score >= 0.55:
+        return "B"
+    if score >= 0.35:
+        return "C"
+    return "F"
+# ---------------------------------------------------------------------------
+# Task catalogue
+# ---------------------------------------------------------------------------
+TASKS: Dict[str, Task] = {
+    # ── EASY: Clean Pipeline ────────────────────────────────────────────────
+    "clean_pipeline": Task(
+        id="clean_pipeline",
+        name="Clean Pipeline",
+        difficulty="easy",
+        seed=0,
+        description=(
+            "Both sources are fresh. Join keys are clean and unique. "
+            "The agent must verify freshness, prepare keys, join, apply SCD, "
+            "and commit without triggering a row explosion."
+        ),
+        success_criteria=[
+            "COMMIT issued (episode finalized)",
+            "No Cartesian explosion detected",
+            "Silver row count ≤ Source A row count",
+            "match_rate > 0.80 after join",
+        ],
+        scoring_rubric={
+            "committed":        0.20,   # Agent issued COMMIT
+            "no_explosion":     0.25,   # No row explosion
+            "volume_ok":        0.20,   # Silver ≤ Source A rows
+            "high_match":       0.20,   # match_rate > 0.80
+            "grader_pass":      0.15,   # All 4 grader checks pass
+        },
+    ),
+    # ── MEDIUM: Dirty Integration ───────────────────────────────────────────
+    "dirty_integration": Task(
+        id="dirty_integration",
+        name="Dirty Key Integration",
+        difficulty="medium",
+        seed=1,
+        description=(
+            "Source A has NULLs and whitespace in join keys. "
+            "Source B has duplicate keys that can cause row explosion. "
+            "The agent must PREP_KEYS and DEDUPLICATE before joining, "
+            "and correctly quarantine unresolvable orphans."
+        ),
+        success_criteria=[
+            "PREP_KEYS_A issued before EXECUTE_JOIN",
+            "PREP_KEYS_B issued before EXECUTE_JOIN",
+            "DEDUPLICATE_B issued before EXECUTE_JOIN",
+            "No row explosion",
+            "Quarantine integrity check passes",
+        ],
+        scoring_rubric={
+            "committed":        0.10,
+            "prepped_before_join": 0.20,  # Both PREP_KEYS before join
+            "deduped_before_join": 0.20,  # DEDUP before join
+            "no_explosion":     0.25,
+            "integrity_ok":     0.15,     # Quarantine holds true orphans only
+            "grader_pass":      0.10,
+        },
+    ),
+    # ── HARD: Full Medallion Integration ────────────────────────────────────
+    "full_medallion": Task(
+        id="full_medallion",
+        name="Full Medallion Integration",
+        difficulty="hard",
+        seed=2,
+        description=(
+            "Source A is stale (>6h old). Source B has new schema columns "
+            "not registered in Silver. The agent must: check freshness, "
+            "evolve the schema, clean keys, deduplicate, execute a left join, "
+            "apply SCD-2 for tracked columns, and pass all grader checks."
+        ),
+        success_criteria=[
+            "SYNC_CHECK issued before any join",
+            "EVOLVE_SCHEMA issued before COMMIT",
+            "SCD-2 applied (not SCD-1) for tracked column",
+            "Silver schema contains new columns from drift",
+            "All 4 grader checks pass",
+        ],
+        scoring_rubric={
+            "committed":        0.05,
+            "sync_checked":     0.15,     # SYNC_CHECK before join
+            "schema_evolved":   0.15,     # EVOLVE_SCHEMA called
+            "used_scd2":        0.20,     # Chose SCD-2 over SCD-1
+            "schema_ok":        0.20,     # Silver has all required columns
+            "grader_pass":      0.25,     # All 4 grader checks pass
+        },
+    ),
+}
+# ---------------------------------------------------------------------------
+# Scoring engine
+# ---------------------------------------------------------------------------
+def score_episode(
+    task_id: str,
+    state: "MedusaState",
+    tables: "Optional[_EpisodeTables]" = None,
+) -> TaskResult:
+    """Score a completed MEDUSA episode against the named task.
+    Args:
+        task_id: One of "clean_pipeline", "dirty_integration", "full_medallion".
+        state: Final ``MedusaState`` after the episode ended.
+        tables: Episode tables (used for schema checks). Optional.
+    Returns:
+        TaskResult with score in [0.0, 1.0].
+    """
+    task = TASKS.get(task_id)
+    if task is None:
+        raise ValueError(f"Unknown task_id={task_id!r}. Valid: {list(TASKS)}")
+    if state.stage not in ("committed", "failed"):
+        return TaskResult(
+            task_id=task_id, score=0.0, grade="F",
+            breakdown={}, passed=False,
+            notes=["Episode not finished — COMMIT was never issued."],
+        )
+    breakdown: Dict[str, float] = {}
+    notes: List[str] = []
+    rubric = task.scoring_rubric
+    committed = state.stage == "committed"
+    # ── Shared criteria ──────────────────────────────────────────────────
+    if "committed" in rubric:
+        breakdown["committed"] = rubric["committed"] if committed else 0.0
+    if "no_explosion" in rubric:
+        ok = not state.explosion_detected
+        breakdown["no_explosion"] = rubric["no_explosion"] if ok else 0.0
+        if not ok:
+            notes.append("Row explosion was detected — heavy penalty applied.")
+    if "grader_pass" in rubric:
+        breakdown["grader_pass"] = rubric["grader_pass"] if state.grader_passed else 0.0
+    # ── Task-specific criteria ────────────────────────────────────────────
+    if task_id == "clean_pipeline":
+        volume_ok = (
+            state.silver_row_count <= state.source_a_row_count * 1.05
+            and state.silver_row_count > 0
+        )
+        breakdown["volume_ok"] = rubric["volume_ok"] if volume_ok else 0.0
+        breakdown["high_match"] = rubric["high_match"] if state.match_rate >= 0.80 else 0.0
+        if state.match_rate < 0.80:
+            notes.append(f"match_rate={state.match_rate:.1%} — target >80%.")
+    elif task_id == "dirty_integration":
+        # Both PREP_KEYS before join
+        prepped = state.did_prep_a and state.did_prep_b and state.did_join
+        breakdown["prepped_before_join"] = rubric["prepped_before_join"] if prepped else 0.0
+        # DEDUP before join
+        deduped = state.did_dedup_b and state.did_join
+        breakdown["deduped_before_join"] = rubric["deduped_before_join"] if deduped else 0.0
+        # Integrity check comes from grader
+        integrity_ok = state.grader_passed or (
+            state.quarantine_row_count >= 0  # grader_passed already covers this
+        )
+        # Use grader_passed as proxy for integrity
+        breakdown["integrity_ok"] = rubric["integrity_ok"] if state.grader_passed else 0.0
+        if not prepped:
+            notes.append("Agent joined without prepping keys first.")
+        if not deduped:
+            notes.append("Agent joined without deduplicating Dimension.")
+    elif task_id == "full_medallion":
+        breakdown["sync_checked"] = rubric["sync_checked"] if state.did_sync_check else 0.0
+        breakdown["schema_evolved"] = rubric["schema_evolved"] if state.did_evolve_schema else 0.0
+        used_scd2 = state.scd_type == "SCD-2"
+        breakdown["used_scd2"] = rubric["used_scd2"] if used_scd2 else 0.0
+        breakdown["schema_ok"] = rubric["schema_ok"] if state.grader_passed else 0.0
+        if not state.did_sync_check:
+            notes.append("SYNC_CHECK was never called — stale source not verified.")
+        if not state.did_evolve_schema:
+            notes.append("EVOLVE_SCHEMA never called — new columns may be missing from Silver.")
+        if not used_scd2:
+            notes.append(f"Used SCD-1 instead of SCD-2 (scd_type={state.scd_type!r}).")
+    # ── Final score ───────────────────────────────────────────────────────
+    total = sum(breakdown.values())
+    # Clip to [0, 1] (row explosion can make total negative from reward engine)
+    score = max(0.0, min(1.0, total))
+    passed = score >= 0.55
+    return TaskResult(
+        task_id=task_id,
+        score=round(score, 4),
+        grade=_grade(score),
+        breakdown=breakdown,
+        passed=passed,
+        notes=notes,
+    )
+# ---------------------------------------------------------------------------
+# Convenience: score all tasks
+# ---------------------------------------------------------------------------
+def score_all_tasks(
+    results: Dict[str, tuple],  # task_id → (state, tables)
+) -> Dict[str, TaskResult]:
+    """Score multiple completed episodes, one per task."""
+    return {
+        task_id: score_episode(task_id, state, tables)
+        for task_id, (state, tables) in results.items()
+    }

tests/test_medusa_environment.py ADDED Viewed

	@@ -0,0 +1,591 @@

+"""Tests for the MEDUSA environment.
+Covers: models, scenario generator, operators, reward engine, grader,
+and full end-to-end environment episodes.
+"""
+from __future__ import annotations
+import pytest
+# ---------------------------------------------------------------------------
+# Models
+# ---------------------------------------------------------------------------
+from medusa_env.models import (
+    MedusaAction,
+    MedusaActionType,
+    MedusaObservation,
+    MedusaState,
+)
+class TestMedusaModels:
+    def test_action_creation(self):
+        a = MedusaAction(action=MedusaActionType.SYNC_CHECK)
+        assert a.action == MedusaActionType.SYNC_CHECK
+        assert a.params == {}
+    def test_state_defaults(self):
+        s = MedusaState()
+        assert s.stage == "init"
+        assert s.step_idx == 0
+        assert s.did_sync_check is False
+        assert s.explosion_detected is False
+        assert s.grader_passed is False
+    def test_observation_defaults(self):
+        obs = MedusaObservation()
+        assert obs.done is False
+        assert obs.reward is None
+        assert obs.features == []
+# ---------------------------------------------------------------------------
+# Scenario Generator
+# ---------------------------------------------------------------------------
+import pandas as pd
+from medusa_env.scenarios import Scenario, ScenarioGenerator
+class TestMedusaScenarios:
+    @pytest.fixture
+    def gen(self):
+        return ScenarioGenerator(n_fact_rows=50, n_dim_rows=40)
+    def test_canonical_clean(self, gen):
+        scen = gen.generate(seed=0)
+        assert scen.id.startswith("clean")
+        assert isinstance(scen.bronze_a, pd.DataFrame)
+        assert len(scen.bronze_a) == 50
+        assert not scen.is_stale_a
+        assert not scen.is_stale_b
+        assert scen.new_cols_a == []
+    def test_canonical_dirty_keys(self, gen):
+        scen = gen.generate(seed=1)
+        assert "dirty_keys" in scen.id
+        # Dirty scenario should have actual null or whitespace keys
+        has_issues = (
+            scen.bronze_a[scen.join_key].isna().any()
+            or scen.bronze_a[scen.join_key].astype(str).str.contains(r"^\s|\s$").any()
+        )
+        assert has_issues
+    def test_canonical_stale(self, gen):
+        scen = gen.generate(seed=2)
+        assert "stale" in scen.id
+        assert scen.is_stale_a  # Source A should be stale
+    def test_canonical_schema_drift(self, gen):
+        scen = gen.generate(seed=3)
+        assert "schema_drift" in scen.id
+        assert len(scen.new_cols_a) > 0
+        assert len(scen.new_cols_b) > 0
+    def test_random_seed_produces_scenario(self, gen):
+        scen = gen.generate(seed=999)
+        assert isinstance(scen, Scenario)
+        assert scen.join_key in scen.bronze_a.columns
+        assert scen.join_key in scen.bronze_b.columns
+# ---------------------------------------------------------------------------
+# Operators
+# ---------------------------------------------------------------------------
+from medusa_env.operators import (
+    apply_scd,
+    deduplicate,
+    evolve_schema,
+    execute_join,
+    prep_keys,
+    sync_check,
+)
+class TestMedusaOperators:
+    def test_sync_check_fresh(self):
+        a = pd.DataFrame({"id": [1, 2]})
+        b = pd.DataFrame({"id": [1, 2]})
+        _, m = sync_check(a, b, time_delta_a=1.0, time_delta_b=2.0)
+        assert m["is_stale_a"] is False
+        assert m["is_stale_b"] is False
+    def test_sync_check_stale(self):
+        a = pd.DataFrame({"id": [1]})
+        b = pd.DataFrame({"id": [1]})
+        _, m = sync_check(a, b, time_delta_a=10.0, time_delta_b=1.0)
+        assert m["is_stale_a"] is True
+        assert m["is_stale_b"] is False
+    def test_prep_keys_strips_whitespace(self):
+        df = pd.DataFrame({"key": ["  K001  ", "K002", None]})
+        result, m = prep_keys(df, "key")
+        # Stripped key should have no leading/trailing spaces
+        non_null = result["key"].dropna().tolist()
+        assert all(v.strip() == v for v in non_null)
+        assert m["null_ratio_before"] > 0
+    def test_deduplicate_removes_dupes(self):
+        df = pd.DataFrame({"key": ["A", "A", "B"], "val": [1, 2, 3]})
+        result, m = deduplicate(df, "key")
+        assert m["dupes_removed"] == 1
+        assert len(result) == 2
+    def test_execute_join_left_basic(self):
+        fact = pd.DataFrame({"key": ["K001", "K002", "K003"], "val": [1, 2, 3]})
+        dim = pd.DataFrame({"key": ["K001", "K002"], "dim_name": ["A", "B"]})
+        joined, quarantine, m = execute_join(fact, dim, "key", "left")
+        assert m["join_rows"] == 3  # left join keeps all fact rows
+        assert m["match_rate"] == pytest.approx(2 / 3, abs=0.01)
+        assert len(quarantine) >= 1  # K003 should be quarantined
+    def test_execute_join_detects_explosion(self):
+        # Non-unique dim key → Cartesian explosion
+        fact = pd.DataFrame({"key": ["K001"] * 10, "val": list(range(10))})
+        dim = pd.DataFrame({"key": ["K001"] * 20, "dim_name": ["X"] * 20})
+        joined, quarantine, m = execute_join(fact, dim, "key", "inner")
+        assert m["explosion_detected"] is True
+    def test_execute_join_anti(self):
+        fact = pd.DataFrame({"key": ["K001", "K002", "K999"], "val": [1, 2, 3]})
+        dim = pd.DataFrame({"key": ["K001", "K002"], "name": ["A", "B"]})
+        joined, quarantine, m = execute_join(fact, dim, "key", "anti")
+        assert len(joined) == 0  # Anti-join: no rows in joined
+        assert len(quarantine) == 1  # K999 goes to quarantine
+    def test_apply_scd1_upsert(self):
+        silver = pd.DataFrame({"key": ["K001"], "val": [10], "status": ["old"]})
+        joined = pd.DataFrame({"key": ["K001", "K002"], "val": [99, 20], "status": ["new", "new"]})
+        result, m = apply_scd(silver, joined, "key", "status", scd_type=1)
+        assert m["scd_type"] == 1
+        assert m["inserts"] + m["updates"] > 0
+        # K001 should be updated to val=99
+        k1_row = result[result["key"] == "K001"]
+        assert not k1_row.empty
+    def test_apply_scd2_adds_history(self):
+        silver = pd.DataFrame()
+        joined = pd.DataFrame({"key": ["K001"], "status": ["active"]})
+        result, m = apply_scd(silver, joined, "key", "status", scd_type=2)
+        assert "valid_from" in result.columns
+        assert m["inserts"] == 1
+    def test_evolve_schema_adds_columns(self):
+        silver = pd.DataFrame({"key": ["K001"], "val": [1]})
+        a = pd.DataFrame({"key": ["K001"], "new_metric": [42]})
+        b = pd.DataFrame({"key": ["K001"]})
+        result, m = evolve_schema(silver, a, b, ["new_metric"], [])
+        assert "new_metric" in result.columns
+        assert m["new_cols_count"] == 1
+# ---------------------------------------------------------------------------
+# Reward Engine
+# ---------------------------------------------------------------------------
+from medusa_env.rewards import RewardEngine
+class TestMedusaRewards:
+    @pytest.fixture
+    def engine(self):
+        return RewardEngine()
+    def _clean_state(self):
+        s = MedusaState()
+        s.did_prep_a = True
+        s.did_prep_b = True
+        s.did_sync_check = True
+        return s
+    def test_step_penalty_always_applied(self, engine):
+        r = engine.evaluate("SYNC_CHECK", {}, MedusaState())
+        assert r == pytest.approx(-0.2, abs=0.01)
+    def test_high_match_join_reward(self, engine):
+        r = engine.evaluate(
+            "EXECUTE_JOIN_LEFT",
+            {"match_rate": 0.95, "join_rows": 100, "fact_rows": 100,
+             "explosion_detected": False, "quarantine_rows": 5},
+            self._clean_state(),
+        )
+        assert r > 0.0  # +25 - 0.2 + 10 (quarantine) = +34.8
+    def test_row_explosion_heavy_penalty(self, engine):
+        r = engine.evaluate(
+            "EXECUTE_JOIN_INNER",
+            {"explosion_detected": True, "join_rows": 1000, "fact_rows": 100,
+             "match_rate": 1.0, "quarantine_rows": 0},
+            self._clean_state(),
+        )
+        assert r < -50.0
+    def test_dirty_join_penalty(self, engine):
+        # No PREP_KEYS → dirty join penalty
+        state = MedusaState()
+        state.did_prep_a = False
+        state.did_prep_b = False
+        r = engine.evaluate(
+            "EXECUTE_JOIN_LEFT",
+            {"explosion_detected": False, "join_rows": 0, "fact_rows": 50,
+             "match_rate": 0.0, "quarantine_rows": 0},
+            state,
+        )
+        assert r < -20.0
+    def test_scd2_extra_reward(self, engine):
+        r = engine.evaluate("APPLY_SCD_2", {}, self._clean_state())
+        # +5 for SCD-2 - 0.2 step penalty
+        assert r == pytest.approx(4.8, abs=0.01)
+    def test_stale_processing_penalty(self, engine):
+        state = MedusaState()
+        state.is_stale_a = True
+        state.did_sync_check = False  # Never checked freshness
+        state.did_prep_a = True
+        state.did_prep_b = True
+        r = engine.evaluate(
+            "EXECUTE_JOIN_LEFT",
+            {"explosion_detected": False, "join_rows": 100, "fact_rows": 100,
+             "match_rate": 0.95, "quarantine_rows": 0},
+            state,
+        )
+        # Should include stale penalty on top of positive join reward
+        assert r < 25.0  # Stale penalty reduces it
+# ---------------------------------------------------------------------------
+# Grader
+# ---------------------------------------------------------------------------
+from medusa_env.grader import Grader
+from medusa_env.scenarios import Scenario
+class TestMedusaGrader:
+    @pytest.fixture
+    def grader(self):
+        return Grader()
+    def _make_scenario(self):
+        a = pd.DataFrame({"entity_id": ["K1", "K2", "K3"], "val": [1, 2, 3],
+                          "fact_category": ["A", "B", "C"],
+                          "fact_value": [1.0, 2.0, 3.0],
+                          "created_at": pd.date_range("2024-01-01", periods=3, freq="h")})
+        b = pd.DataFrame({"entity_id": ["K1", "K2"], "dim_name": ["N1", "N2"], "dim_status": ["x", "y"]})
+        return a, b
+    def test_volume_check_pass(self, grader):
+        a, b = self._make_scenario()
+        silver = pd.DataFrame({"entity_id": ["K1", "K2"], "val": [1, 2]})
+        scen = ScenarioGenerator(n_fact_rows=3, n_dim_rows=2).generate(seed=0)
+        r = grader.audit(silver, pd.DataFrame(), a, b, "entity_id", "left", 1, scen)
+        assert r.volume_ok is True
+    def test_volume_check_fail(self, grader):
+        a, b = self._make_scenario()
+        # Silver has way more rows than source A → violation
+        silver = pd.DataFrame({"entity_id": ["K1"] * 100})
+        scen = ScenarioGenerator(n_fact_rows=3, n_dim_rows=2).generate(seed=0)
+        r = grader.audit(silver, pd.DataFrame(), a, b, "entity_id", "left", 1, scen)
+        assert r.volume_ok is False
+    def test_integrity_check_quarantine_true_orphans(self, grader):
+        a, b = self._make_scenario()
+        # K3 is not in B → true orphan
+        quarantine = pd.DataFrame({"entity_id": ["K3"]})
+        scen = ScenarioGenerator(n_fact_rows=3, n_dim_rows=2).generate(seed=0)
+        silver = pd.DataFrame({"entity_id": ["K1", "K2"]})
+        r = grader.audit(silver, quarantine, a, b, "entity_id", "left", 1, scen)
+        assert r.integrity_ok is True
+    def test_integrity_check_fail_dirty_quarantine(self, grader):
+        a, b = self._make_scenario()
+        # K1 IS in B but ends up in quarantine (agent failed to clean it)
+        quarantine = pd.DataFrame({"entity_id": ["K1"]})
+        scen = ScenarioGenerator(n_fact_rows=3, n_dim_rows=2).generate(seed=0)
+        silver = pd.DataFrame({"entity_id": ["K2"]})
+        r = grader.audit(silver, quarantine, a, b, "entity_id", "left", 1, scen)
+        assert r.integrity_ok is False
+    def test_all_pass_gives_bonus(self, grader):
+        gen = ScenarioGenerator(n_fact_rows=3, n_dim_rows=2)
+        scen = gen.generate(seed=0)
+        a, b = scen.bronze_a, scen.bronze_b
+        # Simulate a perfect run
+        silver = a.merge(b, on="entity_id", how="left")
+        r = grader.audit(silver, pd.DataFrame(), a, b, "entity_id", "left", 1, scen)
+        assert r.bonus_reward > 0
+# ---------------------------------------------------------------------------
+# Full environment integration
+# ---------------------------------------------------------------------------
+from medusa_env.server import MedusaEnv
+from medusa_env.models import MedusaActionType
+class TestMedusaEnvironment:
+    @pytest.fixture
+    def env(self):
+        return MedusaEnv(n_fact_rows=50, n_dim_rows=40)
+    def test_reset_returns_observation(self, env):
+        obs = env.reset(seed=0)
+        assert isinstance(obs, MedusaObservation)
+        assert obs.done is False
+        assert len(obs.features) == 16
+        assert obs.reward is None
+    def test_state_after_reset(self, env):
+        env.reset(seed=0)
+        state = env.state
+        assert state.stage == "running"
+        assert state.step_idx == 0
+        assert state.source_a_row_count == 50
+    def test_happy_path_episode(self, env):
+        """Full pipeline: sync → evolve → prep both → dedup → join → scd → commit."""
+        env.reset(seed=0)  # clean scenario
+        actions = [
+            MedusaActionType.SYNC_CHECK,
+            MedusaActionType.EVOLVE_SCHEMA,
+            MedusaActionType.PREP_KEYS_A,
+            MedusaActionType.PREP_KEYS_B,
+            MedusaActionType.DEDUPLICATE_B,
+            MedusaActionType.EXECUTE_JOIN_LEFT,
+            MedusaActionType.APPLY_SCD_2,
+            MedusaActionType.COMMIT,
+        ]
+        obs = None
+        for act_type in actions:
+            obs = env.step(MedusaAction(action=act_type))
+        assert obs is not None
+        assert obs.done is True
+        assert env.state.stage == "committed"
+        assert env.state.grader_passed  # Clean scenario should pass grader
+    def test_row_explosion_gives_heavy_penalty(self, env):
+        """Joining on non-unique B keys should trigger explosion penalty."""
+        env.reset(seed=1)  # dirty_keys scenario — B has duplicate keys
+        # Skip prep & dedup — go straight to join
+        env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        # Force the dimension to have many duplicates so explosion fires
+        import pandas as _pd
+        env._tables.bronze_b_prepped = _pd.DataFrame({
+            "entity_id": ["K001"] * 30,
+            "dim_name": ["X"] * 30,
+            "dim_status": ["x"] * 30,
+        })
+        env._tables.bronze_a_prepped = _pd.DataFrame({
+            "entity_id": ["K001"] * 10,
+            "fact_value": list(range(10)),
+            "fact_category": ["A"] * 10,
+            "created_at": _pd.date_range("2024-01-01", periods=10, freq="h"),
+        })
+        obs = env.step(MedusaAction(action=MedusaActionType.EXECUTE_JOIN_INNER))
+        assert obs.reward is not None
+        assert obs.reward < -50.0
+        assert env.state.explosion_detected is True
+    def test_dirty_join_penalty(self, env):
+        """Skipping PREP_KEYS and joining on null-heavy keys → dirty join."""
+        env.reset(seed=1)  # dirty_keys scenario
+        # Skip PREP — join directly
+        obs = env.step(MedusaAction(action=MedusaActionType.EXECUTE_JOIN_LEFT))
+        # If all fact keys are null/non-matching → 0-row join → dirty join penalty
+        # (reward < base -0.2 if dirty join fired)
+        assert obs.reward is not None
+    def test_step_idx_increments(self, env):
+        env.reset(seed=0)
+        for _ in range(3):
+            env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        assert env.state.step_idx == 3
+    def test_max_steps_terminates_episode(self):
+        env = MedusaEnv(n_fact_rows=10, n_dim_rows=10, max_steps=3)
+        env.reset(seed=0)
+        obs = None
+        for _ in range(4):  # more than max_steps
+            obs = env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        assert obs is not None
+        assert obs.done is True
+    def test_commit_without_join_grader_fails(self, env):
+        """Committing without joining should make the grader fail."""
+        env.reset(seed=0)
+        env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        obs = env.step(MedusaAction(action=MedusaActionType.COMMIT))
+        assert obs.done is True
+        # Silver will be empty → schema check should fail or volume check fail
+        assert env.state.grader_report != ""
+    def test_features_vector_length(self, env):
+        env.reset(seed=0)
+        obs = env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        assert len(obs.features) == 16
+        assert all(0.0 <= f <= 1.0 for f in obs.features)
+    def test_governance_log_populated(self, env):
+        env.reset(seed=0)
+        env.step(MedusaAction(action=MedusaActionType.SYNC_CHECK))
+        env.step(MedusaAction(action=MedusaActionType.PREP_KEYS_A))
+        log = env._tables.governance_log
+        assert len(log) == 2
+        assert log[0]["action"] == "SYNC_CHECK"
+# ---------------------------------------------------------------------------
+# Task Scorer
+# ---------------------------------------------------------------------------
+from medusa_env.tasks import TASKS, score_episode
+class TestMedusaTasks:
+    """Tests for the 3 formal task definitions and 0.0–1.0 scorer."""
+    def test_three_tasks_defined(self):
+        assert "clean_pipeline" in TASKS
+        assert "dirty_integration" in TASKS
+        assert "full_medallion" in TASKS
+    def test_task_difficulties(self):
+        assert TASKS["clean_pipeline"].difficulty == "easy"
+        assert TASKS["dirty_integration"].difficulty == "medium"
+        assert TASKS["full_medallion"].difficulty == "hard"
+    def test_task_seeds_match_scenarios(self):
+        assert TASKS["clean_pipeline"].seed == 0
+        assert TASKS["dirty_integration"].seed == 1
+        assert TASKS["full_medallion"].seed == 2
+    def _run_happy_path(self, seed: int) -> MedusaState:
+        """Run the optimal action sequence for the given seed and return final state."""
+        env = MedusaEnv(n_fact_rows=50, n_dim_rows=40)
+        env.reset(seed=seed)
+        for act in [
+            MedusaActionType.SYNC_CHECK,
+            MedusaActionType.EVOLVE_SCHEMA,
+            MedusaActionType.PREP_KEYS_A,
+            MedusaActionType.PREP_KEYS_B,
+            MedusaActionType.DEDUPLICATE_B,
+            MedusaActionType.EXECUTE_JOIN_LEFT,
+            MedusaActionType.APPLY_SCD_2,
+            MedusaActionType.COMMIT,
+        ]:
+            env.step(MedusaAction(action=act))
+        return env.state
+    # ── clean_pipeline (easy) ───────────────────────────────────────────────
+    def test_clean_pipeline_score_is_in_range(self):
+        state = self._run_happy_path(seed=0)
+        result = score_episode("clean_pipeline", state)
+        assert 0.0 <= result.score <= 1.0
+    def test_clean_pipeline_happy_path_passes(self):
+        state = self._run_happy_path(seed=0)
+        result = score_episode("clean_pipeline", state)
+        assert result.passed is True
+        assert result.grade in ("S", "A", "B")
+    def test_clean_pipeline_uncommitted_scores_zero(self):
+        state = MedusaState(stage="running")
+        result = score_episode("clean_pipeline", state)
+        assert result.score == 0.0
+        assert result.grade == "F"
+    def test_clean_pipeline_explosion_detected_lowers_score(self):
+        state = MedusaState(
+            stage="committed",
+            explosion_detected=True,
+            silver_row_count=0,
+            source_a_row_count=50,
+            match_rate=0.0,
+            grader_passed=False,
+        )
+        result = score_episode("clean_pipeline", state)
+        assert result.breakdown["no_explosion"] == 0.0
+    # ── dirty_integration (medium) ─────────���────────────────────────────────
+    def test_dirty_integration_score_is_in_range(self):
+        state = self._run_happy_path(seed=1)
+        result = score_episode("dirty_integration", state)
+        assert 0.0 <= result.score <= 1.0
+    def test_dirty_integration_without_prep_penalized(self):
+        state = MedusaState(
+            stage="committed",
+            did_prep_a=False,
+            did_prep_b=False,
+            did_dedup_b=False,
+            did_join=True,
+            explosion_detected=False,
+            grader_passed=False,
+        )
+        result = score_episode("dirty_integration", state)
+        assert result.breakdown["prepped_before_join"] == 0.0
+        assert result.breakdown["deduped_before_join"] == 0.0
+    def test_dirty_integration_with_all_prereqs_scores_higher(self):
+        state_no_prep = MedusaState(
+            stage="committed", did_prep_a=False, did_prep_b=False,
+            did_dedup_b=False, did_join=True, explosion_detected=False, grader_passed=False,
+        )
+        state_prepped = MedusaState(
+            stage="committed", did_prep_a=True, did_prep_b=True,
+            did_dedup_b=True, did_join=True, explosion_detected=False, grader_passed=True,
+        )
+        no_prep = score_episode("dirty_integration", state_no_prep)
+        prepped = score_episode("dirty_integration", state_prepped)
+        assert prepped.score > no_prep.score
+    # ── full_medallion (hard) ───────────────────────────────────────────────
+    def test_full_medallion_score_is_in_range(self):
+        state = self._run_happy_path(seed=2)
+        result = score_episode("full_medallion", state)
+        assert 0.0 <= result.score <= 1.0
+    def test_full_medallion_without_sync_penalized(self):
+        state = MedusaState(
+            stage="committed",
+            did_sync_check=False,
+            did_evolve_schema=True,
+            scd_type="SCD-2",
+            grader_passed=True,
+        )
+        result = score_episode("full_medallion", state)
+        assert result.breakdown["sync_checked"] == 0.0
+    def test_full_medallion_scd1_penalized(self):
+        state_scd1 = MedusaState(
+            stage="committed", did_sync_check=True,
+            did_evolve_schema=True, scd_type="SCD-1", grader_passed=False,
+        )
+        state_scd2 = MedusaState(
+            stage="committed", did_sync_check=True,
+            did_evolve_schema=True, scd_type="SCD-2", grader_passed=True,
+        )
+        r1 = score_episode("full_medallion", state_scd1)
+        r2 = score_episode("full_medallion", state_scd2)
+        assert r2.score > r1.score
+    def test_unknown_task_raises(self):
+        with pytest.raises(ValueError, match="Unknown task_id"):
+            score_episode("nonexistent_task", MedusaState(stage="committed"))

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff