Spaces:

kushalExplores
/

metric_tracker_rl

Sleeping

App Files Files Community

kushalExplores commited on Apr 7

Commit

e415506

verified ·

1 Parent(s): a5fe7ab

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

.dockerignore +14 -0
.gitignore +25 -0
Dockerfile +31 -0
README.md +223 -5
__init__.py +26 -0
analysis_tools.py +1229 -0
client.py +35 -0
evaluation.py +256 -0
inference.py +586 -0
models.py +324 -0
openenv.yaml +7 -0
payload_generation.py +46 -0
pyproject.toml +49 -0
server/__init__.py +11 -0
server/app.py +82 -0
server/data_generator.py +1016 -0
server/gradio_ui.py +728 -0
server/metric_tracker_rl_environment.py +417 -0
server/requirements.txt +6 -0
tasks.py +141 -0
tests/test_metric_tracker_rl.py +367 -0
uv.lock +0 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,14 @@

+.git
+.pytest_cache
+.ruff_cache
+.mypy_cache
+.DS_Store
+.env
+.env.*
+.venv
+__pycache__
+*.pyc
+*.pyo
+*.pyd
+tests
+openenv_metric_tracker_rl.egg-info

.gitignore ADDED Viewed

	@@ -0,0 +1,25 @@

+# Secrets / local env
+.env
+.env.*
+# Virtual environments
+.venv/
+venv/
+# Python cache
+__pycache__/
+*.py[cod]
+# Test / tool cache
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.coverage
+# Build / packaging artifacts
+build/
+dist/
+*.egg-info/
+# OS files
+.DS_Store

Dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+FROM python:3.11-slim
+ENV PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    UV_PROJECT_ENVIRONMENT=/opt/venv \
+    PATH="/opt/venv/bin:/root/.local/bin:${PATH}" \
+    PYTHONPATH=/app \
+    PORT=8000 \
+    ENABLE_WEB_INTERFACE=true
+WORKDIR /app
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends curl git && \
+    rm -rf /var/lib/apt/lists/*
+RUN pip install --no-cache-dir uv
+COPY pyproject.toml uv.lock README.md openenv.yaml /app/
+COPY __init__.py analysis_tools.py client.py evaluation.py inference.py models.py payload_generation.py tasks.py /app/
+COPY server /app/server
+RUN uv sync --frozen --no-dev
+EXPOSE 8000
+HEALTHCHECK --interval=30s --timeout=5s --start-period=20s --retries=3 \
+    CMD curl -fsS "http://127.0.0.1:${PORT}/health" || exit 1
+CMD ["python", "-m", "uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

README.md CHANGED Viewed

@@ -1,10 +1,228 @@
 ---
-title: Metric Tracker Rl
-emoji: 😻
-colorFrom: indigo
-colorTo: blue
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Metric Tracker RL
+emoji: 📈
+colorFrom: blue
+colorTo: green
 sdk: docker
+app_port: 8000
 pinned: false
+tags:
+  - openenv
+  - reinforcement-learning
+  - analytics
+  - anomaly-detection
 ---
+# Metric Tracker RL
+`metric_tracker_rl` is an OpenEnv benchmark for investigating synthetic product-funnel metrics and submitting a structured anomaly report. It is designed to run as a containerized Hugging Face Space and exposes the same environment through both an OpenEnv-compatible HTTP API and a Gradio debugger.
+## Environment Description And Motivation
+This environment models a common analytics workflow: a team notices a KPI shift, inspects daily and hourly aggregates, compares observed values to historical baselines, and decides which anomalies are real enough to report. The benchmark focuses on disciplined investigation rather than raw generation. Agents must use safe analysis tools, avoid over-submitting, and produce a precise anomaly payload that matches hidden seeded ground truth.
+The motivation for the benchmark is to test whether an agent can:
+- navigate a realistic tabular analytics task without direct oracle access
+- combine count-based, rate-based, funnel, and hourly reasoning
+- preserve precision when multiple anomaly families may be present
+- translate evidence into a stable machine-graded submission format
+Each reset creates a deterministic four-week synthetic dataset with daily and hourly funnel aggregates. Hidden anomaly labels are derived from the reset configuration, so tasks are reproducible and programmatically graded.
+## Action Space
+The environment accepts `MetricTrackerRlAction` with three fields:
+- `classifications`: final anomaly rows to grade
+- `analysis_method`: optional safe method name to call instead of grading
+- `analysis_args`: arguments for the selected analysis method
+- `payload_generators`: optional declarative generator methods that create submission rows inside the environment
+Each `classifications` row must include:
+- `date`: ISO date in `YYYY-MM-DD`
+- `entity_type`: one of the stable families such as `conversion_rate`, `event_count`, `funnel_step`, `hourly_mix`, or `data_quality`
+- `entity_name`: stable metric or entity identifier
+- `anomaly_type`: anomaly family identifier
+- `detection_method`: analysis method used to justify the row
+- `baseline_value`: historical reference value
+- `observed_value`: measured anomalous value
+- `delta_value`: `observed_value - baseline_value`
+- `severity`: one of `low`, `medium`, `high`, or `critical`
+## Observation Space
+The environment returns `MetricTrackerRlObservation`, which includes:
+- task metadata: `task_id`, `instruction`, `status`, and visible episode config
+- method surface: `available_methods` and `available_synthetic_generator_methods`
+- task catalog: `available_tasks`
+- metric definitions: `conversion_metric_definitions`
+- latest tool output: `analysis_result`
+- latest submission output: `generated_rows`, `submitted_rows`, `submission_preview`, `submission_issues`, and `reward_breakdown`
+- progress counters: `expected_row_count` and `correct_row_count`
+In standard benchmark mode, raw `daily_metrics`, raw `hourly_metrics`, and hidden debug payloads are not exposed directly. Agents are expected to inspect the data through the read-only shared analysis methods instead.
+## Shared Analysis Surface
+Humans in the Gradio debugger and agents in `inference.py` use the same read-only analysis surface:
+- `task_overview`
+- `list_dates`
+- `list_entities`
+- `rows_for_date`
+- `hourly_rows_for_date`
+- `compare_rate_to_median`
+- `compare_count_to_median`
+- `detect_funnel_break`
+- `check_impossible_counts`
+- `list_suspicious_dates`
+- `preview_submission`
+- payload-generator helpers such as `get_median_filter_rows`
+This keeps the benchmark focused on investigation quality rather than privileged access.
+## Tasks And Expected Difficulty
+The benchmark ships with three named deterministic tasks:
+1. `easy_single_spike`
+   Expected difficulty: easy.
+   One obvious event-count spike is present. A careful single-method investigation should usually be enough.
+2. `medium_mixed_pair`
+   Expected difficulty: medium.
+   Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
+3. `hard_mixed_multi`
+   Expected difficulty: hard.
+   Five anomalies are present with higher density and weaker signal separation. Agents need broader exploration and tighter filtering.
+Supported anomaly families across resets:
+- `rate_drop_from_median`
+- `rate_spike_from_median`
+- `absolute_drop_in_event_count`
+- `absolute_spike_in_event_count`
+- `funnel_break`
+- `hourly_traffic_mix_shift`
+- `instrumentation_data_quality_issue`
+## Reward And Grading
+Grading is deterministic and normalized to `[0, 1]`. The evaluator rewards:
+- precision
+- recall
+- correct `anomaly_type`
+- correct `detection_method`
+- numeric accuracy for `baseline_value`, `observed_value`, and `delta_value` within tolerance
+- correct `severity`
+Penalties apply for:
+- extra rows
+- duplicate rows
+- invalid rows
+- exploit-style mass submission patterns
+The observation exposes `submission_preview`, `submission_issues`, and `reward_breakdown` after a graded step.
+## Baseline Scores
+Reference scores below were measured locally with a deterministic scripted payload-generator baseline that submits:
+- `easy_single_spike`: `get_absolute_spike_in_event_count_rows(threshold_multiplier=2.0)`
+- `medium_mixed_pair`: `get_median_filter_rows(threshold_multiplier=2.0)`
+- `hard_mixed_multi`: `get_median_filter_rows(threshold_multiplier=2.0)`
+Measured normalized scores:
+- `easy_single_spike`: `1.000000`
+- `medium_mixed_pair`: `0.662500`
+- `hard_mixed_multi`: `0.421818`
+- average across named tasks: `0.694773`
+These numbers are useful as a simple non-LLM reference point, not as a ceiling. A perfect submission still scores `1.0` on each task.
+## Hugging Face Space Deployment
+This repository is configured for a containerized Hugging Face Space:
+- `README.md` frontmatter sets `sdk: docker`
+- the Space is tagged with `openenv`
+- [`openenv.yaml`](/Users/kushaljaisinghani/Documents/sample_envs/metric_tracker_rl/openenv.yaml) points to `server.app:app`
+- [`Dockerfile`](/Users/kushaljaisinghani/Documents/sample_envs/metric_tracker_rl/Dockerfile) starts the OpenEnv HTTP server on port `8000`
+## Setup
+### Local Python Setup
+```bash
+cd metric_tracker_rl
+uv sync
+```
+### Run The Environment Locally
+```bash
+cd metric_tracker_rl
+uv run python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+### Run The Inference Baseline
+Set credentials in [`.env.inference`](/Users/kushaljaisinghani/Documents/sample_envs/metric_tracker_rl/.env.inference), then run:
+```bash
+cd metric_tracker_rl
+source .env.inference
+uv run python inference.py
+```
+The inference baseline runs:
+- `easy_single_spike`
+- `medium_mixed_pair`
+- `hard_mixed_multi`
+It prints one score per task and an overall average benchmark score.
+## Container Build And Run
+Build the image:
+```bash
+cd metric_tracker_rl
+docker build -t metric-tracker-rl .
+```
+Run the container:
+```bash
+docker run --rm -p 8000:8000 metric-tracker-rl
+```
+Once running, the Space-compatible server is available at `http://localhost:8000`.
+## Validation
+Useful checks:
+```bash
+cd metric_tracker_rl
+openenv validate .
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
+```
+## Manual Debugging UI
+The bundled Gradio UI exposes:
+- named-task selection
+- reset controls for `seed`, `scenario_family`, `difficulty`, and `anomaly_density`
+- the same shared analysis methods used by the agent baseline
+- payload preview and submission feedback
+- charts for daily counts, rates, hourly metrics, and funnel shape
+Debug mode can expose expected rows and anomaly schedules for development, but that view is intentionally gated and is not part of standard benchmark play.

__init__.py ADDED Viewed

	@@ -0,0 +1,26 @@

+"""Metric Tracker Rl Environment."""
+from .client import MetricTrackerRlEnv
+from .models import MetricSubmissionRow, MetricTrackerRlAction, MetricTrackerRlObservation
+from .payload_generation import (
+    available_analysis_methods,
+    available_payload_generation_methods,
+    available_synthetic_generator_methods,
+)
+from .tasks import DEFAULT_TASK_ID, DEFAULT_TASK_ORDER, TASKS, TaskSpec, available_task_specs, get_task_spec
+__all__ = [
+    "MetricSubmissionRow",
+    "MetricTrackerRlAction",
+    "MetricTrackerRlObservation",
+    "MetricTrackerRlEnv",
+    "available_analysis_methods",
+    "available_payload_generation_methods",
+    "available_synthetic_generator_methods",
+    "TaskSpec",
+    "TASKS",
+    "DEFAULT_TASK_ID",
+    "DEFAULT_TASK_ORDER",
+    "get_task_spec",
+    "available_task_specs",
+]

analysis_tools.py ADDED Viewed

	@@ -0,0 +1,1229 @@

+"""Shared safe analysis methods for agents and the manual UI."""
+from __future__ import annotations
+from dataclasses import dataclass
+import math
+from statistics import median
+from typing import Any
+try:
+    from .models import (
+        ConversionMetricDefinition,
+        MethodSpec,
+        MetricRecord,
+        MetricSubmissionRow,
+        PayloadGeneratorMethod,
+        SubmissionIssue,
+        SubmissionPreview,
+    )
+except ImportError:
+    from models import (
+        ConversionMetricDefinition,
+        MethodSpec,
+        MetricRecord,
+        MetricSubmissionRow,
+        PayloadGeneratorMethod,
+        SubmissionIssue,
+        SubmissionPreview,
+    )
+FUNNEL_STEPS: tuple[tuple[str, str], ...] = (
+    ("menu_opens", "app_opens"),
+    ("product_added_to_cart", "menu_opens"),
+    ("orders_placed", "product_added_to_cart"),
+    ("payment_successful", "orders_placed"),
+)
+COUNT_METRICS: tuple[str, ...] = (
+    "app_opens",
+    "menu_opens",
+    "product_added_to_cart",
+    "orders_placed",
+    "payment_successful",
+)
+DEFAULT_METHOD_SPECS: tuple[MethodSpec, ...] = (
+    MethodSpec(
+        name="task_overview",
+        description="Return compact task context, config, entity catalog, and payload schema.",
+    ),
+    MethodSpec(name="list_dates", description="List all dates in the dataset."),
+    MethodSpec(
+        name="list_entities",
+        description="List count, rate, funnel, hourly mix, and data quality entities.",
+    ),
+    MethodSpec(
+        name="rows_for_date",
+        description="Return daily counts and derived rates for one date.",
+        parameters=["date"],
+    ),
+    MethodSpec(
+        name="hourly_rows_for_date",
+        description="Return hourly rows and traffic-share summaries for one date.",
+        parameters=["date"],
+    ),
+    MethodSpec(
+        name="compare_rate_to_median",
+        description="Compare one conversion rate against its daily median baseline.",
+        parameters=["date", "entity_name"],
+    ),
+    MethodSpec(
+        name="compare_count_to_median",
+        description="Compare one event count against its daily median baseline.",
+        parameters=["date", "entity_name"],
+    ),
+    MethodSpec(
+        name="detect_funnel_break",
+        description="Inspect funnel-step rates and monotonicity for a date.",
+        parameters=["date"],
+    ),
+    MethodSpec(
+        name="check_impossible_counts",
+        description="Find impossible daily or hourly count relationships for a date.",
+        parameters=["date"],
+    ),
+    MethodSpec(
+        name="list_suspicious_dates",
+        description="Rank dates by anomaly suspicion using shared heuristics.",
+        parameters=["limit"],
+    ),
+    MethodSpec(
+        name="preview_submission",
+        description="Validate candidate payload rows without revealing ground truth.",
+        parameters=["rows"],
+    ),
+    MethodSpec(
+        name="show_raw_data",
+        description="Return a head() style view of daily aggregate rows with count and rate metrics.",
+        parameters=["limit"],
+    ),
+    MethodSpec(
+        name="get_metric_median",
+        description="Return the median for a count metric or conversion metric.",
+        parameters=["metric_name"],
+    ),
+    MethodSpec(
+        name="get_metric_std_dev_from_median",
+        description="Return sqrt(mean((value - median)^2)) for a metric.",
+        parameters=["metric_name"],
+    ),
+    MethodSpec(
+        name="get_rows_with_abs_diff_from_median_gt",
+        description="Return all dates where abs(value - median) is greater than a threshold.",
+        parameters=["metric_name", "threshold"],
+    ),
+    MethodSpec(
+        name="get_median_filter_rows",
+        description="Build payload rows where abs(value - median) > threshold_multiplier * std_from_median.",
+        parameters=["metric_name", "threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_rate_drop_from_median_rows",
+        description="Build conversion-rate payload rows where median-filtered values drop below baseline.",
+        parameters=["metric_name", "threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_rate_spike_from_median_rows",
+        description="Build conversion-rate payload rows where median-filtered values spike above baseline.",
+        parameters=["metric_name", "threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_absolute_drop_in_event_count_rows",
+        description="Build event-count payload rows where median-filtered values drop below baseline.",
+        parameters=["metric_name", "threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_absolute_spike_in_event_count_rows",
+        description="Build event-count payload rows where median-filtered values spike above baseline.",
+        parameters=["metric_name", "threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_funnel_break_rows",
+        description="Build payload rows for funnel-step breaks by scanning dates for large funnel-rate drops.",
+        parameters=["threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_hourly_traffic_mix_shift_rows",
+        description="Build payload rows for dates with unusual app_open daytime-share shifts.",
+        parameters=["threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="get_instrumentation_data_quality_issue_rows",
+        description="Build payload rows for dates with impossible count relationships or instrumentation issues.",
+        parameters=["threshold_multiplier"],
+    ),
+    MethodSpec(
+        name="payload_generator",
+        description="Run a list of payload generation methods and merge the generated rows.",
+        parameters=["generator_methods"],
+    ),
+)
+def available_analysis_methods() -> list[MethodSpec]:
+    """Return the shared safe method surface."""
+    return list(DEFAULT_METHOD_SPECS)
+@dataclass
+class AnalysisContext:
+    """Structured input for the shared method implementation."""
+    daily_metrics: list[MetricRecord]
+    hourly_metrics: list[MetricRecord]
+    conversion_definitions: list[ConversionMetricDefinition]
+    instruction: str = ""
+    config: dict[str, Any] | None = None
+class SharedAnalysisToolkit:
+    """Shared method implementation for agents and the manual UI."""
+    def __init__(self, context: AnalysisContext) -> None:
+        self._context = context
+        self._daily_by_date = {row.date: row for row in context.daily_metrics}
+        self._hourly_by_date: dict[str, list[MetricRecord]] = {}
+        for row in context.hourly_metrics:
+            self._hourly_by_date.setdefault(row.date, []).append(row)
+        for rows in self._hourly_by_date.values():
+            rows.sort(key=lambda item: item.hour if item.hour is not None else -1)
+        self._dates = sorted(self._daily_by_date)
+        self._conversion_map = {item.name: item for item in context.conversion_definitions}
+    def task_overview(self) -> dict[str, Any]:
+        """Return a compact overview of the task and available entities."""
+        return {
+            "instruction": self._context.instruction,
+            "config": self._context.config or {},
+            "date_count": len(self._dates),
+            "dates": self._dates,
+            "threshold_search_space": {
+                "rate_delta_pct_points": [3.0, 4.5, 6.0, 8.0],
+                "count_delta_pct": [10.0, 15.0, 22.0, 30.0],
+                "funnel_delta_pct_points": [3.5, 5.0, 7.0, 10.0],
+                "hourly_mix_delta_pct": [8.0, 12.0, 18.0, 25.0],
+            },
+            "payload_schema": [
+                "date",
+                "entity_type",
+                "entity_name",
+                "anomaly_type",
+                "detection_method",
+                "baseline_value",
+                "observed_value",
+                "delta_value",
+                "severity",
+            ],
+            "available_methods": [item.model_dump() for item in available_analysis_methods()],
+            "entities": self.list_entities()["entities"],
+        }
+    def list_dates(self) -> dict[str, Any]:
+        return {"dates": self._dates}
+    def list_entities(self) -> dict[str, Any]:
+        conversions = [
+            {
+                "entity_type": "conversion_rate",
+                "entity_name": item.name,
+                "formula": item.description,
+            }
+            for item in self._context.conversion_definitions
+        ]
+        counts = [
+            {
+                "entity_type": "event_count",
+                "entity_name": metric_name,
+            }
+            for metric_name in COUNT_METRICS
+        ]
+        funnels = [
+            {
+                "entity_type": "funnel_step",
+                "entity_name": f"{numerator}_from_{denominator}",
+            }
+            for numerator, denominator in FUNNEL_STEPS
+        ]
+        quality = [
+            {
+                "entity_type": "data_quality",
+                "entity_name": f"{numerator}_lte_{denominator}",
+            }
+            for numerator, denominator in FUNNEL_STEPS
+        ]
+        hourly = [
+            {
+                "entity_type": "hourly_mix",
+                "entity_name": "app_opens:daytime_share",
+            }
+        ]
+        return {"entities": conversions + counts + funnels + quality + hourly}
+    def rows_for_date(self, date: str) -> dict[str, Any]:
+        row = self._daily_by_date.get(date)
+        if row is None:
+            return {"found": False, "date": date, "error": "Date not found."}
+        derived_rates = self._conversion_rates(row)
+        return {
+            "found": True,
+            "date": date,
+            "daily_metrics": row.model_dump(),
+            "derived_rates": derived_rates,
+        }
+    def hourly_rows_for_date(self, date: str) -> dict[str, Any]:
+        rows = self._hourly_by_date.get(date, [])
+        if not rows:
+            return {"found": False, "date": date, "error": "Date not found."}
+        total = sum(item.app_opens for item in rows) or 1
+        daytime_hours = {8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19}
+        daytime_share = round(
+            sum(item.app_opens for item in rows if item.hour in daytime_hours) / total,
+            4,
+        )
+        return {
+            "found": True,
+            "date": date,
+            "summary": {
+                "daytime_share": daytime_share,
+                "top_hours": sorted(
+                    (
+                        {
+                            "hour": item.hour,
+                            "app_opens": item.app_opens,
+                            "share": round(item.app_opens / total, 4),
+                        }
+                        for item in rows
+                    ),
+                    key=lambda item: item["app_opens"],
+                    reverse=True,
+                )[:5],
+            },
+            "rows": [item.model_dump() for item in rows],
+        }
+    def compare_rate_to_median(self, date: str, entity_name: str) -> dict[str, Any]:
+        record = self._daily_by_date.get(date)
+        definition = self._conversion_map.get(entity_name)
+        if record is None or definition is None:
+            return {
+                "found": False,
+                "date": date,
+                "entity_name": entity_name,
+                "error": "Date or conversion entity not found.",
+            }
+        series = [self._rate_for_record(item, definition) for item in self._context.daily_metrics]
+        baseline = round(median(series), 4)
+        observed = round(self._rate_for_record(record, definition), 4)
+        delta = round(observed - baseline, 4)
+        anomaly_type = "normal"
+        if delta <= -self._rate_threshold():
+            anomaly_type = "rate_drop_from_median"
+        elif delta >= self._rate_threshold():
+            anomaly_type = "rate_spike_from_median"
+        return {
+            "found": True,
+            "date": date,
+            "entity_type": "conversion_rate",
+            "entity_name": entity_name,
+            "detection_method": "compare_rate_to_median",
+            "baseline_value": baseline,
+            "observed_value": observed,
+            "delta_value": delta,
+            "anomaly_type": anomaly_type,
+            "severity": self._severity(abs(delta), medium=4.0, high=8.0, critical=12.0),
+        }
+    def compare_count_to_median(self, date: str, entity_name: str) -> dict[str, Any]:
+        record = self._daily_by_date.get(date)
+        if record is None or entity_name not in COUNT_METRICS:
+            return {
+                "found": False,
+                "date": date,
+                "entity_name": entity_name,
+                "error": "Date or count entity not found.",
+            }
+        series = [float(getattr(item, entity_name)) for item in self._context.daily_metrics]
+        baseline = round(median(series), 4)
+        observed = round(float(getattr(record, entity_name)), 4)
+        delta = round(observed - baseline, 4)
+        threshold = max(50.0, baseline * self._count_threshold_fraction())
+        anomaly_type = "normal"
+        if delta <= -threshold:
+            anomaly_type = "absolute_drop_in_event_count"
+        elif delta >= threshold:
+            anomaly_type = "absolute_spike_in_event_count"
+        return {
+            "found": True,
+            "date": date,
+            "entity_type": "event_count",
+            "entity_name": entity_name,
+            "detection_method": "compare_count_to_median",
+            "baseline_value": baseline,
+            "observed_value": observed,
+            "delta_value": delta,
+            "anomaly_type": anomaly_type,
+            "severity": self._severity(
+                abs(delta) / max(baseline, 1.0) * 100.0,
+                medium=12.0,
+                high=22.0,
+                critical=35.0,
+            ),
+        }
+    def detect_funnel_break(self, date: str) -> dict[str, Any]:
+        record = self._daily_by_date.get(date)
+        if record is None:
+            return {"found": False, "date": date, "error": "Date not found."}
+        candidates: list[dict[str, Any]] = []
+        for numerator, denominator in FUNNEL_STEPS:
+            entity_name = f"{numerator}_from_{denominator}"
+            baseline_series = [
+                self._ratio(getattr(item, numerator), getattr(item, denominator)) * 100.0
+                for item in self._context.daily_metrics
+            ]
+            baseline = round(median(baseline_series), 4)
+            observed = round(
+                self._ratio(getattr(record, numerator), getattr(record, denominator)) * 100.0,
+                4,
+            )
+            delta = round(observed - baseline, 4)
+            issue = {
+                "entity_type": "funnel_step",
+                "entity_name": entity_name,
+                "detection_method": "detect_funnel_break",
+                "baseline_value": baseline,
+                "observed_value": observed,
+                "delta_value": delta,
+                "monotonicity_broken": getattr(record, numerator) > getattr(record, denominator),
+                "severity": self._severity(abs(delta), medium=5.0, high=10.0, critical=15.0),
+            }
+            if issue["monotonicity_broken"] or delta <= -self._funnel_threshold():
+                issue["anomaly_type"] = "funnel_break"
+                candidates.append(issue)
+        return {"found": True, "date": date, "candidates": candidates}
+    def check_impossible_counts(self, date: str) -> dict[str, Any]:
+        daily = self._daily_by_date.get(date)
+        hourly_rows = self._hourly_by_date.get(date, [])
+        if daily is None:
+            return {"found": False, "date": date, "error": "Date not found."}
+        issues = []
+        issues.extend(self._impossible_issues(daily, scope="daily"))
+        for row in hourly_rows:
+            issues.extend(self._impossible_issues(row, scope=f"hour_{row.hour:02d}"))
+        total_excess = round(sum(item["excess_value"] for item in issues), 4)
+        return {
+            "found": True,
+            "date": date,
+            "issue_count": len(issues),
+            "total_excess": total_excess,
+            "issues": issues,
+        }
+    def list_suspicious_dates(self, limit: int = 10) -> dict[str, Any]:
+        ranked = []
+        hourly_baseline = self._median_daytime_share()
+        for date in self._dates:
+            rate_signal = 0.0
+            for definition in self._context.conversion_definitions:
+                comparison = self.compare_rate_to_median(date, definition.name)
+                rate_signal = max(rate_signal, abs(comparison["delta_value"]))
+            count_signal = 0.0
+            for metric_name in COUNT_METRICS:
+                comparison = self.compare_count_to_median(date, metric_name)
+                baseline = max(comparison["baseline_value"], 1.0)
+                count_signal = max(
+                    count_signal,
+                    abs(comparison["delta_value"]) / baseline * 100.0,
+                )
+            funnel_candidates = self.detect_funnel_break(date)["candidates"]
+            impossible = self.check_impossible_counts(date)
+            hourly_share = self.hourly_rows_for_date(date)["summary"]["daytime_share"]
+            hourly_signal = abs(hourly_share - hourly_baseline) * 100.0
+            suspicion_score = round(
+                rate_signal + count_signal + hourly_signal + impossible["total_excess"] * 0.05
+                + len(funnel_candidates) * 6.0,
+                4,
+            )
+            ranked.append(
+                {
+                    "date": date,
+                    "suspicion_score": suspicion_score,
+                    "max_rate_delta": round(rate_signal, 4),
+                    "max_count_delta_pct": round(count_signal, 4),
+                    "hourly_mix_delta_pct": round(hourly_signal, 4),
+                    "funnel_candidate_count": len(funnel_candidates),
+                    "impossible_issue_count": impossible["issue_count"],
+                }
+            )
+        ranked.sort(key=lambda item: (item["suspicion_score"], item["date"]), reverse=True)
+        return {"dates": ranked[: max(limit, 1)]}
+    def preview_submission(self, rows: list[dict[str, Any]] | list[MetricSubmissionRow]) -> dict[str, Any]:
+        preview = preview_submission_rows(rows)
+        return preview.model_dump()
+    def show_raw_data(self, limit: int = 5) -> dict[str, Any]:
+        rows = []
+        for record in self._context.daily_metrics[: max(limit, 1)]:
+            row = record.model_dump()
+            row.update(self._conversion_rates(record))
+            rows.append(row)
+        return {
+            "row_count": len(self._context.daily_metrics),
+            "returned_rows": len(rows),
+            "rows": rows,
+        }
+    def get_metric_median(self, metric_name: str) -> dict[str, Any]:
+        descriptor = self._metric_descriptor(metric_name)
+        values = descriptor["values"]
+        metric_median = round(median(values), 4) if values else 0.0
+        return {
+            "metric_name": metric_name,
+            "metric_type": descriptor["metric_type"],
+            "median_value": metric_median,
+            "sample_size": len(values),
+        }
+    def get_metric_median_multi(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+    ) -> dict[str, Any]:
+        resolved_metrics = self._resolve_metric_names(metric_name=metric_name, metric_names=metric_names)
+        results = [self.get_metric_median(name) for name in resolved_metrics]
+        return {
+            "metric_name": metric_name,
+            "metric_names": resolved_metrics,
+            "results": results,
+        }
+    def get_metric_std_dev_from_median(self, metric_name: str) -> dict[str, Any]:
+        descriptor = self._metric_descriptor(metric_name)
+        values = descriptor["values"]
+        metric_median = median(values) if values else 0.0
+        std_from_median = math.sqrt(
+            sum((value - metric_median) ** 2 for value in values) / len(values)
+        ) if values else 0.0
+        return {
+            "metric_name": metric_name,
+            "metric_type": descriptor["metric_type"],
+            "median_value": round(metric_median, 4),
+            "std_dev_from_median": round(std_from_median, 4),
+            "sample_size": len(values),
+        }
+    def get_metric_std_dev_from_median_multi(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+    ) -> dict[str, Any]:
+        resolved_metrics = self._resolve_metric_names(metric_name=metric_name, metric_names=metric_names)
+        results = [self.get_metric_std_dev_from_median(name) for name in resolved_metrics]
+        return {
+            "metric_name": metric_name,
+            "metric_names": resolved_metrics,
+            "results": results,
+        }
+    def get_rows_with_abs_diff_from_median_gt(self, metric_name: str, threshold: float) -> dict[str, Any]:
+        descriptor = self._metric_descriptor(metric_name)
+        metric_median = median(descriptor["values"]) if descriptor["values"] else 0.0
+        matches = []
+        for date_key, value in descriptor["per_date_values"].items():
+            abs_diff = abs(value - metric_median)
+            if abs_diff <= threshold:
+                continue
+            row = {
+                "date": date_key,
+                "metric_name": metric_name,
+                "metric_type": descriptor["metric_type"],
+                "median_value": round(metric_median, 4),
+                "observed_value": round(value, 4),
+                "abs_diff": round(abs_diff, 4),
+            }
+            suggested = self._build_submission_row_for_metric(
+                metric_name=metric_name,
+                date=date_key,
+                baseline_value=float(metric_median),
+                observed_value=float(value),
+            )
+            if suggested is not None:
+                row["suggested_payload_row"] = suggested.model_dump()
+            matches.append(row)
+        return {
+            "metric_name": metric_name,
+            "threshold": threshold,
+            "match_count": len(matches),
+            "rows": matches,
+        }
+    def get_rows_with_abs_diff_from_median_gt_multi(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+        threshold: float = 0.0,
+    ) -> dict[str, Any]:
+        resolved_metrics = self._resolve_metric_names(metric_name=metric_name, metric_names=metric_names)
+        results = [
+            self.get_rows_with_abs_diff_from_median_gt(name, threshold)
+            for name in resolved_metrics
+        ]
+        return {
+            "metric_name": metric_name,
+            "metric_names": resolved_metrics,
+            "threshold": threshold,
+            "results": results,
+        }
+    def get_median_filter_rows(self, metric_name: str, threshold_multiplier: float) -> dict[str, Any]:
+        return self.get_median_filter_rows_multi(
+            metric_name=metric_name,
+            metric_names=[],
+            threshold_multiplier=threshold_multiplier,
+        )
+    def get_median_filter_rows_multi(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+        threshold_multiplier: float = 2.0,
+    ) -> dict[str, Any]:
+        resolved_metrics = self._resolve_metric_names(metric_name=metric_name, metric_names=metric_names)
+        details = []
+        generated: dict[str, dict[str, Any]] = {}
+        total_matches = 0
+        for resolved_metric in resolved_metrics:
+            stats = self.get_metric_std_dev_from_median(resolved_metric)
+            threshold = stats["std_dev_from_median"] * threshold_multiplier
+            rows_result = self.get_rows_with_abs_diff_from_median_gt(resolved_metric, threshold)
+            payload_rows = [
+                row["suggested_payload_row"]
+                for row in rows_result["rows"]
+                if row.get("suggested_payload_row")
+            ]
+            total_matches += rows_result["match_count"]
+            for row in payload_rows:
+                submission_row = MetricSubmissionRow(**row)
+                generated[submission_row_key(submission_row)] = submission_row.model_dump()
+            details.append(
+                {
+                    "metric_name": resolved_metric,
+                    "threshold": round(threshold, 4),
+                    "match_count": rows_result["match_count"],
+                    "rows": rows_result["rows"],
+                    "generated_rows": payload_rows,
+                }
+            )
+        return {
+            "method_name": "get_median_filter_rows",
+            "metric_name": metric_name,
+            "metric_names": resolved_metrics,
+            "threshold_multiplier": threshold_multiplier,
+            "match_count": total_matches,
+            "generated_rows": list(generated.values()),
+            "details": details,
+        }
+    def get_rate_drop_from_median_rows(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+        threshold_multiplier: float = 2.0,
+    ) -> dict[str, Any]:
+        return self._metric_family_filter_rows(
+            method_name="get_rate_drop_from_median_rows",
+            metric_name=metric_name,
+            metric_names=metric_names,
+            threshold_multiplier=threshold_multiplier,
+            metric_type="conversion_rate",
+            allowed_anomaly_types={"rate_drop_from_median"},
+        )
+    def get_rate_spike_from_median_rows(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+        threshold_multiplier: float = 2.0,
+    ) -> dict[str, Any]:
+        return self._metric_family_filter_rows(
+            method_name="get_rate_spike_from_median_rows",
+            metric_name=metric_name,
+            metric_names=metric_names,
+            threshold_multiplier=threshold_multiplier,
+            metric_type="conversion_rate",
+            allowed_anomaly_types={"rate_spike_from_median"},
+        )
+    def get_absolute_drop_in_event_count_rows(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+        threshold_multiplier: float = 2.0,
+    ) -> dict[str, Any]:
+        return self._metric_family_filter_rows(
+            method_name="get_absolute_drop_in_event_count_rows",
+            metric_name=metric_name,
+            metric_names=metric_names,
+            threshold_multiplier=threshold_multiplier,
+            metric_type="event_count",
+            allowed_anomaly_types={"absolute_drop_in_event_count"},
+        )
+    def get_absolute_spike_in_event_count_rows(
+        self,
+        metric_name: str | None = None,
+        metric_names: list[str] | None = None,
+        threshold_multiplier: float = 2.0,
+    ) -> dict[str, Any]:
+        return self._metric_family_filter_rows(
+            method_name="get_absolute_spike_in_event_count_rows",
+            metric_name=metric_name,
+            metric_names=metric_names,
+            threshold_multiplier=threshold_multiplier,
+            metric_type="event_count",
+            allowed_anomaly_types={"absolute_spike_in_event_count"},
+        )
+    def get_funnel_break_rows(self, threshold_multiplier: float = 2.0) -> dict[str, Any]:
+        details = []
+        generated: dict[str, dict[str, Any]] = {}
+        total_matches = 0
+        for numerator, denominator in FUNNEL_STEPS:
+            entity_name = f"{numerator}_from_{denominator}"
+            per_date_values = {
+                date_key: round(
+                    self._ratio(getattr(record, numerator), getattr(record, denominator)) * 100.0,
+                    4,
+                )
+                for date_key, record in self._daily_by_date.items()
+            }
+            values = list(per_date_values.values())
+            baseline = median(values) if values else 0.0
+            std_from_median = math.sqrt(
+                sum((value - baseline) ** 2 for value in values) / len(values)
+            ) if values else 0.0
+            threshold = max(std_from_median * float(threshold_multiplier), self._funnel_threshold())
+            rows = []
+            generated_rows = []
+            for date_key, observed_value in per_date_values.items():
+                delta_value = round(observed_value - baseline, 4)
+                if delta_value > -threshold:
+                    continue
+                row = {
+                    "date": date_key,
+                    "entity_type": "funnel_step",
+                    "entity_name": entity_name,
+                    "anomaly_type": "funnel_break",
+                    "detection_method": "detect_funnel_break",
+                    "baseline_value": round(baseline, 4),
+                    "observed_value": round(observed_value, 4),
+                    "delta_value": delta_value,
+                    "severity": self._severity(abs(delta_value), medium=5.0, high=10.0, critical=15.0),
+                }
+                total_matches += 1
+                rows.append(row)
+                submission_row = MetricSubmissionRow(**row)
+                generated[submission_row_key(submission_row)] = submission_row.model_dump()
+                generated_rows.append(submission_row.model_dump())
+            details.append(
+                {
+                    "entity_name": entity_name,
+                    "threshold": round(threshold, 4),
+                    "match_count": len(rows),
+                    "rows": rows,
+                    "generated_rows": generated_rows,
+                }
+            )
+        return {
+            "method_name": "get_funnel_break_rows",
+            "threshold_multiplier": threshold_multiplier,
+            "match_count": total_matches,
+            "generated_rows": list(generated.values()),
+            "details": details,
+        }
+    def get_hourly_traffic_mix_shift_rows(self, threshold_multiplier: float = 2.0) -> dict[str, Any]:
+        per_date_values = {}
+        for date_key in self._dates:
+            summary = self.hourly_rows_for_date(date_key).get("summary", {})
+            per_date_values[date_key] = float(summary.get("daytime_share", 0.0))
+        values = list(per_date_values.values())
+        baseline = median(values) if values else 0.0
+        std_from_median = math.sqrt(
+            sum((value - baseline) ** 2 for value in values) / len(values)
+        ) if values else 0.0
+        threshold = std_from_median * float(threshold_multiplier)
+        rows = []
+        generated_rows = []
+        for date_key, observed_value in per_date_values.items():
+            delta_value = round(observed_value - baseline, 4)
+            if abs(delta_value) <= threshold:
+                continue
+            row = {
+                "date": date_key,
+                "entity_type": "hourly_mix",
+                "entity_name": "app_opens:daytime_share",
+                "anomaly_type": "hourly_traffic_mix_shift",
+                "detection_method": "hourly_rows_for_date",
+                "baseline_value": round(baseline, 4),
+                "observed_value": round(observed_value, 4),
+                "delta_value": delta_value,
+                "severity": self._severity(abs(delta_value) * 100.0, medium=10.0, high=18.0, critical=25.0),
+            }
+            rows.append(row)
+            generated_rows.append(row)
+        return {
+            "method_name": "get_hourly_traffic_mix_shift_rows",
+            "threshold_multiplier": threshold_multiplier,
+            "match_count": len(rows),
+            "generated_rows": generated_rows,
+            "details": [
+                {
+                    "entity_name": "app_opens:daytime_share",
+                    "threshold": round(threshold, 4),
+                    "match_count": len(rows),
+                    "rows": rows,
+                    "generated_rows": generated_rows,
+                }
+            ],
+        }
+    def get_instrumentation_data_quality_issue_rows(
+        self,
+        threshold_multiplier: float = 2.0,
+    ) -> dict[str, Any]:
+        per_date_totals = {
+            date_key: float(self.check_impossible_counts(date_key).get("total_excess", 0.0))
+            for date_key in self._dates
+        }
+        values = list(per_date_totals.values())
+        baseline = median(values) if values else 0.0
+        std_from_median = math.sqrt(
+            sum((value - baseline) ** 2 for value in values) / len(values)
+        ) if values else 0.0
+        threshold = std_from_median * float(threshold_multiplier)
+        generated: dict[str, dict[str, Any]] = {}
+        details = []
+        total_matches = 0
+        for numerator, denominator in FUNNEL_STEPS:
+            entity_name = f"{numerator}_lte_{denominator}"
+            rows = []
+            generated_rows = []
+            for date_key in self._dates:
+                result = self.check_impossible_counts(date_key)
+                issue_names = {item["entity_name"] for item in result.get("issues", [])}
+                observed_value = float(result.get("total_excess", 0.0))
+                if entity_name not in issue_names or observed_value <= threshold:
+                    continue
+                row = {
+                    "date": date_key,
+                    "entity_type": "data_quality",
+                    "entity_name": entity_name,
+                    "anomaly_type": "instrumentation_data_quality_issue",
+                    "detection_method": "check_impossible_counts",
+                    "baseline_value": round(baseline, 4),
+                    "observed_value": round(observed_value, 4),
+                    "delta_value": round(observed_value - baseline, 4),
+                    "severity": self._severity(observed_value, medium=20.0, high=60.0, critical=120.0),
+                }
+                total_matches += 1
+                rows.append(row)
+                submission_row = MetricSubmissionRow(**row)
+                generated[submission_row_key(submission_row)] = submission_row.model_dump()
+                generated_rows.append(submission_row.model_dump())
+            details.append(
+                {
+                    "entity_name": entity_name,
+                    "threshold": round(threshold, 4),
+                    "match_count": len(rows),
+                    "rows": rows,
+                    "generated_rows": generated_rows,
+                }
+            )
+        return {
+            "method_name": "get_instrumentation_data_quality_issue_rows",
+            "threshold_multiplier": threshold_multiplier,
+            "match_count": total_matches,
+            "generated_rows": list(generated.values()),
+            "details": details,
+        }
+    def payload_generator(
+        self,
+        generator_methods: list[dict[str, Any]] | list[PayloadGeneratorMethod],
+    ) -> dict[str, Any]:
+        methods = [
+            item if isinstance(item, PayloadGeneratorMethod) else PayloadGeneratorMethod(**item)
+            for item in generator_methods
+        ]
+        generated: dict[str, MetricSubmissionRow] = {}
+        details = []
+        for method in methods:
+            result = self._run_payload_generator_method(method)
+            if "error" in result:
+                details.append(result)
+                continue
+            for row in result["generated_rows"]:
+                submission_row = MetricSubmissionRow(**row)
+                generated[submission_row_key(submission_row)] = submission_row
+            details.append(result)
+        return {
+            "generator_methods": [item.model_dump() for item in methods],
+            "generated_row_count": len(generated),
+            "generated_rows": [row.model_dump() for row in generated.values()],
+            "details": details,
+        }
+    def _run_payload_generator_method(self, method: PayloadGeneratorMethod) -> dict[str, Any]:
+        if method.method_name == "get_median_filter_rows":
+            return self.get_median_filter_rows(
+                metric_name=method.metric_name,
+                threshold_multiplier=method.threshold_multiplier,
+            ) if not method.metric_names else self.get_median_filter_rows_multi(
+                metric_name=method.metric_name,
+                metric_names=method.metric_names,
+                threshold_multiplier=method.threshold_multiplier,
+            )
+        if method.method_name == "get_rate_drop_from_median_rows":
+            return self.get_rate_drop_from_median_rows(
+                metric_name=method.metric_name,
+                metric_names=method.metric_names,
+                threshold_multiplier=method.threshold_multiplier,
+            )
+        if method.method_name == "get_rate_spike_from_median_rows":
+            return self.get_rate_spike_from_median_rows(
+                metric_name=method.metric_name,
+                metric_names=method.metric_names,
+                threshold_multiplier=method.threshold_multiplier,
+            )
+        if method.method_name == "get_absolute_drop_in_event_count_rows":
+            return self.get_absolute_drop_in_event_count_rows(
+                metric_name=method.metric_name,
+                metric_names=method.metric_names,
+                threshold_multiplier=method.threshold_multiplier,
+            )
+        if method.method_name == "get_absolute_spike_in_event_count_rows":
+            return self.get_absolute_spike_in_event_count_rows(
+                metric_name=method.metric_name,
+                metric_names=method.metric_names,
+                threshold_multiplier=method.threshold_multiplier,
+            )
+        if method.method_name == "get_funnel_break_rows":
+            return self.get_funnel_break_rows(threshold_multiplier=method.threshold_multiplier)
+        if method.method_name == "get_hourly_traffic_mix_shift_rows":
+            return self.get_hourly_traffic_mix_shift_rows(threshold_multiplier=method.threshold_multiplier)
+        if method.method_name == "get_instrumentation_data_quality_issue_rows":
+            return self.get_instrumentation_data_quality_issue_rows(threshold_multiplier=method.threshold_multiplier)
+        return {
+            "method": method.model_dump(),
+            "error": "Unsupported payload generator method.",
+        }
+    def build_row_from_analysis(self, analysis_result: dict[str, Any]) -> dict[str, Any] | None:
+        """Extract a payload row when an analysis result directly maps to one."""
+        required_fields = {
+            "date",
+            "entity_type",
+            "entity_name",
+            "anomaly_type",
+            "detection_method",
+            "baseline_value",
+            "observed_value",
+            "delta_value",
+            "severity",
+        }
+        if required_fields.issubset(analysis_result) and analysis_result.get("anomaly_type") != "normal":
+            return {field: analysis_result[field] for field in required_fields}
+        return None
+    def _conversion_rates(self, record: MetricRecord) -> dict[str, float]:
+        return {
+            item.name: round(self._rate_for_record(record, item), 4)
+            for item in self._context.conversion_definitions
+        }
+    def _metric_descriptor(self, metric_name: str) -> dict[str, Any]:
+        if metric_name in COUNT_METRICS:
+            values = [float(getattr(item, metric_name)) for item in self._context.daily_metrics]
+            per_date_values = {
+                item.date: float(getattr(item, metric_name))
+                for item in self._context.daily_metrics
+            }
+            return {
+                "metric_type": "event_count",
+                "values": values,
+                "per_date_values": per_date_values,
+            }
+        definition = self._conversion_map.get(metric_name)
+        if definition is None:
+            raise ValueError(f"Unknown metric_name: {metric_name}")
+        values = [self._rate_for_record(item, definition) for item in self._context.daily_metrics]
+        per_date_values = {
+            item.date: self._rate_for_record(item, definition)
+            for item in self._context.daily_metrics
+        }
+        return {
+            "metric_type": "conversion_rate",
+            "values": values,
+            "per_date_values": per_date_values,
+        }
+    def _resolve_metric_names(
+        self,
+        *,
+        metric_name: str | None,
+        metric_names: list[str] | None,
+    ) -> list[str]:
+        names = [item for item in (metric_names or []) if item]
+        if metric_name:
+            names.append(metric_name)
+        if not names:
+            names = list(COUNT_METRICS) + list(self._conversion_map.keys())
+        deduped = []
+        seen = set()
+        for item in names:
+            if item in seen:
+                continue
+            seen.add(item)
+            deduped.append(item)
+        return deduped
+    def _resolve_metric_names_for_type(
+        self,
+        *,
+        metric_name: str | None,
+        metric_names: list[str] | None,
+        metric_type: str,
+    ) -> list[str]:
+        resolved = self._resolve_metric_names(metric_name=metric_name, metric_names=metric_names)
+        return [
+            item
+            for item in resolved
+            if self._metric_descriptor(item)["metric_type"] == metric_type
+        ]
+    def _metric_family_filter_rows(
+        self,
+        *,
+        method_name: str,
+        metric_name: str | None,
+        metric_names: list[str] | None,
+        threshold_multiplier: float,
+        metric_type: str,
+        allowed_anomaly_types: set[str],
+    ) -> dict[str, Any]:
+        resolved_metrics = self._resolve_metric_names_for_type(
+            metric_name=metric_name,
+            metric_names=metric_names,
+            metric_type=metric_type,
+        )
+        raw_result = self.get_median_filter_rows_multi(
+            metric_name=None,
+            metric_names=resolved_metrics,
+            threshold_multiplier=threshold_multiplier,
+        )
+        generated: dict[str, dict[str, Any]] = {}
+        details = []
+        total_matches = 0
+        for detail in raw_result["details"]:
+            filtered_rows = []
+            filtered_generated = []
+            for row in detail["rows"]:
+                suggested = row.get("suggested_payload_row")
+                if not suggested or suggested.get("anomaly_type") not in allowed_anomaly_types:
+                    continue
+                filtered_rows.append(row)
+                submission_row = MetricSubmissionRow(**suggested)
+                generated[submission_row_key(submission_row)] = submission_row.model_dump()
+                filtered_generated.append(submission_row.model_dump())
+            total_matches += len(filtered_rows)
+            details.append(
+                {
+                    **detail,
+                    "match_count": len(filtered_rows),
+                    "rows": filtered_rows,
+                    "generated_rows": filtered_generated,
+                }
+            )
+        return {
+            "method_name": method_name,
+            "metric_name": metric_name,
+            "metric_names": resolved_metrics,
+            "threshold_multiplier": threshold_multiplier,
+            "match_count": total_matches,
+            "generated_rows": list(generated.values()),
+            "details": details,
+        }
+    def _build_submission_row_for_metric(
+        self,
+        *,
+        metric_name: str,
+        date: str,
+        baseline_value: float,
+        observed_value: float,
+    ) -> MetricSubmissionRow | None:
+        delta_value = round(observed_value - baseline_value, 4)
+        if metric_name in COUNT_METRICS:
+            threshold = max(50.0, baseline_value * self._count_threshold_fraction())
+            if abs(delta_value) <= threshold:
+                return None
+            anomaly_type = (
+                "absolute_spike_in_event_count"
+                if delta_value > 0
+                else "absolute_drop_in_event_count"
+            )
+            return MetricSubmissionRow(
+                date=date,
+                entity_type="event_count",
+                entity_name=metric_name,
+                anomaly_type=anomaly_type,
+                detection_method="compare_count_to_median",
+                baseline_value=round(baseline_value, 4),
+                observed_value=round(observed_value, 4),
+                delta_value=delta_value,
+                severity=self._severity(
+                    abs(delta_value) / max(baseline_value, 1.0) * 100.0,
+                    medium=12.0,
+                    high=22.0,
+                    critical=35.0,
+                ),
+            )
+        threshold = self._rate_threshold()
+        if abs(delta_value) <= threshold:
+            return None
+        anomaly_type = "rate_spike_from_median" if delta_value > 0 else "rate_drop_from_median"
+        return MetricSubmissionRow(
+            date=date,
+            entity_type="conversion_rate",
+            entity_name=metric_name,
+            anomaly_type=anomaly_type,
+            detection_method="compare_rate_to_median",
+            baseline_value=round(baseline_value, 4),
+            observed_value=round(observed_value, 4),
+            delta_value=delta_value,
+            severity=self._severity(abs(delta_value), medium=4.0, high=8.0, critical=12.0),
+        )
+    def _impossible_issues(self, row: MetricRecord, scope: str) -> list[dict[str, Any]]:
+        issues = []
+        for numerator, denominator in FUNNEL_STEPS:
+            numerator_value = getattr(row, numerator)
+            denominator_value = getattr(row, denominator)
+            if numerator_value > denominator_value:
+                issues.append(
+                    {
+                        "scope": scope,
+                        "entity_name": f"{numerator}_lte_{denominator}",
+                        "numerator": numerator_value,
+                        "denominator": denominator_value,
+                        "excess_value": round(float(numerator_value - denominator_value), 4),
+                    }
+                )
+        return issues
+    def _median_daytime_share(self) -> float:
+        shares = []
+        for date in self._dates:
+            hourly_data = self.hourly_rows_for_date(date)
+            shares.append(hourly_data["summary"]["daytime_share"])
+        return round(median(shares), 4) if shares else 0.0
+    @staticmethod
+    def _ratio(numerator: int, denominator: int) -> float:
+        if denominator <= 0:
+            return 0.0
+        return numerator / denominator
+    def _rate_for_record(
+        self,
+        record: MetricRecord,
+        definition: ConversionMetricDefinition,
+    ) -> float:
+        return self._ratio(
+            getattr(record, definition.numerator),
+            getattr(record, definition.denominator),
+        ) * 100.0
+    def _rate_threshold(self) -> float:
+        difficulty = (self._context.config or {}).get("difficulty", "medium")
+        return {"easy": 6.0, "medium": 4.5, "hard": 3.0}.get(difficulty, 4.5)
+    def _funnel_threshold(self) -> float:
+        difficulty = (self._context.config or {}).get("difficulty", "medium")
+        return {"easy": 7.0, "medium": 5.0, "hard": 3.5}.get(difficulty, 5.0)
+    def _count_threshold_fraction(self) -> float:
+        difficulty = (self._context.config or {}).get("difficulty", "medium")
+        return {"easy": 0.22, "medium": 0.15, "hard": 0.10}.get(difficulty, 0.15)
+    @staticmethod
+    def _severity(value: float, *, medium: float, high: float, critical: float) -> str:
+        if value >= critical:
+            return "critical"
+        if value >= high:
+            return "high"
+        if value >= medium:
+            return "medium"
+        return "low"
+def preview_submission_rows(
+    rows: list[dict[str, Any]] | list[MetricSubmissionRow],
+) -> SubmissionPreview:
+    """Validate submission rows without using ground truth."""
+    normalized_rows: list[MetricSubmissionRow] = []
+    issues: list[SubmissionIssue] = []
+    seen: set[str] = set()
+    duplicate_rows = 0
+    invalid_rows = 0
+    for index, row in enumerate(rows):
+        try:
+            normalized = row if isinstance(row, MetricSubmissionRow) else MetricSubmissionRow(**row)
+        except Exception as exc:
+            invalid_rows += 1
+            issues.append(
+                SubmissionIssue(
+                    row_key=f"row_{index}",
+                    issue_type="invalid_row",
+                    message=f"Row could not be parsed: {exc}",
+                    submitted_row=row if isinstance(row, dict) else None,
+                )
+            )
+            continue
+        row_key = submission_row_key(normalized)
+        if row_key in seen:
+            duplicate_rows += 1
+            issues.append(
+                SubmissionIssue(
+                    row_key=row_key,
+                    issue_type="duplicate_row",
+                    message="Duplicate date/entity row detected.",
+                    submitted_row=normalized.model_dump(),
+                )
+            )
+            continue
+        seen.add(row_key)
+        normalized_rows.append(normalized)
+    return SubmissionPreview(
+        valid_rows=len(normalized_rows),
+        invalid_rows=invalid_rows,
+        duplicate_rows=duplicate_rows,
+        unique_keys=len(seen),
+        issues=issues,
+        normalized_rows=normalized_rows,
+    )
+def submission_row_key(row: MetricSubmissionRow) -> str:
+    """Stable row key for matching submissions and expectations."""
+    return f"{row.date}|{row.entity_type}|{row.entity_name}"

client.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""Client for the metric tracker RL environment."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .models import MetricTrackerRlAction, MetricTrackerRlObservation
+class MetricTrackerRlEnv(
+    EnvClient[MetricTrackerRlAction, MetricTrackerRlObservation, State]
+):
+    """Typed client for the metric tracking environment."""
+    def _step_payload(self, action: MetricTrackerRlAction) -> Dict:
+        """Serialize the action as JSON for the environment server."""
+        return action.model_dump()
+    def _parse_result(self, payload: Dict) -> StepResult[MetricTrackerRlObservation]:
+        """Parse environment responses into a typed observation."""
+        observation = MetricTrackerRlObservation(**payload.get("observation", {}))
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        """Parse environment state payloads."""
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

evaluation.py ADDED Viewed

	@@ -0,0 +1,256 @@

+"""Deterministic grading for the metric tracker RL environment."""
+from __future__ import annotations
+from dataclasses import dataclass
+try:
+    from .analysis_tools import preview_submission_rows, submission_row_key
+    from .models import MetricSubmissionRow, RewardBreakdown, SubmissionIssue, SubmissionPreview
+except ImportError:
+    from analysis_tools import preview_submission_rows, submission_row_key
+    from models import MetricSubmissionRow, RewardBreakdown, SubmissionIssue, SubmissionPreview
+@dataclass(frozen=True)
+class EvaluationConfig:
+    """Tunable parameters for deterministic scoring."""
+    value_tolerance: float = 0.06
+    delta_tolerance: float = 0.06
+    precision_weight: float = 0.30
+    recall_weight: float = 0.30
+    anomaly_type_weight: float = 0.12
+    detection_method_weight: float = 0.10
+    value_weight: float = 0.12
+    severity_weight: float = 0.06
+    extra_row_penalty: float = 0.03
+    duplicate_row_penalty: float = 0.04
+    invalid_row_penalty: float = 0.05
+    exploit_row_multiplier: float = 3.0
+    exploit_penalty: float = 0.15
+@dataclass
+class EvaluationResult:
+    """Complete scoring result."""
+    preview: SubmissionPreview
+    issues: list[SubmissionIssue]
+    reward_breakdown: RewardBreakdown
+    matched_rows: int
+    is_perfect: bool
+def evaluate_submission(
+    submitted_rows: list[dict] | list[MetricSubmissionRow],
+    expected_rows: list[MetricSubmissionRow],
+    config: EvaluationConfig | None = None,
+    *,
+    include_debug_expected: bool = False,
+) -> EvaluationResult:
+    """Grade one submission against deterministic expectations."""
+    cfg = config or EvaluationConfig()
+    preview = preview_submission_rows(submitted_rows)
+    expected_map = {submission_row_key(row): row for row in expected_rows}
+    submitted_map = {submission_row_key(row): row for row in preview.normalized_rows}
+    issues = list(preview.issues)
+    matched_keys = [key for key in submitted_map if key in expected_map]
+    extra_keys = [key for key in submitted_map if key not in expected_map]
+    missing_keys = [key for key in expected_map if key not in submitted_map]
+    anomaly_type_hits = 0
+    detection_method_hits = 0
+    value_hits = 0.0
+    severity_hits = 0
+    for key in matched_keys:
+        submitted = submitted_map[key]
+        expected = expected_map[key]
+        field_issues = _field_issues(submitted, expected, cfg, include_debug_expected)
+        issues.extend(field_issues)
+        if submitted.anomaly_type == expected.anomaly_type:
+            anomaly_type_hits += 1
+        if submitted.detection_method == expected.detection_method:
+            detection_method_hits += 1
+        value_hits += _value_match_score(submitted, expected, cfg)
+        if submitted.severity == expected.severity:
+            severity_hits += 1
+    for key in extra_keys:
+        submitted = submitted_map[key]
+        issues.append(
+            SubmissionIssue(
+                row_key=key,
+                issue_type="extra_row",
+                message="Row is not expected for this episode.",
+                submitted_row=submitted.model_dump(),
+                expected_row=None,
+            )
+        )
+    for key in missing_keys:
+        expected = expected_map[key]
+        issues.append(
+            SubmissionIssue(
+                row_key=key,
+                issue_type="missing_row",
+                message="Expected anomaly row is missing from the submission.",
+                submitted_row=None,
+                expected_row=expected.model_dump() if include_debug_expected else None,
+            )
+        )
+    valid_submitted = len(preview.normalized_rows)
+    matched_count = len(matched_keys)
+    expected_count = len(expected_rows)
+    precision = matched_count / valid_submitted if valid_submitted else 0.0
+    recall = matched_count / expected_count if expected_count else 1.0
+    denominator = max(matched_count, 1)
+    anomaly_type_accuracy = anomaly_type_hits / denominator if matched_count else 0.0
+    detection_method_accuracy = detection_method_hits / denominator if matched_count else 0.0
+    value_accuracy = value_hits / denominator if matched_count else 0.0
+    severity_accuracy = severity_hits / denominator if matched_count else 0.0
+    extra_penalty = min(0.5, len(extra_keys) * cfg.extra_row_penalty)
+    duplicate_penalty = min(0.4, preview.duplicate_rows * cfg.duplicate_row_penalty)
+    invalid_penalty = min(0.4, preview.invalid_rows * cfg.invalid_row_penalty)
+    exploit_penalty = 0.0
+    exploit_limit = max(6, int(expected_count * cfg.exploit_row_multiplier))
+    if valid_submitted > exploit_limit:
+        exploit_penalty = cfg.exploit_penalty
+    total_score = (
+        precision * cfg.precision_weight
+        + recall * cfg.recall_weight
+        + anomaly_type_accuracy * cfg.anomaly_type_weight
+        + detection_method_accuracy * cfg.detection_method_weight
+        + value_accuracy * cfg.value_weight
+        + severity_accuracy * cfg.severity_weight
+        - extra_penalty
+        - duplicate_penalty
+        - invalid_penalty
+        - exploit_penalty
+    )
+    total_score = max(0.0, min(1.0, round(total_score, 6)))
+    breakdown = RewardBreakdown(
+        precision=round(precision, 6),
+        recall=round(recall, 6),
+        anomaly_type_accuracy=round(anomaly_type_accuracy, 6),
+        detection_method_accuracy=round(detection_method_accuracy, 6),
+        value_accuracy=round(value_accuracy, 6),
+        severity_accuracy=round(severity_accuracy, 6),
+        extra_row_penalty=round(extra_penalty, 6),
+        duplicate_penalty=round(duplicate_penalty, 6),
+        invalid_row_penalty=round(invalid_penalty, 6),
+        exploit_penalty=round(exploit_penalty, 6),
+        total_score=total_score,
+        matched_rows=matched_count,
+        expected_rows=expected_count,
+        submitted_rows=len(submitted_rows),
+        valid_submitted_rows=valid_submitted,
+        extra_rows=len(extra_keys),
+        duplicate_rows=preview.duplicate_rows,
+        invalid_rows=preview.invalid_rows,
+        missing_rows=len(missing_keys),
+    )
+    is_perfect = total_score >= 0.999999 and not issues
+    return EvaluationResult(
+        preview=preview,
+        issues=issues,
+        reward_breakdown=breakdown,
+        matched_rows=matched_count,
+        is_perfect=is_perfect,
+    )
+def _field_issues(
+    submitted: MetricSubmissionRow,
+    expected: MetricSubmissionRow,
+    cfg: EvaluationConfig,
+    include_debug_expected: bool,
+) -> list[SubmissionIssue]:
+    issues: list[SubmissionIssue] = []
+    row_key = submission_row_key(expected)
+    expected_dump = expected.model_dump() if include_debug_expected else None
+    if submitted.anomaly_type != expected.anomaly_type:
+        issues.append(
+            SubmissionIssue(
+                row_key=row_key,
+                issue_type="wrong_anomaly_type",
+                message=f"Expected anomaly_type={expected.anomaly_type}.",
+                submitted_row=submitted.model_dump(),
+                expected_row=expected_dump,
+            )
+        )
+    if submitted.detection_method != expected.detection_method:
+        issues.append(
+            SubmissionIssue(
+                row_key=row_key,
+                issue_type="wrong_detection_method",
+                message=f"Expected detection_method={expected.detection_method}.",
+                submitted_row=submitted.model_dump(),
+                expected_row=expected_dump,
+            )
+        )
+    if not _close(submitted.baseline_value, expected.baseline_value, cfg.value_tolerance):
+        issues.append(
+            SubmissionIssue(
+                row_key=row_key,
+                issue_type="wrong_baseline_value",
+                message="Baseline value is outside tolerance.",
+                submitted_row=submitted.model_dump(),
+                expected_row=expected_dump,
+            )
+        )
+    if not _close(submitted.observed_value, expected.observed_value, cfg.value_tolerance):
+        issues.append(
+            SubmissionIssue(
+                row_key=row_key,
+                issue_type="wrong_observed_value",
+                message="Observed value is outside tolerance.",
+                submitted_row=submitted.model_dump(),
+                expected_row=expected_dump,
+            )
+        )
+    if not _close(submitted.delta_value, expected.delta_value, cfg.delta_tolerance):
+        issues.append(
+            SubmissionIssue(
+                row_key=row_key,
+                issue_type="wrong_delta_value",
+                message="Delta value is outside tolerance.",
+                submitted_row=submitted.model_dump(),
+                expected_row=expected_dump,
+            )
+        )
+    if submitted.severity != expected.severity:
+        issues.append(
+            SubmissionIssue(
+                row_key=row_key,
+                issue_type="wrong_severity",
+                message=f"Expected severity={expected.severity}.",
+                submitted_row=submitted.model_dump(),
+                expected_row=expected_dump,
+            )
+        )
+    return issues
+def _value_match_score(
+    submitted: MetricSubmissionRow,
+    expected: MetricSubmissionRow,
+    cfg: EvaluationConfig,
+) -> float:
+    checks = [
+        _close(submitted.baseline_value, expected.baseline_value, cfg.value_tolerance),
+        _close(submitted.observed_value, expected.observed_value, cfg.value_tolerance),
+        _close(submitted.delta_value, expected.delta_value, cfg.delta_tolerance),
+    ]
+    return sum(1.0 for ok in checks if ok) / len(checks)
+def _close(submitted: float, expected: float, tolerance: float) -> bool:
+    allowed = max(tolerance, abs(expected) * tolerance)
+    return abs(submitted - expected) <= allowed

inference.py ADDED Viewed

	@@ -0,0 +1,586 @@

+"""Tool-driven inference for the metric tracker RL environment."""
+from __future__ import annotations
+import asyncio
+import json
+import os
+import textwrap
+from dataclasses import dataclass, field
+from typing import Any
+from openai import APIStatusError, OpenAI
+from metric_tracker_rl import DEFAULT_TASK_ORDER, MetricTrackerRlAction, MetricTrackerRlEnv, get_task_spec
+from metric_tracker_rl.analysis_tools import available_analysis_methods
+from metric_tracker_rl.models import (
+    MetricSubmissionRow,
+    MetricTrackerRlObservation,
+    PayloadGeneratorMethod,
+)
+IMAGE_NAME = os.getenv("IMAGE_NAME") or "metric_tracker:latest"
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY") or os.getenv("API_KEY")
+API_BASE_URL = (
+    os.getenv("API_BASE_URL")
+    or os.getenv("OPENAI_BASE_URL")
+    or "https://router.huggingface.co/v1"
+)
+MODEL_NAME = os.getenv("MODEL_NAME") or os.getenv("OPENAI_MODEL") or "Qwen/Qwen2.5-72B-Instruct"
+BASE_URL = os.getenv("BASE_URL")
+TASK_NAME = os.getenv("MetricTrackerRl_TASK", "multi_task_agent_baseline")
+BENCHMARK = os.getenv("MetricTrackerRl_BENCHMARK", "metric_tracker_rl")
+TEMPERATURE = float(os.getenv("TEMPERATURE", "0"))
+MAX_TOKENS = min(int(os.getenv("MAX_TOKENS", "1000")), 4096)
+MAX_TOOL_ROUNDS = int(os.getenv("MAX_TOOL_ROUNDS", "16"))
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are solving a multi-anomaly analytics benchmark with tool use.
+    Rules:
+    - Use only the shared safe analysis methods.
+    - Do not request full hidden answers or assume direct access to ground truth.
+    - Prefer declarative payload generators over manual row construction.
+    - Start from the default reset observation only.
+    - Start by trying `get_median_filter_rows` across different metrics to learn which metrics produce useful anomaly rows.
+    - Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
+    - Prefer: task_overview -> get_median_filter_rows on several metrics -> compare useful results -> payload_generator -> submit_payload_generator.
+    - Keep notes brief and factual.
+    """
+).strip()
+@dataclass
+class ToolRuntimeState:
+    """Mutable state shared across tool calls."""
+    method_log: list[dict[str, Any]] = field(default_factory=list)
+    last_preview: dict[str, Any] | None = None
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_method(tool_name: str, arguments: dict[str, Any], note: str) -> None:
+    print(
+        f"[METHOD] name={tool_name} args={json.dumps(arguments, sort_keys=True)} why={note}",
+        flush=True,
+    )
+def log_payload_generator_methods(tool_name: str, generator_methods: list[dict[str, Any]]) -> None:
+    print(
+        f"[PAYLOAD_GENERATOR_METHODS] source={tool_name} methods={json.dumps(generator_methods, sort_keys=True)}",
+        flush=True,
+    )
+def log_step(step: int, action: str, reward: float, done: bool, error: str | None) -> None:
+    error_val = error if error else "null"
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.3f} done={str(done).lower()} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, method_log: list[dict[str, Any]]) -> None:
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} methods={len(method_log)}",
+        flush=True,
+    )
+    print(json.dumps({"method_log": method_log}, indent=2), flush=True)
+def log_task_boundary(task_id: str, difficulty: str, phase: str) -> None:
+    print(f"[TASK_{phase}] task_id={task_id} difficulty={difficulty}", flush=True)
+def tool_schemas() -> list[dict[str, Any]]:
+    """OpenAI-compatible tool definitions."""
+    shared_schemas = []
+    for spec in available_analysis_methods():
+        properties = {}
+        required = []
+        if spec.name in {"rows_for_date", "hourly_rows_for_date", "detect_funnel_break", "check_impossible_counts"}:
+            properties = {"date": {"type": "string"}}
+            required = ["date"]
+        elif spec.name in {"compare_rate_to_median", "compare_count_to_median"}:
+            properties = {
+                "date": {"type": "string"},
+                "entity_name": {"type": "string"},
+            }
+            required = ["date", "entity_name"]
+        elif spec.name == "list_suspicious_dates":
+            properties = {"limit": {"type": "integer", "default": 10}}
+        elif spec.name == "preview_submission":
+            properties = {
+                "rows": {
+                    "type": "array",
+                    "items": {"type": "object"},
+                }
+            }
+        elif spec.name == "show_raw_data":
+            properties = {"limit": {"type": "integer", "default": 5}}
+        elif spec.name in {"get_metric_median", "get_metric_std_dev_from_median"}:
+            properties = {
+                "metric_name": {"type": "string"},
+                "metric_names": {"type": "array", "items": {"type": "string"}},
+            }
+        elif spec.name == "get_rows_with_abs_diff_from_median_gt":
+            properties = {
+                "metric_name": {"type": "string"},
+                "metric_names": {"type": "array", "items": {"type": "string"}},
+                "threshold": {"type": "number"},
+            }
+            required = ["threshold"]
+        elif spec.name in {
+            "get_median_filter_rows",
+            "get_rate_drop_from_median_rows",
+            "get_rate_spike_from_median_rows",
+            "get_absolute_drop_in_event_count_rows",
+            "get_absolute_spike_in_event_count_rows",
+        }:
+            properties = {
+                "metric_name": {"type": "string"},
+                "metric_names": {"type": "array", "items": {"type": "string"}},
+                "threshold_multiplier": {"type": "number"},
+            }
+            required = ["threshold_multiplier"]
+        elif spec.name in {
+            "get_funnel_break_rows",
+            "get_hourly_traffic_mix_shift_rows",
+            "get_instrumentation_data_quality_issue_rows",
+        }:
+            properties = {
+                "threshold_multiplier": {"type": "number"},
+            }
+            required = ["threshold_multiplier"]
+        elif spec.name == "payload_generator":
+            properties = {
+                "generator_methods": {
+                    "type": "array",
+                    "items": {"type": "object"},
+                }
+            }
+            required = ["generator_methods"]
+        shared_schemas.append(
+            {
+                "type": "function",
+                "function": {
+                    "name": spec.name,
+                    "description": spec.description,
+                    "parameters": {
+                        "type": "object",
+                        "properties": properties,
+                        "required": required,
+                        "additionalProperties": False,
+                    },
+                },
+            }
+        )
+    shared_schemas.append(
+        {
+            "type": "function",
+            "function": {
+                "name": "submit_payload_generator",
+                "description": "Submit declarative payload generator methods for environment-side payload generation and grading.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "generator_methods": {
+                            "type": "array",
+                            "items": {"type": "object"},
+                        }
+                    },
+                    "required": ["generator_methods"],
+                    "additionalProperties": False,
+                },
+            },
+        }
+    )
+    shared_schemas.append(
+        {
+            "type": "function",
+            "function": {
+                "name": "submit_solution",
+                "description": "Submit the final anomaly payload to the environment.",
+                "parameters": {
+                    "type": "object",
+                    "properties": {
+                        "rows": {
+                            "type": "array",
+                            "items": {"type": "object"},
+                        }
+                    },
+                    "required": ["rows"],
+                    "additionalProperties": False,
+                },
+            },
+        }
+    )
+    return shared_schemas
+def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
+    return textwrap.dedent(
+        f"""
+        Solve the RL environment with tools.
+        Initial observation:
+        {json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
+        Prefer building a payload generator first, then submit it.
+        Start by calling `get_median_filter_rows` on several different metrics and see which ones return useful anomaly rows.
+        If a metric returns nothing or low-signal rows, try another metric.
+        For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
+        Final payload rows use:
+        `date`, `entity_type`, `entity_name`, `anomaly_type`, `detection_method`,
+        `baseline_value`, `observed_value`, `delta_value`, `severity`.
+        Supported generator method example:
+        `{{"method_name":"get_median_filter_rows","threshold_multiplier":2.0}}`
+        or
+        `{{"method_name":"get_median_filter_rows","metric_names":["app_open_to_order_placed","orders_placed"],"threshold_multiplier":2.0}}`
+        Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
+        """
+    ).strip()
+def create_chat_completion(client: OpenAI, **kwargs):
+    try:
+        return client.chat.completions.create(**kwargs)
+    except APIStatusError as exc:
+        if exc.status_code == 402:
+            raise RuntimeError(
+                "The configured inference provider rejected the request with HTTP 402. "
+                "Your Hugging Face router credits are depleted. Update `.env.inference` "
+                "with a working provider/key, or switch `API_BASE_URL`/`MODEL_NAME`."
+            ) from exc
+        raise
+def decode_arguments(raw_arguments: str | None) -> dict[str, Any]:
+    if not raw_arguments:
+        return {}
+    return json.loads(raw_arguments)
+def preview_text(text: str, limit: int = 220) -> str:
+    return text.replace("\n", " ")[:limit]
+async def connect_env() -> MetricTrackerRlEnv:
+    if BASE_URL:
+        return MetricTrackerRlEnv(base_url=BASE_URL)
+    return await MetricTrackerRlEnv.from_docker_image(IMAGE_NAME)
+async def execute_tool_call(
+    env: MetricTrackerRlEnv,
+    observation: MetricTrackerRlObservation,
+    runtime_state: ToolRuntimeState,
+    tool_name: str,
+    arguments: dict[str, Any],
+) -> tuple[dict[str, Any], Any | None, MetricTrackerRlObservation]:
+    """Execute one model-requested tool locally."""
+    if tool_name == "submit_payload_generator":
+        methods = [
+            PayloadGeneratorMethod(**item)
+            for item in arguments.get("generator_methods", [])
+        ]
+        runtime_state.method_log.append(
+            {
+                "tool_name": tool_name,
+                "arguments": arguments,
+                "generator_methods": [item.model_dump() for item in methods],
+                "note": _tool_note(tool_name, arguments),
+            }
+        )
+        result = await env.step(MetricTrackerRlAction(payload_generators=methods))
+        return (
+            {
+                "status": result.observation.status,
+                "message": result.observation.message,
+                "reward": result.reward,
+                "done": result.done,
+                "generated_rows": [row.model_dump() for row in result.observation.generated_rows],
+                "submission_issues": [issue.model_dump() for issue in result.observation.submission_issues],
+                "reward_breakdown": (
+                    result.observation.reward_breakdown.model_dump()
+                    if result.observation.reward_breakdown
+                    else None
+                ),
+            },
+            result,
+            result.observation,
+        )
+    if tool_name == "submit_solution":
+        rows = [MetricSubmissionRow(**row) for row in arguments.get("rows", [])]
+        result = await env.step(MetricTrackerRlAction(classifications=rows))
+        return (
+            {
+                "status": result.observation.status,
+                "message": result.observation.message,
+                "reward": result.reward,
+                "done": result.done,
+                "reward_breakdown": (
+                    result.observation.reward_breakdown.model_dump()
+                    if result.observation.reward_breakdown
+                    else None
+                ),
+                "issue_count": len(result.observation.submission_issues),
+                "correct_row_count": result.observation.correct_row_count,
+            },
+            result,
+            result.observation,
+        )
+    result = await env.step(
+        MetricTrackerRlAction(
+            analysis_method=tool_name,
+            analysis_args=arguments,
+        )
+    )
+    output = result.observation.analysis_result or {
+        "method": tool_name,
+        "arguments": arguments,
+        "result": None,
+    }
+    log_arguments = {
+        "tool_name": tool_name,
+        "arguments": arguments,
+        "note": _tool_note(tool_name, arguments),
+    }
+    if tool_name == "payload_generator":
+        log_arguments["generator_methods"] = arguments.get("generator_methods", [])
+    runtime_state.method_log.append(
+        log_arguments
+    )
+    if tool_name == "preview_submission":
+        runtime_state.last_preview = output
+    return output, None, result.observation
+def _tool_note(tool_name: str, arguments: dict[str, Any]) -> str:
+    notes = {
+        "task_overview": "bootstrap the task and payload schema",
+        "list_dates": "confirm the date range",
+        "list_entities": "confirm valid entities",
+        "rows_for_date": "inspect daily counts on one date",
+        "hourly_rows_for_date": "inspect hourly traffic shape",
+        "compare_rate_to_median": "check a conversion-rate anomaly against median baseline",
+        "compare_count_to_median": "check an absolute count anomaly against median baseline",
+        "detect_funnel_break": "test whether a funnel step is broken",
+        "check_impossible_counts": "test for instrumentation or impossible count issues",
+        "list_suspicious_dates": "prioritize dates worth deeper inspection",
+        "preview_submission": "validate payload structure before submit",
+        "show_raw_data": "inspect daily aggregate rows in head() form",
+        "get_metric_median": "measure a baseline median for one metric",
+        "get_metric_std_dev_from_median": "measure metric spread around the median",
+        "get_rows_with_abs_diff_from_median_gt": "inspect dates outside a chosen absolute-difference threshold",
+        "get_median_filter_rows": "generate candidate anomaly rows using median and std-from-median filtering",
+        "get_rate_drop_from_median_rows": "generate candidate conversion-rate drop rows using median and std-from-median filtering",
+        "get_rate_spike_from_median_rows": "generate candidate conversion-rate spike rows using median and std-from-median filtering",
+        "get_absolute_drop_in_event_count_rows": "generate candidate event-count drop rows using median and std-from-median filtering",
+        "get_absolute_spike_in_event_count_rows": "generate candidate event-count spike rows using median and std-from-median filtering",
+        "get_funnel_break_rows": "generate candidate funnel-break rows across funnel steps",
+        "get_hourly_traffic_mix_shift_rows": "generate candidate hourly traffic mix shift rows across dates",
+        "get_instrumentation_data_quality_issue_rows": "generate candidate impossible-count or instrumentation-issue rows across dates",
+        "payload_generator": "merge multiple generator methods into one candidate payload",
+        "submit_payload_generator": "submit generator methods for environment-side generation and grading",
+    }
+    return notes.get(tool_name, f"run {tool_name} with {arguments}")
+async def run_agent_loop(
+    client: OpenAI,
+    env: MetricTrackerRlEnv,
+    observation: MetricTrackerRlObservation,
+) -> tuple[Any, str, int, list[dict[str, Any]]]:
+    """Run a tool-calling loop until the env is solved or the round limit is hit."""
+    runtime_state = ToolRuntimeState()
+    current_observation = observation
+    messages: list[dict[str, Any]] = [
+        {"role": "system", "content": SYSTEM_PROMPT},
+        {"role": "user", "content": build_initial_user_prompt(current_observation)},
+    ]
+    last_result = None
+    final_text = ""
+    tool_rounds = 0
+    for _ in range(MAX_TOOL_ROUNDS):
+        completion = create_chat_completion(
+            client,
+            model=MODEL_NAME,
+            messages=messages,
+            tools=tool_schemas(),
+            tool_choice="auto",
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        message = completion.choices[0].message
+        assistant_payload: dict[str, Any] = {
+            "role": "assistant",
+            "content": message.content or "",
+        }
+        if message.tool_calls:
+            assistant_payload["tool_calls"] = [
+                {
+                    "id": tool_call.id,
+                    "type": tool_call.type,
+                    "function": {
+                        "name": tool_call.function.name,
+                        "arguments": tool_call.function.arguments,
+                    },
+                }
+                for tool_call in message.tool_calls
+            ]
+        messages.append(assistant_payload)
+        if not message.tool_calls:
+            final_text = (message.content or "").strip()
+            break
+        tool_rounds += 1
+        for tool_call in message.tool_calls:
+            tool_name = tool_call.function.name
+            arguments = decode_arguments(tool_call.function.arguments)
+            if tool_name != "submit_solution":
+                log_method(tool_name, arguments, _tool_note(tool_name, arguments))
+            if tool_name in {"payload_generator", "submit_payload_generator"}:
+                log_payload_generator_methods(
+                    tool_name,
+                    arguments.get("generator_methods", []),
+                )
+            tool_output, maybe_result, current_observation = await execute_tool_call(
+                env,
+                current_observation,
+                runtime_state,
+                tool_name,
+                arguments,
+            )
+            messages.append(
+                {
+                    "role": "tool",
+                    "tool_call_id": tool_call.id,
+                    "content": json.dumps(tool_output),
+                }
+            )
+            if maybe_result is not None:
+                last_result = maybe_result
+        if last_result is not None:
+            completion = create_chat_completion(
+                client,
+                model=MODEL_NAME,
+                messages=messages,
+                temperature=TEMPERATURE,
+                max_tokens=MAX_TOKENS,
+                stream=False,
+            )
+            final_text = (completion.choices[0].message.content or "").strip()
+            break
+    return last_result, final_text, tool_rounds, runtime_state.method_log
+async def run_single_task(
+    client: OpenAI,
+    env: MetricTrackerRlEnv,
+    task_id: str,
+) -> dict[str, Any]:
+    """Run one named benchmark task and return a reproducible summary."""
+    task_spec = get_task_spec(task_id)
+    log_task_boundary(task_spec.task_id, task_spec.difficulty, "START")
+    reset_result = await env.reset(task_id=task_spec.task_id)
+    final_result, final_text, tool_rounds, method_log = await run_agent_loop(
+        client,
+        env,
+        reset_result.observation,
+    )
+    if final_result is None:
+        raise RuntimeError(f"The model never submitted a graded action for task `{task_spec.task_id}`.")
+    reward = float(final_result.reward or 0.0)
+    success = bool(final_result.done and reward >= 0.999999)
+    log_step(
+        step=1,
+        action=preview_text(final_text or "graded_submission"),
+        reward=reward,
+        done=bool(final_result.done),
+        error=None,
+    )
+    log_end(success=success, steps=1, score=reward, method_log=method_log)
+    log_task_boundary(task_spec.task_id, task_spec.difficulty, "END")
+    return {
+        "task_id": task_spec.task_id,
+        "difficulty": task_spec.difficulty,
+        "objective": task_spec.objective,
+        "grader_name": task_spec.grader_name,
+        "normalized_score": max(0.0, min(1.0, reward)),
+        "done": final_result.done,
+        "success": success,
+        "final_status": final_result.observation.status,
+        "final_message": final_result.observation.message,
+        "issue_count": len(final_result.observation.submission_issues),
+        "correct_row_count": final_result.observation.correct_row_count,
+        "expected_row_count": final_result.observation.expected_row_count,
+        "tool_rounds": tool_rounds,
+        "assistant_summary": final_text,
+        "reward_breakdown": (
+            final_result.observation.reward_breakdown.model_dump()
+            if final_result.observation.reward_breakdown
+            else None
+        ),
+    }
+async def main() -> None:
+    if not API_KEY:
+        raise RuntimeError("Set OPENAI_API_KEY, HF_TOKEN, or API_KEY.")
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await connect_env()
+    task_summaries: list[dict[str, Any]] = []
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        for task_id in DEFAULT_TASK_ORDER:
+            task_summaries.append(await run_single_task(client, env, task_id))
+    finally:
+        try:
+            await env.close()
+        except Exception:
+            pass
+    average_score = (
+        round(sum(item["normalized_score"] for item in task_summaries) / len(task_summaries), 6)
+        if task_summaries
+        else 0.0
+    )
+    print(
+        json.dumps(
+            {
+                "benchmark": BENCHMARK,
+                "model": MODEL_NAME,
+                "task_count": len(task_summaries),
+                "task_ids": [item["task_id"] for item in task_summaries],
+                "average_score": average_score,
+                "successful_tasks": sum(1 for item in task_summaries if item["success"]),
+                "tasks": task_summaries,
+            },
+            indent=2,
+        ),
+        flush=True,
+    )
+if __name__ == "__main__":
+    asyncio.run(main())

models.py ADDED Viewed

	@@ -0,0 +1,324 @@

+"""Data models for the metric tracker RL environment."""
+from __future__ import annotations
+from typing import Any, Literal
+from pydantic import BaseModel, Field
+from openenv.core.env_server.types import Action, Observation
+class MetricRecord(BaseModel):
+    """Hourly or daily aggregate metrics for the app funnel."""
+    date: str = Field(..., description="ISO date in YYYY-MM-DD format.")
+    hour: int | None = Field(
+        default=None,
+        description="Hour bucket in 24h format. Null for daily aggregates.",
+    )
+    app_opens: int = Field(default=0, description="Count of app_open events.")
+    menu_opens: int = Field(default=0, description="Count of menu_open events.")
+    product_added_to_cart: int = Field(
+        default=0,
+        description="Count of product_added_to_cart events.",
+    )
+    orders_placed: int = Field(default=0, description="Count of order_placed events.")
+    payment_successful: int = Field(
+        default=0,
+        description="Count of payment_successful events.",
+    )
+class ConversionMetricDefinition(BaseModel):
+    """Definition for a conversion metric that the agent can cite."""
+    name: str = Field(..., description="Stable conversion metric identifier.")
+    numerator: str = Field(..., description="Numerator event.")
+    denominator: str = Field(..., description="Denominator event.")
+    description: str = Field(..., description="Human-readable formula.")
+class MethodSpec(BaseModel):
+    """Description of a shared safe analysis method."""
+    name: str = Field(..., description="Method name.")
+    description: str = Field(..., description="What the method does.")
+    parameters: list[str] = Field(
+        default_factory=list,
+        description="Ordered parameter names for the method.",
+    )
+class MetricSubmissionRow(BaseModel):
+    """Submitted anomaly row."""
+    date: str = Field(..., description="ISO date in YYYY-MM-DD format.")
+    entity_type: str = Field(
+        ...,
+        description=(
+            "Stable entity family such as conversion_rate, event_count, funnel_step, "
+            "hourly_mix, or data_quality."
+        ),
+    )
+    entity_name: str = Field(..., description="Stable entity identifier.")
+    anomaly_type: str = Field(..., description="Stable anomaly type identifier.")
+    detection_method: str = Field(..., description="Shared analysis method used.")
+    baseline_value: float = Field(..., description="Reference baseline value.")
+    observed_value: float = Field(..., description="Observed anomalous value.")
+    delta_value: float = Field(..., description="Observed minus baseline.")
+    severity: Literal["low", "medium", "high", "critical"] = Field(
+        ...,
+        description="Severity label.",
+    )
+class PayloadGeneratorMethod(BaseModel):
+    """A declarative payload generation method."""
+    method_name: str = Field(
+        ...,
+        description="Generator method name, for example get_median_filter_rows.",
+    )
+    metric_name: str | None = Field(
+        default=None,
+        description="Single count metric or conversion metric name. Optional.",
+    )
+    metric_names: list[str] = Field(
+        default_factory=list,
+        description="Optional list of metrics to run. Empty means all metrics.",
+    )
+    threshold_multiplier: float = Field(
+        ...,
+        description="Multiplier applied to the metric std-from-median value.",
+    )
+class SyntheticAnomalyGenerator(BaseModel):
+    """A declarative reset-time synthetic anomaly generator."""
+    method_name: str = Field(
+        default="metric_stddev_shift",
+        description="Synthetic generator method name.",
+    )
+    metric_name: str | None = Field(
+        default=None,
+        description="Single count metric or conversion metric name. Optional.",
+    )
+    metric_names: list[str] = Field(
+        default_factory=list,
+        description="Optional list of metrics to generate on. Empty means use metric_name.",
+    )
+    date: str | None = Field(
+        default=None,
+        description="Single ISO date to inject on. Optional.",
+    )
+    dates: list[str] = Field(
+        default_factory=list,
+        description="Optional list of ISO dates to inject on.",
+    )
+    stddev_factor: float = Field(
+        default=2.0,
+        description="Multiplier applied to std_dev_from_median when creating the target value.",
+    )
+    direction: Literal["up", "down", "auto"] = Field(
+        default="auto",
+        description="Whether to shift the metric upward or downward.",
+    )
+class SyntheticGeneratorApplication(BaseModel):
+    """Resolved synthetic generator application used for the active episode."""
+    method_name: str = Field(..., description="Synthetic generator method used.")
+    date: str = Field(..., description="ISO date the generator was applied to.")
+    metric_name: str = Field(..., description="Metric name used by the generator.")
+    metric_type: Literal["event_count", "conversion_rate"] = Field(
+        ...,
+        description="Resolved metric family.",
+    )
+    direction: Literal["up", "down"] = Field(..., description="Resolved direction.")
+    anomaly_type: str = Field(..., description="Expected anomaly type generated.")
+    detection_method: str = Field(..., description="Shared analysis method that should detect it.")
+    baseline_value: float = Field(..., description="Median baseline used during generation.")
+    pre_applied_value: float = Field(..., description="Metric value before generation.")
+    std_dev_from_median: float = Field(..., description="Std-from-median used during generation.")
+    stddev_factor: float = Field(..., description="Configured stddev factor.")
+    threshold_value: float = Field(..., description="stddev_factor * std_dev_from_median.")
+    target_value: float = Field(..., description="Requested target value before rebalancing.")
+    actual_value: float = Field(..., description="Observed value after generation.")
+    formula: str = Field(..., description="Human-readable formula used for generation.")
+class SubmissionIssue(BaseModel):
+    """Feedback about a submitted row or missing expectation."""
+    row_key: str = Field(..., description="Stable key in date|entity_type|entity_name form.")
+    issue_type: str = Field(..., description="Issue class.")
+    message: str = Field(..., description="Human-readable explanation.")
+    submitted_row: dict[str, Any] | None = Field(
+        default=None,
+        description="Submitted row fragment when relevant.",
+    )
+    expected_row: dict[str, Any] | None = Field(
+        default=None,
+        description="Expected row fragment when debug is enabled.",
+    )
+class RewardBreakdown(BaseModel):
+    """Deterministic grading components."""
+    precision: float = 0.0
+    recall: float = 0.0
+    anomaly_type_accuracy: float = 0.0
+    detection_method_accuracy: float = 0.0
+    value_accuracy: float = 0.0
+    severity_accuracy: float = 0.0
+    extra_row_penalty: float = 0.0
+    duplicate_penalty: float = 0.0
+    invalid_row_penalty: float = 0.0
+    exploit_penalty: float = 0.0
+    total_score: float = 0.0
+    matched_rows: int = 0
+    expected_rows: int = 0
+    submitted_rows: int = 0
+    valid_submitted_rows: int = 0
+    extra_rows: int = 0
+    duplicate_rows: int = 0
+    invalid_rows: int = 0
+    missing_rows: int = 0
+class SubmissionPreview(BaseModel):
+    """Safe preview of a candidate submission before grading."""
+    valid_rows: int = 0
+    invalid_rows: int = 0
+    duplicate_rows: int = 0
+    unique_keys: int = 0
+    issues: list[SubmissionIssue] = Field(default_factory=list)
+    normalized_rows: list[MetricSubmissionRow] = Field(default_factory=list)
+class BenchmarkTaskSpec(BaseModel):
+    """Public metadata for a benchmark task."""
+    task_id: str = Field(..., description="Stable benchmark task identifier.")
+    difficulty: Literal["easy", "medium", "hard"] = Field(
+        ...,
+        description="Canonical task difficulty.",
+    )
+    instruction: str = Field(..., description="Task instruction shown to the agent.")
+    objective: str = Field(..., description="Concrete success objective.")
+    scenario_family: str = Field(..., description="Scenario family used to generate the task episode.")
+    anomaly_density: str = Field(..., description="Relative anomaly density for the task episode.")
+    anomaly_count: int = Field(..., description="Number of anomalous rows expected for the task.")
+    grader_name: str = Field(..., description="Programmatic grader used for the task.")
+class MetricTrackerRlAction(Action):
+    """Submitted anomaly payload for the current episode."""
+    classifications: list[MetricSubmissionRow] = Field(
+        default_factory=list,
+        description="Submitted anomaly rows for the dataset.",
+    )
+    analysis_method: str | None = Field(
+        default=None,
+        description="Optional shared analysis method to call instead of grading a submission.",
+    )
+    analysis_args: dict[str, Any] = Field(
+        default_factory=dict,
+        description="Arguments for the selected analysis method.",
+    )
+    payload_generators: list[PayloadGeneratorMethod] = Field(
+        default_factory=list,
+        description="Declarative payload generation methods to run inside the environment.",
+    )
+class MetricTrackerRlObservation(Observation):
+    """Observation containing the dataset and analysis surface."""
+    task_id: str = Field(
+        default="",
+        description="Stable identifier for the active benchmark task.",
+    )
+    status: str = Field(
+        default="ready",
+        description="Episode status: ready, in_progress, evaluated, or completed.",
+    )
+    message: str = Field(default="", description="Human-readable environment feedback.")
+    instruction: str = Field(
+        default="",
+        description="Task presented to the model for the current episode.",
+    )
+    conversion_metric_definitions: list[ConversionMetricDefinition] = Field(
+        default_factory=list,
+        description="Conversion formulas the model may cite.",
+    )
+    available_synthetic_generator_methods: list[MethodSpec] = Field(
+        default_factory=list,
+        description="Reset-time synthetic generator methods available for seeded data creation.",
+    )
+    applied_synthetic_generators: list[SyntheticGeneratorApplication] = Field(
+        default_factory=list,
+        description="Resolved synthetic generator applications used for the active episode.",
+    )
+    available_methods: list[MethodSpec] = Field(
+        default_factory=list,
+        description="Safe shared analysis methods available to agents and humans.",
+    )
+    available_tasks: list[BenchmarkTaskSpec] = Field(
+        default_factory=list,
+        description="Catalog of benchmark tasks available in this environment.",
+    )
+    daily_metrics: list[MetricRecord] = Field(
+        default_factory=list,
+        description="Deprecated raw daily data field. Kept empty in standard mode.",
+    )
+    hourly_metrics: list[MetricRecord] = Field(
+        default_factory=list,
+        description="Deprecated raw hourly data field. Kept empty in standard mode.",
+    )
+    analysis_result: dict[str, Any] | None = Field(
+        default=None,
+        description="Result of the latest analysis-method call.",
+    )
+    generated_rows: list[MetricSubmissionRow] = Field(
+        default_factory=list,
+        description="Rows generated from payload generator methods, if used.",
+    )
+    submitted_rows: list[MetricSubmissionRow] = Field(
+        default_factory=list,
+        description="Most recent submitted anomaly rows.",
+    )
+    submission_preview: SubmissionPreview | None = Field(
+        default=None,
+        description="Safe preview information for the latest submitted payload.",
+    )
+    submission_issues: list[SubmissionIssue] = Field(
+        default_factory=list,
+        description="Feedback for the latest submitted payload.",
+    )
+    reward_breakdown: RewardBreakdown | None = Field(
+        default=None,
+        description="Deterministic reward components for the latest step.",
+    )
+    expected_row_count: int = Field(
+        default=0,
+        description="Number of expected anomaly rows in the current episode.",
+    )
+    correct_row_count: int = Field(
+        default=0,
+        description="Number of matched anomaly rows in the latest step.",
+    )
+    config: dict[str, Any] = Field(
+        default_factory=dict,
+        description="Episode configuration visible in standard mode.",
+    )
+    debug: dict[str, Any] | None = Field(
+        default=None,
+        description="Developer-only debug payload. Hidden in standard mode.",
+    )

openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: metric_tracker_rl
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

payload_generation.py ADDED Viewed

	@@ -0,0 +1,46 @@

+"""Shared method registry and submission preview helpers."""
+from __future__ import annotations
+from typing import Any
+try:
+    from .analysis_tools import (
+        SharedAnalysisToolkit,
+        available_analysis_methods,
+        preview_submission_rows,
+        submission_row_key,
+    )
+    from .models import MetricSubmissionRow, SubmissionPreview
+    from .server.data_generator import available_synthetic_generator_methods
+except ImportError:
+    from analysis_tools import (
+        SharedAnalysisToolkit,
+        available_analysis_methods,
+        preview_submission_rows,
+        submission_row_key,
+    )
+    from models import MetricSubmissionRow, SubmissionPreview
+    from server.data_generator import available_synthetic_generator_methods
+def available_payload_generation_methods():
+    """Backward-compatible alias for the shared analysis method list."""
+    return available_analysis_methods()
+def preview_submission(
+    rows: list[MetricSubmissionRow] | list[dict[str, Any]],
+) -> SubmissionPreview:
+    """Validate a submission without using hidden labels."""
+    return preview_submission_rows(rows)
+__all__ = [
+    "SharedAnalysisToolkit",
+    "available_analysis_methods",
+    "available_payload_generation_methods",
+    "available_synthetic_generator_methods",
+    "preview_submission",
+    "submission_row_key",
+]

pyproject.toml ADDED Viewed

	@@ -0,0 +1,49 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-metric_tracker_rl"
+version = "0.1.0"
+description = "Metric Tracker Rl environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.1",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+    "gradio>=5.0.0",
+    "pandas>=2.2.0",
+    "plotly>=5.24.0",
+    "openai>=1.0.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m metric_tracker_rl.server.app
+server = "metric_tracker_rl.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["metric_tracker_rl", "metric_tracker_rl.server"]
+package-dir = { "metric_tracker_rl" = ".", "metric_tracker_rl.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Metric Tracker Rl environment server components."""
+from .metric_tracker_rl_environment import MetricTrackerRlEnvironment
+__all__ = ["MetricTrackerRlEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,82 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Metric Tracker Rl Environment.
+This module creates an HTTP server that exposes the MetricTrackerRlEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+try:
+    from ..models import MetricTrackerRlAction, MetricTrackerRlObservation
+    from .gradio_ui import build_metric_tracker_gradio_app
+    from .metric_tracker_rl_environment import MetricTrackerRlEnvironment
+except ImportError:
+    from models import MetricTrackerRlAction, MetricTrackerRlObservation
+    from server.gradio_ui import build_metric_tracker_gradio_app
+    from server.metric_tracker_rl_environment import MetricTrackerRlEnvironment
+# Create the app with web interface and README integration
+app = create_app(
+    MetricTrackerRlEnvironment,
+    MetricTrackerRlAction,
+    MetricTrackerRlObservation,
+    env_name="metric_tracker_rl",
+    gradio_builder=build_metric_tracker_gradio_app,
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m metric_tracker_rl.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn metric_tracker_rl.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    main()

server/data_generator.py ADDED Viewed

	@@ -0,0 +1,1016 @@

+"""Synthetic multi-anomaly data generator for the metric tracker RL environment."""
+from __future__ import annotations
+import random
+from dataclasses import dataclass, field
+from datetime import date, timedelta
+from statistics import median
+try:
+    from ..analysis_tools import COUNT_METRICS, FUNNEL_STEPS, SharedAnalysisToolkit, AnalysisContext
+    from ..models import (
+        ConversionMetricDefinition,
+        MethodSpec,
+        MetricRecord,
+        MetricSubmissionRow,
+        SyntheticAnomalyGenerator,
+        SyntheticGeneratorApplication,
+    )
+except ImportError:
+    from analysis_tools import COUNT_METRICS, FUNNEL_STEPS, SharedAnalysisToolkit, AnalysisContext
+    from models import (
+        ConversionMetricDefinition,
+        MethodSpec,
+        MetricRecord,
+        MetricSubmissionRow,
+        SyntheticAnomalyGenerator,
+        SyntheticGeneratorApplication,
+    )
+ALL_SCENARIO_FAMILIES: tuple[str, ...] = (
+    "mixed",
+    "rate_drop_from_median",
+    "rate_spike_from_median",
+    "absolute_drop_in_event_count",
+    "absolute_spike_in_event_count",
+    "funnel_break",
+    "hourly_traffic_mix_shift",
+    "instrumentation_data_quality_issue",
+)
+SYNTHETIC_GENERATOR_METHOD_SPECS: tuple[MethodSpec, ...] = (
+    MethodSpec(
+        name="metric_stddev_shift",
+        description=(
+            "Inject a count or conversion anomaly on specific dates by setting the metric to "
+            "median +/- stddev_factor * std_dev_from_median."
+        ),
+        parameters=["metric_name", "metric_names", "date", "dates", "stddev_factor", "direction"],
+    ),
+)
+def available_synthetic_generator_methods() -> list[MethodSpec]:
+    """Return supported reset-time synthetic generator methods."""
+    return list(SYNTHETIC_GENERATOR_METHOD_SPECS)
+@dataclass(frozen=True)
+class GeneratorConfig:
+    """Configurable parameters for synthetic metric generation."""
+    conversion_definitions: tuple[ConversionMetricDefinition, ...] = (
+        ConversionMetricDefinition(
+            name="app_open_to_menu_open",
+            numerator="menu_opens",
+            denominator="app_opens",
+            description="menu_opens / app_opens * 100",
+        ),
+        ConversionMetricDefinition(
+            name="menu_open_to_product_added_to_cart",
+            numerator="product_added_to_cart",
+            denominator="menu_opens",
+            description="product_added_to_cart / menu_opens * 100",
+        ),
+        ConversionMetricDefinition(
+            name="product_added_to_cart_to_order_placed",
+            numerator="orders_placed",
+            denominator="product_added_to_cart",
+            description="orders_placed / product_added_to_cart * 100",
+        ),
+        ConversionMetricDefinition(
+            name="order_placed_to_payment_successful",
+            numerator="payment_successful",
+            denominator="orders_placed",
+            description="payment_successful / orders_placed * 100",
+        ),
+        ConversionMetricDefinition(
+            name="app_open_to_order_placed",
+            numerator="orders_placed",
+            denominator="app_opens",
+            description="orders_placed / app_opens * 100",
+        ),
+        ConversionMetricDefinition(
+            name="app_open_to_payment_successful",
+            numerator="payment_successful",
+            denominator="app_opens",
+            description="payment_successful / app_opens * 100",
+        ),
+    )
+    num_weeks: int = 4
+    end_date_offset_days: int = 1
+    base_daily_app_opens: int = 18000
+    weekday_factors: tuple[float, ...] = (0.95, 1.0, 1.02, 1.04, 1.06, 1.12, 1.08)
+    hourly_weights: tuple[float, ...] = (
+        0.010,
+        0.008,
+        0.007,
+        0.007,
+        0.010,
+        0.018,
+        0.028,
+        0.040,
+        0.050,
+        0.055,
+        0.058,
+        0.060,
+        0.058,
+        0.056,
+        0.054,
+        0.052,
+        0.054,
+        0.060,
+        0.072,
+        0.078,
+        0.075,
+        0.060,
+        0.038,
+        0.025,
+    )
+    baseline_rates: dict[str, float] = field(
+        default_factory=lambda: {
+            "menu_opens": 0.63,
+            "product_added_to_cart": 0.29,
+            "orders_placed": 0.44,
+            "payment_successful": 0.91,
+        }
+    )
+    @property
+    def num_days(self) -> int:
+        return self.num_weeks * 7
+@dataclass(frozen=True)
+class EpisodeConfig:
+    """Per-episode configuration."""
+    seed: int = 0
+    scenario_family: str = "mixed"
+    difficulty: str = "medium"
+    anomaly_density: str = "medium"
+    anomaly_count: int = 3
+    anomaly_generators: tuple[SyntheticAnomalyGenerator, ...] = ()
+    def normalized(self) -> "EpisodeConfig":
+        family = self.scenario_family if self.scenario_family in ALL_SCENARIO_FAMILIES else "mixed"
+        difficulty = self.difficulty if self.difficulty in {"easy", "medium", "hard"} else "medium"
+        density = self.anomaly_density if self.anomaly_density in {"low", "medium", "high"} else "medium"
+        return EpisodeConfig(
+            seed=int(self.seed),
+            scenario_family=family,
+            difficulty=difficulty,
+            anomaly_density=density,
+            anomaly_count=max(1, int(self.anomaly_count or 3)),
+            anomaly_generators=tuple(self.anomaly_generators or ()),
+        )
+@dataclass
+class PlannedAnomaly:
+    """Internal anomaly schedule item."""
+    date: str
+    anomaly_type: str
+    entity_type: str
+    entity_name: str
+    detection_method: str
+    details: dict[str, str]
+@dataclass
+class EpisodeData:
+    """Synthetic dataset and ground truth used for one episode."""
+    config: EpisodeConfig
+    scenario_label: str
+    daily_metrics: list[MetricRecord]
+    hourly_metrics: list[MetricRecord]
+    expected_rows: list[MetricSubmissionRow]
+    anomaly_schedule: list[dict[str, str]]
+    applied_synthetic_generators: list[SyntheticGeneratorApplication]
+class MetricDataGenerator:
+    """Reusable synthetic data generator used by the env and custom UI."""
+    def __init__(self, config: GeneratorConfig | None = None, seed: int | None = None) -> None:
+        self.config = config or GeneratorConfig()
+        self._default_seed = int(seed or 0)
+    def generate_episode(self, episode_config: EpisodeConfig | None = None) -> EpisodeData:
+        """Generate one seeded episode."""
+        config = (episode_config or EpisodeConfig(seed=self._default_seed)).normalized()
+        rng = random.Random(config.seed)
+        end_date = date.today() - timedelta(days=self.config.end_date_offset_days)
+        start_date = end_date - timedelta(days=self.config.num_days - 1)
+        base_hourly = self._generate_base_hourly_metrics(start_date, rng, config)
+        applied_synthetic_generators: list[SyntheticGeneratorApplication] = []
+        if self._use_synthetic_metric_generators(config):
+            anomaly_plan, applied_synthetic_generators = self._apply_metric_generators(
+                base_hourly,
+                rng,
+                config,
+            )
+        else:
+            anomaly_plan = self._plan_anomalies(base_hourly, rng, config)
+            self._apply_anomalies(base_hourly, anomaly_plan, rng, config)
+        daily_metrics, hourly_metrics = self._materialize_metrics(base_hourly)
+        if applied_synthetic_generators:
+            self._refresh_applied_generator_actuals(
+                applied_synthetic_generators,
+                daily_metrics,
+            )
+        expected_rows = self._build_expected_rows(daily_metrics, hourly_metrics, anomaly_plan, config)
+        anomaly_schedule = [
+            {
+                "date": item.date,
+                "anomaly_type": item.anomaly_type,
+                "entity_type": item.entity_type,
+                "entity_name": item.entity_name,
+                "detection_method": item.detection_method,
+            }
+            for item in anomaly_plan
+        ]
+        return EpisodeData(
+            config=config,
+            scenario_label=config.scenario_family,
+            daily_metrics=daily_metrics,
+            hourly_metrics=hourly_metrics,
+            expected_rows=expected_rows,
+            anomaly_schedule=anomaly_schedule,
+            applied_synthetic_generators=applied_synthetic_generators,
+        )
+    def _use_synthetic_metric_generators(self, episode_config: EpisodeConfig) -> bool:
+        if episode_config.anomaly_generators:
+            return True
+        return episode_config.scenario_family in {
+            "mixed",
+            "rate_drop_from_median",
+            "rate_spike_from_median",
+            "absolute_drop_in_event_count",
+            "absolute_spike_in_event_count",
+        }
+    def _generate_base_hourly_metrics(
+        self,
+        start_date: date,
+        rng: random.Random,
+        episode_config: EpisodeConfig,
+    ) -> dict[str, list[MetricRecord]]:
+        hourly: dict[str, list[MetricRecord]] = {}
+        difficulty_noise = {"easy": 0.015, "medium": 0.025, "hard": 0.035}[episode_config.difficulty]
+        for day_index in range(self.config.num_days):
+            current_date = start_date + timedelta(days=day_index)
+            date_key = current_date.isoformat()
+            weekday_factor = self.config.weekday_factors[current_date.weekday()]
+            trend_factor = 1.0 + day_index * 0.0025
+            noise_factor = 1.0 + rng.uniform(-0.02, 0.02)
+            total_app_opens = round(
+                self.config.base_daily_app_opens * weekday_factor * trend_factor * noise_factor
+            )
+            weights = self._hour_weights(current_date.weekday(), rng)
+            hourly_app_opens = self._allocate_total(total_app_opens, weights, rng)
+            day_rows: list[MetricRecord] = []
+            for hour, app_opens in enumerate(hourly_app_opens):
+                menu_rate = self._bounded(
+                    self.config.baseline_rates["menu_opens"] * (1.0 + rng.uniform(-difficulty_noise, difficulty_noise)),
+                    0.50,
+                    0.80,
+                )
+                cart_rate = self._bounded(
+                    self.config.baseline_rates["product_added_to_cart"]
+                    * (1.0 + rng.uniform(-difficulty_noise * 1.2, difficulty_noise * 1.2)),
+                    0.18,
+                    0.42,
+                )
+                order_rate = self._bounded(
+                    self.config.baseline_rates["orders_placed"]
+                    * (1.0 + rng.uniform(-difficulty_noise * 1.2, difficulty_noise * 1.2)),
+                    0.28,
+                    0.62,
+                )
+                payment_rate = self._bounded(
+                    self.config.baseline_rates["payment_successful"]
+                    * (1.0 + rng.uniform(-difficulty_noise, difficulty_noise)),
+                    0.76,
+                    0.99,
+                )
+                menu_opens = round(app_opens * menu_rate)
+                carts = round(menu_opens * cart_rate)
+                orders = round(carts * order_rate)
+                payments = round(orders * payment_rate)
+                day_rows.append(
+                    MetricRecord(
+                        date=date_key,
+                        hour=hour,
+                        app_opens=app_opens,
+                        menu_opens=menu_opens,
+                        product_added_to_cart=carts,
+                        orders_placed=orders,
+                        payment_successful=payments,
+                    )
+                )
+            hourly[date_key] = day_rows
+        return hourly
+    def _plan_anomalies(
+        self,
+        base_hourly: dict[str, list[MetricRecord]],
+        rng: random.Random,
+        episode_config: EpisodeConfig,
+    ) -> list[PlannedAnomaly]:
+        dates = sorted(base_hourly)
+        candidate_dates = dates[3:-2] if len(dates) > 8 else dates
+        family_pool = (
+            list(ALL_SCENARIO_FAMILIES[1:])
+            if episode_config.scenario_family == "mixed"
+            else [episode_config.scenario_family]
+        )
+        target_count = max(
+            1,
+            int(
+                episode_config.anomaly_count
+                or {"low": 3, "medium": 5, "high": 7}[episode_config.anomaly_density]
+            ),
+        )
+        plan: list[PlannedAnomaly] = []
+        used_pairs: set[tuple[str, str, str]] = set()
+        family_order = family_pool[:]
+        rng.shuffle(family_order)
+        family_index = 0
+        while len(plan) < target_count:
+            if family_index >= len(family_order):
+                family_order = family_pool[:]
+                rng.shuffle(family_order)
+                family_index = 0
+            anomaly_type = family_order[family_index]
+            family_index += 1
+            date_key = rng.choice(candidate_dates)
+            entity_type, entity_name, detection_method, details = self._pick_entity_for_family(
+                anomaly_type,
+                rng,
+            )
+            dedupe_key = (date_key, entity_type, entity_name)
+            if dedupe_key in used_pairs:
+                continue
+            used_pairs.add(dedupe_key)
+            plan.append(
+                PlannedAnomaly(
+                    date=date_key,
+                    anomaly_type=anomaly_type,
+                    entity_type=entity_type,
+                    entity_name=entity_name,
+                    detection_method=detection_method,
+                    details=details,
+                )
+            )
+        plan.sort(key=lambda item: (item.date, item.entity_type, item.entity_name))
+        return plan
+    def _pick_entity_for_family(
+        self,
+        anomaly_type: str,
+        rng: random.Random,
+    ) -> tuple[str, str, str, dict[str, str]]:
+        if anomaly_type in {"rate_drop_from_median", "rate_spike_from_median"}:
+            definition = rng.choice(list(self.config.conversion_definitions))
+            return (
+                "conversion_rate",
+                definition.name,
+                "compare_rate_to_median",
+                {"conversion_name": definition.name},
+            )
+        if anomaly_type in {"absolute_drop_in_event_count", "absolute_spike_in_event_count"}:
+            metric_name = rng.choice(list(COUNT_METRICS))
+            return (
+                "event_count",
+                metric_name,
+                "compare_count_to_median",
+                {"metric_name": metric_name},
+            )
+        if anomaly_type == "funnel_break":
+            numerator, denominator = rng.choice(list(FUNNEL_STEPS))
+            return (
+                "funnel_step",
+                f"{numerator}_from_{denominator}",
+                "detect_funnel_break",
+                {"numerator": numerator, "denominator": denominator},
+            )
+        if anomaly_type == "hourly_traffic_mix_shift":
+            return (
+                "hourly_mix",
+                "app_opens:daytime_share",
+                "hourly_rows_for_date",
+                {},
+            )
+        numerator, denominator = rng.choice(list(FUNNEL_STEPS))
+        return (
+            "data_quality",
+            f"{numerator}_lte_{denominator}",
+            "check_impossible_counts",
+            {"numerator": numerator, "denominator": denominator},
+        )
+    def _apply_anomalies(
+        self,
+        hourly: dict[str, list[MetricRecord]],
+        plan: list[PlannedAnomaly],
+        rng: random.Random,
+        episode_config: EpisodeConfig,
+    ) -> None:
+        difficulty = episode_config.difficulty
+        for item in plan:
+            rows = hourly[item.date]
+            if item.anomaly_type == "rate_drop_from_median":
+                self._apply_rate_change(rows, item.details["conversion_name"], rng, difficulty, direction="down")
+            elif item.anomaly_type == "rate_spike_from_median":
+                self._apply_rate_change(rows, item.details["conversion_name"], rng, difficulty, direction="up")
+            elif item.anomaly_type == "absolute_drop_in_event_count":
+                self._apply_count_change(rows, item.details["metric_name"], rng, difficulty, direction="down")
+            elif item.anomaly_type == "absolute_spike_in_event_count":
+                self._apply_count_change(rows, item.details["metric_name"], rng, difficulty, direction="up")
+            elif item.anomaly_type == "funnel_break":
+                self._apply_funnel_break(rows, item.details["numerator"], item.details["denominator"], rng, difficulty)
+            elif item.anomaly_type == "hourly_traffic_mix_shift":
+                self._apply_hourly_mix_shift(rows, rng, difficulty)
+            elif item.anomaly_type == "instrumentation_data_quality_issue":
+                self._apply_data_quality_issue(rows, item.details["numerator"], item.details["denominator"], rng, difficulty)
+    def _apply_rate_change(
+        self,
+        rows: list[MetricRecord],
+        conversion_name: str,
+        rng: random.Random,
+        difficulty: str,
+        *,
+        direction: str,
+    ) -> None:
+        definition = next(item for item in self.config.conversion_definitions if item.name == conversion_name)
+        multipliers = {
+            "easy": (0.74, 1.32),
+            "medium": (0.82, 1.22),
+            "hard": (0.88, 1.15),
+        }[difficulty]
+        multiplier = multipliers[0] if direction == "down" else multipliers[1]
+        for row in rows:
+            denominator_value = getattr(row, definition.denominator)
+            observed = round(denominator_value * multiplier * self._base_rate_from_metric(definition.numerator))
+            setattr_value = min(max(observed, 0), denominator_value)
+            self._set_metric_and_rebalance(row, definition.numerator, setattr_value)
+    def _apply_count_change(
+        self,
+        rows: list[MetricRecord],
+        metric_name: str,
+        rng: random.Random,
+        difficulty: str,
+        *,
+        direction: str,
+    ) -> None:
+        multipliers = {
+            "easy": (0.58, 1.42),
+            "medium": (0.72, 1.28),
+            "hard": (0.82, 1.18),
+        }[difficulty]
+        multiplier = multipliers[0] if direction == "down" else multipliers[1]
+        for row in rows:
+            original = getattr(row, metric_name)
+            updated = max(0, round(original * multiplier))
+            self._set_metric_and_rebalance(row, metric_name, updated)
+    def _apply_funnel_break(
+        self,
+        rows: list[MetricRecord],
+        numerator: str,
+        denominator: str,
+        rng: random.Random,
+        difficulty: str,
+    ) -> None:
+        if numerator == "menu_opens":
+            return
+        drop = {"easy": 0.45, "medium": 0.58, "hard": 0.7}[difficulty]
+        for row in rows:
+            denominator_value = getattr(row, denominator)
+            broken_value = max(0, round(denominator_value * drop))
+            self._set_metric_and_rebalance(row, numerator, broken_value)
+    def _apply_hourly_mix_shift(
+        self,
+        rows: list[MetricRecord],
+        rng: random.Random,
+        difficulty: str,
+    ) -> None:
+        total = sum(row.app_opens for row in rows)
+        if total <= 0:
+            return
+        shift = {"easy": 0.28, "medium": 0.20, "hard": 0.14}[difficulty]
+        boosted_hours = {0, 1, 2, 3, 4, 21, 22, 23}
+        weights = []
+        for row in rows:
+            base = row.app_opens / total
+            if row.hour in boosted_hours:
+                base *= 1.0 + shift
+            elif 9 <= (row.hour or 0) <= 18:
+                base *= max(0.2, 1.0 - shift)
+            weights.append(base)
+        normalized = [value / sum(weights) for value in weights]
+        redistributed = self._allocate_total(total, normalized, rng)
+        for row, app_opens in zip(rows, redistributed, strict=False):
+            row.app_opens = app_opens
+            menu_rate = self._ratio(row.menu_opens, max(row.app_opens, 1))
+            row.menu_opens = min(row.app_opens, round(app_opens * menu_rate))
+            cart_rate = self._ratio(row.product_added_to_cart, max(row.menu_opens, 1))
+            row.product_added_to_cart = min(row.menu_opens, round(row.menu_opens * cart_rate))
+            order_rate = self._ratio(row.orders_placed, max(row.product_added_to_cart, 1))
+            row.orders_placed = min(row.product_added_to_cart, round(row.product_added_to_cart * order_rate))
+            payment_rate = self._ratio(row.payment_successful, max(row.orders_placed, 1))
+            row.payment_successful = min(row.orders_placed, round(row.orders_placed * payment_rate))
+    def _apply_data_quality_issue(
+        self,
+        rows: list[MetricRecord],
+        numerator: str,
+        denominator: str,
+        rng: random.Random,
+        difficulty: str,
+    ) -> None:
+        affected_hours = {"easy": 5, "medium": 4, "hard": 3}[difficulty]
+        for row in rng.sample(rows, k=min(affected_hours, len(rows))):
+            denominator_value = getattr(row, denominator)
+            violation = max(1, round(denominator_value * {"easy": 0.12, "medium": 0.08, "hard": 0.05}[difficulty]))
+            setattr(row, numerator, denominator_value + violation)
+            self._rebalance_downstream_from(row, numerator)
+    def _apply_metric_generators(
+        self,
+        hourly: dict[str, list[MetricRecord]],
+        rng: random.Random,
+        episode_config: EpisodeConfig,
+    ) -> tuple[list[PlannedAnomaly], list[SyntheticGeneratorApplication]]:
+        generator_specs = self._resolve_metric_generators(hourly, rng, episode_config)
+        if not generator_specs:
+            return [], []
+        daily_metrics, hourly_metrics = self._materialize_metrics(hourly)
+        toolkit = SharedAnalysisToolkit(
+            AnalysisContext(
+                daily_metrics=daily_metrics,
+                hourly_metrics=hourly_metrics,
+                conversion_definitions=list(self.config.conversion_definitions),
+                config=episode_config.__dict__,
+            )
+        )
+        anomaly_plan: list[PlannedAnomaly] = []
+        applications: list[SyntheticGeneratorApplication] = []
+        seen_pairs: set[tuple[str, str]] = set()
+        for spec in generator_specs:
+            for date_key in self._resolve_generator_dates(spec, hourly, rng):
+                for metric_name in self._resolve_generator_metrics(spec):
+                    dedupe_key = (date_key, metric_name)
+                    if dedupe_key in seen_pairs:
+                        continue
+                    seen_pairs.add(dedupe_key)
+                    application = self._build_metric_generator_application(
+                        toolkit=toolkit,
+                        date_key=date_key,
+                        metric_name=metric_name,
+                        spec=spec,
+                        rng=rng,
+                    )
+                    self._apply_metric_generator_application(hourly[date_key], application)
+                    applications.append(application)
+                    anomaly_plan.append(
+                        PlannedAnomaly(
+                            date=date_key,
+                            anomaly_type=application.anomaly_type,
+                            entity_type=application.metric_type,
+                            entity_name=metric_name,
+                            detection_method=application.detection_method,
+                            details={"metric_name": metric_name},
+                        )
+                    )
+        applications.sort(key=lambda item: (item.date, item.metric_name))
+        anomaly_plan.sort(key=lambda item: (item.date, item.entity_type, item.entity_name))
+        return anomaly_plan, applications
+    def _resolve_metric_generators(
+        self,
+        hourly: dict[str, list[MetricRecord]],
+        rng: random.Random,
+        episode_config: EpisodeConfig,
+    ) -> list[SyntheticAnomalyGenerator]:
+        if episode_config.anomaly_generators:
+            return list(episode_config.anomaly_generators)
+        dates = sorted(hourly)
+        candidate_dates = dates[3:-2] if len(dates) > 8 else dates
+        metric_pool = self._metric_pool_for_family(episode_config.scenario_family)
+        if not metric_pool:
+            return []
+        used_pairs: set[tuple[str, str]] = set()
+        generated: list[SyntheticAnomalyGenerator] = []
+        default_stddev = {"easy": 2.6, "medium": 2.2, "hard": 1.8}[episode_config.difficulty]
+        while len(generated) < max(1, episode_config.anomaly_count):
+            date_key = rng.choice(candidate_dates)
+            metric_name = rng.choice(metric_pool)
+            if (date_key, metric_name) in used_pairs:
+                continue
+            used_pairs.add((date_key, metric_name))
+            generated.append(
+                SyntheticAnomalyGenerator(
+                    method_name="metric_stddev_shift",
+                    metric_name=metric_name,
+                    date=date_key,
+                    stddev_factor=default_stddev,
+                    direction=self._default_direction_for_family(episode_config.scenario_family, rng),
+                )
+            )
+        return generated
+    def _metric_pool_for_family(self, scenario_family: str) -> list[str]:
+        conversion_metrics = [item.name for item in self.config.conversion_definitions]
+        if scenario_family in {"rate_drop_from_median", "rate_spike_from_median"}:
+            return conversion_metrics
+        if scenario_family in {"absolute_drop_in_event_count", "absolute_spike_in_event_count"}:
+            return list(COUNT_METRICS)
+        if scenario_family == "mixed":
+            return list(COUNT_METRICS) + conversion_metrics
+        return []
+    @staticmethod
+    def _default_direction_for_family(scenario_family: str, rng: random.Random) -> str:
+        if scenario_family in {"rate_drop_from_median", "absolute_drop_in_event_count"}:
+            return "down"
+        if scenario_family in {"rate_spike_from_median", "absolute_spike_in_event_count"}:
+            return "up"
+        return "down" if rng.random() < 0.5 else "up"
+    def _resolve_generator_dates(
+        self,
+        spec: SyntheticAnomalyGenerator,
+        hourly: dict[str, list[MetricRecord]],
+        rng: random.Random,
+    ) -> list[str]:
+        dates = [item for item in spec.dates if item in hourly]
+        if spec.date and spec.date in hourly:
+            dates.append(spec.date)
+        if not dates:
+            dates = [rng.choice(sorted(hourly))]
+        seen = set()
+        deduped = []
+        for item in dates:
+            if item in seen:
+                continue
+            seen.add(item)
+            deduped.append(item)
+        return deduped
+    def _resolve_generator_metrics(self, spec: SyntheticAnomalyGenerator) -> list[str]:
+        metrics = [item for item in spec.metric_names if item]
+        if spec.metric_name:
+            metrics.append(spec.metric_name)
+        if not metrics:
+            metrics = list(COUNT_METRICS) + [item.name for item in self.config.conversion_definitions]
+        seen = set()
+        deduped = []
+        for item in metrics:
+            if item in seen:
+                continue
+            seen.add(item)
+            deduped.append(item)
+        return deduped
+    def _build_metric_generator_application(
+        self,
+        *,
+        toolkit: SharedAnalysisToolkit,
+        date_key: str,
+        metric_name: str,
+        spec: SyntheticAnomalyGenerator,
+        rng: random.Random,
+    ) -> SyntheticGeneratorApplication:
+        stats = toolkit.get_metric_std_dev_from_median(metric_name)
+        descriptor = toolkit._metric_descriptor(metric_name)
+        baseline_value = float(stats["median_value"])
+        std_dev_from_median = float(stats["std_dev_from_median"])
+        pre_applied_value = float(descriptor["per_date_values"][date_key])
+        direction = spec.direction if spec.direction != "auto" else ("down" if rng.random() < 0.5 else "up")
+        sign = -1.0 if direction == "down" else 1.0
+        threshold_value = round(std_dev_from_median * float(spec.stddev_factor), 4)
+        metric_type = "event_count" if metric_name in COUNT_METRICS else "conversion_rate"
+        if metric_type == "event_count":
+            minimum_shift = max(50.0, baseline_value * toolkit._count_threshold_fraction()) * 1.05
+            applied_shift = max(threshold_value, round(minimum_shift, 4))
+            target_value = max(0.0, baseline_value + sign * applied_shift)
+            anomaly_type = "absolute_spike_in_event_count" if sign > 0 else "absolute_drop_in_event_count"
+            detection_method = "compare_count_to_median"
+        else:
+            applied_shift = max(threshold_value, round(toolkit._rate_threshold() * 1.05, 4))
+            target_value = self._bounded(baseline_value + sign * applied_shift, 0.0, 100.0)
+            anomaly_type = "rate_spike_from_median" if sign > 0 else "rate_drop_from_median"
+            detection_method = "compare_rate_to_median"
+        return SyntheticGeneratorApplication(
+            method_name=spec.method_name,
+            date=date_key,
+            metric_name=metric_name,
+            metric_type=metric_type,
+            direction="up" if sign > 0 else "down",
+            anomaly_type=anomaly_type,
+            detection_method=detection_method,
+            baseline_value=round(baseline_value, 4),
+            pre_applied_value=round(pre_applied_value, 4),
+            std_dev_from_median=round(std_dev_from_median, 4),
+            stddev_factor=round(float(spec.stddev_factor), 4),
+            threshold_value=threshold_value,
+            target_value=round(target_value, 4),
+            actual_value=round(target_value, 4),
+            formula=(
+                f"{metric_name} = median {'+' if sign > 0 else '-'} "
+                "max(stddev_factor * std_dev_from_median, detector_threshold)"
+            ),
+        )
+    def _apply_metric_generator_application(
+        self,
+        rows: list[MetricRecord],
+        application: SyntheticGeneratorApplication,
+    ) -> None:
+        if application.metric_type == "event_count":
+            self._apply_daily_count_target(
+                rows,
+                application.metric_name,
+                int(round(application.target_value)),
+            )
+            return
+        self._apply_daily_conversion_target(
+            rows,
+            application.metric_name,
+            float(application.target_value),
+        )
+    def _apply_daily_count_target(
+        self,
+        rows: list[MetricRecord],
+        metric_name: str,
+        target_total: int,
+    ) -> None:
+        target_total = max(0, target_total)
+        current_values = [max(0, getattr(row, metric_name)) for row in rows]
+        current_total = sum(current_values)
+        if current_total > 0:
+            weights = [value / current_total for value in current_values]
+        else:
+            app_total = sum(max(0, row.app_opens) for row in rows) or len(rows)
+            weights = [max(0, row.app_opens) / app_total for row in rows]
+        allocated = self._allocate_total(target_total, weights, random.Random(target_total + len(rows)))
+        for row, value in zip(rows, allocated, strict=False):
+            self._set_metric_and_rebalance(row, metric_name, value)
+    def _apply_daily_conversion_target(
+        self,
+        rows: list[MetricRecord],
+        conversion_name: str,
+        target_rate_pct: float,
+    ) -> None:
+        definition = next(item for item in self.config.conversion_definitions if item.name == conversion_name)
+        bounded_rate = self._bounded(target_rate_pct / 100.0, 0.0, 1.0)
+        for row in rows:
+            denominator_value = getattr(row, definition.denominator)
+            numerator_target = round(denominator_value * bounded_rate)
+            self._set_metric_and_rebalance(row, definition.numerator, numerator_target)
+    def _refresh_applied_generator_actuals(
+        self,
+        applications: list[SyntheticGeneratorApplication],
+        daily_metrics: list[MetricRecord],
+    ) -> None:
+        by_date = {row.date: row for row in daily_metrics}
+        conversion_map = {item.name: item for item in self.config.conversion_definitions}
+        for application in applications:
+            record = by_date.get(application.date)
+            if record is None:
+                continue
+            if application.metric_type == "event_count":
+                actual_value = float(getattr(record, application.metric_name))
+            else:
+                definition = conversion_map[application.metric_name]
+                denominator = getattr(record, definition.denominator)
+                actual_value = round(
+                    (getattr(record, definition.numerator) / denominator * 100.0)
+                    if denominator > 0
+                    else 0.0,
+                    4,
+                )
+            application.actual_value = round(actual_value, 4)
+    def _build_expected_rows(
+        self,
+        daily_metrics: list[MetricRecord],
+        hourly_metrics: list[MetricRecord],
+        plan: list[PlannedAnomaly],
+        episode_config: EpisodeConfig,
+    ) -> list[MetricSubmissionRow]:
+        toolkit = SharedAnalysisToolkit(
+            AnalysisContext(
+                daily_metrics=daily_metrics,
+                hourly_metrics=hourly_metrics,
+                conversion_definitions=list(self.config.conversion_definitions),
+                config={
+                    "seed": episode_config.seed,
+                    "scenario_family": episode_config.scenario_family,
+                    "difficulty": episode_config.difficulty,
+                    "anomaly_density": episode_config.anomaly_density,
+                    "anomaly_count": episode_config.anomaly_count,
+                },
+            )
+        )
+        rows: list[MetricSubmissionRow] = []
+        for item in plan:
+            if item.detection_method == "compare_rate_to_median":
+                result = toolkit.compare_rate_to_median(item.date, item.entity_name)
+            elif item.detection_method == "compare_count_to_median":
+                result = toolkit.compare_count_to_median(item.date, item.entity_name)
+            elif item.detection_method == "detect_funnel_break":
+                candidates = toolkit.detect_funnel_break(item.date)["candidates"]
+                result = next((row for row in candidates if row["entity_name"] == item.entity_name), None)
+                if result is None:
+                    numerator = item.details["numerator"]
+                    denominator = item.details["denominator"]
+                    daily_row = next(row for row in daily_metrics if row.date == item.date)
+                    baseline_series = [
+                        (getattr(row, numerator) / getattr(row, denominator) * 100.0)
+                        if getattr(row, denominator) > 0
+                        else 0.0
+                        for row in daily_metrics
+                    ]
+                    baseline = round(median(baseline_series), 4)
+                    observed = round(
+                        (getattr(daily_row, numerator) / getattr(daily_row, denominator) * 100.0)
+                        if getattr(daily_row, denominator) > 0
+                        else 0.0,
+                        4,
+                    )
+                    delta = round(observed - baseline, 4)
+                    result = {
+                        "entity_type": item.entity_type,
+                        "entity_name": item.entity_name,
+                        "baseline_value": baseline,
+                        "observed_value": observed,
+                        "delta_value": delta,
+                        "severity": self._severity_from_ratio(abs(delta), 5.0, 10.0, 15.0),
+                    }
+            elif item.detection_method == "check_impossible_counts":
+                impossible = toolkit.check_impossible_counts(item.date)
+                result = {
+                    "date": item.date,
+                    "entity_type": item.entity_type,
+                    "entity_name": item.entity_name,
+                    "anomaly_type": item.anomaly_type,
+                    "detection_method": item.detection_method,
+                    "baseline_value": 0.0,
+                    "observed_value": round(impossible["total_excess"], 4),
+                    "delta_value": round(impossible["total_excess"], 4),
+                    "severity": self._severity_from_ratio(impossible["total_excess"], 20.0, 60.0, 120.0),
+                }
+            else:
+                observed_share = toolkit.hourly_rows_for_date(item.date)["summary"]["daytime_share"]
+                baseline_share = toolkit._median_daytime_share()
+                delta = round(observed_share - baseline_share, 4)
+                result = {
+                    "date": item.date,
+                    "entity_type": item.entity_type,
+                    "entity_name": item.entity_name,
+                    "anomaly_type": item.anomaly_type,
+                    "detection_method": item.detection_method,
+                    "baseline_value": round(baseline_share, 4),
+                    "observed_value": round(observed_share, 4),
+                    "delta_value": delta,
+                    "severity": self._severity_from_ratio(abs(delta) * 100.0, 10.0, 18.0, 25.0),
+                }
+            if not result:
+                continue
+            normalized = dict(result)
+            normalized["date"] = item.date
+            normalized["anomaly_type"] = item.anomaly_type
+            normalized["detection_method"] = item.detection_method
+            rows.append(MetricSubmissionRow(**normalized))
+        deduped = {f"{row.date}|{row.entity_type}|{row.entity_name}": row for row in rows}
+        return sorted(deduped.values(), key=lambda row: (row.date, row.entity_type, row.entity_name))
+    def _materialize_metrics(
+        self,
+        base_hourly: dict[str, list[MetricRecord]],
+    ) -> tuple[list[MetricRecord], list[MetricRecord]]:
+        hourly_metrics = []
+        daily_metrics = []
+        for date_key in sorted(base_hourly):
+            rows = base_hourly[date_key]
+            hourly_metrics.extend(rows)
+            daily_metrics.append(
+                MetricRecord(
+                    date=date_key,
+                    hour=None,
+                    app_opens=sum(item.app_opens for item in rows),
+                    menu_opens=sum(item.menu_opens for item in rows),
+                    product_added_to_cart=sum(item.product_added_to_cart for item in rows),
+                    orders_placed=sum(item.orders_placed for item in rows),
+                    payment_successful=sum(item.payment_successful for item in rows),
+                )
+            )
+        return daily_metrics, hourly_metrics
+    def _set_metric_and_rebalance(self, row: MetricRecord, metric_name: str, value: int) -> None:
+        caps = {
+            "app_opens": None,
+            "menu_opens": row.app_opens,
+            "product_added_to_cart": row.menu_opens,
+            "orders_placed": row.product_added_to_cart,
+            "payment_successful": row.orders_placed,
+        }
+        cap = caps.get(metric_name)
+        bounded = max(0, value if cap is None else min(value, cap))
+        setattr(row, metric_name, bounded)
+        self._rebalance_downstream_from(row, metric_name)
+        self._rebalance_upstream_to(row, metric_name)
+    def _rebalance_downstream_from(self, row: MetricRecord, metric_name: str) -> None:
+        order = list(COUNT_METRICS)
+        start_index = order.index(metric_name)
+        for index in range(start_index + 1, len(order)):
+            parent_name = order[index - 1]
+            current_name = order[index]
+            parent_value = getattr(row, parent_name)
+            current_value = min(getattr(row, current_name), parent_value)
+            setattr(row, current_name, max(0, current_value))
+    def _rebalance_upstream_to(self, row: MetricRecord, metric_name: str) -> None:
+        order = list(COUNT_METRICS)
+        start_index = order.index(metric_name)
+        for index in range(start_index - 1, -1, -1):
+            child_name = order[index + 1]
+            current_name = order[index]
+            child_value = getattr(row, child_name)
+            current_value = max(getattr(row, current_name), child_value)
+            setattr(row, current_name, current_value)
+    def _base_rate_from_metric(self, metric_name: str) -> float:
+        if metric_name == "menu_opens":
+            return self.config.baseline_rates["menu_opens"]
+        if metric_name == "product_added_to_cart":
+            return self.config.baseline_rates["product_added_to_cart"]
+        if metric_name == "orders_placed":
+            return self.config.baseline_rates["orders_placed"]
+        if metric_name == "payment_successful":
+            return self.config.baseline_rates["payment_successful"]
+        return 1.0
+    def _hour_weights(self, weekday: int, rng: random.Random) -> list[float]:
+        weekend_multiplier = 1.12 if weekday >= 5 else 1.0
+        weights = [
+            max(0.001, value * weekend_multiplier * (1.0 + rng.uniform(-0.08, 0.08)))
+            for value in self.config.hourly_weights
+        ]
+        total = sum(weights)
+        return [value / total for value in weights]
+    @staticmethod
+    def _allocate_total(total: int, weights: list[float], rng: random.Random) -> list[int]:
+        raw = [total * weight for weight in weights]
+        integers = [int(value) for value in raw]
+        remainder = total - sum(integers)
+        ranked = sorted(
+            range(len(weights)),
+            key=lambda index: (raw[index] - integers[index], rng.random()),
+            reverse=True,
+        )
+        for index in ranked[:remainder]:
+            integers[index] += 1
+        return integers
+    @staticmethod
+    def _ratio(numerator: int, denominator: int) -> float:
+        if denominator <= 0:
+            return 0.0
+        return numerator / denominator
+    @staticmethod
+    def _bounded(value: float, lower: float, upper: float) -> float:
+        return min(max(value, lower), upper)
+    @staticmethod
+    def _severity_from_ratio(value: float, medium: float, high: float, critical: float) -> str:
+        if value >= critical:
+            return "critical"
+        if value >= high:
+            return "high"
+        if value >= medium:
+            return "medium"
+        return "low"

server/gradio_ui.py ADDED Viewed

	@@ -0,0 +1,728 @@

+"""Custom Gradio UI for testing the metric tracker RL environment."""
+from __future__ import annotations
+import json
+import pandas as pd
+try:
+    from ..analysis_tools import available_analysis_methods
+    from ..tasks import DEFAULT_TASK_ORDER, available_task_specs, get_task_spec
+except ImportError:
+    from analysis_tools import available_analysis_methods
+    from tasks import DEFAULT_TASK_ORDER, available_task_specs, get_task_spec
+try:
+    import gradio as gr
+except ImportError:  # pragma: no cover
+    gr = None
+GENERATOR_METHODS = [
+    "get_median_filter_rows",
+    "get_rate_drop_from_median_rows",
+    "get_rate_spike_from_median_rows",
+    "get_absolute_drop_in_event_count_rows",
+    "get_absolute_spike_in_event_count_rows",
+    "get_funnel_break_rows",
+    "get_hourly_traffic_mix_shift_rows",
+    "get_instrumentation_data_quality_issue_rows",
+]
+METHOD_CHOICES = [item.name for item in available_analysis_methods()]
+TASK_CHOICES = list(DEFAULT_TASK_ORDER)
+TASK_SUMMARIES = {
+    item.task_id: item.model_dump()
+    for item in available_task_specs()
+}
+METRIC_CHOICES = [
+    "app_opens",
+    "menu_opens",
+    "product_added_to_cart",
+    "orders_placed",
+    "payment_successful",
+    "app_open_to_menu_open",
+    "menu_open_to_product_added_to_cart",
+    "product_added_to_cart_to_order_placed",
+    "order_placed_to_payment_successful",
+    "app_open_to_order_placed",
+    "app_open_to_payment_successful",
+]
+def build_metric_tracker_gradio_app(
+    web_manager,
+    action_fields,
+    metadata,
+    is_chat_env,
+    title,
+    quick_start_md,
+):
+    """Build a method-driven and generator-driven debugger."""
+    del action_fields, metadata, is_chat_env, quick_start_md
+    if gr is None:  # pragma: no cover
+        raise ImportError("gradio is required to build the custom metric tracker UI.")
+    with gr.Blocks() as demo:
+        gr.Markdown(
+            f"""
+            # {title} Generator Debugger
+            The UI now supports the same named benchmark tasks used by the agent baseline.
+            Pick a task to load its canonical easy, medium, or hard setup, then optionally
+            override the reset controls for custom debugging.
+            Standard mode exposes method calls only. You inspect data through methods like
+            `show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
+            and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
+            such as `get_median_filter_rows(metric_name, threshold_multiplier)`.
+            """
+        )
+        session_state = gr.State(_empty_state())
+        with gr.Row():
+            task_id = gr.Dropdown(
+                label="Named Task",
+                choices=TASK_CHOICES,
+                value=TASK_CHOICES[0],
+            )
+            task_details = gr.JSON(
+                label="Selected Task Details",
+                value=TASK_SUMMARIES[TASK_CHOICES[0]],
+            )
+        with gr.Row():
+            initial_task = get_task_spec(TASK_CHOICES[0])
+            seed = gr.Number(label="Seed", value=initial_task.seed, precision=0)
+            scenario_family = gr.Dropdown(
+                label="Scenario Family",
+                choices=[
+                    "mixed",
+                    "rate_drop_from_median",
+                    "rate_spike_from_median",
+                    "absolute_drop_in_event_count",
+                    "absolute_spike_in_event_count",
+                    "funnel_break",
+                    "hourly_traffic_mix_shift",
+                    "instrumentation_data_quality_issue",
+                ],
+                value=initial_task.scenario_family,
+            )
+            difficulty = gr.Dropdown(
+                label="Difficulty",
+                choices=["easy", "medium", "hard"],
+                value=initial_task.difficulty,
+            )
+            anomaly_density = gr.Dropdown(
+                label="Anomaly Density",
+                choices=["low", "medium", "high"],
+                value=initial_task.anomaly_density,
+            )
+            anomaly_count = gr.Number(label="Anomaly Count", value=initial_task.anomaly_count, precision=0)
+            debug_mode = gr.Checkbox(label="Debug Mode", value=False)
+        reset_anomalies = gr.Code(
+            label="Reset Anomalies JSON",
+            language="json",
+            value="[]",
+            interactive=True,
+        )
+        with gr.Row():
+            reset_btn = gr.Button("Reset Episode", variant="primary")
+            preview_btn = gr.Button("Preview Generator Payload", variant="secondary")
+            submit_btn = gr.Button("Submit Generator Payload", variant="secondary")
+            get_state_btn = gr.Button("Get State", variant="secondary")
+        gr.Markdown("## Methods")
+        gr.Markdown(
+            "Run a method after reset to fetch exactly the daily aggregate data, median, "
+            "std-from-median, filtered rows, or generated payload rows you want."
+        )
+        with gr.Row():
+            method_name = gr.Dropdown(
+                label="Method",
+                choices=METHOD_CHOICES,
+                value="show_raw_data",
+            )
+            method_metric = gr.Dropdown(
+                label="metrics",
+                choices=METRIC_CHOICES,
+                value=[],
+                multiselect=True,
+            )
+            method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
+            method_limit = gr.Number(label="limit", value=5, precision=0)
+            run_method_btn = gr.Button("Run Method", variant="secondary")
+        with gr.Row():
+            method_date = gr.Textbox(label="date", placeholder="YYYY-MM-DD")
+            method_entity = gr.Textbox(label="entity_name", placeholder="orders_placed or app_open_to_order_placed")
+        method_rows_json = gr.Code(
+            label="rows JSON for preview_submission",
+            language="json",
+            value="[]",
+            interactive=True,
+        )
+        analysis_result = gr.JSON(label="Last Method Results")
+        with gr.Tab("Method Data"):
+            gr.Markdown(
+                "This panel shows only method-returned data. Use `show_raw_data` for daily "
+                "aggregate rows, then median/std/filter methods to inspect candidate anomalies."
+            )
+            method_rows = gr.Dataframe(label="Method Rows", interactive=False)
+        gr.Markdown("## Payload Generators")
+        gr.Markdown(
+            "Add generator methods here, then preview or submit using the buttons at the top."
+        )
+        generator_methods_df = gr.Dataframe(
+            headers=["method_name", "metric_name", "metric_names", "threshold_multiplier"],
+            datatype=["str", "str", "str", "number"],
+            label="Generator Methods",
+            interactive=True,
+        )
+        payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
+        with gr.Row():
+            generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="get_median_filter_rows")
+            generator_metric_name = gr.Dropdown(
+                label="metrics",
+                choices=METRIC_CHOICES,
+                value=[],
+                multiselect=True,
+            )
+            generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
+        with gr.Row():
+            add_generator_btn = gr.Button("Add / Update Generator", variant="secondary")
+            remove_generator_btn = gr.Button("Remove Generator", variant="secondary")
+            clear_generators_btn = gr.Button("Clear Generators", variant="secondary")
+        status = gr.Textbox(label="Status", interactive=False)
+        summary = gr.JSON(label="Episode Summary")
+        active_task = gr.JSON(label="Active Task", value=TASK_SUMMARIES[TASK_CHOICES[0]])
+        task_catalog = gr.JSON(label="Available Tasks", value=list(TASK_SUMMARIES.values()))
+        synthetic_methods = gr.JSON(label="Synthetic Generator Methods")
+        applied_synthetic_generators = gr.Dataframe(label="Applied Synthetic Generators", interactive=False)
+        available_methods = gr.JSON(label="Shared Methods")
+        submission_feedback = gr.JSON(label="Submission Feedback")
+        reward_breakdown = gr.JSON(label="Reward Breakdown")
+        generated_rows = gr.Dataframe(label="Generated Payload Rows", interactive=False)
+        raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
+        debug_snapshot = gr.JSON(label="Debug Snapshot")
+        def apply_task_defaults(selected_task_id: str):
+            task = get_task_spec(selected_task_id)
+            return (
+                task.seed,
+                task.scenario_family,
+                task.difficulty,
+                task.anomaly_density,
+                task.anomaly_count,
+                task.to_model().model_dump(),
+            )
+        async def reset_episode(selected_task_id, seed_value, family, level, density, anomaly_count_value, reset_anomalies_json, debug_enabled):
+            try:
+                parsed_anomalies = json.loads(reset_anomalies_json or "[]")
+                if not isinstance(parsed_anomalies, list):
+                    raise ValueError("Reset anomalies JSON must be a list.")
+            except Exception as exc:
+                return (
+                    _empty_state(),
+                    f"Invalid reset anomalies JSON: {exc}",
+                    {"status": "error"},
+                    {},
+                    list(TASK_SUMMARIES.values()),
+                    [],
+                    _generator_frame([]),
+                    [],
+                    {},
+                    {},
+                    {},
+                    _generator_frame([]),
+                    _generator_frame([]),
+                    "",
+                    _debug_snapshot(web_manager, debug_enabled),
+                    _generator_frame([]),
+                    [],
+                )
+            web_manager.env.set_debug_mode(bool(debug_enabled))
+            data = await web_manager.reset_environment(
+                {
+                    "task_id": selected_task_id,
+                    "seed": int(seed_value or 0),
+                    "scenario_family": family,
+                    "difficulty": level,
+                    "anomaly_density": density,
+                    "anomaly_count": int(anomaly_count_value or 3),
+                    "anomalies": parsed_anomalies,
+                }
+            )
+            method_data = await web_manager.step_environment(
+                {
+                    "analysis_method": "show_raw_data",
+                    "analysis_args": {"limit": 5},
+                    "classifications": [],
+                    "payload_generators": [],
+                }
+            )
+            state = _state_from_response(data)
+            state["latest_response"] = method_data
+            state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
+            obs = data.get("observation", {})
+            method_result = state["last_method_result"] or {}
+            available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
+            active_task_payload = next(
+                (item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
+                {
+                    "task_id": obs.get("task_id"),
+                    "instruction": obs.get("instruction"),
+                    "objective": obs.get("message"),
+                    "difficulty": (obs.get("config") or {}).get("difficulty"),
+                    "grader_name": (obs.get("config") or {}).get("grader_name"),
+                },
+            )
+            return (
+                state,
+                obs.get("message", ""),
+                {
+                    "task_id": obs.get("task_id"),
+                    "status": obs.get("status"),
+                    "config": obs.get("config"),
+                    "expected_row_count": obs.get("expected_row_count"),
+                },
+                active_task_payload,
+                available_tasks,
+                [item for item in obs.get("available_synthetic_generator_methods", [])],
+                pd.DataFrame([item for item in obs.get("applied_synthetic_generators", [])]),
+                [item for item in obs.get("available_methods", [])],
+                method_result,
+                obs.get("submission_issues") or [],
+                obs.get("reward_breakdown") or {},
+                _method_frame(method_result),
+                pd.DataFrame(),
+                json.dumps(method_data, indent=2),
+                _debug_snapshot(web_manager, debug_enabled),
+                _generator_frame(state["payload_generators"]),
+                state["payload_generators"],
+            )
+        async def run_method(
+            payload: dict,
+            selected_method: str,
+            metric_names: list[str],
+            method_date_value: str,
+            method_entity_value: str,
+            method_rows_value: str,
+            threshold: float,
+            limit_value: int,
+        ):
+            if not payload.get("active"):
+                return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip()
+            args = _method_args(
+                selected_method,
+                metric_names,
+                method_date_value,
+                method_entity_value,
+                method_rows_value,
+                threshold,
+                limit_value,
+                payload["payload_generators"],
+            )
+            data = await web_manager.step_environment(
+                {
+                    "analysis_method": selected_method,
+                    "analysis_args": args,
+                    "classifications": [],
+                    "payload_generators": [],
+                }
+            )
+            payload["latest_response"] = data
+            payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
+            method_result = payload["last_method_result"] or {}
+            generated = method_result.get("result", {}).get("generated_rows", [])
+            method_frame = _method_frame(method_result)
+            return (
+                payload,
+                method_result,
+                data.get("observation", {}).get("message", ""),
+                method_frame,
+                pd.DataFrame(generated),
+                json.dumps(data, indent=2),
+            )
+        def add_or_update_generator(payload: dict, method_name_value: str, metric_names: list[str], threshold_multiplier: float):
+            if not payload.get("active"):
+                return payload, _generator_frame([]), []
+            metric_names = [item for item in (metric_names or []) if item]
+            row = {
+                "method_name": method_name_value,
+                "metric_name": metric_names[0] if len(metric_names) == 1 else None,
+                "metric_names": metric_names,
+                "threshold_multiplier": float(threshold_multiplier),
+            }
+            keyed = {
+                _generator_row_key(item): item
+                for item in payload["payload_generators"]
+            }
+            keyed[_generator_row_key(row)] = row
+            payload["payload_generators"] = list(keyed.values())
+            return payload, _generator_frame(payload["payload_generators"]), payload["payload_generators"]
+        def remove_generator(payload: dict, method_name_value: str, metric_names: list[str]):
+            if not payload.get("active"):
+                return payload, _generator_frame([]), []
+            metric_names = [item for item in (metric_names or []) if item]
+            payload["payload_generators"] = [
+                item
+                for item in payload["payload_generators"]
+                if not (
+                    item.get("method_name") == method_name_value
+                    and [name for name in item.get("metric_names", []) if name] == metric_names
+                )
+            ]
+            return payload, _generator_frame(payload["payload_generators"]), payload["payload_generators"]
+        def clear_generators(payload: dict):
+            payload["payload_generators"] = []
+            return payload, _generator_frame([]), []
+        def sync_generator_rows(payload: dict, generator_rows):
+            normalized = _normalize_generator_rows(generator_rows)
+            payload["payload_generators"] = normalized
+            return payload, _generator_frame(normalized), normalized
+        async def preview_payload(payload: dict, generator_rows):
+            if not payload.get("active"):
+                return payload, {"error": "Reset the environment first."}, _generator_frame([]), []
+            payload["payload_generators"] = _normalize_generator_rows(generator_rows)
+            if not payload.get("payload_generators"):
+                return payload, {"error": "Add at least one payload generator first."}, _generator_frame([]), []
+            data = await web_manager.step_environment(
+                {
+                    "analysis_method": "payload_generator",
+                    "analysis_args": {"generator_methods": payload["payload_generators"]},
+                    "classifications": [],
+                    "payload_generators": [],
+                }
+            )
+            payload["latest_response"] = data
+            payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
+            result = payload["last_method_result"] or {}
+            return payload, result, pd.DataFrame(result.get("result", {}).get("generated_rows", [])), payload["payload_generators"]
+        async def submit_payload(payload: dict, debug_enabled: bool, generator_rows):
+            if not payload.get("active"):
+                return payload, "Reset the environment first.", gr.skip(), gr.skip(), gr.skip(), "", gr.skip(), gr.skip(), gr.skip()
+            payload["payload_generators"] = _normalize_generator_rows(generator_rows)
+            if not payload.get("payload_generators"):
+                return (
+                    payload,
+                    "Add at least one payload generator before submitting.",
+                    {
+                        "status": "ready",
+                        "generator_count": 0,
+                    },
+                    {"error": "No payload generators configured."},
+                    {},
+                    "",
+                    _debug_snapshot(web_manager, debug_enabled),
+                    _generator_frame([]),
+                    [],
+                )
+            data = await web_manager.step_environment(
+                {
+                    "payload_generators": payload["payload_generators"],
+                    "classifications": [],
+                }
+            )
+            payload["latest_response"] = data
+            obs = data.get("observation", {})
+            summary = {
+                "task_id": obs.get("task_id"),
+                "status": obs.get("status"),
+                "message": obs.get("message"),
+                "config": obs.get("config"),
+                "expected_row_count": obs.get("expected_row_count"),
+                "correct_row_count": obs.get("correct_row_count"),
+                "generated_row_count": len(obs.get("generated_rows") or []),
+                "submitted_row_count": len(obs.get("submitted_rows") or []),
+                "issue_count": len(obs.get("submission_issues") or []),
+                "reward": data.get("reward", 0.0),
+                "done": data.get("done", False),
+            }
+            feedback = {
+                "message": obs.get("message", ""),
+                "issue_count": len(obs.get("submission_issues") or []),
+                "issues": obs.get("submission_issues") or [],
+                "generated_row_count": len(obs.get("generated_rows") or []),
+                "generator_count": len(payload.get("payload_generators") or []),
+            }
+            return (
+                payload,
+                obs.get("message", ""),
+                summary,
+                feedback,
+                obs.get("reward_breakdown") or {},
+                json.dumps(data, indent=2),
+                _debug_snapshot(web_manager, debug_enabled),
+                pd.DataFrame([row for row in obs.get("generated_rows", [])]),
+                payload["payload_generators"],
+            )
+        def get_state_sync():
+            return json.dumps(web_manager.get_state(), indent=2)
+        reset_btn.click(
+            fn=reset_episode,
+            inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode],
+            outputs=[
+                session_state,
+                status,
+                summary,
+                active_task,
+                task_catalog,
+                synthetic_methods,
+                applied_synthetic_generators,
+                available_methods,
+                analysis_result,
+                submission_feedback,
+                reward_breakdown,
+                method_rows,
+                generated_rows,
+                raw_json,
+                debug_snapshot,
+                generator_methods_df,
+                payload_generator_methods,
+            ],
+        )
+        task_id.change(
+            fn=apply_task_defaults,
+            inputs=[task_id],
+            outputs=[seed, scenario_family, difficulty, anomaly_density, anomaly_count, task_details],
+        )
+        run_method_btn.click(
+            fn=run_method,
+            inputs=[
+                session_state,
+                method_name,
+                method_metric,
+                method_date,
+                method_entity,
+                method_rows_json,
+                method_threshold,
+                method_limit,
+            ],
+            outputs=[session_state, analysis_result, status, method_rows, generated_rows, raw_json],
+        )
+        add_generator_btn.click(
+            fn=add_or_update_generator,
+            inputs=[session_state, generator_method_name, generator_metric_name, generator_multiplier],
+            outputs=[session_state, generator_methods_df, payload_generator_methods],
+        )
+        remove_generator_btn.click(
+            fn=remove_generator,
+            inputs=[session_state, generator_method_name, generator_metric_name],
+            outputs=[session_state, generator_methods_df, payload_generator_methods],
+        )
+        clear_generators_btn.click(
+            fn=clear_generators,
+            inputs=[session_state],
+            outputs=[session_state, generator_methods_df, payload_generator_methods],
+        )
+        generator_methods_df.change(
+            fn=sync_generator_rows,
+            inputs=[session_state, generator_methods_df],
+            outputs=[session_state, generator_methods_df, payload_generator_methods],
+        )
+        preview_btn.click(
+            fn=preview_payload,
+            inputs=[session_state, generator_methods_df],
+            outputs=[session_state, analysis_result, generated_rows, payload_generator_methods],
+        )
+        submit_btn.click(
+            fn=submit_payload,
+            inputs=[session_state, debug_mode, generator_methods_df],
+            outputs=[session_state, status, summary, submission_feedback, reward_breakdown, raw_json, debug_snapshot, generated_rows, payload_generator_methods],
+        )
+        get_state_btn.click(fn=get_state_sync, outputs=[raw_json])
+    return demo
+def _method_args(
+    method_name: str,
+    metric_names: list[str],
+    method_date: str,
+    method_entity: str,
+    method_rows_json: str,
+    threshold: float,
+    limit_value: int,
+    payload_generators: list[dict],
+) -> dict:
+    selected = [item for item in (metric_names or []) if item]
+    resolved_date = (method_date or "").strip()
+    resolved_entity = (method_entity or "").strip()
+    if method_name == "show_raw_data":
+        return {"limit": int(limit_value or 5)}
+    if method_name in {"rows_for_date", "hourly_rows_for_date", "detect_funnel_break", "check_impossible_counts"}:
+        return {"date": resolved_date}
+    if method_name in {"compare_rate_to_median", "compare_count_to_median"}:
+        return {
+            "date": resolved_date,
+            "entity_name": resolved_entity,
+        }
+    if method_name in {"get_metric_median", "get_metric_std_dev_from_median"}:
+        return {
+            "metric_name": selected[0] if len(selected) == 1 else None,
+            "metric_names": selected,
+        }
+    if method_name == "get_rows_with_abs_diff_from_median_gt":
+        return {
+            "metric_name": selected[0] if len(selected) == 1 else None,
+            "metric_names": selected,
+            "threshold": float(threshold),
+        }
+    if method_name in {
+        "get_median_filter_rows",
+        "get_rate_drop_from_median_rows",
+        "get_rate_spike_from_median_rows",
+        "get_absolute_drop_in_event_count_rows",
+        "get_absolute_spike_in_event_count_rows",
+    }:
+        return {
+            "metric_name": selected[0] if len(selected) == 1 else None,
+            "metric_names": selected,
+            "threshold_multiplier": float(threshold),
+        }
+    if method_name in {
+        "get_funnel_break_rows",
+        "get_hourly_traffic_mix_shift_rows",
+        "get_instrumentation_data_quality_issue_rows",
+    }:
+        return {"threshold_multiplier": float(threshold)}
+    if method_name == "payload_generator":
+        return {"generator_methods": payload_generators}
+    if method_name == "list_suspicious_dates":
+        return {"limit": int(limit_value or 10)}
+    if method_name == "preview_submission":
+        return {"rows": _parse_rows_json(method_rows_json)}
+    return {}
+def _parse_rows_json(raw_value: str) -> list[dict]:
+    if not raw_value or not raw_value.strip():
+        return []
+    parsed = json.loads(raw_value)
+    if not isinstance(parsed, list):
+        raise ValueError("rows JSON must be a list.")
+    return [item for item in parsed if isinstance(item, dict)]
+def _method_frame(method_result: dict) -> pd.DataFrame:
+    result = (method_result or {}).get("result") or {}
+    if isinstance(result, dict):
+        if isinstance(result.get("results"), list):
+            rows = []
+            for item in result["results"]:
+                if isinstance(item, dict) and isinstance(item.get("rows"), list):
+                    for row in item["rows"]:
+                        enriched = dict(row)
+                        enriched["metric_name"] = item.get("metric_name", enriched.get("metric_name"))
+                        rows.append(enriched)
+                elif isinstance(item, dict):
+                    rows.append(item)
+            return pd.DataFrame(rows)
+        if isinstance(result.get("rows"), list):
+            return pd.DataFrame(result["rows"])
+        if isinstance(result.get("dates"), list):
+            return pd.DataFrame(result["dates"])
+        if isinstance(result.get("generated_rows"), list):
+            return pd.DataFrame(result["generated_rows"])
+    return pd.DataFrame()
+def _state_from_response(data: dict) -> dict:
+    return {
+        "active": True,
+        "payload_generators": [],
+        "last_method_result": data.get("observation", {}).get("analysis_result"),
+        "latest_response": data,
+    }
+def _normalize_generator_rows(generator_rows) -> list[dict]:
+    if generator_rows is None:
+        return []
+    if isinstance(generator_rows, pd.DataFrame):
+        rows = generator_rows.to_dict(orient="records")
+    elif isinstance(generator_rows, list):
+        rows = generator_rows
+    else:
+        return []
+    normalized = []
+    for row in rows:
+        if not isinstance(row, dict):
+            continue
+        metric_names = row.get("metric_names", [])
+        if isinstance(metric_names, str):
+            metric_names = [item for item in metric_names.split(",") if item]
+        elif not isinstance(metric_names, list):
+            metric_names = []
+        normalized.append(
+            {
+                "method_name": row.get("method_name"),
+                "metric_name": row.get("metric_name"),
+                "metric_names": metric_names,
+                "threshold_multiplier": float(row.get("threshold_multiplier", 0.0)),
+            }
+        )
+    return normalized
+def _generator_row_key(row: dict) -> str:
+    metric_names = [item for item in (row.get("metric_names") or []) if item]
+    return (
+        f"{row.get('method_name') or ''}"
+        f"|{','.join(metric_names)}"
+        f"|{row.get('metric_name') or ''}"
+        f"|{float(row.get('threshold_multiplier', 0.0)):.6f}"
+    )
+def _generator_frame(rows: list[dict]) -> pd.DataFrame:
+    normalized = []
+    for row in rows or []:
+        metric_names = [item for item in (row.get("metric_names") or []) if item]
+        normalized.append(
+            {
+                "method_name": row.get("method_name") or "",
+                "metric_name": row.get("metric_name") or "",
+                "metric_names": ",".join(metric_names),
+                "threshold_multiplier": float(row.get("threshold_multiplier", 0.0)),
+            }
+        )
+    return pd.DataFrame(
+        normalized,
+        columns=["method_name", "metric_name", "metric_names", "threshold_multiplier"],
+    )
+def _debug_snapshot(web_manager, debug_enabled: bool) -> dict:
+    if not debug_enabled:
+        return {}
+    try:
+        return web_manager.env.export_debug_snapshot()
+    except Exception as exc:
+        return {"error": str(exc)}
+def _empty_state() -> dict:
+    return {
+        "active": False,
+        "payload_generators": [],
+        "last_method_result": None,
+        "latest_response": None,
+    }

server/metric_tracker_rl_environment.py ADDED Viewed

	@@ -0,0 +1,417 @@

+"""Metric tracking RL environment."""
+from __future__ import annotations
+from dataclasses import dataclass
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..analysis_tools import AnalysisContext, SharedAnalysisToolkit, available_analysis_methods
+    from ..evaluation import EvaluationConfig
+    from ..models import (
+        MetricTrackerRlAction,
+        MetricTrackerRlObservation,
+        MetricSubmissionRow,
+        SyntheticAnomalyGenerator,
+    )
+    from ..tasks import DEFAULT_TASK_ID, available_task_specs, get_task_spec
+    from .data_generator import (
+        EpisodeConfig,
+        EpisodeData,
+        MetricDataGenerator,
+        available_synthetic_generator_methods,
+    )
+except ImportError:
+    from analysis_tools import AnalysisContext, SharedAnalysisToolkit, available_analysis_methods
+    from models import (
+        MetricTrackerRlAction,
+        MetricTrackerRlObservation,
+        MetricSubmissionRow,
+        SyntheticAnomalyGenerator,
+    )
+    from tasks import DEFAULT_TASK_ID, available_task_specs, get_task_spec
+    from server.data_generator import (
+        EpisodeConfig,
+        EpisodeData,
+        MetricDataGenerator,
+        available_synthetic_generator_methods,
+    )
+    from evaluation import EvaluationConfig
+@dataclass(frozen=True)
+class RewardConfig:
+    """Compatibility wrapper around the evaluator configuration."""
+    evaluation: EvaluationConfig = EvaluationConfig()
+class MetricTrackerRlEnvironment(Environment):
+    """Iterative multi-anomaly benchmark with safe analysis methods."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(
+        self,
+        generator: MetricDataGenerator | None = None,
+        reward_config: RewardConfig | None = None,
+    ) -> None:
+        initial_task = get_task_spec(DEFAULT_TASK_ID)
+        self._generator = generator or MetricDataGenerator()
+        self._reward_config = reward_config or RewardConfig()
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._episode: EpisodeData | None = None
+        self._completed = False
+        self._debug_mode = False
+        self._active_task = initial_task
+        self._next_task_id = initial_task.task_id
+        self._next_reset_config = initial_task.build_episode_config()
+        self._last_analysis_result: dict | None = None
+        self._expose_applied_generators = False
+    def configure_next_reset(
+        self,
+        *,
+        task_id: str | None = None,
+        seed: int | None = None,
+        scenario_family: str | None = None,
+        difficulty: str | None = None,
+        anomaly_density: str | None = None,
+        anomaly_count: int | None = None,
+        anomalies: list[dict] | list[SyntheticAnomalyGenerator] | None = None,
+    ) -> None:
+        """Update the configuration used for the next reset."""
+        base_task = get_task_spec(task_id or self._next_task_id)
+        base_config = base_task.build_episode_config() if task_id else self._next_reset_config
+        anomaly_generators = tuple(
+            item if isinstance(item, SyntheticAnomalyGenerator) else SyntheticAnomalyGenerator(**item)
+            for item in (anomalies or [])
+        )
+        self._next_task_id = base_task.task_id
+        self._next_reset_config = EpisodeConfig(
+            seed=base_config.seed if seed is None else seed,
+            scenario_family=base_config.scenario_family if scenario_family is None else scenario_family,
+            difficulty=base_config.difficulty if difficulty is None else difficulty,
+            anomaly_density=base_config.anomaly_density if anomaly_density is None else anomaly_density,
+            anomaly_count=base_config.anomaly_count if anomaly_count is None else anomaly_count,
+            anomaly_generators=anomaly_generators or base_config.anomaly_generators,
+        ).normalized()
+    def set_debug_mode(self, enabled: bool) -> None:
+        """Enable or disable debug-only environment views."""
+        self._debug_mode = bool(enabled)
+    def export_debug_snapshot(self) -> dict:
+        """Return a developer-only debug snapshot for the active episode."""
+        if not self._debug_mode:
+            raise RuntimeError("Debug mode is disabled.")
+        if self._episode is None:
+            return {}
+        return {
+            "config": self._episode.config.__dict__,
+            "expected_payload": [row.model_dump() for row in self._episode.expected_rows],
+            "anomaly_schedule": self._episode.anomaly_schedule,
+            "applied_synthetic_generators": [
+                row.model_dump() for row in self._episode.applied_synthetic_generators
+            ],
+        }
+    def reset(
+        self,
+        task_id: str | None = None,
+        seed: int | None = None,
+        scenario_family: str | None = None,
+        difficulty: str | None = None,
+        anomaly_density: str | None = None,
+        anomaly_count: int | None = None,
+        anomalies: list[dict] | list[SyntheticAnomalyGenerator] | None = None,
+    ) -> MetricTrackerRlObservation:
+        """Generate a fresh dataset and hidden target payload."""
+        if any(value is not None for value in (task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count)) or anomalies is not None:
+            self.configure_next_reset(
+                task_id=task_id,
+                seed=seed,
+                scenario_family=scenario_family,
+                difficulty=difficulty,
+                anomaly_density=anomaly_density,
+                anomaly_count=anomaly_count,
+                anomalies=anomalies,
+            )
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._active_task = get_task_spec(self._next_task_id)
+        self._episode = self._generator.generate_episode(self._next_reset_config)
+        self._completed = False
+        self._last_analysis_result = None
+        self._expose_applied_generators = anomalies is not None
+        return self._build_observation(
+            status="ready",
+            message=self._active_task.objective,
+            reward=0.0,
+            done=False,
+        )
+    def step(self, action: MetricTrackerRlAction) -> MetricTrackerRlObservation:  # type: ignore[override]
+        """Evaluate a submitted payload and return deterministic feedback."""
+        if self._episode is None:
+            return self.reset()
+        if self._completed:
+            return self._build_observation(
+                status="completed",
+                message="Dataset already solved. Call reset() to create a new dataset.",
+                reward=1.0,
+                done=True,
+                submitted_rows=action.classifications,
+            )
+        if action.analysis_method:
+            self._state.step_count += 1
+            analysis_result = self._run_analysis(action.analysis_method, action.analysis_args)
+            self._last_analysis_result = analysis_result
+            return self._build_observation(
+                status="analyzed",
+                message=f"Ran analysis method `{action.analysis_method}`.",
+                reward=0.0,
+                done=False,
+                analysis_result=analysis_result,
+            )
+        submitted_rows = action.classifications
+        generated_rows: list[MetricSubmissionRow] = []
+        if action.payload_generators:
+            generator_result = self._run_analysis(
+                "payload_generator",
+                {"generator_methods": [item.model_dump() for item in action.payload_generators]},
+            )
+            self._last_analysis_result = generator_result
+            generated_rows = [
+                MetricSubmissionRow(**row)
+                for row in generator_result["result"]["generated_rows"]
+            ]
+            submitted_rows = generated_rows
+        self._state.step_count += 1
+        result = self._active_task.grade_submission(
+            submitted_rows,
+            self._episode.expected_rows,
+            config=self._reward_config.evaluation,
+            include_debug_expected=self._debug_mode,
+        )
+        self._completed = result.is_perfect
+        reward = result.reward_breakdown.total_score
+        message = self._submission_message(result)
+        return self._build_observation(
+            status="evaluated" if result.is_perfect else "in_progress",
+            message=message,
+            reward=reward,
+            done=result.is_perfect,
+            submitted_rows=result.preview.normalized_rows,
+            reward_breakdown=result.reward_breakdown,
+            submission_preview=result.preview,
+            issues=result.issues,
+            correct_row_count=result.matched_rows,
+            analysis_result=self._last_analysis_result,
+            generated_rows=generated_rows,
+        )
+    @property
+    def state(self) -> State:
+        """Return current episode state."""
+        return self._state
+    def _build_observation(
+        self,
+        *,
+        status: str,
+        message: str,
+        reward: float,
+        done: bool,
+        submitted_rows=None,
+        reward_breakdown=None,
+        submission_preview=None,
+        issues=None,
+        correct_row_count: int = 0,
+        analysis_result=None,
+        generated_rows=None,
+    ) -> MetricTrackerRlObservation:
+        assert self._episode is not None
+        metadata = {
+            "step": self._state.step_count,
+            "current_state": self.state.model_dump(),
+            "task_id": self._active_task.task_id,
+            "objective": self._active_task.objective,
+            "grader_name": self._active_task.grader_name,
+            "seed": self._episode.config.seed,
+            "scenario_family": self._episode.config.scenario_family,
+            "difficulty": self._episode.config.difficulty,
+            "anomaly_density": self._episode.config.anomaly_density,
+            "anomaly_count": self._episode.config.anomaly_count,
+        }
+        return MetricTrackerRlObservation(
+            task_id=self._active_task.task_id,
+            status=status,
+            message=message,
+            instruction=self._active_task.instruction,
+            conversion_metric_definitions=list(self._generator.config.conversion_definitions),
+            available_synthetic_generator_methods=available_synthetic_generator_methods(),
+            applied_synthetic_generators=(
+                self._episode.applied_synthetic_generators
+                if self._debug_mode or self._expose_applied_generators
+                else []
+            ),
+            available_methods=available_analysis_methods(),
+            available_tasks=available_task_specs(),
+            daily_metrics=[],
+            hourly_metrics=[],
+            analysis_result=analysis_result,
+            generated_rows=generated_rows or [],
+            submitted_rows=submitted_rows or [],
+            submission_preview=submission_preview,
+            submission_issues=issues or [],
+            reward_breakdown=reward_breakdown,
+            expected_row_count=len(self._episode.expected_rows),
+            correct_row_count=correct_row_count,
+            reward=reward,
+            done=done,
+            config=metadata,
+            debug=(
+                {
+                    "task_id": self._active_task.task_id,
+                    "expected_payload": [row.model_dump() for row in self._episode.expected_rows],
+                    "anomaly_schedule": self._episode.anomaly_schedule,
+                    "reward_breakdown": reward_breakdown.model_dump() if reward_breakdown else None,
+                    "issues": [item.model_dump() for item in (issues or [])],
+                }
+                if self._debug_mode
+                else None
+            ),
+        )
+    def _run_analysis(self, method_name: str, arguments: dict) -> dict:
+        toolkit = SharedAnalysisToolkit(
+            AnalysisContext(
+                daily_metrics=self._episode.daily_metrics,
+                hourly_metrics=self._episode.hourly_metrics,
+                conversion_definitions=list(self._generator.config.conversion_definitions),
+                instruction=self._active_task.instruction,
+                config={
+                    "task_id": self._active_task.task_id,
+                    "objective": self._active_task.objective,
+                    "grader_name": self._active_task.grader_name,
+                    **self._episode.config.__dict__,
+                },
+            )
+        )
+        if method_name == "task_overview":
+            result = toolkit.task_overview()
+        elif method_name == "list_dates":
+            result = toolkit.list_dates()
+        elif method_name == "list_entities":
+            result = toolkit.list_entities()
+        elif method_name == "rows_for_date":
+            result = toolkit.rows_for_date(arguments["date"])
+        elif method_name == "hourly_rows_for_date":
+            result = toolkit.hourly_rows_for_date(arguments["date"])
+        elif method_name == "compare_rate_to_median":
+            result = toolkit.compare_rate_to_median(arguments["date"], arguments["entity_name"])
+        elif method_name == "compare_count_to_median":
+            result = toolkit.compare_count_to_median(arguments["date"], arguments["entity_name"])
+        elif method_name == "detect_funnel_break":
+            result = toolkit.detect_funnel_break(arguments["date"])
+        elif method_name == "check_impossible_counts":
+            result = toolkit.check_impossible_counts(arguments["date"])
+        elif method_name == "list_suspicious_dates":
+            result = toolkit.list_suspicious_dates(limit=arguments.get("limit", 10))
+        elif method_name == "preview_submission":
+            result = toolkit.preview_submission(arguments.get("rows", []))
+        elif method_name == "show_raw_data":
+            result = toolkit.show_raw_data(limit=arguments.get("limit", 5))
+        elif method_name == "get_metric_median":
+            result = toolkit.get_metric_median_multi(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+            )
+        elif method_name == "get_metric_std_dev_from_median":
+            result = toolkit.get_metric_std_dev_from_median_multi(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+            )
+        elif method_name == "get_rows_with_abs_diff_from_median_gt":
+            result = toolkit.get_rows_with_abs_diff_from_median_gt_multi(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+                threshold=float(arguments["threshold"]),
+            )
+        elif method_name == "get_median_filter_rows":
+            result = toolkit.get_median_filter_rows_multi(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_rate_drop_from_median_rows":
+            result = toolkit.get_rate_drop_from_median_rows(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_rate_spike_from_median_rows":
+            result = toolkit.get_rate_spike_from_median_rows(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_absolute_drop_in_event_count_rows":
+            result = toolkit.get_absolute_drop_in_event_count_rows(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_absolute_spike_in_event_count_rows":
+            result = toolkit.get_absolute_spike_in_event_count_rows(
+                metric_name=arguments.get("metric_name"),
+                metric_names=arguments.get("metric_names", []),
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_funnel_break_rows":
+            result = toolkit.get_funnel_break_rows(
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_hourly_traffic_mix_shift_rows":
+            result = toolkit.get_hourly_traffic_mix_shift_rows(
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "get_instrumentation_data_quality_issue_rows":
+            result = toolkit.get_instrumentation_data_quality_issue_rows(
+                threshold_multiplier=float(arguments["threshold_multiplier"]),
+            )
+        elif method_name == "payload_generator":
+            result = toolkit.payload_generator(arguments.get("generator_methods", []))
+        else:
+            raise ValueError(f"Unsupported analysis method: {method_name}")
+        return {
+            "method": method_name,
+            "arguments": arguments,
+            "result": result,
+        }
+    @staticmethod
+    def _submission_message(result) -> str:
+        if result.is_perfect:
+            return "Submission is fully correct."
+        extra_issues = [issue for issue in result.issues if issue.issue_type == "extra_row"]
+        missing_count = result.reward_breakdown.missing_rows
+        if not extra_issues and missing_count > 0:
+            return (
+                "All submitted rows are anomalies, but a few are missing. "
+                f"Missing value count: {missing_count}."
+            )
+        if extra_issues:
+            first = extra_issues[0]
+            return f"Specific row is not an anomaly: {first.row_key}."
+        return (
+            f"Matched {result.reward_breakdown.matched_rows}/"
+            f"{result.reward_breakdown.expected_rows} expected rows. Review the feedback."
+        )

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+gradio>=5.0.0
+pandas>=2.2.0
+plotly>=5.24.0

tasks.py ADDED Viewed

	@@ -0,0 +1,141 @@

+"""Named benchmark tasks and deterministic task graders."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+try:
+    from .evaluation import EvaluationConfig, EvaluationResult, evaluate_submission
+    from .models import BenchmarkTaskSpec, MetricSubmissionRow
+    from .server.data_generator import EpisodeConfig
+except ImportError:
+    from evaluation import EvaluationConfig, EvaluationResult, evaluate_submission
+    from models import BenchmarkTaskSpec, MetricSubmissionRow
+    from server.data_generator import EpisodeConfig
+DEFAULT_GRADER_NAME = "deterministic_exact_match"
+@dataclass(frozen=True)
+class TaskSpec:
+    """A concrete benchmark task that an agent can solve and be graded on."""
+    task_id: str
+    difficulty: str
+    instruction: str
+    objective: str
+    seed: int
+    scenario_family: str
+    anomaly_density: str
+    anomaly_count: int
+    grader_name: str = DEFAULT_GRADER_NAME
+    evaluation_config: EvaluationConfig = field(default_factory=EvaluationConfig)
+    def build_episode_config(self) -> EpisodeConfig:
+        """Return the canonical episode configuration for this task."""
+        return EpisodeConfig(
+            seed=self.seed,
+            scenario_family=self.scenario_family,
+            difficulty=self.difficulty,
+            anomaly_density=self.anomaly_density,
+            anomaly_count=self.anomaly_count,
+        ).normalized()
+    def grade_submission(
+        self,
+        submitted_rows: list[dict] | list[MetricSubmissionRow],
+        expected_rows: list[MetricSubmissionRow],
+        *,
+        config: EvaluationConfig | None = None,
+        include_debug_expected: bool = False,
+    ) -> EvaluationResult:
+        """Grade one candidate submission for this task."""
+        return evaluate_submission(
+            submitted_rows,
+            expected_rows,
+            config=config or self.evaluation_config,
+            include_debug_expected=include_debug_expected,
+        )
+    def to_model(self) -> BenchmarkTaskSpec:
+        """Return a typed summary safe to expose in observations."""
+        return BenchmarkTaskSpec(
+            task_id=self.task_id,
+            difficulty=self.difficulty,
+            instruction=self.instruction,
+            objective=self.objective,
+            scenario_family=self.scenario_family,
+            anomaly_density=self.anomaly_density,
+            anomaly_count=self.anomaly_count,
+            grader_name=self.grader_name,
+        )
+TASKS: dict[str, TaskSpec] = {
+    "easy_single_spike": TaskSpec(
+        task_id="easy_single_spike",
+        difficulty="easy",
+        instruction=(
+            "Investigate the seeded funnel dataset and submit the single anomalous row. "
+            "Use the shared analysis methods before submitting."
+        ),
+        objective=(
+            "Find the one obvious anomaly and submit exactly one correctly populated anomaly row."
+        ),
+        seed=11,
+        scenario_family="absolute_spike_in_event_count",
+        anomaly_density="low",
+        anomaly_count=1,
+    ),
+    "medium_mixed_pair": TaskSpec(
+        task_id="medium_mixed_pair",
+        difficulty="medium",
+        instruction=(
+            "Investigate the seeded funnel dataset and submit every anomalous row. "
+            "Expect both event-count and conversion-rate reasoning."
+        ),
+        objective=(
+            "Find the full set of medium-difficulty anomalies without submitting extras."
+        ),
+        seed=23,
+        scenario_family="mixed",
+        anomaly_density="medium",
+        anomaly_count=3,
+    ),
+    "hard_mixed_multi": TaskSpec(
+        task_id="hard_mixed_multi",
+        difficulty="hard",
+        instruction=(
+            "Investigate the seeded funnel dataset and submit every anomalous row. "
+            "Some anomalies are subtle, so use the analysis methods carefully and avoid over-submitting."
+        ),
+        objective=(
+            "Recover the complete set of hard mixed anomalies while preserving precision."
+        ),
+        seed=37,
+        scenario_family="mixed",
+        anomaly_density="high",
+        anomaly_count=5,
+    ),
+}
+DEFAULT_TASK_ORDER: tuple[str, ...] = (
+    "easy_single_spike",
+    "medium_mixed_pair",
+    "hard_mixed_multi",
+)
+DEFAULT_TASK_ID = DEFAULT_TASK_ORDER[0]
+def get_task_spec(task_id: str) -> TaskSpec:
+    """Return the task spec for a known task id."""
+    try:
+        return TASKS[task_id]
+    except KeyError as exc:
+        raise ValueError(f"Unsupported task_id: {task_id}") from exc
+def available_task_specs() -> list[BenchmarkTaskSpec]:
+    """Return typed summaries for all named benchmark tasks."""
+    return [TASKS[task_id].to_model() for task_id in DEFAULT_TASK_ORDER]

tests/test_metric_tracker_rl.py ADDED Viewed

	@@ -0,0 +1,367 @@

+from __future__ import annotations
+from metric_tracker_rl.analysis_tools import AnalysisContext, SharedAnalysisToolkit
+from metric_tracker_rl.evaluation import evaluate_submission
+from metric_tracker_rl.models import MetricSubmissionRow
+from metric_tracker_rl.server.data_generator import ALL_SCENARIO_FAMILIES, EpisodeConfig, MetricDataGenerator
+from metric_tracker_rl.server.metric_tracker_rl_environment import MetricTrackerRlEnvironment
+from metric_tracker_rl import MetricTrackerRlAction
+from metric_tracker_rl.models import PayloadGeneratorMethod
+from metric_tracker_rl.tasks import DEFAULT_TASK_ORDER, TASKS, get_task_spec
+def _toolkit_for(seed: int = 11, scenario_family: str = "mixed") -> tuple[SharedAnalysisToolkit, list[MetricSubmissionRow]]:
+    generator = MetricDataGenerator()
+    episode = generator.generate_episode(
+        EpisodeConfig(
+            seed=seed,
+            scenario_family=scenario_family,
+            difficulty="medium",
+            anomaly_density="medium",
+            anomaly_count=5,
+        )
+    )
+    toolkit = SharedAnalysisToolkit(
+        AnalysisContext(
+            daily_metrics=episode.daily_metrics,
+            hourly_metrics=episode.hourly_metrics,
+            conversion_definitions=list(generator.config.conversion_definitions),
+            config=episode.config.__dict__,
+        )
+    )
+    return toolkit, episode.expected_rows
+def test_seed_reproducibility():
+    generator = MetricDataGenerator()
+    config = EpisodeConfig(seed=17, scenario_family="mixed", difficulty="hard", anomaly_density="high")
+    first = generator.generate_episode(config)
+    second = generator.generate_episode(config)
+    assert [row.model_dump() for row in first.daily_metrics] == [row.model_dump() for row in second.daily_metrics]
+    assert [row.model_dump() for row in first.hourly_metrics] == [row.model_dump() for row in second.hourly_metrics]
+    assert [row.model_dump() for row in first.expected_rows] == [row.model_dump() for row in second.expected_rows]
+def test_anomaly_variety():
+    generator = MetricDataGenerator()
+    family_results = {}
+    for family in ALL_SCENARIO_FAMILIES[1:]:
+        episode = generator.generate_episode(
+            EpisodeConfig(
+                seed=7,
+                scenario_family=family,
+                difficulty="medium",
+                anomaly_density="medium",
+                anomaly_count=5,
+            )
+        )
+        family_results[family] = {row.anomaly_type for row in episode.expected_rows}
+    assert family_results["rate_drop_from_median"] == {"rate_drop_from_median"}
+    assert family_results["rate_spike_from_median"] == {"rate_spike_from_median"}
+    assert family_results["absolute_drop_in_event_count"] == {"absolute_drop_in_event_count"}
+    assert family_results["absolute_spike_in_event_count"] == {"absolute_spike_in_event_count"}
+    assert family_results["funnel_break"] == {"funnel_break"}
+    assert family_results["hourly_traffic_mix_shift"] == {"hourly_traffic_mix_shift"}
+    assert family_results["instrumentation_data_quality_issue"] == {"instrumentation_data_quality_issue"}
+    mixed = generator.generate_episode(
+        EpisodeConfig(
+            seed=7,
+            scenario_family="mixed",
+            difficulty="medium",
+            anomaly_density="medium",
+            anomaly_count=5,
+        )
+    )
+    assert len(mixed.expected_rows) == 5
+    assert {row.anomaly_type for row in mixed.expected_rows}.issubset(
+        {
+            "rate_drop_from_median",
+            "rate_spike_from_median",
+            "absolute_drop_in_event_count",
+            "absolute_spike_in_event_count",
+        }
+    )
+    assert len({row.anomaly_type for row in mixed.expected_rows}) >= 2
+def test_evaluator_scores_perfect_submission():
+    _, expected_rows = _toolkit_for()
+    result = evaluate_submission(expected_rows, expected_rows)
+    assert result.is_perfect is True
+    assert result.reward_breakdown.total_score == 1.0
+    assert result.reward_breakdown.extra_rows == 0
+    assert result.reward_breakdown.duplicate_rows == 0
+    assert result.reward_breakdown.invalid_rows == 0
+def test_named_task_registry_covers_easy_medium_hard():
+    assert DEFAULT_TASK_ORDER == (
+        "easy_single_spike",
+        "medium_mixed_pair",
+        "hard_mixed_multi",
+    )
+    assert len(TASKS) == 3
+    assert {TASKS[task_id].difficulty for task_id in DEFAULT_TASK_ORDER} == {"easy", "medium", "hard"}
+    assert all(TASKS[task_id].grader_name for task_id in DEFAULT_TASK_ORDER)
+def test_task_grader_scores_perfect_submission():
+    generator = MetricDataGenerator()
+    task = get_task_spec("medium_mixed_pair")
+    episode = generator.generate_episode(task.build_episode_config())
+    result = task.grade_submission(episode.expected_rows, episode.expected_rows)
+    assert result.is_perfect is True
+    assert result.reward_breakdown.total_score == 1.0
+def test_duplicate_and_extra_rows_are_penalized():
+    _, expected_rows = _toolkit_for()
+    extra_row = MetricSubmissionRow(
+        date=expected_rows[0].date,
+        entity_type="event_count",
+        entity_name="nonexistent_metric",
+        anomaly_type="absolute_spike_in_event_count",
+        detection_method="compare_count_to_median",
+        baseline_value=100.0,
+        observed_value=120.0,
+        delta_value=20.0,
+        severity="low",
+    )
+    submitted = [expected_rows[0], expected_rows[0], extra_row]
+    result = evaluate_submission(submitted, expected_rows)
+    assert result.is_perfect is False
+    assert result.reward_breakdown.duplicate_rows == 1
+    assert result.reward_breakdown.extra_rows == 1
+    assert result.reward_breakdown.total_score < 1.0
+def test_shared_methods_behave_consistently():
+    toolkit, expected_rows = _toolkit_for(seed=3, scenario_family="mixed")
+    overview = toolkit.task_overview()
+    suspicious = toolkit.list_suspicious_dates(limit=5)
+    first_row = expected_rows[0]
+    assert overview["payload_schema"][0] == "date"
+    method_names = {item["name"] for item in overview["available_methods"]}
+    assert "show_raw_data" in method_names
+    assert "get_median_filter_rows" in method_names
+    assert "get_funnel_break_rows" in method_names
+    assert "get_hourly_traffic_mix_shift_rows" in method_names
+    assert "get_instrumentation_data_quality_issue_rows" in method_names
+    assert "payload_generator" in method_names
+    assert len(suspicious["dates"]) == 5
+    if first_row.detection_method == "compare_rate_to_median":
+        result = toolkit.compare_rate_to_median(first_row.date, first_row.entity_name)
+        assert result["anomaly_type"] == first_row.anomaly_type
+    elif first_row.detection_method == "compare_count_to_median":
+        result = toolkit.compare_count_to_median(first_row.date, first_row.entity_name)
+        assert result["anomaly_type"] == first_row.anomaly_type
+    elif first_row.detection_method == "detect_funnel_break":
+        result = toolkit.detect_funnel_break(first_row.date)
+        assert any(item["entity_name"] == first_row.entity_name for item in result["candidates"])
+    elif first_row.detection_method == "check_impossible_counts":
+        result = toolkit.check_impossible_counts(first_row.date)
+        assert result["issue_count"] > 0
+    else:
+        result = toolkit.hourly_rows_for_date(first_row.date)
+        assert result["found"] is True
+    raw = toolkit.show_raw_data(limit=3)
+    assert raw["returned_rows"] == 3
+    median_stats = toolkit.get_metric_median("app_open_to_order_placed")
+    std_stats = toolkit.get_metric_std_dev_from_median("app_open_to_order_placed")
+    assert median_stats["sample_size"] > 0
+    assert std_stats["std_dev_from_median"] >= 0
+def test_debug_mode_is_gated():
+    env = MetricTrackerRlEnvironment()
+    observation = env.reset()
+    assert observation.debug is None
+    assert observation.daily_metrics == []
+    assert observation.hourly_metrics == []
+    try:
+        env.export_debug_snapshot()
+    except RuntimeError as exc:
+        assert "Debug mode is disabled" in str(exc)
+    else:
+        raise AssertionError("Expected debug snapshot to be gated.")
+    env.set_debug_mode(True)
+    debug_observation = env.reset()
+    snapshot = env.export_debug_snapshot()
+    assert debug_observation.debug is not None
+    assert "expected_payload" in snapshot
+    assert "applied_synthetic_generators" in snapshot
+def test_reset_exposes_synthetic_generator_metadata():
+    env = MetricTrackerRlEnvironment()
+    observation = env.reset()
+    assert observation.task_id == "easy_single_spike"
+    assert len(observation.available_tasks) == 3
+    assert observation.available_synthetic_generator_methods
+    assert observation.available_synthetic_generator_methods[0].name == "metric_stddev_shift"
+    assert observation.applied_synthetic_generators == []
+def test_named_task_reset_updates_instruction_and_config():
+    env = MetricTrackerRlEnvironment()
+    observation = env.reset(task_id="hard_mixed_multi")
+    assert observation.task_id == "hard_mixed_multi"
+    assert observation.config["task_id"] == "hard_mixed_multi"
+    assert observation.config["grader_name"] == "deterministic_exact_match"
+    assert observation.config["difficulty"] == "hard"
+    assert observation.instruction == get_task_spec("hard_mixed_multi").instruction
+def test_custom_reset_anomalies_support_specific_dates_and_stddev_factor():
+    env = MetricTrackerRlEnvironment()
+    observation = env.reset(
+        seed=21,
+        scenario_family="mixed",
+        anomaly_count=2,
+        anomalies=[
+            {
+                "method_name": "metric_stddev_shift",
+                "metric_name": "orders_placed",
+                "date": "2026-03-20",
+                "stddev_factor": 2.5,
+                "direction": "down",
+            },
+            {
+                "method_name": "metric_stddev_shift",
+                "metric_name": "app_open_to_order_placed",
+                "date": "2026-03-25",
+                "stddev_factor": 2.0,
+                "direction": "up",
+            },
+        ],
+    )
+    applied = {item.date: item for item in observation.applied_synthetic_generators}
+    assert "2026-03-20" in applied
+    assert "2026-03-25" in applied
+    assert applied["2026-03-20"].metric_name == "orders_placed"
+    assert applied["2026-03-20"].stddev_factor == 2.5
+    assert applied["2026-03-20"].threshold_value == round(
+        applied["2026-03-20"].std_dev_from_median * 2.5,
+        4,
+    )
+    assert applied["2026-03-25"].metric_type == "conversion_rate"
+def test_analysis_methods_run_through_step_api():
+    env = MetricTrackerRlEnvironment()
+    env.reset()
+    analyzed = env.step(
+        MetricTrackerRlAction(
+            analysis_method="list_suspicious_dates",
+            analysis_args={"limit": 3},
+        )
+    )
+    assert analyzed.analysis_result is not None
+    assert analyzed.analysis_result["method"] == "list_suspicious_dates"
+    assert len(analyzed.analysis_result["result"]["dates"]) == 3
+def test_payload_generator_method_creates_rows():
+    toolkit, _ = _toolkit_for(seed=5, scenario_family="mixed")
+    result = toolkit.get_median_filter_rows("app_open_to_order_placed", 2.0)
+    assert result["details"][0]["threshold"] >= 0
+    assert isinstance(result["generated_rows"], list)
+def test_payload_generator_method_without_metric_runs_all_metrics():
+    toolkit, _ = _toolkit_for(seed=5, scenario_family="mixed")
+    result = toolkit.get_median_filter_rows_multi(metric_name=None, metric_names=[], threshold_multiplier=2.0)
+    assert "app_opens" in result["metric_names"]
+    assert "app_open_to_order_placed" in result["metric_names"]
+    assert isinstance(result["generated_rows"], list)
+def test_family_specific_generator_methods_create_matching_anomaly_types():
+    cases = [
+        ("rate_drop_from_median", "get_rate_drop_from_median_rows", 1.5),
+        ("rate_spike_from_median", "get_rate_spike_from_median_rows", 1.5),
+        ("absolute_drop_in_event_count", "get_absolute_drop_in_event_count_rows", 1.5),
+        ("absolute_spike_in_event_count", "get_absolute_spike_in_event_count_rows", 1.5),
+        ("funnel_break", "get_funnel_break_rows", 1.0),
+        ("hourly_traffic_mix_shift", "get_hourly_traffic_mix_shift_rows", 1.0),
+        ("instrumentation_data_quality_issue", "get_instrumentation_data_quality_issue_rows", 1.0),
+    ]
+    for family, method_name, threshold_multiplier in cases:
+        toolkit, _ = _toolkit_for(seed=7, scenario_family=family)
+        method = getattr(toolkit, method_name)
+        if "rate_" in method_name or "event_count" in method_name:
+            result = method(metric_name=None, metric_names=[], threshold_multiplier=threshold_multiplier)
+        else:
+            result = method(threshold_multiplier=threshold_multiplier)
+        assert result["generated_rows"], method_name
+        assert {row["anomaly_type"] for row in result["generated_rows"]} == {family}
+def test_metric_summary_methods_without_metric_run_all_metrics():
+    toolkit, _ = _toolkit_for(seed=5, scenario_family="mixed")
+    medians = toolkit.get_metric_median_multi(metric_name=None, metric_names=[])
+    stds = toolkit.get_metric_std_dev_from_median_multi(metric_name=None, metric_names=[])
+    diffs = toolkit.get_rows_with_abs_diff_from_median_gt_multi(
+        metric_name=None,
+        metric_names=[],
+        threshold=1.0,
+    )
+    assert "app_opens" in medians["metric_names"]
+    assert "app_open_to_order_placed" in stds["metric_names"]
+    assert len(medians["results"]) == len(medians["metric_names"])
+    assert len(stds["results"]) == len(stds["metric_names"])
+    assert len(diffs["results"]) == len(diffs["metric_names"])
+def test_generator_submission_path_runs():
+    env = MetricTrackerRlEnvironment()
+    env.reset()
+    result = env.step(
+        MetricTrackerRlAction(
+            payload_generators=[
+                PayloadGeneratorMethod(
+                    method_name="get_median_filter_rows",
+                    metric_name="app_open_to_order_placed",
+                    threshold_multiplier=2.0,
+                )
+            ]
+        )
+    )
+    assert result.generated_rows is not None
+    assert result.status in {"evaluated", "in_progress", "completed"}
+def test_generator_submission_path_supports_family_specific_methods():
+    env = MetricTrackerRlEnvironment()
+    env.reset(task_id="hard_mixed_multi", scenario_family="funnel_break")
+    result = env.step(
+        MetricTrackerRlAction(
+            payload_generators=[
+                PayloadGeneratorMethod(
+                    method_name="get_funnel_break_rows",
+                    threshold_multiplier=1.0,
+                )
+            ]
+        )
+    )
+    assert result.analysis_result is not None
+    assert result.analysis_result["result"]["generated_rows"] is not None

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff