Spaces:

kushalExplores
/

metric_tracker_rl

Sleeping

App Files Files Community

kushalExplores commited on 29 days ago

Commit

eb340e3

verified ·

1 Parent(s): 9ff23e0

Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

README.md +101 -2
inference.py +5 -5
server/gradio_ui.py +171 -15
tasks.py +5 -5

README.md CHANGED Viewed

@@ -84,19 +84,118 @@ Humans in the Gradio debugger and agents in `inference.py` use the same read-onl
 This keeps the benchmark focused on investigation quality rather than privileged access.
 ## Tasks And Expected Difficulty
 The benchmark ships with three named deterministic tasks:
 1. `easy_single_spike`
    Expected difficulty: easy.
-   One obvious event-count spike is present. A careful single-method investigation should usually be enough.
 2. `medium_mixed_pair`
    Expected difficulty: medium.
    Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
 3. `hard_mixed_multi`
    Expected difficulty: hard.
-   Five anomalies are present with higher density and weaker signal separation. Agents need broader exploration and tighter filtering.
 Supported anomaly families across resets:

 This keeps the benchmark focused on investigation quality rather than privileged access.
+## How The Agent Should Choose Methods
+The intended agent behavior is not "call every tool and submit everything." The benchmark rewards selecting the narrowest useful method for the anomaly family that the evidence supports.
+Practical method-selection rules:
+- start with `task_overview` to confirm the task shape, expected payload format, and visible config
+- use broad discovery methods first when the anomaly family is unclear:
+  - `get_median_filter_rows`
+  - `list_suspicious_dates`
+  - `rows_for_date`
+- use targeted confirmation methods once a candidate anomaly is visible:
+  - `compare_count_to_median` for event-count spikes or drops
+  - `compare_rate_to_median` for conversion-rate shifts
+  - `detect_funnel_break` for step-level funnel problems
+  - `check_impossible_counts` for instrumentation or impossible-value issues
+- use family-specific generator methods when the anomaly family is already clear:
+  - `get_absolute_spike_in_event_count_rows`
+  - `get_absolute_drop_in_event_count_rows`
+  - `get_rate_spike_from_median_rows`
+  - `get_rate_drop_from_median_rows`
+  - `get_funnel_break_rows`
+  - `get_hourly_traffic_mix_shift_rows`
+  - `get_instrumentation_data_quality_issue_rows`
+- prefer fewer high-confidence rows over broad over-submission because extra rows are penalized
+- use `preview_submission` before final submission when manually building rows
+In practice, a strong agent usually follows this pattern:
+1. Identify which metric family is likely broken.
+2. Confirm the exact date and entity with a comparison tool.
+3. Generate the smallest plausible payload.
+4. Submit only when the evidence is specific enough to justify the row.
+## How Payload Generation Works In The Server
+The server supports two final submission paths:
+- direct row submission with `classifications`
+- declarative server-side generation with `payload_generators`
+The payload-generator path is usually simpler and more stable because the model chooses methods and thresholds, and the server constructs the final anomaly rows.
+Simple flow:
+```text
+LLM
+  -> choose analysis method from available_methods
+  -> inspect evidence from analysis_result
+  -> choose one or more payload generator methods
+  -> submit payload_generators
+Server
+  -> run payload_generator inside the environment
+  -> create normalized submission rows
+  -> grade submitted_rows against hidden expected_rows
+  -> return reward_breakdown, submission_issues, generated_rows
+```
+At the server level, the path is:
+```text
+MetricTrackerRlAction(payload_generators=[...])
+  -> environment step
+  -> _run_analysis("payload_generator", ...)
+  -> generated_rows
+  -> grade_submission(submitted_rows, expected_rows)
+  -> observation.reward_breakdown + observation.submission_issues
+```
+This means the LLM is responsible for choosing the right generator method, but the server is responsible for turning that declarative request into actual payload rows and grading them.
+## Example Decision Path
+Suppose the agent suspects a conversion-rate drop but does not yet know which metric is responsible.
+```text
+1. task_overview()
+2. get_median_filter_rows(metric_names=["app_open_to_order_placed", "app_open_to_payment_successful"], threshold_multiplier=2.0)
+3. compare_rate_to_median(date="2026-03-19", entity_name="app_open_to_payment_successful")
+4. payload_generator(generator_methods=[
+     {
+       "method_name": "get_rate_drop_from_median_rows",
+       "metric_name": "app_open_to_payment_successful",
+       "threshold_multiplier": 2.0
+     }
+   ])
+5. submit_payload_generator(...)
+```
+What happens conceptually:
+- the first broad method narrows the search space
+- the comparison method confirms the metric/date pair with a baseline and observed value
+- the generator submission asks the server to build the final row in the benchmark's required schema
+- the grader scores the generated payload against the hidden expected anomalies
+If the feedback reports extra rows or missing rows, the agent should refine the generator choice or threshold rather than blindly adding more methods.
 ## Tasks And Expected Difficulty
 The benchmark ships with three named deterministic tasks:
 1. `easy_single_spike`
    Expected difficulty: easy.
+   Two rate-spike anomalies are present. A careful targeted investigation should usually be enough.
 2. `medium_mixed_pair`
    Expected difficulty: medium.
    Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
 3. `hard_mixed_multi`
    Expected difficulty: hard.
+   Four anomalies are present with higher density and weaker signal separation. Agents need broader exploration and tighter filtering.
 Supported anomaly families across resets:

inference.py CHANGED Viewed

@@ -52,9 +52,9 @@ SYSTEM_PROMPT = textwrap.dedent(
     - Do not request full hidden answers or assume direct access to ground truth.
     - Prefer declarative payload generators over manual row construction.
     - Start from the default reset observation only.
-    - Start by trying `get_median_filter_rows` across different metrics to learn which metrics produce useful anomaly rows.
     - Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
-    - Prefer: task_overview -> get_median_filter_rows on several metrics -> compare useful results -> payload_generator -> submit_payload_generator.
     - Keep notes brief and factual.
     """
 ).strip()
@@ -245,7 +245,7 @@ def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
         {json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
         Prefer building a payload generator first, then submit it.
-        Start by calling `get_median_filter_rows` on several different metrics and see which ones return useful anomaly rows.
         If a metric returns nothing or low-signal rows, try another metric.
         For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
@@ -254,9 +254,9 @@ def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
         `baseline_value`, `observed_value`, `delta_value`, `severity`.
         Supported generator method example:
-        `{{"method_name":"get_median_filter_rows","threshold_multiplier":2.0}}`
         or
-        `{{"method_name":"get_median_filter_rows","metric_names":["app_open_to_order_placed","orders_placed"],"threshold_multiplier":2.0}}`
         Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
         """

     - Do not request full hidden answers or assume direct access to ground truth.
     - Prefer declarative payload generators over manual row construction.
     - Start from the default reset observation only.
+    - Start by trying `get_rate_spike_from_median_rows` across different conversion metrics to learn which metrics produce useful anomaly rows.
     - Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
+    - Prefer: task_overview -> get_rate_spike_from_median_rows on several conversion metrics -> compare useful results -> payload_generator -> submit_payload_generator.
     - Keep notes brief and factual.
     """
 ).strip()
         {json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
         Prefer building a payload generator first, then submit it.
+        Start by calling `get_rate_spike_from_median_rows` on several conversion metrics and see which ones return useful anomaly rows.
         If a metric returns nothing or low-signal rows, try another metric.
         For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
         `baseline_value`, `observed_value`, `delta_value`, `severity`.
         Supported generator method example:
+        `{{"method_name":"get_rate_spike_from_median_rows","threshold_multiplier":2.0}}`
         or
+        `{{"method_name":"get_rate_spike_from_median_rows","metric_names":["app_open_to_order_placed","app_open_to_payment_successful"],"threshold_multiplier":2.0}}`
         Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
         """

server/gradio_ui.py CHANGED Viewed

@@ -3,6 +3,8 @@
 from __future__ import annotations
 import json
 import pandas as pd
@@ -48,6 +50,26 @@ METRIC_CHOICES = [
     "app_open_to_order_placed",
     "app_open_to_payment_successful",
 ]
 def build_metric_tracker_gradio_app(
@@ -75,7 +97,7 @@ def build_metric_tracker_gradio_app(
             Standard mode exposes method calls only. You inspect data through methods like
             `show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
             and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
-            such as `get_median_filter_rows(metric_name, threshold_multiplier)`.
             """
         )
@@ -137,19 +159,19 @@ def build_metric_tracker_gradio_app(
         gr.Markdown("## Methods")
         gr.Markdown(
-            "Run a method after reset to fetch exactly the daily aggregate data, median, "
-            "std-from-median, filtered rows, or generated payload rows you want."
         )
         with gr.Row():
             method_name = gr.Dropdown(
                 label="Method",
                 choices=METHOD_CHOICES,
-                value="show_raw_data",
             )
             method_metric = gr.Dropdown(
                 label="metrics",
                 choices=METRIC_CHOICES,
-                value=[],
                 multiselect=True,
             )
             method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
@@ -164,12 +186,26 @@ def build_metric_tracker_gradio_app(
             value="[]",
             interactive=True,
         )
         analysis_result = gr.JSON(label="Last Method Results")
         with gr.Tab("Method Data"):
             gr.Markdown(
-                "This panel shows only method-returned data. Use `show_raw_data` for daily "
-                "aggregate rows, then median/std/filter methods to inspect candidate anomalies."
             )
             method_rows = gr.Dataframe(label="Method Rows", interactive=False)
@@ -185,11 +221,11 @@ def build_metric_tracker_gradio_app(
         )
         payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
         with gr.Row():
-            generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="get_median_filter_rows")
             generator_metric_name = gr.Dropdown(
                 label="metrics",
                 choices=METRIC_CHOICES,
-                value=[],
                 multiselect=True,
             )
             generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
@@ -207,7 +243,6 @@ def build_metric_tracker_gradio_app(
         available_methods = gr.JSON(label="Shared Methods")
         submission_feedback = gr.JSON(label="Submission Feedback")
         reward_breakdown = gr.JSON(label="Reward Breakdown")
-        generated_rows = gr.Dataframe(label="Generated Payload Rows", interactive=False)
         raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
         debug_snapshot = gr.JSON(label="Debug Snapshot")
@@ -222,7 +257,17 @@ def build_metric_tracker_gradio_app(
                 task.to_model().model_dump(),
             )
-        async def reset_episode(selected_task_id, seed_value, family, level, density, anomaly_count_value, reset_anomalies_json, debug_enabled):
             try:
                 parsed_anomalies = json.loads(reset_anomalies_json or "[]")
                 if not isinstance(parsed_anomalies, list):
@@ -240,6 +285,7 @@ def build_metric_tracker_gradio_app(
                     {},
                     {},
                     {},
                     _generator_frame([]),
                     _generator_frame([]),
                     "",
@@ -262,7 +308,7 @@ def build_metric_tracker_gradio_app(
             method_data = await web_manager.step_environment(
                 {
                     "analysis_method": "show_raw_data",
-                    "analysis_args": {"limit": 5},
                     "classifications": [],
                     "payload_generators": [],
                 }
@@ -270,8 +316,11 @@ def build_metric_tracker_gradio_app(
             state = _state_from_response(data)
             state["latest_response"] = method_data
             state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
             obs = data.get("observation", {})
             method_result = state["last_method_result"] or {}
             available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
             active_task_payload = next(
                 (item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
@@ -300,6 +349,7 @@ def build_metric_tracker_gradio_app(
                 method_result,
                 obs.get("submission_issues") or [],
                 obs.get("reward_breakdown") or {},
                 _method_frame(method_result),
                 pd.DataFrame(),
                 json.dumps(method_data, indent=2),
@@ -317,9 +367,10 @@ def build_metric_tracker_gradio_app(
             method_rows_value: str,
             threshold: float,
             limit_value: int,
         ):
             if not payload.get("active"):
-                return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip()
             args = _method_args(
                 selected_method,
                 metric_names,
@@ -340,13 +391,20 @@ def build_metric_tracker_gradio_app(
             )
             payload["latest_response"] = data
             payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
             method_result = payload["last_method_result"] or {}
             generated = method_result.get("result", {}).get("generated_rows", [])
             method_frame = _method_frame(method_result)
             return (
                 payload,
                 method_result,
                 data.get("observation", {}).get("message", ""),
                 method_frame,
                 pd.DataFrame(generated),
                 json.dumps(data, indent=2),
@@ -474,9 +532,16 @@ def build_metric_tracker_gradio_app(
         def get_state_sync():
             return json.dumps(web_manager.get_state(), indent=2)
         reset_btn.click(
             fn=reset_episode,
-            inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode],
             outputs=[
                 session_state,
                 status,
@@ -489,6 +554,7 @@ def build_metric_tracker_gradio_app(
                 analysis_result,
                 submission_feedback,
                 reward_breakdown,
                 method_rows,
                 generated_rows,
                 raw_json,
@@ -513,8 +579,14 @@ def build_metric_tracker_gradio_app(
                 method_rows_json,
                 method_threshold,
                 method_limit,
             ],
-            outputs=[session_state, analysis_result, status, method_rows, generated_rows, raw_json],
         )
         add_generator_btn.click(
             fn=add_or_update_generator,
@@ -649,6 +721,8 @@ def _state_from_response(data: dict) -> dict:
         "payload_generators": [],
         "last_method_result": data.get("observation", {}).get("analysis_result"),
         "latest_response": data,
     }
@@ -725,4 +799,86 @@ def _empty_state() -> dict:
         "payload_generators": [],
         "last_method_result": None,
         "latest_response": None,
     }

 from __future__ import annotations
 import json
+import math
+from statistics import median
 import pandas as pd
     "app_open_to_order_placed",
     "app_open_to_payment_successful",
 ]
+CONVERSION_METRICS = [
+    "app_open_to_menu_open",
+    "menu_open_to_product_added_to_cart",
+    "product_added_to_cart_to_order_placed",
+    "order_placed_to_payment_successful",
+    "app_open_to_order_placed",
+    "app_open_to_payment_successful",
+]
+THRESHOLD_METHODS = {
+    "get_rows_with_abs_diff_from_median_gt",
+    "get_median_filter_rows",
+    "get_rate_drop_from_median_rows",
+    "get_rate_spike_from_median_rows",
+    "get_absolute_drop_in_event_count_rows",
+    "get_absolute_spike_in_event_count_rows",
+}
+MEDIAN_METHODS = THRESHOLD_METHODS | {
+    "get_metric_median",
+    "get_metric_std_dev_from_median",
+}
 def build_metric_tracker_gradio_app(
             Standard mode exposes method calls only. You inspect data through methods like
             `show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
             and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
+            such as `get_rate_spike_from_median_rows(metric_name, threshold_multiplier)`.
             """
         )
         gr.Markdown("## Methods")
         gr.Markdown(
+            "Run a method after reset to inspect daily aggregate data and then start from "
+            "rate-spike detection on conversion metrics before broadening the search."
         )
         with gr.Row():
             method_name = gr.Dropdown(
                 label="Method",
                 choices=METHOD_CHOICES,
+                value="get_rate_spike_from_median_rows",
             )
             method_metric = gr.Dropdown(
                 label="metrics",
                 choices=METRIC_CHOICES,
+                value=CONVERSION_METRICS,
                 multiselect=True,
             )
             method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
             value="[]",
             interactive=True,
         )
+        plot_metrics = gr.Dropdown(
+            label="Plot Metrics",
+            choices=METRIC_CHOICES,
+            value=CONVERSION_METRICS,
+            multiselect=True,
+        )
+        metric_plot = gr.LinePlot(
+            label="Metric Plot",
+            x="date",
+            y="value",
+            color="series",
+            tooltip=["date", "series", "value"],
+        )
+        generated_rows = gr.Dataframe(label="Payload Rows For Current Method", interactive=False)
         analysis_result = gr.JSON(label="Last Method Results")
         with gr.Tab("Method Data"):
             gr.Markdown(
+                "This panel shows only method-returned data. The chart already loads all daily "
+                "rows on reset, so use this table to inspect the exact rows returned by the current method."
             )
             method_rows = gr.Dataframe(label="Method Rows", interactive=False)
         )
         payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
         with gr.Row():
+            generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="get_rate_spike_from_median_rows")
             generator_metric_name = gr.Dropdown(
                 label="metrics",
                 choices=METRIC_CHOICES,
+                value=CONVERSION_METRICS,
                 multiselect=True,
             )
             generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
         available_methods = gr.JSON(label="Shared Methods")
         submission_feedback = gr.JSON(label="Submission Feedback")
         reward_breakdown = gr.JSON(label="Reward Breakdown")
         raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
         debug_snapshot = gr.JSON(label="Debug Snapshot")
                 task.to_model().model_dump(),
             )
+        async def reset_episode(
+            selected_task_id,
+            seed_value,
+            family,
+            level,
+            density,
+            anomaly_count_value,
+            reset_anomalies_json,
+            debug_enabled,
+            selected_plot_metrics,
+        ):
             try:
                 parsed_anomalies = json.loads(reset_anomalies_json or "[]")
                 if not isinstance(parsed_anomalies, list):
                     {},
                     {},
                     {},
+                    _plot_frame([], selected_plot_metrics, None),
                     _generator_frame([]),
                     _generator_frame([]),
                     "",
             method_data = await web_manager.step_environment(
                 {
                     "analysis_method": "show_raw_data",
+                    "analysis_args": {"limit": 10000},
                     "classifications": [],
                     "payload_generators": [],
                 }
             state = _state_from_response(data)
             state["latest_response"] = method_data
             state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
+            state["raw_rows"] = ((state["last_method_result"] or {}).get("result") or {}).get("rows", [])
+            state["last_plot_context"] = None
             obs = data.get("observation", {})
             method_result = state["last_method_result"] or {}
+            plot_frame = _plot_frame(state["raw_rows"], selected_plot_metrics, state["last_plot_context"])
             available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
             active_task_payload = next(
                 (item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
                 method_result,
                 obs.get("submission_issues") or [],
                 obs.get("reward_breakdown") or {},
+                plot_frame,
                 _method_frame(method_result),
                 pd.DataFrame(),
                 json.dumps(method_data, indent=2),
             method_rows_value: str,
             threshold: float,
             limit_value: int,
+            selected_plot_metrics: list[str],
         ):
             if not payload.get("active"):
+                return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip(), gr.skip()
             args = _method_args(
                 selected_method,
                 metric_names,
             )
             payload["latest_response"] = data
             payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
+            payload["last_plot_context"] = {
+                "method_name": selected_method,
+                "metric_names": [item for item in (metric_names or []) if item],
+                "threshold": float(threshold),
+            }
             method_result = payload["last_method_result"] or {}
             generated = method_result.get("result", {}).get("generated_rows", [])
             method_frame = _method_frame(method_result)
+            plot_frame = _plot_frame(payload.get("raw_rows", []), selected_plot_metrics, payload["last_plot_context"])
             return (
                 payload,
                 method_result,
                 data.get("observation", {}).get("message", ""),
+                plot_frame,
                 method_frame,
                 pd.DataFrame(generated),
                 json.dumps(data, indent=2),
         def get_state_sync():
             return json.dumps(web_manager.get_state(), indent=2)
+        def update_plot(payload: dict, selected_plot_metrics: list[str]):
+            return _plot_frame(
+                payload.get("raw_rows", []),
+                selected_plot_metrics,
+                payload.get("last_plot_context"),
+            )
         reset_btn.click(
             fn=reset_episode,
+            inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode, plot_metrics],
             outputs=[
                 session_state,
                 status,
                 analysis_result,
                 submission_feedback,
                 reward_breakdown,
+                metric_plot,
                 method_rows,
                 generated_rows,
                 raw_json,
                 method_rows_json,
                 method_threshold,
                 method_limit,
+                plot_metrics,
             ],
+            outputs=[session_state, analysis_result, status, metric_plot, method_rows, generated_rows, raw_json],
+        )
+        plot_metrics.change(
+            fn=update_plot,
+            inputs=[session_state, plot_metrics],
+            outputs=[metric_plot],
         )
         add_generator_btn.click(
             fn=add_or_update_generator,
         "payload_generators": [],
         "last_method_result": data.get("observation", {}).get("analysis_result"),
         "latest_response": data,
+        "raw_rows": [],
+        "last_plot_context": None,
     }
         "payload_generators": [],
         "last_method_result": None,
         "latest_response": None,
+        "raw_rows": [],
+        "last_plot_context": None,
     }
+def _plot_frame(raw_rows: list[dict], selected_metrics: list[str], plot_context: dict | None) -> pd.DataFrame:
+    if not raw_rows:
+        return pd.DataFrame(columns=["date", "value", "series"])
+    frame = pd.DataFrame(raw_rows)
+    if "date" not in frame.columns:
+        return pd.DataFrame(columns=["date", "value", "series"])
+    metrics = [item for item in (selected_metrics or []) if item in frame.columns]
+    if not metrics:
+        return pd.DataFrame(columns=["date", "value", "series"])
+    rows: list[dict] = []
+    for metric_name in metrics:
+        values = pd.to_numeric(frame[metric_name], errors="coerce")
+        for date_value, numeric_value in zip(frame["date"], values):
+            if pd.isna(numeric_value):
+                continue
+            rows.append(
+                {
+                    "date": date_value,
+                    "value": float(numeric_value),
+                    "series": metric_name,
+                }
+            )
+        rows.extend(_overlay_rows(frame, metric_name, plot_context))
+    return pd.DataFrame(rows, columns=["date", "value", "series"])
+def _overlay_rows(frame: pd.DataFrame, metric_name: str, plot_context: dict | None) -> list[dict]:
+    if not plot_context:
+        return []
+    selected_metrics = [item for item in (plot_context.get("metric_names") or []) if item]
+    method_name = plot_context.get("method_name")
+    threshold = float(plot_context.get("threshold", 0.0))
+    if metric_name not in selected_metrics or method_name not in MEDIAN_METHODS:
+        return []
+    values = [float(item) for item in pd.to_numeric(frame[metric_name], errors="coerce").dropna().tolist()]
+    if not values:
+        return []
+    dates = frame["date"].tolist()
+    metric_median = float(median(values))
+    rows = [
+        {"date": date_value, "value": metric_median, "series": f"{metric_name} median"}
+        for date_value in dates
+    ]
+    threshold_value = None
+    if method_name == "get_metric_std_dev_from_median":
+        threshold_value = _std_from_median(values)
+    elif method_name == "get_rows_with_abs_diff_from_median_gt":
+        threshold_value = threshold
+    elif method_name in THRESHOLD_METHODS:
+        threshold_value = _std_from_median(values) * threshold
+    if threshold_value is None:
+        return rows
+    upper = metric_median + threshold_value
+    lower = metric_median - threshold_value
+    suffix = (
+        f"{threshold:.2f}*std"
+        if method_name in THRESHOLD_METHODS and method_name != "get_rows_with_abs_diff_from_median_gt"
+        else "threshold"
+    )
+    rows.extend(
+        {"date": date_value, "value": upper, "series": f"{metric_name} + {suffix}"}
+        for date_value in dates
+    )
+    rows.extend(
+        {"date": date_value, "value": lower, "series": f"{metric_name} - {suffix}"}
+        for date_value in dates
+    )
+    return rows
+def _std_from_median(values: list[float]) -> float:
+    if not values:
+        return 0.0
+    metric_median = median(values)
+    return math.sqrt(sum((value - metric_median) ** 2 for value in values) / len(values))

tasks.py CHANGED Viewed

@@ -77,16 +77,16 @@ TASKS: dict[str, TaskSpec] = {
         task_id="easy_single_spike",
         difficulty="easy",
         instruction=(
-            "Investigate the seeded funnel dataset and submit the single anomalous row. "
             "Use the shared analysis methods before submitting."
         ),
         objective=(
-            "Find the one obvious anomaly and submit exactly one correctly populated anomaly row."
         ),
         seed=11,
-        scenario_family="absolute_spike_in_event_count",
         anomaly_density="low",
-        anomaly_count=1,
     ),
     "medium_mixed_pair": TaskSpec(
         task_id="medium_mixed_pair",
@@ -116,7 +116,7 @@ TASKS: dict[str, TaskSpec] = {
         seed=37,
         scenario_family="mixed",
         anomaly_density="high",
-        anomaly_count=5,
     ),
 }

         task_id="easy_single_spike",
         difficulty="easy",
         instruction=(
+            "Investigate the seeded funnel dataset and submit every anomalous row. "
             "Use the shared analysis methods before submitting."
         ),
         objective=(
+            "Find all anomalies and submit every correctly populated anomaly row."
         ),
         seed=11,
+        scenario_family="rate_spike_from_median",
         anomaly_density="low",
+        anomaly_count=2,
     ),
     "medium_mixed_pair": TaskSpec(
         task_id="medium_mixed_pair",
         seed=37,
         scenario_family="mixed",
         anomaly_density="high",
+        anomaly_count=4,
     ),
 }