Spaces:
Sleeping
Sleeping
Upload folder using huggingface_hub
Browse files- README.md +101 -2
- inference.py +5 -5
- server/gradio_ui.py +171 -15
- tasks.py +5 -5
README.md
CHANGED
|
@@ -84,19 +84,118 @@ Humans in the Gradio debugger and agents in `inference.py` use the same read-onl
|
|
| 84 |
|
| 85 |
This keeps the benchmark focused on investigation quality rather than privileged access.
|
| 86 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
## Tasks And Expected Difficulty
|
| 88 |
|
| 89 |
The benchmark ships with three named deterministic tasks:
|
| 90 |
|
| 91 |
1. `easy_single_spike`
|
| 92 |
Expected difficulty: easy.
|
| 93 |
-
|
| 94 |
2. `medium_mixed_pair`
|
| 95 |
Expected difficulty: medium.
|
| 96 |
Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
|
| 97 |
3. `hard_mixed_multi`
|
| 98 |
Expected difficulty: hard.
|
| 99 |
-
|
| 100 |
|
| 101 |
Supported anomaly families across resets:
|
| 102 |
|
|
|
|
| 84 |
|
| 85 |
This keeps the benchmark focused on investigation quality rather than privileged access.
|
| 86 |
|
| 87 |
+
## How The Agent Should Choose Methods
|
| 88 |
+
|
| 89 |
+
The intended agent behavior is not "call every tool and submit everything." The benchmark rewards selecting the narrowest useful method for the anomaly family that the evidence supports.
|
| 90 |
+
|
| 91 |
+
Practical method-selection rules:
|
| 92 |
+
|
| 93 |
+
- start with `task_overview` to confirm the task shape, expected payload format, and visible config
|
| 94 |
+
- use broad discovery methods first when the anomaly family is unclear:
|
| 95 |
+
- `get_median_filter_rows`
|
| 96 |
+
- `list_suspicious_dates`
|
| 97 |
+
- `rows_for_date`
|
| 98 |
+
- use targeted confirmation methods once a candidate anomaly is visible:
|
| 99 |
+
- `compare_count_to_median` for event-count spikes or drops
|
| 100 |
+
- `compare_rate_to_median` for conversion-rate shifts
|
| 101 |
+
- `detect_funnel_break` for step-level funnel problems
|
| 102 |
+
- `check_impossible_counts` for instrumentation or impossible-value issues
|
| 103 |
+
- use family-specific generator methods when the anomaly family is already clear:
|
| 104 |
+
- `get_absolute_spike_in_event_count_rows`
|
| 105 |
+
- `get_absolute_drop_in_event_count_rows`
|
| 106 |
+
- `get_rate_spike_from_median_rows`
|
| 107 |
+
- `get_rate_drop_from_median_rows`
|
| 108 |
+
- `get_funnel_break_rows`
|
| 109 |
+
- `get_hourly_traffic_mix_shift_rows`
|
| 110 |
+
- `get_instrumentation_data_quality_issue_rows`
|
| 111 |
+
- prefer fewer high-confidence rows over broad over-submission because extra rows are penalized
|
| 112 |
+
- use `preview_submission` before final submission when manually building rows
|
| 113 |
+
|
| 114 |
+
In practice, a strong agent usually follows this pattern:
|
| 115 |
+
|
| 116 |
+
1. Identify which metric family is likely broken.
|
| 117 |
+
2. Confirm the exact date and entity with a comparison tool.
|
| 118 |
+
3. Generate the smallest plausible payload.
|
| 119 |
+
4. Submit only when the evidence is specific enough to justify the row.
|
| 120 |
+
|
| 121 |
+
## How Payload Generation Works In The Server
|
| 122 |
+
|
| 123 |
+
The server supports two final submission paths:
|
| 124 |
+
|
| 125 |
+
- direct row submission with `classifications`
|
| 126 |
+
- declarative server-side generation with `payload_generators`
|
| 127 |
+
|
| 128 |
+
The payload-generator path is usually simpler and more stable because the model chooses methods and thresholds, and the server constructs the final anomaly rows.
|
| 129 |
+
|
| 130 |
+
Simple flow:
|
| 131 |
+
|
| 132 |
+
```text
|
| 133 |
+
LLM
|
| 134 |
+
-> choose analysis method from available_methods
|
| 135 |
+
-> inspect evidence from analysis_result
|
| 136 |
+
-> choose one or more payload generator methods
|
| 137 |
+
-> submit payload_generators
|
| 138 |
+
|
| 139 |
+
Server
|
| 140 |
+
-> run payload_generator inside the environment
|
| 141 |
+
-> create normalized submission rows
|
| 142 |
+
-> grade submitted_rows against hidden expected_rows
|
| 143 |
+
-> return reward_breakdown, submission_issues, generated_rows
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
At the server level, the path is:
|
| 147 |
+
|
| 148 |
+
```text
|
| 149 |
+
MetricTrackerRlAction(payload_generators=[...])
|
| 150 |
+
-> environment step
|
| 151 |
+
-> _run_analysis("payload_generator", ...)
|
| 152 |
+
-> generated_rows
|
| 153 |
+
-> grade_submission(submitted_rows, expected_rows)
|
| 154 |
+
-> observation.reward_breakdown + observation.submission_issues
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
This means the LLM is responsible for choosing the right generator method, but the server is responsible for turning that declarative request into actual payload rows and grading them.
|
| 158 |
+
|
| 159 |
+
## Example Decision Path
|
| 160 |
+
|
| 161 |
+
Suppose the agent suspects a conversion-rate drop but does not yet know which metric is responsible.
|
| 162 |
+
|
| 163 |
+
```text
|
| 164 |
+
1. task_overview()
|
| 165 |
+
2. get_median_filter_rows(metric_names=["app_open_to_order_placed", "app_open_to_payment_successful"], threshold_multiplier=2.0)
|
| 166 |
+
3. compare_rate_to_median(date="2026-03-19", entity_name="app_open_to_payment_successful")
|
| 167 |
+
4. payload_generator(generator_methods=[
|
| 168 |
+
{
|
| 169 |
+
"method_name": "get_rate_drop_from_median_rows",
|
| 170 |
+
"metric_name": "app_open_to_payment_successful",
|
| 171 |
+
"threshold_multiplier": 2.0
|
| 172 |
+
}
|
| 173 |
+
])
|
| 174 |
+
5. submit_payload_generator(...)
|
| 175 |
+
```
|
| 176 |
+
|
| 177 |
+
What happens conceptually:
|
| 178 |
+
|
| 179 |
+
- the first broad method narrows the search space
|
| 180 |
+
- the comparison method confirms the metric/date pair with a baseline and observed value
|
| 181 |
+
- the generator submission asks the server to build the final row in the benchmark's required schema
|
| 182 |
+
- the grader scores the generated payload against the hidden expected anomalies
|
| 183 |
+
|
| 184 |
+
If the feedback reports extra rows or missing rows, the agent should refine the generator choice or threshold rather than blindly adding more methods.
|
| 185 |
+
|
| 186 |
## Tasks And Expected Difficulty
|
| 187 |
|
| 188 |
The benchmark ships with three named deterministic tasks:
|
| 189 |
|
| 190 |
1. `easy_single_spike`
|
| 191 |
Expected difficulty: easy.
|
| 192 |
+
Two rate-spike anomalies are present. A careful targeted investigation should usually be enough.
|
| 193 |
2. `medium_mixed_pair`
|
| 194 |
Expected difficulty: medium.
|
| 195 |
Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
|
| 196 |
3. `hard_mixed_multi`
|
| 197 |
Expected difficulty: hard.
|
| 198 |
+
Four anomalies are present with higher density and weaker signal separation. Agents need broader exploration and tighter filtering.
|
| 199 |
|
| 200 |
Supported anomaly families across resets:
|
| 201 |
|
inference.py
CHANGED
|
@@ -52,9 +52,9 @@ SYSTEM_PROMPT = textwrap.dedent(
|
|
| 52 |
- Do not request full hidden answers or assume direct access to ground truth.
|
| 53 |
- Prefer declarative payload generators over manual row construction.
|
| 54 |
- Start from the default reset observation only.
|
| 55 |
-
- Start by trying `
|
| 56 |
- Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
|
| 57 |
-
- Prefer: task_overview ->
|
| 58 |
- Keep notes brief and factual.
|
| 59 |
"""
|
| 60 |
).strip()
|
|
@@ -245,7 +245,7 @@ def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
|
|
| 245 |
{json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
|
| 246 |
|
| 247 |
Prefer building a payload generator first, then submit it.
|
| 248 |
-
Start by calling `
|
| 249 |
If a metric returns nothing or low-signal rows, try another metric.
|
| 250 |
For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
|
| 251 |
|
|
@@ -254,9 +254,9 @@ def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
|
|
| 254 |
`baseline_value`, `observed_value`, `delta_value`, `severity`.
|
| 255 |
|
| 256 |
Supported generator method example:
|
| 257 |
-
`{{"method_name":"
|
| 258 |
or
|
| 259 |
-
`{{"method_name":"
|
| 260 |
|
| 261 |
Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
|
| 262 |
"""
|
|
|
|
| 52 |
- Do not request full hidden answers or assume direct access to ground truth.
|
| 53 |
- Prefer declarative payload generators over manual row construction.
|
| 54 |
- Start from the default reset observation only.
|
| 55 |
+
- Start by trying `get_rate_spike_from_median_rows` across different conversion metrics to learn which metrics produce useful anomaly rows.
|
| 56 |
- Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
|
| 57 |
+
- Prefer: task_overview -> get_rate_spike_from_median_rows on several conversion metrics -> compare useful results -> payload_generator -> submit_payload_generator.
|
| 58 |
- Keep notes brief and factual.
|
| 59 |
"""
|
| 60 |
).strip()
|
|
|
|
| 245 |
{json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
|
| 246 |
|
| 247 |
Prefer building a payload generator first, then submit it.
|
| 248 |
+
Start by calling `get_rate_spike_from_median_rows` on several conversion metrics and see which ones return useful anomaly rows.
|
| 249 |
If a metric returns nothing or low-signal rows, try another metric.
|
| 250 |
For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
|
| 251 |
|
|
|
|
| 254 |
`baseline_value`, `observed_value`, `delta_value`, `severity`.
|
| 255 |
|
| 256 |
Supported generator method example:
|
| 257 |
+
`{{"method_name":"get_rate_spike_from_median_rows","threshold_multiplier":2.0}}`
|
| 258 |
or
|
| 259 |
+
`{{"method_name":"get_rate_spike_from_median_rows","metric_names":["app_open_to_order_placed","app_open_to_payment_successful"],"threshold_multiplier":2.0}}`
|
| 260 |
|
| 261 |
Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
|
| 262 |
"""
|
server/gradio_ui.py
CHANGED
|
@@ -3,6 +3,8 @@
|
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
import json
|
|
|
|
|
|
|
| 6 |
|
| 7 |
import pandas as pd
|
| 8 |
|
|
@@ -48,6 +50,26 @@ METRIC_CHOICES = [
|
|
| 48 |
"app_open_to_order_placed",
|
| 49 |
"app_open_to_payment_successful",
|
| 50 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
|
| 53 |
def build_metric_tracker_gradio_app(
|
|
@@ -75,7 +97,7 @@ def build_metric_tracker_gradio_app(
|
|
| 75 |
Standard mode exposes method calls only. You inspect data through methods like
|
| 76 |
`show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
|
| 77 |
and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
|
| 78 |
-
such as `
|
| 79 |
"""
|
| 80 |
)
|
| 81 |
|
|
@@ -137,19 +159,19 @@ def build_metric_tracker_gradio_app(
|
|
| 137 |
|
| 138 |
gr.Markdown("## Methods")
|
| 139 |
gr.Markdown(
|
| 140 |
-
"Run a method after reset to
|
| 141 |
-
"
|
| 142 |
)
|
| 143 |
with gr.Row():
|
| 144 |
method_name = gr.Dropdown(
|
| 145 |
label="Method",
|
| 146 |
choices=METHOD_CHOICES,
|
| 147 |
-
value="
|
| 148 |
)
|
| 149 |
method_metric = gr.Dropdown(
|
| 150 |
label="metrics",
|
| 151 |
choices=METRIC_CHOICES,
|
| 152 |
-
value=
|
| 153 |
multiselect=True,
|
| 154 |
)
|
| 155 |
method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
|
|
@@ -164,12 +186,26 @@ def build_metric_tracker_gradio_app(
|
|
| 164 |
value="[]",
|
| 165 |
interactive=True,
|
| 166 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 167 |
analysis_result = gr.JSON(label="Last Method Results")
|
| 168 |
|
| 169 |
with gr.Tab("Method Data"):
|
| 170 |
gr.Markdown(
|
| 171 |
-
"This panel shows only method-returned data.
|
| 172 |
-
"
|
| 173 |
)
|
| 174 |
method_rows = gr.Dataframe(label="Method Rows", interactive=False)
|
| 175 |
|
|
@@ -185,11 +221,11 @@ def build_metric_tracker_gradio_app(
|
|
| 185 |
)
|
| 186 |
payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
|
| 187 |
with gr.Row():
|
| 188 |
-
generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="
|
| 189 |
generator_metric_name = gr.Dropdown(
|
| 190 |
label="metrics",
|
| 191 |
choices=METRIC_CHOICES,
|
| 192 |
-
value=
|
| 193 |
multiselect=True,
|
| 194 |
)
|
| 195 |
generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
|
|
@@ -207,7 +243,6 @@ def build_metric_tracker_gradio_app(
|
|
| 207 |
available_methods = gr.JSON(label="Shared Methods")
|
| 208 |
submission_feedback = gr.JSON(label="Submission Feedback")
|
| 209 |
reward_breakdown = gr.JSON(label="Reward Breakdown")
|
| 210 |
-
generated_rows = gr.Dataframe(label="Generated Payload Rows", interactive=False)
|
| 211 |
raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
|
| 212 |
debug_snapshot = gr.JSON(label="Debug Snapshot")
|
| 213 |
|
|
@@ -222,7 +257,17 @@ def build_metric_tracker_gradio_app(
|
|
| 222 |
task.to_model().model_dump(),
|
| 223 |
)
|
| 224 |
|
| 225 |
-
async def reset_episode(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 226 |
try:
|
| 227 |
parsed_anomalies = json.loads(reset_anomalies_json or "[]")
|
| 228 |
if not isinstance(parsed_anomalies, list):
|
|
@@ -240,6 +285,7 @@ def build_metric_tracker_gradio_app(
|
|
| 240 |
{},
|
| 241 |
{},
|
| 242 |
{},
|
|
|
|
| 243 |
_generator_frame([]),
|
| 244 |
_generator_frame([]),
|
| 245 |
"",
|
|
@@ -262,7 +308,7 @@ def build_metric_tracker_gradio_app(
|
|
| 262 |
method_data = await web_manager.step_environment(
|
| 263 |
{
|
| 264 |
"analysis_method": "show_raw_data",
|
| 265 |
-
"analysis_args": {"limit":
|
| 266 |
"classifications": [],
|
| 267 |
"payload_generators": [],
|
| 268 |
}
|
|
@@ -270,8 +316,11 @@ def build_metric_tracker_gradio_app(
|
|
| 270 |
state = _state_from_response(data)
|
| 271 |
state["latest_response"] = method_data
|
| 272 |
state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
|
|
|
|
|
|
|
| 273 |
obs = data.get("observation", {})
|
| 274 |
method_result = state["last_method_result"] or {}
|
|
|
|
| 275 |
available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
|
| 276 |
active_task_payload = next(
|
| 277 |
(item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
|
|
@@ -300,6 +349,7 @@ def build_metric_tracker_gradio_app(
|
|
| 300 |
method_result,
|
| 301 |
obs.get("submission_issues") or [],
|
| 302 |
obs.get("reward_breakdown") or {},
|
|
|
|
| 303 |
_method_frame(method_result),
|
| 304 |
pd.DataFrame(),
|
| 305 |
json.dumps(method_data, indent=2),
|
|
@@ -317,9 +367,10 @@ def build_metric_tracker_gradio_app(
|
|
| 317 |
method_rows_value: str,
|
| 318 |
threshold: float,
|
| 319 |
limit_value: int,
|
|
|
|
| 320 |
):
|
| 321 |
if not payload.get("active"):
|
| 322 |
-
return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip()
|
| 323 |
args = _method_args(
|
| 324 |
selected_method,
|
| 325 |
metric_names,
|
|
@@ -340,13 +391,20 @@ def build_metric_tracker_gradio_app(
|
|
| 340 |
)
|
| 341 |
payload["latest_response"] = data
|
| 342 |
payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 343 |
method_result = payload["last_method_result"] or {}
|
| 344 |
generated = method_result.get("result", {}).get("generated_rows", [])
|
| 345 |
method_frame = _method_frame(method_result)
|
|
|
|
| 346 |
return (
|
| 347 |
payload,
|
| 348 |
method_result,
|
| 349 |
data.get("observation", {}).get("message", ""),
|
|
|
|
| 350 |
method_frame,
|
| 351 |
pd.DataFrame(generated),
|
| 352 |
json.dumps(data, indent=2),
|
|
@@ -474,9 +532,16 @@ def build_metric_tracker_gradio_app(
|
|
| 474 |
def get_state_sync():
|
| 475 |
return json.dumps(web_manager.get_state(), indent=2)
|
| 476 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 477 |
reset_btn.click(
|
| 478 |
fn=reset_episode,
|
| 479 |
-
inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode],
|
| 480 |
outputs=[
|
| 481 |
session_state,
|
| 482 |
status,
|
|
@@ -489,6 +554,7 @@ def build_metric_tracker_gradio_app(
|
|
| 489 |
analysis_result,
|
| 490 |
submission_feedback,
|
| 491 |
reward_breakdown,
|
|
|
|
| 492 |
method_rows,
|
| 493 |
generated_rows,
|
| 494 |
raw_json,
|
|
@@ -513,8 +579,14 @@ def build_metric_tracker_gradio_app(
|
|
| 513 |
method_rows_json,
|
| 514 |
method_threshold,
|
| 515 |
method_limit,
|
|
|
|
| 516 |
],
|
| 517 |
-
outputs=[session_state, analysis_result, status, method_rows, generated_rows, raw_json],
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 518 |
)
|
| 519 |
add_generator_btn.click(
|
| 520 |
fn=add_or_update_generator,
|
|
@@ -649,6 +721,8 @@ def _state_from_response(data: dict) -> dict:
|
|
| 649 |
"payload_generators": [],
|
| 650 |
"last_method_result": data.get("observation", {}).get("analysis_result"),
|
| 651 |
"latest_response": data,
|
|
|
|
|
|
|
| 652 |
}
|
| 653 |
|
| 654 |
|
|
@@ -725,4 +799,86 @@ def _empty_state() -> dict:
|
|
| 725 |
"payload_generators": [],
|
| 726 |
"last_method_result": None,
|
| 727 |
"latest_response": None,
|
|
|
|
|
|
|
| 728 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
import json
|
| 6 |
+
import math
|
| 7 |
+
from statistics import median
|
| 8 |
|
| 9 |
import pandas as pd
|
| 10 |
|
|
|
|
| 50 |
"app_open_to_order_placed",
|
| 51 |
"app_open_to_payment_successful",
|
| 52 |
]
|
| 53 |
+
CONVERSION_METRICS = [
|
| 54 |
+
"app_open_to_menu_open",
|
| 55 |
+
"menu_open_to_product_added_to_cart",
|
| 56 |
+
"product_added_to_cart_to_order_placed",
|
| 57 |
+
"order_placed_to_payment_successful",
|
| 58 |
+
"app_open_to_order_placed",
|
| 59 |
+
"app_open_to_payment_successful",
|
| 60 |
+
]
|
| 61 |
+
THRESHOLD_METHODS = {
|
| 62 |
+
"get_rows_with_abs_diff_from_median_gt",
|
| 63 |
+
"get_median_filter_rows",
|
| 64 |
+
"get_rate_drop_from_median_rows",
|
| 65 |
+
"get_rate_spike_from_median_rows",
|
| 66 |
+
"get_absolute_drop_in_event_count_rows",
|
| 67 |
+
"get_absolute_spike_in_event_count_rows",
|
| 68 |
+
}
|
| 69 |
+
MEDIAN_METHODS = THRESHOLD_METHODS | {
|
| 70 |
+
"get_metric_median",
|
| 71 |
+
"get_metric_std_dev_from_median",
|
| 72 |
+
}
|
| 73 |
|
| 74 |
|
| 75 |
def build_metric_tracker_gradio_app(
|
|
|
|
| 97 |
Standard mode exposes method calls only. You inspect data through methods like
|
| 98 |
`show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
|
| 99 |
and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
|
| 100 |
+
such as `get_rate_spike_from_median_rows(metric_name, threshold_multiplier)`.
|
| 101 |
"""
|
| 102 |
)
|
| 103 |
|
|
|
|
| 159 |
|
| 160 |
gr.Markdown("## Methods")
|
| 161 |
gr.Markdown(
|
| 162 |
+
"Run a method after reset to inspect daily aggregate data and then start from "
|
| 163 |
+
"rate-spike detection on conversion metrics before broadening the search."
|
| 164 |
)
|
| 165 |
with gr.Row():
|
| 166 |
method_name = gr.Dropdown(
|
| 167 |
label="Method",
|
| 168 |
choices=METHOD_CHOICES,
|
| 169 |
+
value="get_rate_spike_from_median_rows",
|
| 170 |
)
|
| 171 |
method_metric = gr.Dropdown(
|
| 172 |
label="metrics",
|
| 173 |
choices=METRIC_CHOICES,
|
| 174 |
+
value=CONVERSION_METRICS,
|
| 175 |
multiselect=True,
|
| 176 |
)
|
| 177 |
method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
|
|
|
|
| 186 |
value="[]",
|
| 187 |
interactive=True,
|
| 188 |
)
|
| 189 |
+
plot_metrics = gr.Dropdown(
|
| 190 |
+
label="Plot Metrics",
|
| 191 |
+
choices=METRIC_CHOICES,
|
| 192 |
+
value=CONVERSION_METRICS,
|
| 193 |
+
multiselect=True,
|
| 194 |
+
)
|
| 195 |
+
metric_plot = gr.LinePlot(
|
| 196 |
+
label="Metric Plot",
|
| 197 |
+
x="date",
|
| 198 |
+
y="value",
|
| 199 |
+
color="series",
|
| 200 |
+
tooltip=["date", "series", "value"],
|
| 201 |
+
)
|
| 202 |
+
generated_rows = gr.Dataframe(label="Payload Rows For Current Method", interactive=False)
|
| 203 |
analysis_result = gr.JSON(label="Last Method Results")
|
| 204 |
|
| 205 |
with gr.Tab("Method Data"):
|
| 206 |
gr.Markdown(
|
| 207 |
+
"This panel shows only method-returned data. The chart already loads all daily "
|
| 208 |
+
"rows on reset, so use this table to inspect the exact rows returned by the current method."
|
| 209 |
)
|
| 210 |
method_rows = gr.Dataframe(label="Method Rows", interactive=False)
|
| 211 |
|
|
|
|
| 221 |
)
|
| 222 |
payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
|
| 223 |
with gr.Row():
|
| 224 |
+
generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="get_rate_spike_from_median_rows")
|
| 225 |
generator_metric_name = gr.Dropdown(
|
| 226 |
label="metrics",
|
| 227 |
choices=METRIC_CHOICES,
|
| 228 |
+
value=CONVERSION_METRICS,
|
| 229 |
multiselect=True,
|
| 230 |
)
|
| 231 |
generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
|
|
|
|
| 243 |
available_methods = gr.JSON(label="Shared Methods")
|
| 244 |
submission_feedback = gr.JSON(label="Submission Feedback")
|
| 245 |
reward_breakdown = gr.JSON(label="Reward Breakdown")
|
|
|
|
| 246 |
raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
|
| 247 |
debug_snapshot = gr.JSON(label="Debug Snapshot")
|
| 248 |
|
|
|
|
| 257 |
task.to_model().model_dump(),
|
| 258 |
)
|
| 259 |
|
| 260 |
+
async def reset_episode(
|
| 261 |
+
selected_task_id,
|
| 262 |
+
seed_value,
|
| 263 |
+
family,
|
| 264 |
+
level,
|
| 265 |
+
density,
|
| 266 |
+
anomaly_count_value,
|
| 267 |
+
reset_anomalies_json,
|
| 268 |
+
debug_enabled,
|
| 269 |
+
selected_plot_metrics,
|
| 270 |
+
):
|
| 271 |
try:
|
| 272 |
parsed_anomalies = json.loads(reset_anomalies_json or "[]")
|
| 273 |
if not isinstance(parsed_anomalies, list):
|
|
|
|
| 285 |
{},
|
| 286 |
{},
|
| 287 |
{},
|
| 288 |
+
_plot_frame([], selected_plot_metrics, None),
|
| 289 |
_generator_frame([]),
|
| 290 |
_generator_frame([]),
|
| 291 |
"",
|
|
|
|
| 308 |
method_data = await web_manager.step_environment(
|
| 309 |
{
|
| 310 |
"analysis_method": "show_raw_data",
|
| 311 |
+
"analysis_args": {"limit": 10000},
|
| 312 |
"classifications": [],
|
| 313 |
"payload_generators": [],
|
| 314 |
}
|
|
|
|
| 316 |
state = _state_from_response(data)
|
| 317 |
state["latest_response"] = method_data
|
| 318 |
state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
|
| 319 |
+
state["raw_rows"] = ((state["last_method_result"] or {}).get("result") or {}).get("rows", [])
|
| 320 |
+
state["last_plot_context"] = None
|
| 321 |
obs = data.get("observation", {})
|
| 322 |
method_result = state["last_method_result"] or {}
|
| 323 |
+
plot_frame = _plot_frame(state["raw_rows"], selected_plot_metrics, state["last_plot_context"])
|
| 324 |
available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
|
| 325 |
active_task_payload = next(
|
| 326 |
(item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
|
|
|
|
| 349 |
method_result,
|
| 350 |
obs.get("submission_issues") or [],
|
| 351 |
obs.get("reward_breakdown") or {},
|
| 352 |
+
plot_frame,
|
| 353 |
_method_frame(method_result),
|
| 354 |
pd.DataFrame(),
|
| 355 |
json.dumps(method_data, indent=2),
|
|
|
|
| 367 |
method_rows_value: str,
|
| 368 |
threshold: float,
|
| 369 |
limit_value: int,
|
| 370 |
+
selected_plot_metrics: list[str],
|
| 371 |
):
|
| 372 |
if not payload.get("active"):
|
| 373 |
+
return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip(), gr.skip()
|
| 374 |
args = _method_args(
|
| 375 |
selected_method,
|
| 376 |
metric_names,
|
|
|
|
| 391 |
)
|
| 392 |
payload["latest_response"] = data
|
| 393 |
payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
|
| 394 |
+
payload["last_plot_context"] = {
|
| 395 |
+
"method_name": selected_method,
|
| 396 |
+
"metric_names": [item for item in (metric_names or []) if item],
|
| 397 |
+
"threshold": float(threshold),
|
| 398 |
+
}
|
| 399 |
method_result = payload["last_method_result"] or {}
|
| 400 |
generated = method_result.get("result", {}).get("generated_rows", [])
|
| 401 |
method_frame = _method_frame(method_result)
|
| 402 |
+
plot_frame = _plot_frame(payload.get("raw_rows", []), selected_plot_metrics, payload["last_plot_context"])
|
| 403 |
return (
|
| 404 |
payload,
|
| 405 |
method_result,
|
| 406 |
data.get("observation", {}).get("message", ""),
|
| 407 |
+
plot_frame,
|
| 408 |
method_frame,
|
| 409 |
pd.DataFrame(generated),
|
| 410 |
json.dumps(data, indent=2),
|
|
|
|
| 532 |
def get_state_sync():
|
| 533 |
return json.dumps(web_manager.get_state(), indent=2)
|
| 534 |
|
| 535 |
+
def update_plot(payload: dict, selected_plot_metrics: list[str]):
|
| 536 |
+
return _plot_frame(
|
| 537 |
+
payload.get("raw_rows", []),
|
| 538 |
+
selected_plot_metrics,
|
| 539 |
+
payload.get("last_plot_context"),
|
| 540 |
+
)
|
| 541 |
+
|
| 542 |
reset_btn.click(
|
| 543 |
fn=reset_episode,
|
| 544 |
+
inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode, plot_metrics],
|
| 545 |
outputs=[
|
| 546 |
session_state,
|
| 547 |
status,
|
|
|
|
| 554 |
analysis_result,
|
| 555 |
submission_feedback,
|
| 556 |
reward_breakdown,
|
| 557 |
+
metric_plot,
|
| 558 |
method_rows,
|
| 559 |
generated_rows,
|
| 560 |
raw_json,
|
|
|
|
| 579 |
method_rows_json,
|
| 580 |
method_threshold,
|
| 581 |
method_limit,
|
| 582 |
+
plot_metrics,
|
| 583 |
],
|
| 584 |
+
outputs=[session_state, analysis_result, status, metric_plot, method_rows, generated_rows, raw_json],
|
| 585 |
+
)
|
| 586 |
+
plot_metrics.change(
|
| 587 |
+
fn=update_plot,
|
| 588 |
+
inputs=[session_state, plot_metrics],
|
| 589 |
+
outputs=[metric_plot],
|
| 590 |
)
|
| 591 |
add_generator_btn.click(
|
| 592 |
fn=add_or_update_generator,
|
|
|
|
| 721 |
"payload_generators": [],
|
| 722 |
"last_method_result": data.get("observation", {}).get("analysis_result"),
|
| 723 |
"latest_response": data,
|
| 724 |
+
"raw_rows": [],
|
| 725 |
+
"last_plot_context": None,
|
| 726 |
}
|
| 727 |
|
| 728 |
|
|
|
|
| 799 |
"payload_generators": [],
|
| 800 |
"last_method_result": None,
|
| 801 |
"latest_response": None,
|
| 802 |
+
"raw_rows": [],
|
| 803 |
+
"last_plot_context": None,
|
| 804 |
}
|
| 805 |
+
|
| 806 |
+
|
| 807 |
+
def _plot_frame(raw_rows: list[dict], selected_metrics: list[str], plot_context: dict | None) -> pd.DataFrame:
|
| 808 |
+
if not raw_rows:
|
| 809 |
+
return pd.DataFrame(columns=["date", "value", "series"])
|
| 810 |
+
frame = pd.DataFrame(raw_rows)
|
| 811 |
+
if "date" not in frame.columns:
|
| 812 |
+
return pd.DataFrame(columns=["date", "value", "series"])
|
| 813 |
+
metrics = [item for item in (selected_metrics or []) if item in frame.columns]
|
| 814 |
+
if not metrics:
|
| 815 |
+
return pd.DataFrame(columns=["date", "value", "series"])
|
| 816 |
+
|
| 817 |
+
rows: list[dict] = []
|
| 818 |
+
for metric_name in metrics:
|
| 819 |
+
values = pd.to_numeric(frame[metric_name], errors="coerce")
|
| 820 |
+
for date_value, numeric_value in zip(frame["date"], values):
|
| 821 |
+
if pd.isna(numeric_value):
|
| 822 |
+
continue
|
| 823 |
+
rows.append(
|
| 824 |
+
{
|
| 825 |
+
"date": date_value,
|
| 826 |
+
"value": float(numeric_value),
|
| 827 |
+
"series": metric_name,
|
| 828 |
+
}
|
| 829 |
+
)
|
| 830 |
+
rows.extend(_overlay_rows(frame, metric_name, plot_context))
|
| 831 |
+
return pd.DataFrame(rows, columns=["date", "value", "series"])
|
| 832 |
+
|
| 833 |
+
|
| 834 |
+
def _overlay_rows(frame: pd.DataFrame, metric_name: str, plot_context: dict | None) -> list[dict]:
|
| 835 |
+
if not plot_context:
|
| 836 |
+
return []
|
| 837 |
+
selected_metrics = [item for item in (plot_context.get("metric_names") or []) if item]
|
| 838 |
+
method_name = plot_context.get("method_name")
|
| 839 |
+
threshold = float(plot_context.get("threshold", 0.0))
|
| 840 |
+
if metric_name not in selected_metrics or method_name not in MEDIAN_METHODS:
|
| 841 |
+
return []
|
| 842 |
+
|
| 843 |
+
values = [float(item) for item in pd.to_numeric(frame[metric_name], errors="coerce").dropna().tolist()]
|
| 844 |
+
if not values:
|
| 845 |
+
return []
|
| 846 |
+
dates = frame["date"].tolist()
|
| 847 |
+
metric_median = float(median(values))
|
| 848 |
+
rows = [
|
| 849 |
+
{"date": date_value, "value": metric_median, "series": f"{metric_name} median"}
|
| 850 |
+
for date_value in dates
|
| 851 |
+
]
|
| 852 |
+
threshold_value = None
|
| 853 |
+
if method_name == "get_metric_std_dev_from_median":
|
| 854 |
+
threshold_value = _std_from_median(values)
|
| 855 |
+
elif method_name == "get_rows_with_abs_diff_from_median_gt":
|
| 856 |
+
threshold_value = threshold
|
| 857 |
+
elif method_name in THRESHOLD_METHODS:
|
| 858 |
+
threshold_value = _std_from_median(values) * threshold
|
| 859 |
+
|
| 860 |
+
if threshold_value is None:
|
| 861 |
+
return rows
|
| 862 |
+
upper = metric_median + threshold_value
|
| 863 |
+
lower = metric_median - threshold_value
|
| 864 |
+
suffix = (
|
| 865 |
+
f"{threshold:.2f}*std"
|
| 866 |
+
if method_name in THRESHOLD_METHODS and method_name != "get_rows_with_abs_diff_from_median_gt"
|
| 867 |
+
else "threshold"
|
| 868 |
+
)
|
| 869 |
+
rows.extend(
|
| 870 |
+
{"date": date_value, "value": upper, "series": f"{metric_name} + {suffix}"}
|
| 871 |
+
for date_value in dates
|
| 872 |
+
)
|
| 873 |
+
rows.extend(
|
| 874 |
+
{"date": date_value, "value": lower, "series": f"{metric_name} - {suffix}"}
|
| 875 |
+
for date_value in dates
|
| 876 |
+
)
|
| 877 |
+
return rows
|
| 878 |
+
|
| 879 |
+
|
| 880 |
+
def _std_from_median(values: list[float]) -> float:
|
| 881 |
+
if not values:
|
| 882 |
+
return 0.0
|
| 883 |
+
metric_median = median(values)
|
| 884 |
+
return math.sqrt(sum((value - metric_median) ** 2 for value in values) / len(values))
|
tasks.py
CHANGED
|
@@ -77,16 +77,16 @@ TASKS: dict[str, TaskSpec] = {
|
|
| 77 |
task_id="easy_single_spike",
|
| 78 |
difficulty="easy",
|
| 79 |
instruction=(
|
| 80 |
-
"Investigate the seeded funnel dataset and submit
|
| 81 |
"Use the shared analysis methods before submitting."
|
| 82 |
),
|
| 83 |
objective=(
|
| 84 |
-
"Find
|
| 85 |
),
|
| 86 |
seed=11,
|
| 87 |
-
scenario_family="
|
| 88 |
anomaly_density="low",
|
| 89 |
-
anomaly_count=
|
| 90 |
),
|
| 91 |
"medium_mixed_pair": TaskSpec(
|
| 92 |
task_id="medium_mixed_pair",
|
|
@@ -116,7 +116,7 @@ TASKS: dict[str, TaskSpec] = {
|
|
| 116 |
seed=37,
|
| 117 |
scenario_family="mixed",
|
| 118 |
anomaly_density="high",
|
| 119 |
-
anomaly_count=
|
| 120 |
),
|
| 121 |
}
|
| 122 |
|
|
|
|
| 77 |
task_id="easy_single_spike",
|
| 78 |
difficulty="easy",
|
| 79 |
instruction=(
|
| 80 |
+
"Investigate the seeded funnel dataset and submit every anomalous row. "
|
| 81 |
"Use the shared analysis methods before submitting."
|
| 82 |
),
|
| 83 |
objective=(
|
| 84 |
+
"Find all anomalies and submit every correctly populated anomaly row."
|
| 85 |
),
|
| 86 |
seed=11,
|
| 87 |
+
scenario_family="rate_spike_from_median",
|
| 88 |
anomaly_density="low",
|
| 89 |
+
anomaly_count=2,
|
| 90 |
),
|
| 91 |
"medium_mixed_pair": TaskSpec(
|
| 92 |
task_id="medium_mixed_pair",
|
|
|
|
| 116 |
seed=37,
|
| 117 |
scenario_family="mixed",
|
| 118 |
anomaly_density="high",
|
| 119 |
+
anomaly_count=4,
|
| 120 |
),
|
| 121 |
}
|
| 122 |
|