kushalExplores commited on
Commit
eb340e3
·
verified ·
1 Parent(s): 9ff23e0

Upload folder using huggingface_hub

Browse files
Files changed (4) hide show
  1. README.md +101 -2
  2. inference.py +5 -5
  3. server/gradio_ui.py +171 -15
  4. tasks.py +5 -5
README.md CHANGED
@@ -84,19 +84,118 @@ Humans in the Gradio debugger and agents in `inference.py` use the same read-onl
84
 
85
  This keeps the benchmark focused on investigation quality rather than privileged access.
86
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
87
  ## Tasks And Expected Difficulty
88
 
89
  The benchmark ships with three named deterministic tasks:
90
 
91
  1. `easy_single_spike`
92
  Expected difficulty: easy.
93
- One obvious event-count spike is present. A careful single-method investigation should usually be enough.
94
  2. `medium_mixed_pair`
95
  Expected difficulty: medium.
96
  Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
97
  3. `hard_mixed_multi`
98
  Expected difficulty: hard.
99
- Five anomalies are present with higher density and weaker signal separation. Agents need broader exploration and tighter filtering.
100
 
101
  Supported anomaly families across resets:
102
 
 
84
 
85
  This keeps the benchmark focused on investigation quality rather than privileged access.
86
 
87
+ ## How The Agent Should Choose Methods
88
+
89
+ The intended agent behavior is not "call every tool and submit everything." The benchmark rewards selecting the narrowest useful method for the anomaly family that the evidence supports.
90
+
91
+ Practical method-selection rules:
92
+
93
+ - start with `task_overview` to confirm the task shape, expected payload format, and visible config
94
+ - use broad discovery methods first when the anomaly family is unclear:
95
+ - `get_median_filter_rows`
96
+ - `list_suspicious_dates`
97
+ - `rows_for_date`
98
+ - use targeted confirmation methods once a candidate anomaly is visible:
99
+ - `compare_count_to_median` for event-count spikes or drops
100
+ - `compare_rate_to_median` for conversion-rate shifts
101
+ - `detect_funnel_break` for step-level funnel problems
102
+ - `check_impossible_counts` for instrumentation or impossible-value issues
103
+ - use family-specific generator methods when the anomaly family is already clear:
104
+ - `get_absolute_spike_in_event_count_rows`
105
+ - `get_absolute_drop_in_event_count_rows`
106
+ - `get_rate_spike_from_median_rows`
107
+ - `get_rate_drop_from_median_rows`
108
+ - `get_funnel_break_rows`
109
+ - `get_hourly_traffic_mix_shift_rows`
110
+ - `get_instrumentation_data_quality_issue_rows`
111
+ - prefer fewer high-confidence rows over broad over-submission because extra rows are penalized
112
+ - use `preview_submission` before final submission when manually building rows
113
+
114
+ In practice, a strong agent usually follows this pattern:
115
+
116
+ 1. Identify which metric family is likely broken.
117
+ 2. Confirm the exact date and entity with a comparison tool.
118
+ 3. Generate the smallest plausible payload.
119
+ 4. Submit only when the evidence is specific enough to justify the row.
120
+
121
+ ## How Payload Generation Works In The Server
122
+
123
+ The server supports two final submission paths:
124
+
125
+ - direct row submission with `classifications`
126
+ - declarative server-side generation with `payload_generators`
127
+
128
+ The payload-generator path is usually simpler and more stable because the model chooses methods and thresholds, and the server constructs the final anomaly rows.
129
+
130
+ Simple flow:
131
+
132
+ ```text
133
+ LLM
134
+ -> choose analysis method from available_methods
135
+ -> inspect evidence from analysis_result
136
+ -> choose one or more payload generator methods
137
+ -> submit payload_generators
138
+
139
+ Server
140
+ -> run payload_generator inside the environment
141
+ -> create normalized submission rows
142
+ -> grade submitted_rows against hidden expected_rows
143
+ -> return reward_breakdown, submission_issues, generated_rows
144
+ ```
145
+
146
+ At the server level, the path is:
147
+
148
+ ```text
149
+ MetricTrackerRlAction(payload_generators=[...])
150
+ -> environment step
151
+ -> _run_analysis("payload_generator", ...)
152
+ -> generated_rows
153
+ -> grade_submission(submitted_rows, expected_rows)
154
+ -> observation.reward_breakdown + observation.submission_issues
155
+ ```
156
+
157
+ This means the LLM is responsible for choosing the right generator method, but the server is responsible for turning that declarative request into actual payload rows and grading them.
158
+
159
+ ## Example Decision Path
160
+
161
+ Suppose the agent suspects a conversion-rate drop but does not yet know which metric is responsible.
162
+
163
+ ```text
164
+ 1. task_overview()
165
+ 2. get_median_filter_rows(metric_names=["app_open_to_order_placed", "app_open_to_payment_successful"], threshold_multiplier=2.0)
166
+ 3. compare_rate_to_median(date="2026-03-19", entity_name="app_open_to_payment_successful")
167
+ 4. payload_generator(generator_methods=[
168
+ {
169
+ "method_name": "get_rate_drop_from_median_rows",
170
+ "metric_name": "app_open_to_payment_successful",
171
+ "threshold_multiplier": 2.0
172
+ }
173
+ ])
174
+ 5. submit_payload_generator(...)
175
+ ```
176
+
177
+ What happens conceptually:
178
+
179
+ - the first broad method narrows the search space
180
+ - the comparison method confirms the metric/date pair with a baseline and observed value
181
+ - the generator submission asks the server to build the final row in the benchmark's required schema
182
+ - the grader scores the generated payload against the hidden expected anomalies
183
+
184
+ If the feedback reports extra rows or missing rows, the agent should refine the generator choice or threshold rather than blindly adding more methods.
185
+
186
  ## Tasks And Expected Difficulty
187
 
188
  The benchmark ships with three named deterministic tasks:
189
 
190
  1. `easy_single_spike`
191
  Expected difficulty: easy.
192
+ Two rate-spike anomalies are present. A careful targeted investigation should usually be enough.
193
  2. `medium_mixed_pair`
194
  Expected difficulty: medium.
195
  Three anomalies are present across mixed count and rate signals. Precision matters because over-submission is penalized.
196
  3. `hard_mixed_multi`
197
  Expected difficulty: hard.
198
+ Four anomalies are present with higher density and weaker signal separation. Agents need broader exploration and tighter filtering.
199
 
200
  Supported anomaly families across resets:
201
 
inference.py CHANGED
@@ -52,9 +52,9 @@ SYSTEM_PROMPT = textwrap.dedent(
52
  - Do not request full hidden answers or assume direct access to ground truth.
53
  - Prefer declarative payload generators over manual row construction.
54
  - Start from the default reset observation only.
55
- - Start by trying `get_median_filter_rows` across different metrics to learn which metrics produce useful anomaly rows.
56
  - Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
57
- - Prefer: task_overview -> get_median_filter_rows on several metrics -> compare useful results -> payload_generator -> submit_payload_generator.
58
  - Keep notes brief and factual.
59
  """
60
  ).strip()
@@ -245,7 +245,7 @@ def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
245
  {json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
246
 
247
  Prefer building a payload generator first, then submit it.
248
- Start by calling `get_median_filter_rows` on several different metrics and see which ones return useful anomaly rows.
249
  If a metric returns nothing or low-signal rows, try another metric.
250
  For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
251
 
@@ -254,9 +254,9 @@ def build_initial_user_prompt(observation: MetricTrackerRlObservation) -> str:
254
  `baseline_value`, `observed_value`, `delta_value`, `severity`.
255
 
256
  Supported generator method example:
257
- `{{"method_name":"get_median_filter_rows","threshold_multiplier":2.0}}`
258
  or
259
- `{{"method_name":"get_median_filter_rows","metric_names":["app_open_to_order_placed","orders_placed"],"threshold_multiplier":2.0}}`
260
 
261
  Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
262
  """
 
52
  - Do not request full hidden answers or assume direct access to ground truth.
53
  - Prefer declarative payload generators over manual row construction.
54
  - Start from the default reset observation only.
55
+ - Start by trying `get_rate_spike_from_median_rows` across different conversion metrics to learn which metrics produce useful anomaly rows.
56
  - Compare candidate metrics, then refine with raw-data inspection and median/std methods only when needed.
57
+ - Prefer: task_overview -> get_rate_spike_from_median_rows on several conversion metrics -> compare useful results -> payload_generator -> submit_payload_generator.
58
  - Keep notes brief and factual.
59
  """
60
  ).strip()
 
245
  {json.dumps(observation.model_dump(exclude={"debug"}), indent=2)}
246
 
247
  Prefer building a payload generator first, then submit it.
248
+ Start by calling `get_rate_spike_from_median_rows` on several conversion metrics and see which ones return useful anomaly rows.
249
  If a metric returns nothing or low-signal rows, try another metric.
250
  For funnel, hourly mix, or data-quality tasks, use the family-specific generator methods instead.
251
 
 
254
  `baseline_value`, `observed_value`, `delta_value`, `severity`.
255
 
256
  Supported generator method example:
257
+ `{{"method_name":"get_rate_spike_from_median_rows","threshold_multiplier":2.0}}`
258
  or
259
+ `{{"method_name":"get_rate_spike_from_median_rows","metric_names":["app_open_to_order_placed","app_open_to_payment_successful"],"threshold_multiplier":2.0}}`
260
 
261
  Use shared analysis methods only. Prefer `submit_payload_generator` over `submit_solution`.
262
  """
server/gradio_ui.py CHANGED
@@ -3,6 +3,8 @@
3
  from __future__ import annotations
4
 
5
  import json
 
 
6
 
7
  import pandas as pd
8
 
@@ -48,6 +50,26 @@ METRIC_CHOICES = [
48
  "app_open_to_order_placed",
49
  "app_open_to_payment_successful",
50
  ]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
 
53
  def build_metric_tracker_gradio_app(
@@ -75,7 +97,7 @@ def build_metric_tracker_gradio_app(
75
  Standard mode exposes method calls only. You inspect data through methods like
76
  `show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
77
  and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
78
- such as `get_median_filter_rows(metric_name, threshold_multiplier)`.
79
  """
80
  )
81
 
@@ -137,19 +159,19 @@ def build_metric_tracker_gradio_app(
137
 
138
  gr.Markdown("## Methods")
139
  gr.Markdown(
140
- "Run a method after reset to fetch exactly the daily aggregate data, median, "
141
- "std-from-median, filtered rows, or generated payload rows you want."
142
  )
143
  with gr.Row():
144
  method_name = gr.Dropdown(
145
  label="Method",
146
  choices=METHOD_CHOICES,
147
- value="show_raw_data",
148
  )
149
  method_metric = gr.Dropdown(
150
  label="metrics",
151
  choices=METRIC_CHOICES,
152
- value=[],
153
  multiselect=True,
154
  )
155
  method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
@@ -164,12 +186,26 @@ def build_metric_tracker_gradio_app(
164
  value="[]",
165
  interactive=True,
166
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
167
  analysis_result = gr.JSON(label="Last Method Results")
168
 
169
  with gr.Tab("Method Data"):
170
  gr.Markdown(
171
- "This panel shows only method-returned data. Use `show_raw_data` for daily "
172
- "aggregate rows, then median/std/filter methods to inspect candidate anomalies."
173
  )
174
  method_rows = gr.Dataframe(label="Method Rows", interactive=False)
175
 
@@ -185,11 +221,11 @@ def build_metric_tracker_gradio_app(
185
  )
186
  payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
187
  with gr.Row():
188
- generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="get_median_filter_rows")
189
  generator_metric_name = gr.Dropdown(
190
  label="metrics",
191
  choices=METRIC_CHOICES,
192
- value=[],
193
  multiselect=True,
194
  )
195
  generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
@@ -207,7 +243,6 @@ def build_metric_tracker_gradio_app(
207
  available_methods = gr.JSON(label="Shared Methods")
208
  submission_feedback = gr.JSON(label="Submission Feedback")
209
  reward_breakdown = gr.JSON(label="Reward Breakdown")
210
- generated_rows = gr.Dataframe(label="Generated Payload Rows", interactive=False)
211
  raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
212
  debug_snapshot = gr.JSON(label="Debug Snapshot")
213
 
@@ -222,7 +257,17 @@ def build_metric_tracker_gradio_app(
222
  task.to_model().model_dump(),
223
  )
224
 
225
- async def reset_episode(selected_task_id, seed_value, family, level, density, anomaly_count_value, reset_anomalies_json, debug_enabled):
 
 
 
 
 
 
 
 
 
 
226
  try:
227
  parsed_anomalies = json.loads(reset_anomalies_json or "[]")
228
  if not isinstance(parsed_anomalies, list):
@@ -240,6 +285,7 @@ def build_metric_tracker_gradio_app(
240
  {},
241
  {},
242
  {},
 
243
  _generator_frame([]),
244
  _generator_frame([]),
245
  "",
@@ -262,7 +308,7 @@ def build_metric_tracker_gradio_app(
262
  method_data = await web_manager.step_environment(
263
  {
264
  "analysis_method": "show_raw_data",
265
- "analysis_args": {"limit": 5},
266
  "classifications": [],
267
  "payload_generators": [],
268
  }
@@ -270,8 +316,11 @@ def build_metric_tracker_gradio_app(
270
  state = _state_from_response(data)
271
  state["latest_response"] = method_data
272
  state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
 
 
273
  obs = data.get("observation", {})
274
  method_result = state["last_method_result"] or {}
 
275
  available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
276
  active_task_payload = next(
277
  (item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
@@ -300,6 +349,7 @@ def build_metric_tracker_gradio_app(
300
  method_result,
301
  obs.get("submission_issues") or [],
302
  obs.get("reward_breakdown") or {},
 
303
  _method_frame(method_result),
304
  pd.DataFrame(),
305
  json.dumps(method_data, indent=2),
@@ -317,9 +367,10 @@ def build_metric_tracker_gradio_app(
317
  method_rows_value: str,
318
  threshold: float,
319
  limit_value: int,
 
320
  ):
321
  if not payload.get("active"):
322
- return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip()
323
  args = _method_args(
324
  selected_method,
325
  metric_names,
@@ -340,13 +391,20 @@ def build_metric_tracker_gradio_app(
340
  )
341
  payload["latest_response"] = data
342
  payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
 
 
 
 
 
343
  method_result = payload["last_method_result"] or {}
344
  generated = method_result.get("result", {}).get("generated_rows", [])
345
  method_frame = _method_frame(method_result)
 
346
  return (
347
  payload,
348
  method_result,
349
  data.get("observation", {}).get("message", ""),
 
350
  method_frame,
351
  pd.DataFrame(generated),
352
  json.dumps(data, indent=2),
@@ -474,9 +532,16 @@ def build_metric_tracker_gradio_app(
474
  def get_state_sync():
475
  return json.dumps(web_manager.get_state(), indent=2)
476
 
 
 
 
 
 
 
 
477
  reset_btn.click(
478
  fn=reset_episode,
479
- inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode],
480
  outputs=[
481
  session_state,
482
  status,
@@ -489,6 +554,7 @@ def build_metric_tracker_gradio_app(
489
  analysis_result,
490
  submission_feedback,
491
  reward_breakdown,
 
492
  method_rows,
493
  generated_rows,
494
  raw_json,
@@ -513,8 +579,14 @@ def build_metric_tracker_gradio_app(
513
  method_rows_json,
514
  method_threshold,
515
  method_limit,
 
516
  ],
517
- outputs=[session_state, analysis_result, status, method_rows, generated_rows, raw_json],
 
 
 
 
 
518
  )
519
  add_generator_btn.click(
520
  fn=add_or_update_generator,
@@ -649,6 +721,8 @@ def _state_from_response(data: dict) -> dict:
649
  "payload_generators": [],
650
  "last_method_result": data.get("observation", {}).get("analysis_result"),
651
  "latest_response": data,
 
 
652
  }
653
 
654
 
@@ -725,4 +799,86 @@ def _empty_state() -> dict:
725
  "payload_generators": [],
726
  "last_method_result": None,
727
  "latest_response": None,
 
 
728
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  from __future__ import annotations
4
 
5
  import json
6
+ import math
7
+ from statistics import median
8
 
9
  import pandas as pd
10
 
 
50
  "app_open_to_order_placed",
51
  "app_open_to_payment_successful",
52
  ]
53
+ CONVERSION_METRICS = [
54
+ "app_open_to_menu_open",
55
+ "menu_open_to_product_added_to_cart",
56
+ "product_added_to_cart_to_order_placed",
57
+ "order_placed_to_payment_successful",
58
+ "app_open_to_order_placed",
59
+ "app_open_to_payment_successful",
60
+ ]
61
+ THRESHOLD_METHODS = {
62
+ "get_rows_with_abs_diff_from_median_gt",
63
+ "get_median_filter_rows",
64
+ "get_rate_drop_from_median_rows",
65
+ "get_rate_spike_from_median_rows",
66
+ "get_absolute_drop_in_event_count_rows",
67
+ "get_absolute_spike_in_event_count_rows",
68
+ }
69
+ MEDIAN_METHODS = THRESHOLD_METHODS | {
70
+ "get_metric_median",
71
+ "get_metric_std_dev_from_median",
72
+ }
73
 
74
 
75
  def build_metric_tracker_gradio_app(
 
97
  Standard mode exposes method calls only. You inspect data through methods like
98
  `show_raw_data`, `get_metric_median`, `get_metric_std_dev_from_median`,
99
  and `get_rows_with_abs_diff_from_median_gt`, then assemble payload generators
100
+ such as `get_rate_spike_from_median_rows(metric_name, threshold_multiplier)`.
101
  """
102
  )
103
 
 
159
 
160
  gr.Markdown("## Methods")
161
  gr.Markdown(
162
+ "Run a method after reset to inspect daily aggregate data and then start from "
163
+ "rate-spike detection on conversion metrics before broadening the search."
164
  )
165
  with gr.Row():
166
  method_name = gr.Dropdown(
167
  label="Method",
168
  choices=METHOD_CHOICES,
169
+ value="get_rate_spike_from_median_rows",
170
  )
171
  method_metric = gr.Dropdown(
172
  label="metrics",
173
  choices=METRIC_CHOICES,
174
+ value=CONVERSION_METRICS,
175
  multiselect=True,
176
  )
177
  method_threshold = gr.Number(label="threshold / multiplier", value=2.0)
 
186
  value="[]",
187
  interactive=True,
188
  )
189
+ plot_metrics = gr.Dropdown(
190
+ label="Plot Metrics",
191
+ choices=METRIC_CHOICES,
192
+ value=CONVERSION_METRICS,
193
+ multiselect=True,
194
+ )
195
+ metric_plot = gr.LinePlot(
196
+ label="Metric Plot",
197
+ x="date",
198
+ y="value",
199
+ color="series",
200
+ tooltip=["date", "series", "value"],
201
+ )
202
+ generated_rows = gr.Dataframe(label="Payload Rows For Current Method", interactive=False)
203
  analysis_result = gr.JSON(label="Last Method Results")
204
 
205
  with gr.Tab("Method Data"):
206
  gr.Markdown(
207
+ "This panel shows only method-returned data. The chart already loads all daily "
208
+ "rows on reset, so use this table to inspect the exact rows returned by the current method."
209
  )
210
  method_rows = gr.Dataframe(label="Method Rows", interactive=False)
211
 
 
221
  )
222
  payload_generator_methods = gr.JSON(label="Methods Passed to Payload Generator")
223
  with gr.Row():
224
+ generator_method_name = gr.Dropdown(label="method_name", choices=GENERATOR_METHODS, value="get_rate_spike_from_median_rows")
225
  generator_metric_name = gr.Dropdown(
226
  label="metrics",
227
  choices=METRIC_CHOICES,
228
+ value=CONVERSION_METRICS,
229
  multiselect=True,
230
  )
231
  generator_multiplier = gr.Number(label="threshold_multiplier", value=2.0)
 
243
  available_methods = gr.JSON(label="Shared Methods")
244
  submission_feedback = gr.JSON(label="Submission Feedback")
245
  reward_breakdown = gr.JSON(label="Reward Breakdown")
 
246
  raw_json = gr.Code(label="Latest Environment Response", language="json", interactive=False)
247
  debug_snapshot = gr.JSON(label="Debug Snapshot")
248
 
 
257
  task.to_model().model_dump(),
258
  )
259
 
260
+ async def reset_episode(
261
+ selected_task_id,
262
+ seed_value,
263
+ family,
264
+ level,
265
+ density,
266
+ anomaly_count_value,
267
+ reset_anomalies_json,
268
+ debug_enabled,
269
+ selected_plot_metrics,
270
+ ):
271
  try:
272
  parsed_anomalies = json.loads(reset_anomalies_json or "[]")
273
  if not isinstance(parsed_anomalies, list):
 
285
  {},
286
  {},
287
  {},
288
+ _plot_frame([], selected_plot_metrics, None),
289
  _generator_frame([]),
290
  _generator_frame([]),
291
  "",
 
308
  method_data = await web_manager.step_environment(
309
  {
310
  "analysis_method": "show_raw_data",
311
+ "analysis_args": {"limit": 10000},
312
  "classifications": [],
313
  "payload_generators": [],
314
  }
 
316
  state = _state_from_response(data)
317
  state["latest_response"] = method_data
318
  state["last_method_result"] = method_data.get("observation", {}).get("analysis_result")
319
+ state["raw_rows"] = ((state["last_method_result"] or {}).get("result") or {}).get("rows", [])
320
+ state["last_plot_context"] = None
321
  obs = data.get("observation", {})
322
  method_result = state["last_method_result"] or {}
323
+ plot_frame = _plot_frame(state["raw_rows"], selected_plot_metrics, state["last_plot_context"])
324
  available_tasks = obs.get("available_tasks") or list(TASK_SUMMARIES.values())
325
  active_task_payload = next(
326
  (item for item in available_tasks if item.get("task_id") == obs.get("task_id")),
 
349
  method_result,
350
  obs.get("submission_issues") or [],
351
  obs.get("reward_breakdown") or {},
352
+ plot_frame,
353
  _method_frame(method_result),
354
  pd.DataFrame(),
355
  json.dumps(method_data, indent=2),
 
367
  method_rows_value: str,
368
  threshold: float,
369
  limit_value: int,
370
+ selected_plot_metrics: list[str],
371
  ):
372
  if not payload.get("active"):
373
+ return payload, {"error": "Reset the environment first."}, "", gr.skip(), gr.skip(), gr.skip(), gr.skip()
374
  args = _method_args(
375
  selected_method,
376
  metric_names,
 
391
  )
392
  payload["latest_response"] = data
393
  payload["last_method_result"] = data.get("observation", {}).get("analysis_result")
394
+ payload["last_plot_context"] = {
395
+ "method_name": selected_method,
396
+ "metric_names": [item for item in (metric_names or []) if item],
397
+ "threshold": float(threshold),
398
+ }
399
  method_result = payload["last_method_result"] or {}
400
  generated = method_result.get("result", {}).get("generated_rows", [])
401
  method_frame = _method_frame(method_result)
402
+ plot_frame = _plot_frame(payload.get("raw_rows", []), selected_plot_metrics, payload["last_plot_context"])
403
  return (
404
  payload,
405
  method_result,
406
  data.get("observation", {}).get("message", ""),
407
+ plot_frame,
408
  method_frame,
409
  pd.DataFrame(generated),
410
  json.dumps(data, indent=2),
 
532
  def get_state_sync():
533
  return json.dumps(web_manager.get_state(), indent=2)
534
 
535
+ def update_plot(payload: dict, selected_plot_metrics: list[str]):
536
+ return _plot_frame(
537
+ payload.get("raw_rows", []),
538
+ selected_plot_metrics,
539
+ payload.get("last_plot_context"),
540
+ )
541
+
542
  reset_btn.click(
543
  fn=reset_episode,
544
+ inputs=[task_id, seed, scenario_family, difficulty, anomaly_density, anomaly_count, reset_anomalies, debug_mode, plot_metrics],
545
  outputs=[
546
  session_state,
547
  status,
 
554
  analysis_result,
555
  submission_feedback,
556
  reward_breakdown,
557
+ metric_plot,
558
  method_rows,
559
  generated_rows,
560
  raw_json,
 
579
  method_rows_json,
580
  method_threshold,
581
  method_limit,
582
+ plot_metrics,
583
  ],
584
+ outputs=[session_state, analysis_result, status, metric_plot, method_rows, generated_rows, raw_json],
585
+ )
586
+ plot_metrics.change(
587
+ fn=update_plot,
588
+ inputs=[session_state, plot_metrics],
589
+ outputs=[metric_plot],
590
  )
591
  add_generator_btn.click(
592
  fn=add_or_update_generator,
 
721
  "payload_generators": [],
722
  "last_method_result": data.get("observation", {}).get("analysis_result"),
723
  "latest_response": data,
724
+ "raw_rows": [],
725
+ "last_plot_context": None,
726
  }
727
 
728
 
 
799
  "payload_generators": [],
800
  "last_method_result": None,
801
  "latest_response": None,
802
+ "raw_rows": [],
803
+ "last_plot_context": None,
804
  }
805
+
806
+
807
+ def _plot_frame(raw_rows: list[dict], selected_metrics: list[str], plot_context: dict | None) -> pd.DataFrame:
808
+ if not raw_rows:
809
+ return pd.DataFrame(columns=["date", "value", "series"])
810
+ frame = pd.DataFrame(raw_rows)
811
+ if "date" not in frame.columns:
812
+ return pd.DataFrame(columns=["date", "value", "series"])
813
+ metrics = [item for item in (selected_metrics or []) if item in frame.columns]
814
+ if not metrics:
815
+ return pd.DataFrame(columns=["date", "value", "series"])
816
+
817
+ rows: list[dict] = []
818
+ for metric_name in metrics:
819
+ values = pd.to_numeric(frame[metric_name], errors="coerce")
820
+ for date_value, numeric_value in zip(frame["date"], values):
821
+ if pd.isna(numeric_value):
822
+ continue
823
+ rows.append(
824
+ {
825
+ "date": date_value,
826
+ "value": float(numeric_value),
827
+ "series": metric_name,
828
+ }
829
+ )
830
+ rows.extend(_overlay_rows(frame, metric_name, plot_context))
831
+ return pd.DataFrame(rows, columns=["date", "value", "series"])
832
+
833
+
834
+ def _overlay_rows(frame: pd.DataFrame, metric_name: str, plot_context: dict | None) -> list[dict]:
835
+ if not plot_context:
836
+ return []
837
+ selected_metrics = [item for item in (plot_context.get("metric_names") or []) if item]
838
+ method_name = plot_context.get("method_name")
839
+ threshold = float(plot_context.get("threshold", 0.0))
840
+ if metric_name not in selected_metrics or method_name not in MEDIAN_METHODS:
841
+ return []
842
+
843
+ values = [float(item) for item in pd.to_numeric(frame[metric_name], errors="coerce").dropna().tolist()]
844
+ if not values:
845
+ return []
846
+ dates = frame["date"].tolist()
847
+ metric_median = float(median(values))
848
+ rows = [
849
+ {"date": date_value, "value": metric_median, "series": f"{metric_name} median"}
850
+ for date_value in dates
851
+ ]
852
+ threshold_value = None
853
+ if method_name == "get_metric_std_dev_from_median":
854
+ threshold_value = _std_from_median(values)
855
+ elif method_name == "get_rows_with_abs_diff_from_median_gt":
856
+ threshold_value = threshold
857
+ elif method_name in THRESHOLD_METHODS:
858
+ threshold_value = _std_from_median(values) * threshold
859
+
860
+ if threshold_value is None:
861
+ return rows
862
+ upper = metric_median + threshold_value
863
+ lower = metric_median - threshold_value
864
+ suffix = (
865
+ f"{threshold:.2f}*std"
866
+ if method_name in THRESHOLD_METHODS and method_name != "get_rows_with_abs_diff_from_median_gt"
867
+ else "threshold"
868
+ )
869
+ rows.extend(
870
+ {"date": date_value, "value": upper, "series": f"{metric_name} + {suffix}"}
871
+ for date_value in dates
872
+ )
873
+ rows.extend(
874
+ {"date": date_value, "value": lower, "series": f"{metric_name} - {suffix}"}
875
+ for date_value in dates
876
+ )
877
+ return rows
878
+
879
+
880
+ def _std_from_median(values: list[float]) -> float:
881
+ if not values:
882
+ return 0.0
883
+ metric_median = median(values)
884
+ return math.sqrt(sum((value - metric_median) ** 2 for value in values) / len(values))
tasks.py CHANGED
@@ -77,16 +77,16 @@ TASKS: dict[str, TaskSpec] = {
77
  task_id="easy_single_spike",
78
  difficulty="easy",
79
  instruction=(
80
- "Investigate the seeded funnel dataset and submit the single anomalous row. "
81
  "Use the shared analysis methods before submitting."
82
  ),
83
  objective=(
84
- "Find the one obvious anomaly and submit exactly one correctly populated anomaly row."
85
  ),
86
  seed=11,
87
- scenario_family="absolute_spike_in_event_count",
88
  anomaly_density="low",
89
- anomaly_count=1,
90
  ),
91
  "medium_mixed_pair": TaskSpec(
92
  task_id="medium_mixed_pair",
@@ -116,7 +116,7 @@ TASKS: dict[str, TaskSpec] = {
116
  seed=37,
117
  scenario_family="mixed",
118
  anomaly_density="high",
119
- anomaly_count=5,
120
  ),
121
  }
122
 
 
77
  task_id="easy_single_spike",
78
  difficulty="easy",
79
  instruction=(
80
+ "Investigate the seeded funnel dataset and submit every anomalous row. "
81
  "Use the shared analysis methods before submitting."
82
  ),
83
  objective=(
84
+ "Find all anomalies and submit every correctly populated anomaly row."
85
  ),
86
  seed=11,
87
+ scenario_family="rate_spike_from_median",
88
  anomaly_density="low",
89
+ anomaly_count=2,
90
  ),
91
  "medium_mixed_pair": TaskSpec(
92
  task_id="medium_mixed_pair",
 
116
  seed=37,
117
  scenario_family="mixed",
118
  anomaly_density="high",
119
+ anomaly_count=4,
120
  ),
121
  }
122