Aksel Joonas Reedi commited on
Commit
3eec386
·
unverified ·
1 Parent(s): 0545e40

Fix CLI rendering corruption and split CLI/frontend model defaults (#121)

Browse files

* Stabilize CLI rendering and make surface defaults explicit

The interactive CLI was interleaving live sub-agent redraws with streamed
markdown output, which corrupted ANSI rendering and leaked raw control
sequences into the terminal. The CLI and web app also shared one default
model config even though they need different Anthropic routing defaults.

Constraint: CLI default must use direct Anthropic credentials while web sessions must default to Bedrock Anthropic
Constraint: Interactive terminal output must remain readable while sub-agent progress is live
Rejected: Single shared config file with runtime overrides | keeps ownership of defaults implicit across surfaces
Rejected: Keep background redraw ticker | concurrent terminal writers still corrupt streamed output
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep CLI and frontend default models in separate config files unless both surfaces intentionally converge again
Tested: python -m compileall agent backend
Tested: ./frontend/node_modules/.bin/tsc -p frontend/tsconfig.json --noEmit
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest python -m pytest -q tests/unit/test_cli_rendering.py
Not-tested: Full pytest suite (blocked by pre-existing tests/unit/test_llm_error_classification.py import error during collection)

* Restore regression coverage and make the full test suite green

The earlier PR fixed the CLI rendering and model-default split, but the
full local suite exposed additional regressions in tool-result patching,
doom-loop polling detection, sandbox reuse messaging, and async test
support. This follow-up commit restores the missing helpers and updates
those production paths so the new regression tests pass for real.

Constraint: Provider message histories must keep tool_use/tool_result pairing valid across interrupted turns
Constraint: Legitimate polling with changing results must not trip doom-loop recovery
Rejected: Only fix the original collection blocker | leaves the full suite red and the PR note stale
Rejected: Silence the failing tests without restoring runtime helpers | would hide real production regressions
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep the local regression tests in sync with the production recovery paths they exercise
Tested: python -m compileall agent/context_manager/manager.py agent/core/agent_loop.py agent/core/doom_loop.py agent/tools/sandbox_tool.py backend/user_quotas.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q tests/unit/test_dangling_tool_calls.py tests/unit/test_doom_loop_polling.py tests/unit/test_sandbox_already_active_message.py tests/unit/test_user_quotas.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q
Not-tested: Remote CI environment parity

* Tighten rate-limit retries and drop the orphaned shared config

The review was right about two follow-up issues: the old shared config file
was still present after the CLI/frontend split, and the Bedrock rate-limit
retry schedule still had a dead third entry because the loop only ever
consumed two retry delays. This commit removes the orphaned config and makes
the rate-limit schedule line up with the actual retry budget.

Constraint: Retry budget for Bedrock token throttling must exceed the provider's ~60s bucket recovery window in the retries that actually run
Rejected: Keep a third delay entry in the schedule | the current retry loop never reaches it
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep retry schedules aligned with the retry loop's real number of sleeps, not the raw retry constant count
Tested: python -m compileall agent/core/agent_loop.py tests/unit/test_llm_error_classification.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q tests/unit/test_llm_error_classification.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q
Not-tested: Remote CI environment parity

README.md CHANGED
@@ -212,7 +212,8 @@ def create_builtin_tools() -> list[ToolSpec]:
212
 
213
  ### Adding MCP Servers
214
 
215
- Edit `configs/main_agent_config.json`:
 
216
 
217
  ```json
218
  {
 
212
 
213
  ### Adding MCP Servers
214
 
215
+ Edit `configs/cli_agent_config.json` for CLI defaults, or
216
+ `configs/frontend_agent_config.json` for web-session defaults:
217
 
218
  ```json
219
  {
agent/context_manager/manager.py CHANGED
@@ -253,45 +253,49 @@ class ContextManager:
253
  def _patch_dangling_tool_calls(self) -> None:
254
  """Add stub tool results for any tool_calls that lack a matching result.
255
 
256
- Scans backwards to find the last assistant message with tool_calls,
257
- which may not be items[-1] if some tool results were already added.
 
 
258
  """
259
  if not self.items:
260
  return
261
 
262
- # Find the last assistant message with tool_calls
263
- assistant_msg = None
264
- for i in range(len(self.items) - 1, -1, -1):
265
  msg = self.items[i]
266
- if getattr(msg, "role", None) == "assistant" and getattr(
267
- msg, "tool_calls", None
268
- ):
269
- assistant_msg = msg
270
- break
271
- # Stop scanning once we hit a user message — anything before
272
- # that belongs to a previous (complete) turn.
273
- if getattr(msg, "role", None) == "user":
274
- break
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
275
 
276
- if not assistant_msg:
277
- return
 
278
 
279
- self._normalize_tool_calls(assistant_msg)
280
- answered_ids = {
281
- getattr(m, "tool_call_id", None)
282
- for m in self.items
283
- if getattr(m, "role", None) == "tool"
284
- }
285
- for tc in assistant_msg.tool_calls:
286
- if tc.id not in answered_ids:
287
- self.items.append(
288
- Message(
289
- role="tool",
290
- content="Tool was not executed (interrupted or error).",
291
- tool_call_id=tc.id,
292
- name=tc.function.name,
293
- )
294
- )
295
 
296
  def undo_last_turn(self) -> bool:
297
  """Remove the last complete turn (user msg + all assistant/tool msgs that follow).
 
253
  def _patch_dangling_tool_calls(self) -> None:
254
  """Add stub tool results for any tool_calls that lack a matching result.
255
 
256
+ Ensures each assistant message's tool_calls are followed immediately
257
+ by matching tool-result messages. This has to work across the whole
258
+ history, not just the most recent turn, because a cancelled tool use
259
+ in an earlier turn can still poison the next provider request.
260
  """
261
  if not self.items:
262
  return
263
 
264
+ i = 0
265
+ while i < len(self.items):
 
266
  msg = self.items[i]
267
+ if getattr(msg, "role", None) != "assistant" or not getattr(msg, "tool_calls", None):
268
+ i += 1
269
+ continue
270
+
271
+ self._normalize_tool_calls(msg)
272
+
273
+ # Consume the contiguous tool-result block that immediately follows
274
+ # this assistant message. Any missing tool ids must be inserted
275
+ # before the next non-tool message to satisfy provider ordering.
276
+ j = i + 1
277
+ immediate_ids: set[str | None] = set()
278
+ while j < len(self.items) and getattr(self.items[j], "role", None) == "tool":
279
+ immediate_ids.add(getattr(self.items[j], "tool_call_id", None))
280
+ j += 1
281
+
282
+ missing: list[Message] = []
283
+ for tc in msg.tool_calls:
284
+ if tc.id not in immediate_ids:
285
+ missing.append(
286
+ Message(
287
+ role="tool",
288
+ content="Tool was not executed (interrupted or error).",
289
+ tool_call_id=tc.id,
290
+ name=tc.function.name,
291
+ )
292
+ )
293
 
294
+ if missing:
295
+ self.items[j:j] = missing
296
+ j += len(missing)
297
 
298
+ i = j
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
299
 
300
  def undo_last_turn(self) -> bool:
301
  """Remove the last complete turn (user msg + all assistant/tool msgs that follow).
agent/core/agent_loop.py CHANGED
@@ -25,6 +25,61 @@ logger = logging.getLogger(__name__)
25
 
26
  ToolCall = ChatCompletionMessageToolCall
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  def _validate_tool_args(tool_args: dict) -> tuple[bool, str | None]:
30
  """
@@ -121,6 +176,54 @@ def _needs_approval(
121
  # -- LLM retry constants --------------------------------------------------
122
  _MAX_LLM_RETRIES = 3
123
  _LLM_RETRY_DELAYS = [5, 15, 30] # seconds between retries
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
 
126
  def _is_transient_error(error: Exception) -> bool:
@@ -128,7 +231,6 @@ def _is_transient_error(error: Exception) -> bool:
128
  err_str = str(error).lower()
129
  transient_patterns = [
130
  "timeout", "timed out",
131
- "429", "rate limit", "rate_limit",
132
  "503", "service unavailable",
133
  "502", "bad gateway",
134
  "500", "internal server error",
@@ -136,7 +238,7 @@ def _is_transient_error(error: Exception) -> bool:
136
  "connection reset", "connection refused", "connection error",
137
  "eof", "broken pipe",
138
  ]
139
- return any(pattern in err_str for pattern in transient_patterns)
140
 
141
 
142
  def _is_effort_config_error(error: Exception) -> bool:
@@ -317,6 +419,8 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
317
  except ContextWindowExceededError:
318
  raise
319
  except Exception as e:
 
 
320
  if not _healed_effort and _is_effort_config_error(e):
321
  _healed_effort = True
322
  llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
@@ -325,8 +429,8 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
325
  data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
326
  ))
327
  continue
328
- if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
329
- _delay = _LLM_RETRY_DELAYS[_llm_attempt]
330
  logger.warning(
331
  "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
332
  _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
@@ -424,6 +528,8 @@ async def _call_llm_non_streaming(session: Session, messages, tools, llm_params)
424
  except ContextWindowExceededError:
425
  raise
426
  except Exception as e:
 
 
427
  if not _healed_effort and _is_effort_config_error(e):
428
  _healed_effort = True
429
  llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
@@ -432,8 +538,8 @@ async def _call_llm_non_streaming(session: Session, messages, tools, llm_params)
432
  data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
433
  ))
434
  continue
435
- if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
436
- _delay = _LLM_RETRY_DELAYS[_llm_attempt]
437
  logger.warning(
438
  "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
439
  _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
@@ -585,6 +691,31 @@ class Handlers:
585
  )
586
  )
587
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
588
  messages = session.context_manager.get_messages()
589
  tools = session.tool_router.get_tool_specs_for_llm()
590
  try:
 
25
 
26
  ToolCall = ChatCompletionMessageToolCall
27
 
28
+ _MALFORMED_TOOL_PREFIX = "ERROR: Tool call to '"
29
+ _MALFORMED_TOOL_SUFFIX = "' had malformed JSON arguments"
30
+
31
+
32
+ def _malformed_tool_name(message: Message) -> str | None:
33
+ """Return the tool name for malformed-json tool-result messages."""
34
+ if getattr(message, "role", None) != "tool":
35
+ return None
36
+ content = getattr(message, "content", None)
37
+ if not isinstance(content, str):
38
+ return None
39
+ if not content.startswith(_MALFORMED_TOOL_PREFIX):
40
+ return None
41
+ end = content.find(_MALFORMED_TOOL_SUFFIX, len(_MALFORMED_TOOL_PREFIX))
42
+ if end == -1:
43
+ return None
44
+ return content[len(_MALFORMED_TOOL_PREFIX):end]
45
+
46
+
47
+ def _detect_repeated_malformed(
48
+ items: list[Message], threshold: int = 2,
49
+ ) -> str | None:
50
+ """Return the repeated malformed tool name if the tail contains a streak.
51
+
52
+ Walk backward over the current conversation tail. A streak counts only
53
+ consecutive malformed tool-result messages for the same tool; any other
54
+ tool result breaks it.
55
+ """
56
+ if threshold <= 0:
57
+ return None
58
+
59
+ streak_tool: str | None = None
60
+ streak = 0
61
+
62
+ for item in reversed(items):
63
+ if getattr(item, "role", None) != "tool":
64
+ continue
65
+
66
+ malformed_tool = _malformed_tool_name(item)
67
+ if malformed_tool is None:
68
+ break
69
+
70
+ if streak_tool is None:
71
+ streak_tool = malformed_tool
72
+ streak = 1
73
+ elif malformed_tool == streak_tool:
74
+ streak += 1
75
+ else:
76
+ break
77
+
78
+ if streak >= threshold:
79
+ return streak_tool
80
+
81
+ return None
82
+
83
 
84
  def _validate_tool_args(tool_args: dict) -> tuple[bool, str | None]:
85
  """
 
176
  # -- LLM retry constants --------------------------------------------------
177
  _MAX_LLM_RETRIES = 3
178
  _LLM_RETRY_DELAYS = [5, 15, 30] # seconds between retries
179
+ _LLM_RATE_LIMIT_RETRY_DELAYS = [30, 60] # exceed Bedrock's ~60s TPM bucket window
180
+
181
+
182
+ def _is_rate_limit_error(error: Exception) -> bool:
183
+ """Return True for rate-limit / quota-bucket style provider errors."""
184
+ err_str = str(error).lower()
185
+ rate_limit_patterns = [
186
+ "429",
187
+ "rate limit",
188
+ "rate_limit",
189
+ "too many requests",
190
+ "too many tokens",
191
+ "request limit",
192
+ "throttl",
193
+ ]
194
+ return any(pattern in err_str for pattern in rate_limit_patterns)
195
+
196
+
197
+ def _is_context_overflow_error(error: Exception) -> bool:
198
+ """Return True when the prompt exceeded the model's context window."""
199
+ if isinstance(error, ContextWindowExceededError):
200
+ return True
201
+
202
+ err_str = str(error).lower()
203
+ overflow_patterns = [
204
+ "context window exceeded",
205
+ "maximum context length",
206
+ "max context length",
207
+ "prompt is too long",
208
+ "context length exceeded",
209
+ "too many input tokens",
210
+ "input is too long",
211
+ ]
212
+ return any(pattern in err_str for pattern in overflow_patterns)
213
+
214
+
215
+ def _retry_delay_for(error: Exception, attempt_index: int) -> int | None:
216
+ """Return the delay for this retry attempt, or None if it should not retry."""
217
+ if _is_rate_limit_error(error):
218
+ schedule = _LLM_RATE_LIMIT_RETRY_DELAYS
219
+ elif _is_transient_error(error):
220
+ schedule = _LLM_RETRY_DELAYS
221
+ else:
222
+ return None
223
+
224
+ if attempt_index >= len(schedule):
225
+ return None
226
+ return schedule[attempt_index]
227
 
228
 
229
  def _is_transient_error(error: Exception) -> bool:
 
231
  err_str = str(error).lower()
232
  transient_patterns = [
233
  "timeout", "timed out",
 
234
  "503", "service unavailable",
235
  "502", "bad gateway",
236
  "500", "internal server error",
 
238
  "connection reset", "connection refused", "connection error",
239
  "eof", "broken pipe",
240
  ]
241
+ return _is_rate_limit_error(error) or any(pattern in err_str for pattern in transient_patterns)
242
 
243
 
244
  def _is_effort_config_error(error: Exception) -> bool:
 
419
  except ContextWindowExceededError:
420
  raise
421
  except Exception as e:
422
+ if _is_context_overflow_error(e):
423
+ raise ContextWindowExceededError(str(e)) from e
424
  if not _healed_effort and _is_effort_config_error(e):
425
  _healed_effort = True
426
  llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
 
429
  data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
430
  ))
431
  continue
432
+ _delay = _retry_delay_for(e, _llm_attempt)
433
+ if _llm_attempt < _MAX_LLM_RETRIES - 1 and _delay is not None:
434
  logger.warning(
435
  "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
436
  _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
 
528
  except ContextWindowExceededError:
529
  raise
530
  except Exception as e:
531
+ if _is_context_overflow_error(e):
532
+ raise ContextWindowExceededError(str(e)) from e
533
  if not _healed_effort and _is_effort_config_error(e):
534
  _healed_effort = True
535
  llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
 
538
  data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
539
  ))
540
  continue
541
+ _delay = _retry_delay_for(e, _llm_attempt)
542
+ if _llm_attempt < _MAX_LLM_RETRIES - 1 and _delay is not None:
543
  logger.warning(
544
  "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
545
  _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
 
691
  )
692
  )
693
 
694
+ malformed_tool = _detect_repeated_malformed(session.context_manager.items)
695
+ if malformed_tool:
696
+ recovery_prompt = (
697
+ "[SYSTEM: Repeated malformed tool arguments detected for "
698
+ f"'{malformed_tool}'. Stop retrying the same tool call shape. "
699
+ "Use a different strategy that produces smaller, valid JSON. "
700
+ "For large file writes, prefer bash with a heredoc or split the "
701
+ "edit into multiple smaller tool calls.]"
702
+ )
703
+ session.context_manager.add_message(
704
+ Message(role="user", content=recovery_prompt)
705
+ )
706
+ await session.send_event(
707
+ Event(
708
+ event_type="tool_log",
709
+ data={
710
+ "tool": "system",
711
+ "log": (
712
+ "Repeated malformed tool arguments detected — "
713
+ f"forcing a different strategy for {malformed_tool}"
714
+ ),
715
+ },
716
+ )
717
+ )
718
+
719
  messages = session.context_manager.get_messages()
720
  tools = session.tool_router.get_tool_specs_for_llm()
721
  try:
agent/core/doom_loop.py CHANGED
@@ -17,10 +17,11 @@ logger = logging.getLogger(__name__)
17
 
18
  @dataclass(frozen=True)
19
  class ToolCallSignature:
20
- """Hashable signature for a single tool call (name + args hash)."""
21
 
22
  name: str
23
  args_hash: str
 
24
 
25
 
26
  def _hash_args(args_str: str) -> str:
@@ -31,11 +32,16 @@ def _hash_args(args_str: str) -> str:
31
  def extract_recent_tool_signatures(
32
  messages: list[Message], lookback: int = 30
33
  ) -> list[ToolCallSignature]:
34
- """Extract tool call signatures from recent assistant messages."""
 
 
 
 
 
35
  signatures: list[ToolCallSignature] = []
36
  recent = messages[-lookback:] if len(messages) > lookback else messages
37
 
38
- for msg in recent:
39
  if getattr(msg, "role", None) != "assistant":
40
  continue
41
  tool_calls = getattr(msg, "tool_calls", None)
@@ -47,7 +53,21 @@ def extract_recent_tool_signatures(
47
  continue
48
  name = getattr(fn, "name", "") or ""
49
  args_str = getattr(fn, "arguments", "") or ""
50
- signatures.append(ToolCallSignature(name=name, args_hash=_hash_args(args_str)))
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
  return signatures
53
 
 
17
 
18
  @dataclass(frozen=True)
19
  class ToolCallSignature:
20
+ """Hashable signature for a single tool call plus its observed result."""
21
 
22
  name: str
23
  args_hash: str
24
+ result_hash: str | None = None
25
 
26
 
27
  def _hash_args(args_str: str) -> str:
 
32
  def extract_recent_tool_signatures(
33
  messages: list[Message], lookback: int = 30
34
  ) -> list[ToolCallSignature]:
35
+ """Extract tool call signatures from recent assistant messages.
36
+
37
+ Includes the immediate tool result hash when present. This prevents
38
+ legitimate polling from being classified as a doom loop when the poll
39
+ arguments stay constant but the observed result keeps changing.
40
+ """
41
  signatures: list[ToolCallSignature] = []
42
  recent = messages[-lookback:] if len(messages) > lookback else messages
43
 
44
+ for idx, msg in enumerate(recent):
45
  if getattr(msg, "role", None) != "assistant":
46
  continue
47
  tool_calls = getattr(msg, "tool_calls", None)
 
53
  continue
54
  name = getattr(fn, "name", "") or ""
55
  args_str = getattr(fn, "arguments", "") or ""
56
+ result_hash = None
57
+ for follow in recent[idx + 1:]:
58
+ role = getattr(follow, "role", None)
59
+ if role == "tool" and getattr(follow, "tool_call_id", None) == getattr(tc, "id", None):
60
+ result_hash = _hash_args(str(getattr(follow, "content", "") or ""))
61
+ break
62
+ if role in {"assistant", "user"}:
63
+ break
64
+ signatures.append(
65
+ ToolCallSignature(
66
+ name=name,
67
+ args_hash=_hash_args(args_str),
68
+ result_hash=result_hash,
69
+ )
70
+ )
71
 
72
  return signatures
73
 
agent/main.py CHANGED
@@ -50,6 +50,16 @@ litellm.drop_params = True
50
  # on every error — users don't need it, and our friendly errors cover the case.
51
  litellm.suppress_debug_info = True
52
 
 
 
 
 
 
 
 
 
 
 
53
  def _safe_get_args(arguments: dict) -> dict:
54
  """Safely extract args dict from arguments, handling cases where LLM passes string."""
55
  args = arguments.get("args", {})
@@ -846,8 +856,7 @@ async def main():
846
  ready_event = asyncio.Event()
847
 
848
  # Start agent loop in background
849
- config_path = Path(__file__).parent.parent / "configs" / "main_agent_config.json"
850
- config = load_config(config_path)
851
 
852
  # Create tool router with local mode
853
  tool_router = ToolRouter(config.mcpServers, hf_token=hf_token, local_mode=True)
@@ -1037,6 +1046,7 @@ async def headless_main(
1037
  import logging
1038
 
1039
  logging.basicConfig(level=logging.WARNING)
 
1040
 
1041
  hf_token = _get_hf_token()
1042
  if not hf_token:
@@ -1045,8 +1055,7 @@ async def headless_main(
1045
 
1046
  print(f"HF token loaded", file=sys.stderr)
1047
 
1048
- config_path = Path(__file__).parent.parent / "configs" / "main_agent_config.json"
1049
- config = load_config(config_path)
1050
  config.yolo_mode = True # Auto-approve everything in headless mode
1051
 
1052
  if model:
@@ -1222,6 +1231,7 @@ def cli():
1222
  import warnings
1223
  # Suppress aiohttp "Unclosed client session" noise during event loop teardown
1224
  _logging.getLogger("asyncio").setLevel(_logging.CRITICAL)
 
1225
  # Suppress litellm pydantic deprecation warnings
1226
  warnings.filterwarnings("ignore", category=DeprecationWarning, module="litellm")
1227
  # Suppress whoosh invalid escape sequence warnings (third-party, unfixed upstream)
 
50
  # on every error — users don't need it, and our friendly errors cover the case.
51
  litellm.suppress_debug_info = True
52
 
53
+ CLI_CONFIG_PATH = Path(__file__).parent.parent / "configs" / "cli_agent_config.json"
54
+
55
+
56
+ def _configure_runtime_logging() -> None:
57
+ """Keep third-party warning spam from punching through the interactive UI."""
58
+ import logging
59
+
60
+ logging.getLogger("LiteLLM").setLevel(logging.ERROR)
61
+ logging.getLogger("litellm").setLevel(logging.ERROR)
62
+
63
  def _safe_get_args(arguments: dict) -> dict:
64
  """Safely extract args dict from arguments, handling cases where LLM passes string."""
65
  args = arguments.get("args", {})
 
856
  ready_event = asyncio.Event()
857
 
858
  # Start agent loop in background
859
+ config = load_config(CLI_CONFIG_PATH)
 
860
 
861
  # Create tool router with local mode
862
  tool_router = ToolRouter(config.mcpServers, hf_token=hf_token, local_mode=True)
 
1046
  import logging
1047
 
1048
  logging.basicConfig(level=logging.WARNING)
1049
+ _configure_runtime_logging()
1050
 
1051
  hf_token = _get_hf_token()
1052
  if not hf_token:
 
1055
 
1056
  print(f"HF token loaded", file=sys.stderr)
1057
 
1058
+ config = load_config(CLI_CONFIG_PATH)
 
1059
  config.yolo_mode = True # Auto-approve everything in headless mode
1060
 
1061
  if model:
 
1231
  import warnings
1232
  # Suppress aiohttp "Unclosed client session" noise during event loop teardown
1233
  _logging.getLogger("asyncio").setLevel(_logging.CRITICAL)
1234
+ _configure_runtime_logging()
1235
  # Suppress litellm pydantic deprecation warnings
1236
  warnings.filterwarnings("ignore", category=DeprecationWarning, module="litellm")
1237
  # Suppress whoosh invalid escape sequence warnings (third-party, unfixed upstream)
agent/tools/research_tool.py CHANGED
@@ -216,7 +216,9 @@ RESEARCH_TOOL_SPEC = {
216
 
217
  def _get_research_model(main_model: str) -> str:
218
  """Pick a cheaper model for research based on the main model."""
219
- if "anthropic" in main_model:
 
 
220
  return "bedrock/us.anthropic.claude-sonnet-4-6"
221
  # For non-Anthropic models (HF router etc.), use the same model
222
  return main_model
 
216
 
217
  def _get_research_model(main_model: str) -> str:
218
  """Pick a cheaper model for research based on the main model."""
219
+ if main_model.startswith("anthropic/"):
220
+ return "anthropic/claude-sonnet-4-6"
221
+ if main_model.startswith("bedrock/") and "anthropic" in main_model:
222
  return "bedrock/us.anthropic.claude-sonnet-4-6"
223
  # For non-Anthropic models (HF router etc.), use the same model
224
  return main_model
agent/tools/sandbox_tool.py CHANGED
@@ -213,16 +213,26 @@ async def sandbox_create_handler(
213
  args: dict[str, Any], session: Any = None
214
  ) -> tuple[str, bool]:
215
  """Handle sandbox_create tool calls."""
 
 
216
  # If sandbox already exists, return its info
217
  if session and getattr(session, "sandbox", None):
218
  sb = session.sandbox
 
 
 
 
 
 
 
 
219
  return (
220
  f"Sandbox already active: {sb.space_id}\n"
221
  f"URL: {sb.url}\n"
 
222
  f"Use bash/read/write/edit to interact with it."
223
  ), True
224
 
225
- hardware = args.get("hardware", "cpu-basic")
226
  create_kwargs = {}
227
  if "private" in args:
228
  create_kwargs["private"] = args["private"]
 
213
  args: dict[str, Any], session: Any = None
214
  ) -> tuple[str, bool]:
215
  """Handle sandbox_create tool calls."""
216
+ hardware = args.get("hardware", "cpu-basic")
217
+
218
  # If sandbox already exists, return its info
219
  if session and getattr(session, "sandbox", None):
220
  sb = session.sandbox
221
+ requested_hardware = args.get("hardware")
222
+ lockout_note = ""
223
+ if requested_hardware:
224
+ lockout_note = (
225
+ f"\nRequested hardware: {requested_hardware}\n"
226
+ "Hardware cannot be changed by calling sandbox_create again. "
227
+ "Delete the existing sandbox first if you need a different tier."
228
+ )
229
  return (
230
  f"Sandbox already active: {sb.space_id}\n"
231
  f"URL: {sb.url}\n"
232
+ f"{lockout_note}\n"
233
  f"Use bash/read/write/edit to interact with it."
234
  ), True
235
 
 
236
  create_kwargs = {}
237
  if "private" in args:
238
  create_kwargs["private"] = args["private"]
agent/utils/terminal_display.py CHANGED
@@ -180,10 +180,8 @@ class SubAgentDisplayManager:
180
  def __init__(self):
181
  self._agents: dict[str, dict] = {} # agent_id -> state dict
182
  self._lines_on_screen = 0
183
- self._ticker_task = None
184
 
185
  def start(self, agent_id: str, label: str = "research") -> None:
186
- import asyncio
187
  import time
188
  self._agents[agent_id] = {
189
  "label": label,
@@ -192,8 +190,6 @@ class SubAgentDisplayManager:
192
  "token_count": 0,
193
  "start_time": time.monotonic(),
194
  }
195
- if not self._ticker_task:
196
- self._ticker_task = asyncio.ensure_future(self._tick())
197
  self._redraw()
198
 
199
  def set_tokens(self, agent_id: str, tokens: int) -> None:
@@ -222,11 +218,7 @@ class SubAgentDisplayManager:
222
  _console.file.write(line + "\n")
223
  _console.file.flush()
224
  self._lines_on_screen = 0
225
- if not self._agents:
226
- if self._ticker_task:
227
- self._ticker_task.cancel()
228
- self._ticker_task = None
229
- else:
230
  self._redraw()
231
 
232
  @staticmethod
@@ -239,16 +231,6 @@ class SubAgentDisplayManager:
239
  line += f" \033[2m({stats})\033[0m"
240
  return line
241
 
242
- async def _tick(self) -> None:
243
- import asyncio
244
- try:
245
- while True:
246
- await asyncio.sleep(1.0)
247
- if self._agents:
248
- self._redraw()
249
- except asyncio.CancelledError:
250
- pass
251
-
252
  @staticmethod
253
  def _format_stats(agent: dict) -> str:
254
  import time
 
180
  def __init__(self):
181
  self._agents: dict[str, dict] = {} # agent_id -> state dict
182
  self._lines_on_screen = 0
 
183
 
184
  def start(self, agent_id: str, label: str = "research") -> None:
 
185
  import time
186
  self._agents[agent_id] = {
187
  "label": label,
 
190
  "token_count": 0,
191
  "start_time": time.monotonic(),
192
  }
 
 
193
  self._redraw()
194
 
195
  def set_tokens(self, agent_id: str, tokens: int) -> None:
 
218
  _console.file.write(line + "\n")
219
  _console.file.flush()
220
  self._lines_on_screen = 0
221
+ if self._agents:
 
 
 
 
222
  self._redraw()
223
 
224
  @staticmethod
 
231
  line += f" \033[2m({stats})\033[0m"
232
  return line
233
 
 
 
 
 
 
 
 
 
 
 
234
  @staticmethod
235
  def _format_stats(agent: dict) -> str:
236
  import time
backend/session_manager.py CHANGED
@@ -15,7 +15,7 @@ from agent.core.tools import ToolRouter
15
 
16
  # Get project root (parent of backend directory)
17
  PROJECT_ROOT = Path(__file__).parent.parent
18
- DEFAULT_CONFIG_PATH = str(PROJECT_ROOT / "configs" / "main_agent_config.json")
19
 
20
 
21
  # These dataclasses match agent/main.py structure
 
15
 
16
  # Get project root (parent of backend directory)
17
  PROJECT_ROOT = Path(__file__).parent.parent
18
+ DEFAULT_CONFIG_PATH = str(PROJECT_ROOT / "configs" / "frontend_agent_config.json")
19
 
20
 
21
  # These dataclasses match agent/main.py structure
configs/{main_agent_config.json → cli_agent_config.json} RENAMED
File without changes
configs/frontend_agent_config.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_name": "bedrock/us.anthropic.claude-opus-4-6-v1",
3
+ "save_sessions": true,
4
+ "session_dataset_repo": "smolagents/ml-intern-sessions",
5
+ "yolo_mode": false,
6
+ "confirm_cpu_jobs": true,
7
+ "auto_file_upload": true,
8
+ "mcpServers": {
9
+ "hf-mcp-server": {
10
+ "transport": "http",
11
+ "url": "https://huggingface.co/mcp?login"
12
+ }
13
+ }
14
+ }
frontend/src/components/Chat/ChatInput.tsx CHANGED
@@ -7,7 +7,7 @@ import { apiFetch } from '@/utils/api';
7
  import { useUserQuota } from '@/hooks/useUserQuota';
8
  import ClaudeCapDialog from '@/components/ClaudeCapDialog';
9
  import { useAgentStore } from '@/store/agentStore';
10
- import { FIRST_FREE_MODEL_PATH } from '@/utils/model';
11
 
12
  // Model configuration
13
  interface ModelOption {
@@ -37,7 +37,7 @@ const MODEL_OPTIONS: ModelOption[] = [
37
  id: 'claude-opus',
38
  name: 'Claude Opus 4.6',
39
  description: 'Anthropic',
40
- modelPath: 'anthropic/claude-opus-4-6',
41
  avatarUrl: 'https://huggingface.co/api/avatars/Anthropic',
42
  recommended: true,
43
  },
@@ -70,7 +70,7 @@ interface ChatInputProps {
70
  placeholder?: string;
71
  }
72
 
73
- const isClaudeModel = (m: ModelOption) => m.modelPath.startsWith('anthropic/');
74
  const firstFreeModel = () => MODEL_OPTIONS.find(m => !isClaudeModel(m)) ?? MODEL_OPTIONS[0];
75
 
76
  export default function ChatInput({ sessionId, onSend, onStop, isProcessing = false, disabled = false, placeholder = 'Ask anything...' }: ChatInputProps) {
 
7
  import { useUserQuota } from '@/hooks/useUserQuota';
8
  import ClaudeCapDialog from '@/components/ClaudeCapDialog';
9
  import { useAgentStore } from '@/store/agentStore';
10
+ import { CLAUDE_MODEL_PATH, FIRST_FREE_MODEL_PATH, isClaudePath } from '@/utils/model';
11
 
12
  // Model configuration
13
  interface ModelOption {
 
37
  id: 'claude-opus',
38
  name: 'Claude Opus 4.6',
39
  description: 'Anthropic',
40
+ modelPath: CLAUDE_MODEL_PATH,
41
  avatarUrl: 'https://huggingface.co/api/avatars/Anthropic',
42
  recommended: true,
43
  },
 
70
  placeholder?: string;
71
  }
72
 
73
+ const isClaudeModel = (m: ModelOption) => isClaudePath(m.modelPath);
74
  const firstFreeModel = () => MODEL_OPTIONS.find(m => !isClaudeModel(m)) ?? MODEL_OPTIONS[0];
75
 
76
  export default function ChatInput({ sessionId, onSend, onStop, isProcessing = false, disabled = false, placeholder = 'Ask anything...' }: ChatInputProps) {
frontend/src/utils/model.ts CHANGED
@@ -3,13 +3,12 @@
3
  * ClaudeCapDialog "Use a free model" escape hatch.
4
  *
5
  * Keep in sync with MODEL_OPTIONS in components/Chat/ChatInput.tsx and
6
- * AVAILABLE_MODELS in backend/routes/agent.py. Bare HF ids (no
7
- * `huggingface/` prefix) — matches upstream's auto-router.
8
  */
9
 
10
- export const CLAUDE_MODEL_PATH = 'anthropic/claude-opus-4-6';
11
  export const FIRST_FREE_MODEL_PATH = 'moonshotai/Kimi-K2.6';
12
 
13
  export function isClaudePath(modelPath: string | undefined): boolean {
14
- return !!modelPath && modelPath.startsWith('anthropic/');
15
  }
 
3
  * ClaudeCapDialog "Use a free model" escape hatch.
4
  *
5
  * Keep in sync with MODEL_OPTIONS in components/Chat/ChatInput.tsx and
6
+ * AVAILABLE_MODELS in backend/routes/agent.py.
 
7
  */
8
 
9
+ export const CLAUDE_MODEL_PATH = 'bedrock/us.anthropic.claude-opus-4-6-v1';
10
  export const FIRST_FREE_MODEL_PATH = 'moonshotai/Kimi-K2.6';
11
 
12
  export function isClaudePath(modelPath: string | undefined): boolean {
13
+ return !!modelPath && modelPath.includes('anthropic');
14
  }
pyproject.toml CHANGED
@@ -42,6 +42,7 @@ eval = [
42
  # Development and testing dependencies
43
  dev = [
44
  "pytest>=9.0.2",
 
45
  ]
46
 
47
  # All dependencies (eval + dev)
@@ -61,3 +62,6 @@ include = ["agent*"]
61
 
62
  [tool.uv]
63
  package = true
 
 
 
 
42
  # Development and testing dependencies
43
  dev = [
44
  "pytest>=9.0.2",
45
+ "pytest-asyncio>=0.26.0",
46
  ]
47
 
48
  # All dependencies (eval + dev)
 
62
 
63
  [tool.uv]
64
  package = true
65
+
66
+ [tool.pytest.ini_options]
67
+ asyncio_mode = "auto"
tests/unit/test_cli_rendering.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regression tests for interactive CLI rendering and research model routing."""
2
+
3
+ from io import StringIO
4
+ from types import SimpleNamespace
5
+
6
+ from agent.tools.research_tool import _get_research_model
7
+ from agent.utils import terminal_display
8
+
9
+
10
+ def test_direct_anthropic_research_model_stays_off_bedrock():
11
+ assert _get_research_model("anthropic/claude-opus-4-6") == "anthropic/claude-sonnet-4-6"
12
+
13
+
14
+ def test_bedrock_anthropic_research_model_stays_on_bedrock():
15
+ assert (
16
+ _get_research_model("bedrock/us.anthropic.claude-opus-4-6-v1")
17
+ == "bedrock/us.anthropic.claude-sonnet-4-6"
18
+ )
19
+
20
+
21
+ def test_non_anthropic_research_model_is_unchanged():
22
+ assert _get_research_model("openai/gpt-5.4") == "openai/gpt-5.4"
23
+
24
+
25
+ def test_subagent_display_does_not_spawn_background_redraw(monkeypatch):
26
+ calls: list[object] = []
27
+
28
+ def _unexpected_future(*args, **kwargs):
29
+ calls.append((args, kwargs))
30
+ raise AssertionError("background redraw task should not be created")
31
+
32
+ monkeypatch.setattr("asyncio.ensure_future", _unexpected_future)
33
+ monkeypatch.setattr(
34
+ terminal_display,
35
+ "_console",
36
+ SimpleNamespace(file=StringIO(), width=100),
37
+ )
38
+
39
+ mgr = terminal_display.SubAgentDisplayManager()
40
+ mgr.start("agent-1", "research")
41
+ mgr.add_call("agent-1", "▸ hf_papers {\"operation\": \"search\"}")
42
+ mgr.clear("agent-1")
43
+
44
+ assert calls == []
tests/unit/test_dangling_tool_calls.py ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regression tests for `_patch_dangling_tool_calls`.
2
+
3
+ Reproduces the failure mode behind observatory sessions 8dd2ce30 and
4
+ 59c9e678 (2026-04-25): a tool call cancelled mid-execution leaves an
5
+ orphan ``tool_use`` in history; the user types a follow-up; Bedrock
6
+ rejects the next request with HTTP 400 ``messages.N: tool_use ids were
7
+ found without tool_result blocks immediately after``.
8
+ """
9
+
10
+ from litellm import ChatCompletionMessageToolCall, Message
11
+
12
+ from agent.context_manager.manager import ContextManager
13
+
14
+
15
+ def _tool_call(call_id: str, name: str = "research") -> ChatCompletionMessageToolCall:
16
+ return ChatCompletionMessageToolCall(
17
+ id=call_id,
18
+ type="function",
19
+ function={"name": name, "arguments": "{}"},
20
+ )
21
+
22
+
23
+ def _make_cm() -> ContextManager:
24
+ cm = ContextManager.__new__(ContextManager)
25
+ cm.system_prompt = "system"
26
+ cm.model_max_tokens = 100_000
27
+ cm.compact_size = 1_000
28
+ cm.running_context_usage = 0
29
+ cm.untouched_messages = 5
30
+ cm.items = [Message(role="system", content="system")]
31
+ return cm
32
+
33
+
34
+ def test_orphan_tool_use_followed_by_user_message_is_patched():
35
+ cm = _make_cm()
36
+ cm.items.extend([
37
+ Message(role="user", content="Research X"),
38
+ Message(
39
+ role="assistant",
40
+ content=None,
41
+ tool_calls=[_tool_call("call_abc", "research")],
42
+ ),
43
+ Message(role="user", content="??"),
44
+ ])
45
+ msgs = cm.get_messages()
46
+ tool_msgs = [m for m in msgs if getattr(m, "role", None) == "tool"]
47
+ assert len(tool_msgs) == 1
48
+ assert tool_msgs[0].tool_call_id == "call_abc"
49
+ assert "interrupted" in (tool_msgs[0].content or "").lower() or "not executed" in (tool_msgs[0].content or "").lower()
50
+
51
+
52
+ def test_no_orphan_means_no_stub():
53
+ cm = _make_cm()
54
+ cm.items.extend([
55
+ Message(role="user", content="Research X"),
56
+ Message(
57
+ role="assistant",
58
+ content=None,
59
+ tool_calls=[_tool_call("call_abc", "research")],
60
+ ),
61
+ Message(role="tool", content="ok", tool_call_id="call_abc", name="research"),
62
+ ])
63
+ cm.get_messages()
64
+ tool_msgs = [m for m in cm.items if getattr(m, "role", None) == "tool"]
65
+ assert len(tool_msgs) == 1
66
+ assert tool_msgs[0].content == "ok"
67
+
68
+
69
+ def test_multiple_dangling_tool_calls_in_one_assistant_message_are_all_patched():
70
+ cm = _make_cm()
71
+ cm.items.extend([
72
+ Message(role="user", content="do two things"),
73
+ Message(
74
+ role="assistant",
75
+ content=None,
76
+ tool_calls=[
77
+ _tool_call("call_1", "research"),
78
+ _tool_call("call_2", "bash"),
79
+ ],
80
+ ),
81
+ Message(role="user", content="follow up"),
82
+ ])
83
+ cm.get_messages()
84
+ tool_ids = {
85
+ getattr(m, "tool_call_id", None)
86
+ for m in cm.items
87
+ if getattr(m, "role", None) == "tool"
88
+ }
89
+ assert tool_ids == {"call_1", "call_2"}
90
+
91
+
92
+ def test_orphan_in_earlier_turn_still_gets_patched():
93
+ """Two-turn history where the FIRST turn was interrupted.
94
+
95
+ Old patcher stopped at the first user msg encountered while scanning
96
+ backwards, so this case never got fixed and Bedrock rejected.
97
+ """
98
+ cm = _make_cm()
99
+ cm.items.extend([
100
+ Message(role="user", content="turn 1"),
101
+ Message(
102
+ role="assistant",
103
+ content=None,
104
+ tool_calls=[_tool_call("call_old", "research")],
105
+ ),
106
+ Message(role="user", content="turn 2 — please retry"),
107
+ Message(
108
+ role="assistant",
109
+ content=None,
110
+ tool_calls=[_tool_call("call_new", "bash")],
111
+ ),
112
+ Message(role="tool", content="ok", tool_call_id="call_new", name="bash"),
113
+ ])
114
+ cm.get_messages()
115
+ tool_ids = {
116
+ getattr(m, "tool_call_id", None)
117
+ for m in cm.items
118
+ if getattr(m, "role", None) == "tool"
119
+ }
120
+ assert "call_old" in tool_ids
121
+ assert "call_new" in tool_ids
tests/unit/test_doom_loop_polling.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regression test for doom-loop false-positive on legitimate polling.
2
+
3
+ Reproduces the failure mode in observatory sessions 40fcb414 ($32.59),
4
+ 8e90352e ($62.63), and 403178bf ($5.71) on 2026-04-25: the agent polled a
5
+ long-running job with `bash sleep 300 && wc -l output` four times in a
6
+ row. The arguments were byte-identical, but the results moved (27210 →
7
+ 36454 → 45770 → 55138 — actual progress). The detector hashed args only
8
+ and false-fired DOOM LOOP, which made the agent abandon perfectly valid
9
+ polling.
10
+
11
+ After the fix the signature includes the tool result hash, so identical
12
+ args + different results no longer trips the detector.
13
+ """
14
+
15
+ from litellm import ChatCompletionMessageToolCall, Message
16
+
17
+ from agent.core.doom_loop import check_for_doom_loop
18
+
19
+
20
+ def _assistant(call_id: str, name: str, args: str) -> Message:
21
+ return Message(
22
+ role="assistant",
23
+ content=None,
24
+ tool_calls=[
25
+ ChatCompletionMessageToolCall(
26
+ id=call_id,
27
+ type="function",
28
+ function={"name": name, "arguments": args},
29
+ )
30
+ ],
31
+ )
32
+
33
+
34
+ def _tool(call_id: str, name: str, content: str) -> Message:
35
+ return Message(role="tool", content=content, tool_call_id=call_id, name=name)
36
+
37
+
38
+ _POLL_ARGS = '{"command": "sleep 300 && ls /app/images/ | wc -l"}'
39
+
40
+
41
+ def test_polling_with_progressing_results_does_not_fire():
42
+ msgs = [
43
+ Message(role="user", content="run the job"),
44
+ _assistant("c1", "bash", _POLL_ARGS),
45
+ _tool("c1", "bash", "27210"),
46
+ _assistant("c2", "bash", _POLL_ARGS),
47
+ _tool("c2", "bash", "36454"),
48
+ _assistant("c3", "bash", _POLL_ARGS),
49
+ _tool("c3", "bash", "45770"),
50
+ _assistant("c4", "bash", _POLL_ARGS),
51
+ _tool("c4", "bash", "55138"),
52
+ ]
53
+ assert check_for_doom_loop(msgs) is None
54
+
55
+
56
+ def test_truly_stuck_polling_with_identical_results_still_fires():
57
+ """If the same poll returns the same number, the job is genuinely
58
+ stuck and the detector SHOULD fire."""
59
+ msgs = [
60
+ _assistant("c1", "bash", _POLL_ARGS),
61
+ _tool("c1", "bash", "55138"),
62
+ _assistant("c2", "bash", _POLL_ARGS),
63
+ _tool("c2", "bash", "55138"),
64
+ _assistant("c3", "bash", _POLL_ARGS),
65
+ _tool("c3", "bash", "55138"),
66
+ ]
67
+ prompt = check_for_doom_loop(msgs)
68
+ assert prompt is not None
69
+ assert "DOOM LOOP" in prompt
70
+ assert "bash" in prompt
71
+
72
+
73
+ def test_identical_calls_with_no_results_yet_still_fires():
74
+ """If three identical calls have no tool results (e.g. all cancelled
75
+ or errored before a result was recorded), treat as a real loop."""
76
+ msgs = [
77
+ _assistant("c1", "write", '{"path": "/tmp/x", "content": "..."}'),
78
+ _assistant("c2", "write", '{"path": "/tmp/x", "content": "..."}'),
79
+ _assistant("c3", "write", '{"path": "/tmp/x", "content": "..."}'),
80
+ ]
81
+ prompt = check_for_doom_loop(msgs)
82
+ assert prompt is not None
83
+ assert "DOOM LOOP" in prompt
84
+ assert "write" in prompt
85
+
86
+
87
+ def test_different_args_does_not_fire():
88
+ msgs = [
89
+ _assistant("c1", "bash", '{"command": "ls /a"}'),
90
+ _tool("c1", "bash", "ok"),
91
+ _assistant("c2", "bash", '{"command": "ls /b"}'),
92
+ _tool("c2", "bash", "ok"),
93
+ _assistant("c3", "bash", '{"command": "ls /c"}'),
94
+ _tool("c3", "bash", "ok"),
95
+ ]
96
+ assert check_for_doom_loop(msgs) is None
tests/unit/test_llm_error_classification.py ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Tests for LLM error classification helpers in agent.core.agent_loop.
2
+
3
+ Covers two regressions on 2026-04-25:
4
+
5
+ 1. Non-Anthropic context overflow (Kimi 365k > 262k) was not classified as
6
+ ``_is_context_overflow_error``, so the recovery path didn't fire and
7
+ session 62ccfdcb died with 68 wasted compaction events.
8
+
9
+ 2. Bedrock TPM rate limit (`Too many tokens, please wait before trying
10
+ again.`) needs the longer rate-limit retry schedule. The old schedule
11
+ ([5, 15, 30] = 50s) burned through 6 sessions costing >$2,400 combined
12
+ on the same day.
13
+ """
14
+
15
+ from agent.core.agent_loop import (
16
+ _MAX_LLM_RETRIES,
17
+ _LLM_RATE_LIMIT_RETRY_DELAYS,
18
+ _LLM_RETRY_DELAYS,
19
+ _is_context_overflow_error,
20
+ _is_rate_limit_error,
21
+ _is_transient_error,
22
+ _retry_delay_for,
23
+ )
24
+
25
+
26
+ # ── context overflow ────────────────────────────────────────────────────
27
+
28
+
29
+ def test_kimi_prompt_too_long_is_context_overflow():
30
+ # Verbatim error text from session 62ccfdcb (2026-04-25, Kimi K2.6).
31
+ err = Exception(
32
+ "litellm.BadRequestError: OpenAIException - The prompt is too long: "
33
+ "365407, model maximum context length: 262143"
34
+ )
35
+ assert _is_context_overflow_error(err)
36
+
37
+
38
+ def test_openai_context_length_exceeded_is_context_overflow():
39
+ err = Exception("Error: This model's maximum context length is 8192 tokens.")
40
+ assert _is_context_overflow_error(err)
41
+
42
+
43
+ def test_random_error_is_not_context_overflow():
44
+ err = Exception("connection reset by peer")
45
+ assert not _is_context_overflow_error(err)
46
+
47
+
48
+ # ── rate limit ──────────────────────────────────────────────────────────
49
+
50
+
51
+ def test_bedrock_too_many_tokens_is_rate_limit():
52
+ # Verbatim from sessions b37a3823, c4d7a831, b63c4933 (2026-04-25).
53
+ err = Exception(
54
+ 'litellm.RateLimitError: BedrockException - {"message":"Too many '
55
+ 'tokens, please wait before trying again."}'
56
+ )
57
+ assert _is_rate_limit_error(err)
58
+ # Rate-limit errors are also classified as transient.
59
+ assert _is_transient_error(err)
60
+
61
+
62
+ def test_429_is_rate_limit():
63
+ err = Exception("HTTP 429 Too Many Requests")
64
+ assert _is_rate_limit_error(err)
65
+
66
+
67
+ def test_timeout_is_transient_but_not_rate_limit():
68
+ err = Exception("Request timed out after 600s")
69
+ assert _is_transient_error(err)
70
+ assert not _is_rate_limit_error(err)
71
+
72
+
73
+ # ── retry schedule selection ────────────────────────────────────────────
74
+
75
+
76
+ def test_rate_limit_uses_longer_schedule():
77
+ err = Exception("Too many tokens, please wait before trying again.")
78
+ delays = [_retry_delay_for(err, i) for i in range(len(_LLM_RATE_LIMIT_RETRY_DELAYS))]
79
+ assert delays == _LLM_RATE_LIMIT_RETRY_DELAYS
80
+ # Just past the schedule → None (stop retrying).
81
+ assert _retry_delay_for(err, len(_LLM_RATE_LIMIT_RETRY_DELAYS)) is None
82
+
83
+
84
+ def test_other_transient_uses_short_schedule():
85
+ err = Exception("503 service unavailable")
86
+ delays = [_retry_delay_for(err, i) for i in range(len(_LLM_RETRY_DELAYS))]
87
+ assert delays == _LLM_RETRY_DELAYS
88
+ assert _retry_delay_for(err, len(_LLM_RETRY_DELAYS)) is None
89
+
90
+
91
+ def test_non_transient_returns_none():
92
+ err = Exception("invalid request: bad parameter")
93
+ assert _retry_delay_for(err, 0) is None
94
+
95
+
96
+ def test_rate_limit_total_budget_covers_bedrock_bucket_recovery():
97
+ """The whole point of the rate-limit schedule: total wait time should
98
+ exceed the ~60s Bedrock TPM bucket recovery window."""
99
+ assert len(_LLM_RATE_LIMIT_RETRY_DELAYS) == _MAX_LLM_RETRIES - 1
100
+ assert sum(_LLM_RATE_LIMIT_RETRY_DELAYS) > 60
tests/unit/test_malformed_args_recovery.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regression test for the malformed-JSON loop in observatory session
2
+ 7750e82f (2026-04-25): GLM-5.1 produced six consecutive ``write`` calls
3
+ whose ``arguments`` strings JSON-parse-failed (truncated mid-stream by
4
+ the provider). The soft retry hint didn't move the model. The detector
5
+ in ``_detect_repeated_malformed`` looks for the streak so the agent loop
6
+ can inject a hard system-prompt forcing a different strategy.
7
+ """
8
+
9
+ from litellm import Message
10
+
11
+ from agent.core.agent_loop import _detect_repeated_malformed
12
+
13
+
14
+ def _malformed_tool_msg(name: str, call_id: str) -> Message:
15
+ return Message(
16
+ role="tool",
17
+ content=(
18
+ f"ERROR: Tool call to '{name}' had malformed JSON arguments and "
19
+ f"was NOT executed. Retry with smaller content — for 'write', "
20
+ f"split into multiple smaller writes using 'edit'."
21
+ ),
22
+ tool_call_id=call_id,
23
+ name=name,
24
+ )
25
+
26
+
27
+ def test_two_consecutive_malformed_same_tool_triggers():
28
+ items = [
29
+ Message(role="user", content="write a big plan"),
30
+ Message(role="assistant", content=None),
31
+ _malformed_tool_msg("write", "1"),
32
+ Message(role="assistant", content=None),
33
+ _malformed_tool_msg("write", "2"),
34
+ ]
35
+ assert _detect_repeated_malformed(items, threshold=2) == "write"
36
+
37
+
38
+ def test_one_malformed_does_not_trigger():
39
+ items = [
40
+ Message(role="user", content="write a plan"),
41
+ Message(role="assistant", content=None),
42
+ _malformed_tool_msg("write", "1"),
43
+ ]
44
+ assert _detect_repeated_malformed(items, threshold=2) is None
45
+
46
+
47
+ def test_two_malformed_different_tools_does_not_trigger():
48
+ items = [
49
+ Message(role="assistant", content=None),
50
+ _malformed_tool_msg("write", "1"),
51
+ Message(role="assistant", content=None),
52
+ _malformed_tool_msg("bash", "2"),
53
+ ]
54
+ assert _detect_repeated_malformed(items, threshold=2) is None
55
+
56
+
57
+ def test_streak_broken_by_successful_tool_call_does_not_trigger():
58
+ items = [
59
+ Message(role="assistant", content=None),
60
+ _malformed_tool_msg("write", "1"),
61
+ Message(role="assistant", content=None),
62
+ Message(role="tool", content="ok", tool_call_id="2", name="write"),
63
+ Message(role="assistant", content=None),
64
+ _malformed_tool_msg("write", "3"),
65
+ ]
66
+ assert _detect_repeated_malformed(items, threshold=2) is None
tests/unit/test_sandbox_already_active_message.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Regression test for sandbox_create not surfacing the hardware lockout.
2
+
3
+ In observatory session d6f8454c (2026-04-25) the agent called
4
+ sandbox_create 18 times across 11 distinct hardware tiers (a10g-large,
5
+ a100-large, t4-small, cpu-upgrade, cpu-basic, zero-a10g, l4x1, t4-medium,
6
+ a10g-small, l40sx1, …). Every call returned 'Sandbox already active' for
7
+ the same sandbox, but the message did not say that hardware can't be
8
+ changed by re-calling, so the agent thought "still pending, retry with a
9
+ different flavor" and burned 17 useless turns.
10
+
11
+ The fix makes the response explicit when the requested hardware differs
12
+ from what's already active.
13
+ """
14
+
15
+ import asyncio
16
+ from types import SimpleNamespace
17
+
18
+ from agent.tools.sandbox_tool import sandbox_create_handler
19
+
20
+
21
+ def _session_with_sandbox():
22
+ sb = SimpleNamespace(
23
+ space_id="user/sandbox-abc123",
24
+ url="https://huggingface.co/spaces/user/sandbox-abc123",
25
+ )
26
+ return SimpleNamespace(sandbox=sb)
27
+
28
+
29
+ def test_already_active_with_different_hw_warns_about_lockout():
30
+ session = _session_with_sandbox()
31
+ out, ok = asyncio.run(
32
+ sandbox_create_handler({"hardware": "a100-large"}, session=session)
33
+ )
34
+ assert ok is True
35
+ # The message should mention the lockout AND the requested flavor.
36
+ assert "cannot be changed" in out.lower()
37
+ assert "a100-large" in out
38
+ assert "delete" in out.lower()
39
+
40
+
41
+ def test_already_active_no_hw_request_just_returns_handle():
42
+ session = _session_with_sandbox()
43
+ out, ok = asyncio.run(sandbox_create_handler({}, session=session))
44
+ assert ok is True
45
+ assert "user/sandbox-abc123" in out
46
+ # No spurious lockout note when the agent didn't request a flavor.
47
+ assert "cannot be changed" not in out.lower()
uv.lock CHANGED
@@ -1799,10 +1799,12 @@ all = [
1799
  { name = "inspect-ai" },
1800
  { name = "pandas" },
1801
  { name = "pytest" },
 
1802
  { name = "tenacity" },
1803
  ]
1804
  dev = [
1805
  { name = "pytest" },
 
1806
  ]
1807
  eval = [
1808
  { name = "datasets" },
@@ -1830,6 +1832,7 @@ requires-dist = [
1830
  { name = "prompt-toolkit", specifier = ">=3.0.0" },
1831
  { name = "pydantic", specifier = ">=2.12.3" },
1832
  { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2" },
 
1833
  { name = "python-dotenv", specifier = ">=1.2.1" },
1834
  { name = "requests", specifier = ">=2.33.0" },
1835
  { name = "rich", specifier = ">=13.0.0" },
@@ -2789,6 +2792,19 @@ wheels = [
2789
  { url = "https://files.pythonhosted.org/packages/3b/ab/b3226f0bd7cdcf710fbede2b3548584366da3b19b5021e74f5bde2a8fa3f/pytest-9.0.2-py3-none-any.whl", hash = "sha256:711ffd45bf766d5264d487b917733b453d917afd2b0ad65223959f59089f875b", size = 374801, upload-time = "2025-12-06T21:30:49.154Z" },
2790
  ]
2791
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2792
  [[package]]
2793
  name = "python-dateutil"
2794
  version = "2.9.0.post0"
 
1799
  { name = "inspect-ai" },
1800
  { name = "pandas" },
1801
  { name = "pytest" },
1802
+ { name = "pytest-asyncio" },
1803
  { name = "tenacity" },
1804
  ]
1805
  dev = [
1806
  { name = "pytest" },
1807
+ { name = "pytest-asyncio" },
1808
  ]
1809
  eval = [
1810
  { name = "datasets" },
 
1832
  { name = "prompt-toolkit", specifier = ">=3.0.0" },
1833
  { name = "pydantic", specifier = ">=2.12.3" },
1834
  { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2" },
1835
+ { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.26.0" },
1836
  { name = "python-dotenv", specifier = ">=1.2.1" },
1837
  { name = "requests", specifier = ">=2.33.0" },
1838
  { name = "rich", specifier = ">=13.0.0" },
 
2792
  { url = "https://files.pythonhosted.org/packages/3b/ab/b3226f0bd7cdcf710fbede2b3548584366da3b19b5021e74f5bde2a8fa3f/pytest-9.0.2-py3-none-any.whl", hash = "sha256:711ffd45bf766d5264d487b917733b453d917afd2b0ad65223959f59089f875b", size = 374801, upload-time = "2025-12-06T21:30:49.154Z" },
2793
  ]
2794
 
2795
+ [[package]]
2796
+ name = "pytest-asyncio"
2797
+ version = "1.3.0"
2798
+ source = { registry = "https://pypi.org/simple" }
2799
+ dependencies = [
2800
+ { name = "pytest" },
2801
+ { name = "typing-extensions", marker = "python_full_version < '3.13'" },
2802
+ ]
2803
+ sdist = { url = "https://files.pythonhosted.org/packages/90/2c/8af215c0f776415f3590cac4f9086ccefd6fd463befeae41cd4d3f193e5a/pytest_asyncio-1.3.0.tar.gz", hash = "sha256:d7f52f36d231b80ee124cd216ffb19369aa168fc10095013c6b014a34d3ee9e5", size = 50087, upload-time = "2025-11-10T16:07:47.256Z" }
2804
+ wheels = [
2805
+ { url = "https://files.pythonhosted.org/packages/e5/35/f8b19922b6a25bc0880171a2f1a003eaeb93657475193ab516fd87cac9da/pytest_asyncio-1.3.0-py3-none-any.whl", hash = "sha256:611e26147c7f77640e6d0a92a38ed17c3e9848063698d5c93d5aa7aa11cebff5", size = 15075, upload-time = "2025-11-10T16:07:45.537Z" },
2806
+ ]
2807
+
2808
  [[package]]
2809
  name = "python-dateutil"
2810
  version = "2.9.0.post0"