Spaces:

smolagents
/

ml-intern

Running on CPU Upgrade

Aksel Joonas Reedi commited on 12 days ago

Commit

3eec386

unverified ·

1 Parent(s): 0545e40

Fix CLI rendering corruption and split CLI/frontend model defaults (#121)

* Stabilize CLI rendering and make surface defaults explicit

The interactive CLI was interleaving live sub-agent redraws with streamed
markdown output, which corrupted ANSI rendering and leaked raw control
sequences into the terminal. The CLI and web app also shared one default
model config even though they need different Anthropic routing defaults.

Constraint: CLI default must use direct Anthropic credentials while web sessions must default to Bedrock Anthropic
Constraint: Interactive terminal output must remain readable while sub-agent progress is live
Rejected: Single shared config file with runtime overrides | keeps ownership of defaults implicit across surfaces
Rejected: Keep background redraw ticker | concurrent terminal writers still corrupt streamed output
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep CLI and frontend default models in separate config files unless both surfaces intentionally converge again
Tested: python -m compileall agent backend
Tested: ./frontend/node_modules/.bin/tsc -p frontend/tsconfig.json --noEmit
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest python -m pytest -q tests/unit/test_cli_rendering.py
Not-tested: Full pytest suite (blocked by pre-existing tests/unit/test_llm_error_classification.py import error during collection)

* Restore regression coverage and make the full test suite green

The earlier PR fixed the CLI rendering and model-default split, but the
full local suite exposed additional regressions in tool-result patching,
doom-loop polling detection, sandbox reuse messaging, and async test
support. This follow-up commit restores the missing helpers and updates
those production paths so the new regression tests pass for real.

Constraint: Provider message histories must keep tool_use/tool_result pairing valid across interrupted turns
Constraint: Legitimate polling with changing results must not trip doom-loop recovery
Rejected: Only fix the original collection blocker | leaves the full suite red and the PR note stale
Rejected: Silence the failing tests without restoring runtime helpers | would hide real production regressions
Confidence: high
Scope-risk: moderate
Reversibility: clean
Directive: Keep the local regression tests in sync with the production recovery paths they exercise
Tested: python -m compileall agent/context_manager/manager.py agent/core/agent_loop.py agent/core/doom_loop.py agent/tools/sandbox_tool.py backend/user_quotas.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q tests/unit/test_dangling_tool_calls.py tests/unit/test_doom_loop_polling.py tests/unit/test_sandbox_already_active_message.py tests/unit/test_user_quotas.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q
Not-tested: Remote CI environment parity

* Tighten rate-limit retries and drop the orphaned shared config

The review was right about two follow-up issues: the old shared config file
was still present after the CLI/frontend split, and the Bedrock rate-limit
retry schedule still had a dead third entry because the loop only ever
consumed two retry delays. This commit removes the orphaned config and makes
the rate-limit schedule line up with the actual retry budget.

Constraint: Retry budget for Bedrock token throttling must exceed the provider's ~60s bucket recovery window in the retries that actually run
Rejected: Keep a third delay entry in the schedule | the current retry loop never reaches it
Confidence: high
Scope-risk: narrow
Reversibility: clean
Directive: Keep retry schedules aligned with the retry loop's real number of sleeps, not the raw retry constant count
Tested: python -m compileall agent/core/agent_loop.py tests/unit/test_llm_error_classification.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q tests/unit/test_llm_error_classification.py
Tested: UV_CACHE_DIR=/tmp/uv-cache uv run --with pytest --with pytest-asyncio python -m pytest -q
Not-tested: Remote CI environment parity

Files changed (21) hide show

README.md +2 -1
agent/context_manager/manager.py +36 -32
agent/core/agent_loop.py +137 -6
agent/core/doom_loop.py +24 -4
agent/main.py +14 -4
agent/tools/research_tool.py +3 -1
agent/tools/sandbox_tool.py +11 -1
agent/utils/terminal_display.py +1 -19
backend/session_manager.py +1 -1
configs/{main_agent_config.json → cli_agent_config.json} +0 -0
configs/frontend_agent_config.json +14 -0
frontend/src/components/Chat/ChatInput.tsx +3 -3
frontend/src/utils/model.ts +3 -4
pyproject.toml +4 -0
tests/unit/test_cli_rendering.py +44 -0
tests/unit/test_dangling_tool_calls.py +121 -0
tests/unit/test_doom_loop_polling.py +96 -0
tests/unit/test_llm_error_classification.py +100 -0
tests/unit/test_malformed_args_recovery.py +66 -0
tests/unit/test_sandbox_already_active_message.py +47 -0
uv.lock +16 -0

README.md CHANGED Viewed

@@ -212,7 +212,8 @@ def create_builtin_tools() -> list[ToolSpec]:
 ### Adding MCP Servers
-Edit `configs/main_agent_config.json`:
 ```json
 {

 ### Adding MCP Servers
+Edit `configs/cli_agent_config.json` for CLI defaults, or
+`configs/frontend_agent_config.json` for web-session defaults:
 ```json
 {

agent/context_manager/manager.py CHANGED Viewed

@@ -253,45 +253,49 @@ class ContextManager:
     def _patch_dangling_tool_calls(self) -> None:
         """Add stub tool results for any tool_calls that lack a matching result.
-        Scans backwards to find the last assistant message with tool_calls,
-        which may not be items[-1] if some tool results were already added.
         """
         if not self.items:
             return
-        # Find the last assistant message with tool_calls
-        assistant_msg = None
-        for i in range(len(self.items) - 1, -1, -1):
             msg = self.items[i]
-            if getattr(msg, "role", None) == "assistant" and getattr(
-                msg, "tool_calls", None
-            ):
-                assistant_msg = msg
-                break
-            # Stop scanning once we hit a user message — anything before
-            # that belongs to a previous (complete) turn.
-            if getattr(msg, "role", None) == "user":
-                break
-        if not assistant_msg:
-            return
-        self._normalize_tool_calls(assistant_msg)
-        answered_ids = {
-            getattr(m, "tool_call_id", None)
-            for m in self.items
-            if getattr(m, "role", None) == "tool"
-        }
-        for tc in assistant_msg.tool_calls:
-            if tc.id not in answered_ids:
-                self.items.append(
-                    Message(
-                        role="tool",
-                        content="Tool was not executed (interrupted or error).",
-                        tool_call_id=tc.id,
-                        name=tc.function.name,
-                    )
-                )
     def undo_last_turn(self) -> bool:
         """Remove the last complete turn (user msg + all assistant/tool msgs that follow).

     def _patch_dangling_tool_calls(self) -> None:
         """Add stub tool results for any tool_calls that lack a matching result.
+        Ensures each assistant message's tool_calls are followed immediately
+        by matching tool-result messages. This has to work across the whole
+        history, not just the most recent turn, because a cancelled tool use
+        in an earlier turn can still poison the next provider request.
         """
         if not self.items:
             return
+        i = 0
+        while i < len(self.items):
             msg = self.items[i]
+            if getattr(msg, "role", None) != "assistant" or not getattr(msg, "tool_calls", None):
+                i += 1
+                continue
+            self._normalize_tool_calls(msg)
+            # Consume the contiguous tool-result block that immediately follows
+            # this assistant message. Any missing tool ids must be inserted
+            # before the next non-tool message to satisfy provider ordering.
+            j = i + 1
+            immediate_ids: set[str | None] = set()
+            while j < len(self.items) and getattr(self.items[j], "role", None) == "tool":
+                immediate_ids.add(getattr(self.items[j], "tool_call_id", None))
+                j += 1
+            missing: list[Message] = []
+            for tc in msg.tool_calls:
+                if tc.id not in immediate_ids:
+                    missing.append(
+                        Message(
+                            role="tool",
+                            content="Tool was not executed (interrupted or error).",
+                            tool_call_id=tc.id,
+                            name=tc.function.name,
+                        )
+                    )
+            if missing:
+                self.items[j:j] = missing
+                j += len(missing)
+            i = j
     def undo_last_turn(self) -> bool:
         """Remove the last complete turn (user msg + all assistant/tool msgs that follow).

agent/core/agent_loop.py CHANGED Viewed

@@ -25,6 +25,61 @@ logger = logging.getLogger(__name__)
 ToolCall = ChatCompletionMessageToolCall
 def _validate_tool_args(tool_args: dict) -> tuple[bool, str | None]:
     """
@@ -121,6 +176,54 @@ def _needs_approval(
 # -- LLM retry constants --------------------------------------------------
 _MAX_LLM_RETRIES = 3
 _LLM_RETRY_DELAYS = [5, 15, 30]  # seconds between retries
 def _is_transient_error(error: Exception) -> bool:
@@ -128,7 +231,6 @@ def _is_transient_error(error: Exception) -> bool:
     err_str = str(error).lower()
     transient_patterns = [
         "timeout", "timed out",
-        "429", "rate limit", "rate_limit",
         "503", "service unavailable",
         "502", "bad gateway",
         "500", "internal server error",
@@ -136,7 +238,7 @@ def _is_transient_error(error: Exception) -> bool:
         "connection reset", "connection refused", "connection error",
         "eof", "broken pipe",
     ]
-    return any(pattern in err_str for pattern in transient_patterns)
 def _is_effort_config_error(error: Exception) -> bool:
@@ -317,6 +419,8 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
         except ContextWindowExceededError:
             raise
         except Exception as e:
             if not _healed_effort and _is_effort_config_error(e):
                 _healed_effort = True
                 llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
@@ -325,8 +429,8 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
                     data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
                 ))
                 continue
-            if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
-                _delay = _LLM_RETRY_DELAYS[_llm_attempt]
                 logger.warning(
                     "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
                     _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
@@ -424,6 +528,8 @@ async def _call_llm_non_streaming(session: Session, messages, tools, llm_params)
         except ContextWindowExceededError:
             raise
         except Exception as e:
             if not _healed_effort and _is_effort_config_error(e):
                 _healed_effort = True
                 llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
@@ -432,8 +538,8 @@ async def _call_llm_non_streaming(session: Session, messages, tools, llm_params)
                     data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
                 ))
                 continue
-            if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
-                _delay = _LLM_RETRY_DELAYS[_llm_attempt]
                 logger.warning(
                     "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
                     _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
@@ -585,6 +691,31 @@ class Handlers:
                     )
                 )
             messages = session.context_manager.get_messages()
             tools = session.tool_router.get_tool_specs_for_llm()
             try:

 ToolCall = ChatCompletionMessageToolCall
+_MALFORMED_TOOL_PREFIX = "ERROR: Tool call to '"
+_MALFORMED_TOOL_SUFFIX = "' had malformed JSON arguments"
+def _malformed_tool_name(message: Message) -> str | None:
+    """Return the tool name for malformed-json tool-result messages."""
+    if getattr(message, "role", None) != "tool":
+        return None
+    content = getattr(message, "content", None)
+    if not isinstance(content, str):
+        return None
+    if not content.startswith(_MALFORMED_TOOL_PREFIX):
+        return None
+    end = content.find(_MALFORMED_TOOL_SUFFIX, len(_MALFORMED_TOOL_PREFIX))
+    if end == -1:
+        return None
+    return content[len(_MALFORMED_TOOL_PREFIX):end]
+def _detect_repeated_malformed(
+    items: list[Message], threshold: int = 2,
+) -> str | None:
+    """Return the repeated malformed tool name if the tail contains a streak.
+    Walk backward over the current conversation tail. A streak counts only
+    consecutive malformed tool-result messages for the same tool; any other
+    tool result breaks it.
+    """
+    if threshold <= 0:
+        return None
+    streak_tool: str | None = None
+    streak = 0
+    for item in reversed(items):
+        if getattr(item, "role", None) != "tool":
+            continue
+        malformed_tool = _malformed_tool_name(item)
+        if malformed_tool is None:
+            break
+        if streak_tool is None:
+            streak_tool = malformed_tool
+            streak = 1
+        elif malformed_tool == streak_tool:
+            streak += 1
+        else:
+            break
+        if streak >= threshold:
+            return streak_tool
+    return None
 def _validate_tool_args(tool_args: dict) -> tuple[bool, str | None]:
     """
 # -- LLM retry constants --------------------------------------------------
 _MAX_LLM_RETRIES = 3
 _LLM_RETRY_DELAYS = [5, 15, 30]  # seconds between retries
+_LLM_RATE_LIMIT_RETRY_DELAYS = [30, 60]  # exceed Bedrock's ~60s TPM bucket window
+def _is_rate_limit_error(error: Exception) -> bool:
+    """Return True for rate-limit / quota-bucket style provider errors."""
+    err_str = str(error).lower()
+    rate_limit_patterns = [
+        "429",
+        "rate limit",
+        "rate_limit",
+        "too many requests",
+        "too many tokens",
+        "request limit",
+        "throttl",
+    ]
+    return any(pattern in err_str for pattern in rate_limit_patterns)
+def _is_context_overflow_error(error: Exception) -> bool:
+    """Return True when the prompt exceeded the model's context window."""
+    if isinstance(error, ContextWindowExceededError):
+        return True
+    err_str = str(error).lower()
+    overflow_patterns = [
+        "context window exceeded",
+        "maximum context length",
+        "max context length",
+        "prompt is too long",
+        "context length exceeded",
+        "too many input tokens",
+        "input is too long",
+    ]
+    return any(pattern in err_str for pattern in overflow_patterns)
+def _retry_delay_for(error: Exception, attempt_index: int) -> int | None:
+    """Return the delay for this retry attempt, or None if it should not retry."""
+    if _is_rate_limit_error(error):
+        schedule = _LLM_RATE_LIMIT_RETRY_DELAYS
+    elif _is_transient_error(error):
+        schedule = _LLM_RETRY_DELAYS
+    else:
+        return None
+    if attempt_index >= len(schedule):
+        return None
+    return schedule[attempt_index]
 def _is_transient_error(error: Exception) -> bool:
     err_str = str(error).lower()
     transient_patterns = [
         "timeout", "timed out",
         "503", "service unavailable",
         "502", "bad gateway",
         "500", "internal server error",
         "connection reset", "connection refused", "connection error",
         "eof", "broken pipe",
     ]
+    return _is_rate_limit_error(error) or any(pattern in err_str for pattern in transient_patterns)
 def _is_effort_config_error(error: Exception) -> bool:
         except ContextWindowExceededError:
             raise
         except Exception as e:
+            if _is_context_overflow_error(e):
+                raise ContextWindowExceededError(str(e)) from e
             if not _healed_effort and _is_effort_config_error(e):
                 _healed_effort = True
                 llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
                     data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
                 ))
                 continue
+            _delay = _retry_delay_for(e, _llm_attempt)
+            if _llm_attempt < _MAX_LLM_RETRIES - 1 and _delay is not None:
                 logger.warning(
                     "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
                     _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
         except ContextWindowExceededError:
             raise
         except Exception as e:
+            if _is_context_overflow_error(e):
+                raise ContextWindowExceededError(str(e)) from e
             if not _healed_effort and _is_effort_config_error(e):
                 _healed_effort = True
                 llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
                     data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
                 ))
                 continue
+            _delay = _retry_delay_for(e, _llm_attempt)
+            if _llm_attempt < _MAX_LLM_RETRIES - 1 and _delay is not None:
                 logger.warning(
                     "Transient LLM error (attempt %d/%d): %s — retrying in %ds",
                     _llm_attempt + 1, _MAX_LLM_RETRIES, e, _delay,
                     )
                 )
+            malformed_tool = _detect_repeated_malformed(session.context_manager.items)
+            if malformed_tool:
+                recovery_prompt = (
+                    "[SYSTEM: Repeated malformed tool arguments detected for "
+                    f"'{malformed_tool}'. Stop retrying the same tool call shape. "
+                    "Use a different strategy that produces smaller, valid JSON. "
+                    "For large file writes, prefer bash with a heredoc or split the "
+                    "edit into multiple smaller tool calls.]"
+                )
+                session.context_manager.add_message(
+                    Message(role="user", content=recovery_prompt)
+                )
+                await session.send_event(
+                    Event(
+                        event_type="tool_log",
+                        data={
+                            "tool": "system",
+                            "log": (
+                                "Repeated malformed tool arguments detected — "
+                                f"forcing a different strategy for {malformed_tool}"
+                            ),
+                        },
+                    )
+                )
             messages = session.context_manager.get_messages()
             tools = session.tool_router.get_tool_specs_for_llm()
             try:

agent/core/doom_loop.py CHANGED Viewed

@@ -17,10 +17,11 @@ logger = logging.getLogger(__name__)
 @dataclass(frozen=True)
 class ToolCallSignature:
-    """Hashable signature for a single tool call (name + args hash)."""
     name: str
     args_hash: str
 def _hash_args(args_str: str) -> str:
@@ -31,11 +32,16 @@ def _hash_args(args_str: str) -> str:
 def extract_recent_tool_signatures(
     messages: list[Message], lookback: int = 30
 ) -> list[ToolCallSignature]:
-    """Extract tool call signatures from recent assistant messages."""
     signatures: list[ToolCallSignature] = []
     recent = messages[-lookback:] if len(messages) > lookback else messages
-    for msg in recent:
         if getattr(msg, "role", None) != "assistant":
             continue
         tool_calls = getattr(msg, "tool_calls", None)
@@ -47,7 +53,21 @@ def extract_recent_tool_signatures(
                 continue
             name = getattr(fn, "name", "") or ""
             args_str = getattr(fn, "arguments", "") or ""
-            signatures.append(ToolCallSignature(name=name, args_hash=_hash_args(args_str)))
     return signatures

 @dataclass(frozen=True)
 class ToolCallSignature:
+    """Hashable signature for a single tool call plus its observed result."""
     name: str
     args_hash: str
+    result_hash: str | None = None
 def _hash_args(args_str: str) -> str:
 def extract_recent_tool_signatures(
     messages: list[Message], lookback: int = 30
 ) -> list[ToolCallSignature]:
+    """Extract tool call signatures from recent assistant messages.
+    Includes the immediate tool result hash when present. This prevents
+    legitimate polling from being classified as a doom loop when the poll
+    arguments stay constant but the observed result keeps changing.
+    """
     signatures: list[ToolCallSignature] = []
     recent = messages[-lookback:] if len(messages) > lookback else messages
+    for idx, msg in enumerate(recent):
         if getattr(msg, "role", None) != "assistant":
             continue
         tool_calls = getattr(msg, "tool_calls", None)
                 continue
             name = getattr(fn, "name", "") or ""
             args_str = getattr(fn, "arguments", "") or ""
+            result_hash = None
+            for follow in recent[idx + 1:]:
+                role = getattr(follow, "role", None)
+                if role == "tool" and getattr(follow, "tool_call_id", None) == getattr(tc, "id", None):
+                    result_hash = _hash_args(str(getattr(follow, "content", "") or ""))
+                    break
+                if role in {"assistant", "user"}:
+                    break
+            signatures.append(
+                ToolCallSignature(
+                    name=name,
+                    args_hash=_hash_args(args_str),
+                    result_hash=result_hash,
+                )
+            )
     return signatures

agent/main.py CHANGED Viewed

@@ -50,6 +50,16 @@ litellm.drop_params = True
 # on every error — users don't need it, and our friendly errors cover the case.
 litellm.suppress_debug_info = True
 def _safe_get_args(arguments: dict) -> dict:
     """Safely extract args dict from arguments, handling cases where LLM passes string."""
     args = arguments.get("args", {})
@@ -846,8 +856,7 @@ async def main():
     ready_event = asyncio.Event()
     # Start agent loop in background
-    config_path = Path(__file__).parent.parent / "configs" / "main_agent_config.json"
-    config = load_config(config_path)
     # Create tool router with local mode
     tool_router = ToolRouter(config.mcpServers, hf_token=hf_token, local_mode=True)
@@ -1037,6 +1046,7 @@ async def headless_main(
     import logging
     logging.basicConfig(level=logging.WARNING)
     hf_token = _get_hf_token()
     if not hf_token:
@@ -1045,8 +1055,7 @@ async def headless_main(
     print(f"HF token loaded", file=sys.stderr)
-    config_path = Path(__file__).parent.parent / "configs" / "main_agent_config.json"
-    config = load_config(config_path)
     config.yolo_mode = True  # Auto-approve everything in headless mode
     if model:
@@ -1222,6 +1231,7 @@ def cli():
     import warnings
     # Suppress aiohttp "Unclosed client session" noise during event loop teardown
     _logging.getLogger("asyncio").setLevel(_logging.CRITICAL)
     # Suppress litellm pydantic deprecation warnings
     warnings.filterwarnings("ignore", category=DeprecationWarning, module="litellm")
     # Suppress whoosh invalid escape sequence warnings (third-party, unfixed upstream)

 # on every error — users don't need it, and our friendly errors cover the case.
 litellm.suppress_debug_info = True
+CLI_CONFIG_PATH = Path(__file__).parent.parent / "configs" / "cli_agent_config.json"
+def _configure_runtime_logging() -> None:
+    """Keep third-party warning spam from punching through the interactive UI."""
+    import logging
+    logging.getLogger("LiteLLM").setLevel(logging.ERROR)
+    logging.getLogger("litellm").setLevel(logging.ERROR)
 def _safe_get_args(arguments: dict) -> dict:
     """Safely extract args dict from arguments, handling cases where LLM passes string."""
     args = arguments.get("args", {})
     ready_event = asyncio.Event()
     # Start agent loop in background
+    config = load_config(CLI_CONFIG_PATH)
     # Create tool router with local mode
     tool_router = ToolRouter(config.mcpServers, hf_token=hf_token, local_mode=True)
     import logging
     logging.basicConfig(level=logging.WARNING)
+    _configure_runtime_logging()
     hf_token = _get_hf_token()
     if not hf_token:
     print(f"HF token loaded", file=sys.stderr)
+    config = load_config(CLI_CONFIG_PATH)
     config.yolo_mode = True  # Auto-approve everything in headless mode
     if model:
     import warnings
     # Suppress aiohttp "Unclosed client session" noise during event loop teardown
     _logging.getLogger("asyncio").setLevel(_logging.CRITICAL)
+    _configure_runtime_logging()
     # Suppress litellm pydantic deprecation warnings
     warnings.filterwarnings("ignore", category=DeprecationWarning, module="litellm")
     # Suppress whoosh invalid escape sequence warnings (third-party, unfixed upstream)

agent/tools/research_tool.py CHANGED Viewed

@@ -216,7 +216,9 @@ RESEARCH_TOOL_SPEC = {
 def _get_research_model(main_model: str) -> str:
     """Pick a cheaper model for research based on the main model."""
-    if "anthropic" in main_model:
         return "bedrock/us.anthropic.claude-sonnet-4-6"
     # For non-Anthropic models (HF router etc.), use the same model
     return main_model

 def _get_research_model(main_model: str) -> str:
     """Pick a cheaper model for research based on the main model."""
+    if main_model.startswith("anthropic/"):
+        return "anthropic/claude-sonnet-4-6"
+    if main_model.startswith("bedrock/") and "anthropic" in main_model:
         return "bedrock/us.anthropic.claude-sonnet-4-6"
     # For non-Anthropic models (HF router etc.), use the same model
     return main_model

agent/tools/sandbox_tool.py CHANGED Viewed

@@ -213,16 +213,26 @@ async def sandbox_create_handler(
     args: dict[str, Any], session: Any = None
 ) -> tuple[str, bool]:
     """Handle sandbox_create tool calls."""
     # If sandbox already exists, return its info
     if session and getattr(session, "sandbox", None):
         sb = session.sandbox
         return (
             f"Sandbox already active: {sb.space_id}\n"
             f"URL: {sb.url}\n"
             f"Use bash/read/write/edit to interact with it."
         ), True
-    hardware = args.get("hardware", "cpu-basic")
     create_kwargs = {}
     if "private" in args:
         create_kwargs["private"] = args["private"]

     args: dict[str, Any], session: Any = None
 ) -> tuple[str, bool]:
     """Handle sandbox_create tool calls."""
+    hardware = args.get("hardware", "cpu-basic")
     # If sandbox already exists, return its info
     if session and getattr(session, "sandbox", None):
         sb = session.sandbox
+        requested_hardware = args.get("hardware")
+        lockout_note = ""
+        if requested_hardware:
+            lockout_note = (
+                f"\nRequested hardware: {requested_hardware}\n"
+                "Hardware cannot be changed by calling sandbox_create again. "
+                "Delete the existing sandbox first if you need a different tier."
+            )
         return (
             f"Sandbox already active: {sb.space_id}\n"
             f"URL: {sb.url}\n"
+            f"{lockout_note}\n"
             f"Use bash/read/write/edit to interact with it."
         ), True
     create_kwargs = {}
     if "private" in args:
         create_kwargs["private"] = args["private"]

agent/utils/terminal_display.py CHANGED Viewed

@@ -180,10 +180,8 @@ class SubAgentDisplayManager:
     def __init__(self):
         self._agents: dict[str, dict] = {}  # agent_id -> state dict
         self._lines_on_screen = 0
-        self._ticker_task = None
     def start(self, agent_id: str, label: str = "research") -> None:
-        import asyncio
         import time
         self._agents[agent_id] = {
             "label": label,
@@ -192,8 +190,6 @@ class SubAgentDisplayManager:
             "token_count": 0,
             "start_time": time.monotonic(),
         }
-        if not self._ticker_task:
-            self._ticker_task = asyncio.ensure_future(self._tick())
         self._redraw()
     def set_tokens(self, agent_id: str, tokens: int) -> None:
@@ -222,11 +218,7 @@ class SubAgentDisplayManager:
             _console.file.write(line + "\n")
             _console.file.flush()
         self._lines_on_screen = 0
-        if not self._agents:
-            if self._ticker_task:
-                self._ticker_task.cancel()
-                self._ticker_task = None
-        else:
             self._redraw()
     @staticmethod
@@ -239,16 +231,6 @@ class SubAgentDisplayManager:
             line += f"  \033[2m({stats})\033[0m"
         return line
-    async def _tick(self) -> None:
-        import asyncio
-        try:
-            while True:
-                await asyncio.sleep(1.0)
-                if self._agents:
-                    self._redraw()
-        except asyncio.CancelledError:
-            pass
     @staticmethod
     def _format_stats(agent: dict) -> str:
         import time

     def __init__(self):
         self._agents: dict[str, dict] = {}  # agent_id -> state dict
         self._lines_on_screen = 0
     def start(self, agent_id: str, label: str = "research") -> None:
         import time
         self._agents[agent_id] = {
             "label": label,
             "token_count": 0,
             "start_time": time.monotonic(),
         }
         self._redraw()
     def set_tokens(self, agent_id: str, tokens: int) -> None:
             _console.file.write(line + "\n")
             _console.file.flush()
         self._lines_on_screen = 0
+        if self._agents:
             self._redraw()
     @staticmethod
             line += f"  \033[2m({stats})\033[0m"
         return line
     @staticmethod
     def _format_stats(agent: dict) -> str:
         import time

backend/session_manager.py CHANGED Viewed

@@ -15,7 +15,7 @@ from agent.core.tools import ToolRouter
 # Get project root (parent of backend directory)
 PROJECT_ROOT = Path(__file__).parent.parent
-DEFAULT_CONFIG_PATH = str(PROJECT_ROOT / "configs" / "main_agent_config.json")
 # These dataclasses match agent/main.py structure

 # Get project root (parent of backend directory)
 PROJECT_ROOT = Path(__file__).parent.parent
+DEFAULT_CONFIG_PATH = str(PROJECT_ROOT / "configs" / "frontend_agent_config.json")
 # These dataclasses match agent/main.py structure

configs/{main_agent_config.json → cli_agent_config.json} RENAMED Viewed

File without changes

configs/frontend_agent_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_name": "bedrock/us.anthropic.claude-opus-4-6-v1",
+  "save_sessions": true,
+  "session_dataset_repo": "smolagents/ml-intern-sessions",
+  "yolo_mode": false,
+  "confirm_cpu_jobs": true,
+  "auto_file_upload": true,
+  "mcpServers": {
+    "hf-mcp-server": {
+      "transport": "http",
+      "url": "https://huggingface.co/mcp?login"
+    }
+  }
+}

frontend/src/components/Chat/ChatInput.tsx CHANGED Viewed

@@ -7,7 +7,7 @@ import { apiFetch } from '@/utils/api';
 import { useUserQuota } from '@/hooks/useUserQuota';
 import ClaudeCapDialog from '@/components/ClaudeCapDialog';
 import { useAgentStore } from '@/store/agentStore';
-import { FIRST_FREE_MODEL_PATH } from '@/utils/model';
 // Model configuration
 interface ModelOption {
@@ -37,7 +37,7 @@ const MODEL_OPTIONS: ModelOption[] = [
     id: 'claude-opus',
     name: 'Claude Opus 4.6',
     description: 'Anthropic',
-    modelPath: 'anthropic/claude-opus-4-6',
     avatarUrl: 'https://huggingface.co/api/avatars/Anthropic',
     recommended: true,
   },
@@ -70,7 +70,7 @@ interface ChatInputProps {
   placeholder?: string;
 }
-const isClaudeModel = (m: ModelOption) => m.modelPath.startsWith('anthropic/');
 const firstFreeModel = () => MODEL_OPTIONS.find(m => !isClaudeModel(m)) ?? MODEL_OPTIONS[0];
 export default function ChatInput({ sessionId, onSend, onStop, isProcessing = false, disabled = false, placeholder = 'Ask anything...' }: ChatInputProps) {

 import { useUserQuota } from '@/hooks/useUserQuota';
 import ClaudeCapDialog from '@/components/ClaudeCapDialog';
 import { useAgentStore } from '@/store/agentStore';
+import { CLAUDE_MODEL_PATH, FIRST_FREE_MODEL_PATH, isClaudePath } from '@/utils/model';
 // Model configuration
 interface ModelOption {
     id: 'claude-opus',
     name: 'Claude Opus 4.6',
     description: 'Anthropic',
+    modelPath: CLAUDE_MODEL_PATH,
     avatarUrl: 'https://huggingface.co/api/avatars/Anthropic',
     recommended: true,
   },
   placeholder?: string;
 }
+const isClaudeModel = (m: ModelOption) => isClaudePath(m.modelPath);
 const firstFreeModel = () => MODEL_OPTIONS.find(m => !isClaudeModel(m)) ?? MODEL_OPTIONS[0];
 export default function ChatInput({ sessionId, onSend, onStop, isProcessing = false, disabled = false, placeholder = 'Ask anything...' }: ChatInputProps) {

frontend/src/utils/model.ts CHANGED Viewed

@@ -3,13 +3,12 @@
  * ClaudeCapDialog "Use a free model" escape hatch.
  *
  * Keep in sync with MODEL_OPTIONS in components/Chat/ChatInput.tsx and
- * AVAILABLE_MODELS in backend/routes/agent.py. Bare HF ids (no
- * `huggingface/` prefix) — matches upstream's auto-router.
  */
-export const CLAUDE_MODEL_PATH = 'anthropic/claude-opus-4-6';
 export const FIRST_FREE_MODEL_PATH = 'moonshotai/Kimi-K2.6';
 export function isClaudePath(modelPath: string | undefined): boolean {
-  return !!modelPath && modelPath.startsWith('anthropic/');
 }

  * ClaudeCapDialog "Use a free model" escape hatch.
  *
  * Keep in sync with MODEL_OPTIONS in components/Chat/ChatInput.tsx and
+ * AVAILABLE_MODELS in backend/routes/agent.py.
  */
+export const CLAUDE_MODEL_PATH = 'bedrock/us.anthropic.claude-opus-4-6-v1';
 export const FIRST_FREE_MODEL_PATH = 'moonshotai/Kimi-K2.6';
 export function isClaudePath(modelPath: string | undefined): boolean {
+  return !!modelPath && modelPath.includes('anthropic');
 }

pyproject.toml CHANGED Viewed

@@ -42,6 +42,7 @@ eval = [
 # Development and testing dependencies
 dev = [
     "pytest>=9.0.2",
 ]
 # All dependencies (eval + dev)
@@ -61,3 +62,6 @@ include = ["agent*"]
 [tool.uv]
 package = true

 # Development and testing dependencies
 dev = [
     "pytest>=9.0.2",
+    "pytest-asyncio>=0.26.0",
 ]
 # All dependencies (eval + dev)
 [tool.uv]
 package = true
+[tool.pytest.ini_options]
+asyncio_mode = "auto"

tests/unit/test_cli_rendering.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""Regression tests for interactive CLI rendering and research model routing."""
+from io import StringIO
+from types import SimpleNamespace
+from agent.tools.research_tool import _get_research_model
+from agent.utils import terminal_display
+def test_direct_anthropic_research_model_stays_off_bedrock():
+    assert _get_research_model("anthropic/claude-opus-4-6") == "anthropic/claude-sonnet-4-6"
+def test_bedrock_anthropic_research_model_stays_on_bedrock():
+    assert (
+        _get_research_model("bedrock/us.anthropic.claude-opus-4-6-v1")
+        == "bedrock/us.anthropic.claude-sonnet-4-6"
+    )
+def test_non_anthropic_research_model_is_unchanged():
+    assert _get_research_model("openai/gpt-5.4") == "openai/gpt-5.4"
+def test_subagent_display_does_not_spawn_background_redraw(monkeypatch):
+    calls: list[object] = []
+    def _unexpected_future(*args, **kwargs):
+        calls.append((args, kwargs))
+        raise AssertionError("background redraw task should not be created")
+    monkeypatch.setattr("asyncio.ensure_future", _unexpected_future)
+    monkeypatch.setattr(
+        terminal_display,
+        "_console",
+        SimpleNamespace(file=StringIO(), width=100),
+    )
+    mgr = terminal_display.SubAgentDisplayManager()
+    mgr.start("agent-1", "research")
+    mgr.add_call("agent-1", "▸ hf_papers  {\"operation\": \"search\"}")
+    mgr.clear("agent-1")
+    assert calls == []

tests/unit/test_dangling_tool_calls.py ADDED Viewed

	@@ -0,0 +1,121 @@

+"""Regression tests for `_patch_dangling_tool_calls`.
+Reproduces the failure mode behind observatory sessions 8dd2ce30 and
+59c9e678 (2026-04-25): a tool call cancelled mid-execution leaves an
+orphan ``tool_use`` in history; the user types a follow-up; Bedrock
+rejects the next request with HTTP 400 ``messages.N: tool_use ids were
+found without tool_result blocks immediately after``.
+"""
+from litellm import ChatCompletionMessageToolCall, Message
+from agent.context_manager.manager import ContextManager
+def _tool_call(call_id: str, name: str = "research") -> ChatCompletionMessageToolCall:
+    return ChatCompletionMessageToolCall(
+        id=call_id,
+        type="function",
+        function={"name": name, "arguments": "{}"},
+    )
+def _make_cm() -> ContextManager:
+    cm = ContextManager.__new__(ContextManager)
+    cm.system_prompt = "system"
+    cm.model_max_tokens = 100_000
+    cm.compact_size = 1_000
+    cm.running_context_usage = 0
+    cm.untouched_messages = 5
+    cm.items = [Message(role="system", content="system")]
+    return cm
+def test_orphan_tool_use_followed_by_user_message_is_patched():
+    cm = _make_cm()
+    cm.items.extend([
+        Message(role="user", content="Research X"),
+        Message(
+            role="assistant",
+            content=None,
+            tool_calls=[_tool_call("call_abc", "research")],
+        ),
+        Message(role="user", content="??"),
+    ])
+    msgs = cm.get_messages()
+    tool_msgs = [m for m in msgs if getattr(m, "role", None) == "tool"]
+    assert len(tool_msgs) == 1
+    assert tool_msgs[0].tool_call_id == "call_abc"
+    assert "interrupted" in (tool_msgs[0].content or "").lower() or "not executed" in (tool_msgs[0].content or "").lower()
+def test_no_orphan_means_no_stub():
+    cm = _make_cm()
+    cm.items.extend([
+        Message(role="user", content="Research X"),
+        Message(
+            role="assistant",
+            content=None,
+            tool_calls=[_tool_call("call_abc", "research")],
+        ),
+        Message(role="tool", content="ok", tool_call_id="call_abc", name="research"),
+    ])
+    cm.get_messages()
+    tool_msgs = [m for m in cm.items if getattr(m, "role", None) == "tool"]
+    assert len(tool_msgs) == 1
+    assert tool_msgs[0].content == "ok"
+def test_multiple_dangling_tool_calls_in_one_assistant_message_are_all_patched():
+    cm = _make_cm()
+    cm.items.extend([
+        Message(role="user", content="do two things"),
+        Message(
+            role="assistant",
+            content=None,
+            tool_calls=[
+                _tool_call("call_1", "research"),
+                _tool_call("call_2", "bash"),
+            ],
+        ),
+        Message(role="user", content="follow up"),
+    ])
+    cm.get_messages()
+    tool_ids = {
+        getattr(m, "tool_call_id", None)
+        for m in cm.items
+        if getattr(m, "role", None) == "tool"
+    }
+    assert tool_ids == {"call_1", "call_2"}
+def test_orphan_in_earlier_turn_still_gets_patched():
+    """Two-turn history where the FIRST turn was interrupted.
+    Old patcher stopped at the first user msg encountered while scanning
+    backwards, so this case never got fixed and Bedrock rejected.
+    """
+    cm = _make_cm()
+    cm.items.extend([
+        Message(role="user", content="turn 1"),
+        Message(
+            role="assistant",
+            content=None,
+            tool_calls=[_tool_call("call_old", "research")],
+        ),
+        Message(role="user", content="turn 2 — please retry"),
+        Message(
+            role="assistant",
+            content=None,
+            tool_calls=[_tool_call("call_new", "bash")],
+        ),
+        Message(role="tool", content="ok", tool_call_id="call_new", name="bash"),
+    ])
+    cm.get_messages()
+    tool_ids = {
+        getattr(m, "tool_call_id", None)
+        for m in cm.items
+        if getattr(m, "role", None) == "tool"
+    }
+    assert "call_old" in tool_ids
+    assert "call_new" in tool_ids

tests/unit/test_doom_loop_polling.py ADDED Viewed

	@@ -0,0 +1,96 @@

+"""Regression test for doom-loop false-positive on legitimate polling.
+Reproduces the failure mode in observatory sessions 40fcb414 ($32.59),
+8e90352e ($62.63), and 403178bf ($5.71) on 2026-04-25: the agent polled a
+long-running job with `bash sleep 300 && wc -l output` four times in a
+row. The arguments were byte-identical, but the results moved (27210 →
+36454 → 45770 → 55138 — actual progress). The detector hashed args only
+and false-fired DOOM LOOP, which made the agent abandon perfectly valid
+polling.
+After the fix the signature includes the tool result hash, so identical
+args + different results no longer trips the detector.
+"""
+from litellm import ChatCompletionMessageToolCall, Message
+from agent.core.doom_loop import check_for_doom_loop
+def _assistant(call_id: str, name: str, args: str) -> Message:
+    return Message(
+        role="assistant",
+        content=None,
+        tool_calls=[
+            ChatCompletionMessageToolCall(
+                id=call_id,
+                type="function",
+                function={"name": name, "arguments": args},
+            )
+        ],
+    )
+def _tool(call_id: str, name: str, content: str) -> Message:
+    return Message(role="tool", content=content, tool_call_id=call_id, name=name)
+_POLL_ARGS = '{"command": "sleep 300 && ls /app/images/ | wc -l"}'
+def test_polling_with_progressing_results_does_not_fire():
+    msgs = [
+        Message(role="user", content="run the job"),
+        _assistant("c1", "bash", _POLL_ARGS),
+        _tool("c1", "bash", "27210"),
+        _assistant("c2", "bash", _POLL_ARGS),
+        _tool("c2", "bash", "36454"),
+        _assistant("c3", "bash", _POLL_ARGS),
+        _tool("c3", "bash", "45770"),
+        _assistant("c4", "bash", _POLL_ARGS),
+        _tool("c4", "bash", "55138"),
+    ]
+    assert check_for_doom_loop(msgs) is None
+def test_truly_stuck_polling_with_identical_results_still_fires():
+    """If the same poll returns the same number, the job is genuinely
+    stuck and the detector SHOULD fire."""
+    msgs = [
+        _assistant("c1", "bash", _POLL_ARGS),
+        _tool("c1", "bash", "55138"),
+        _assistant("c2", "bash", _POLL_ARGS),
+        _tool("c2", "bash", "55138"),
+        _assistant("c3", "bash", _POLL_ARGS),
+        _tool("c3", "bash", "55138"),
+    ]
+    prompt = check_for_doom_loop(msgs)
+    assert prompt is not None
+    assert "DOOM LOOP" in prompt
+    assert "bash" in prompt
+def test_identical_calls_with_no_results_yet_still_fires():
+    """If three identical calls have no tool results (e.g. all cancelled
+    or errored before a result was recorded), treat as a real loop."""
+    msgs = [
+        _assistant("c1", "write", '{"path": "/tmp/x", "content": "..."}'),
+        _assistant("c2", "write", '{"path": "/tmp/x", "content": "..."}'),
+        _assistant("c3", "write", '{"path": "/tmp/x", "content": "..."}'),
+    ]
+    prompt = check_for_doom_loop(msgs)
+    assert prompt is not None
+    assert "DOOM LOOP" in prompt
+    assert "write" in prompt
+def test_different_args_does_not_fire():
+    msgs = [
+        _assistant("c1", "bash", '{"command": "ls /a"}'),
+        _tool("c1", "bash", "ok"),
+        _assistant("c2", "bash", '{"command": "ls /b"}'),
+        _tool("c2", "bash", "ok"),
+        _assistant("c3", "bash", '{"command": "ls /c"}'),
+        _tool("c3", "bash", "ok"),
+    ]
+    assert check_for_doom_loop(msgs) is None

tests/unit/test_llm_error_classification.py ADDED Viewed

	@@ -0,0 +1,100 @@

+"""Tests for LLM error classification helpers in agent.core.agent_loop.
+Covers two regressions on 2026-04-25:
+1. Non-Anthropic context overflow (Kimi 365k > 262k) was not classified as
+   ``_is_context_overflow_error``, so the recovery path didn't fire and
+   session 62ccfdcb died with 68 wasted compaction events.
+2. Bedrock TPM rate limit (`Too many tokens, please wait before trying
+   again.`) needs the longer rate-limit retry schedule. The old schedule
+   ([5, 15, 30] = 50s) burned through 6 sessions costing >$2,400 combined
+   on the same day.
+"""
+from agent.core.agent_loop import (
+    _MAX_LLM_RETRIES,
+    _LLM_RATE_LIMIT_RETRY_DELAYS,
+    _LLM_RETRY_DELAYS,
+    _is_context_overflow_error,
+    _is_rate_limit_error,
+    _is_transient_error,
+    _retry_delay_for,
+)
+# ── context overflow ────────────────────────────────────────────────────
+def test_kimi_prompt_too_long_is_context_overflow():
+    # Verbatim error text from session 62ccfdcb (2026-04-25, Kimi K2.6).
+    err = Exception(
+        "litellm.BadRequestError: OpenAIException - The prompt is too long: "
+        "365407, model maximum context length: 262143"
+    )
+    assert _is_context_overflow_error(err)
+def test_openai_context_length_exceeded_is_context_overflow():
+    err = Exception("Error: This model's maximum context length is 8192 tokens.")
+    assert _is_context_overflow_error(err)
+def test_random_error_is_not_context_overflow():
+    err = Exception("connection reset by peer")
+    assert not _is_context_overflow_error(err)
+# ── rate limit ──────────────────────────────────────────────────────────
+def test_bedrock_too_many_tokens_is_rate_limit():
+    # Verbatim from sessions b37a3823, c4d7a831, b63c4933 (2026-04-25).
+    err = Exception(
+        'litellm.RateLimitError: BedrockException - {"message":"Too many '
+        'tokens, please wait before trying again."}'
+    )
+    assert _is_rate_limit_error(err)
+    # Rate-limit errors are also classified as transient.
+    assert _is_transient_error(err)
+def test_429_is_rate_limit():
+    err = Exception("HTTP 429 Too Many Requests")
+    assert _is_rate_limit_error(err)
+def test_timeout_is_transient_but_not_rate_limit():
+    err = Exception("Request timed out after 600s")
+    assert _is_transient_error(err)
+    assert not _is_rate_limit_error(err)
+# ── retry schedule selection ────────────────────────────────────────────
+def test_rate_limit_uses_longer_schedule():
+    err = Exception("Too many tokens, please wait before trying again.")
+    delays = [_retry_delay_for(err, i) for i in range(len(_LLM_RATE_LIMIT_RETRY_DELAYS))]
+    assert delays == _LLM_RATE_LIMIT_RETRY_DELAYS
+    # Just past the schedule → None (stop retrying).
+    assert _retry_delay_for(err, len(_LLM_RATE_LIMIT_RETRY_DELAYS)) is None
+def test_other_transient_uses_short_schedule():
+    err = Exception("503 service unavailable")
+    delays = [_retry_delay_for(err, i) for i in range(len(_LLM_RETRY_DELAYS))]
+    assert delays == _LLM_RETRY_DELAYS
+    assert _retry_delay_for(err, len(_LLM_RETRY_DELAYS)) is None
+def test_non_transient_returns_none():
+    err = Exception("invalid request: bad parameter")
+    assert _retry_delay_for(err, 0) is None
+def test_rate_limit_total_budget_covers_bedrock_bucket_recovery():
+    """The whole point of the rate-limit schedule: total wait time should
+    exceed the ~60s Bedrock TPM bucket recovery window."""
+    assert len(_LLM_RATE_LIMIT_RETRY_DELAYS) == _MAX_LLM_RETRIES - 1
+    assert sum(_LLM_RATE_LIMIT_RETRY_DELAYS) > 60

tests/unit/test_malformed_args_recovery.py ADDED Viewed

	@@ -0,0 +1,66 @@

+"""Regression test for the malformed-JSON loop in observatory session
+7750e82f (2026-04-25): GLM-5.1 produced six consecutive ``write`` calls
+whose ``arguments`` strings JSON-parse-failed (truncated mid-stream by
+the provider). The soft retry hint didn't move the model. The detector
+in ``_detect_repeated_malformed`` looks for the streak so the agent loop
+can inject a hard system-prompt forcing a different strategy.
+"""
+from litellm import Message
+from agent.core.agent_loop import _detect_repeated_malformed
+def _malformed_tool_msg(name: str, call_id: str) -> Message:
+    return Message(
+        role="tool",
+        content=(
+            f"ERROR: Tool call to '{name}' had malformed JSON arguments and "
+            f"was NOT executed. Retry with smaller content — for 'write', "
+            f"split into multiple smaller writes using 'edit'."
+        ),
+        tool_call_id=call_id,
+        name=name,
+    )
+def test_two_consecutive_malformed_same_tool_triggers():
+    items = [
+        Message(role="user", content="write a big plan"),
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("write", "1"),
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("write", "2"),
+    ]
+    assert _detect_repeated_malformed(items, threshold=2) == "write"
+def test_one_malformed_does_not_trigger():
+    items = [
+        Message(role="user", content="write a plan"),
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("write", "1"),
+    ]
+    assert _detect_repeated_malformed(items, threshold=2) is None
+def test_two_malformed_different_tools_does_not_trigger():
+    items = [
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("write", "1"),
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("bash", "2"),
+    ]
+    assert _detect_repeated_malformed(items, threshold=2) is None
+def test_streak_broken_by_successful_tool_call_does_not_trigger():
+    items = [
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("write", "1"),
+        Message(role="assistant", content=None),
+        Message(role="tool", content="ok", tool_call_id="2", name="write"),
+        Message(role="assistant", content=None),
+        _malformed_tool_msg("write", "3"),
+    ]
+    assert _detect_repeated_malformed(items, threshold=2) is None

tests/unit/test_sandbox_already_active_message.py ADDED Viewed

	@@ -0,0 +1,47 @@

+"""Regression test for sandbox_create not surfacing the hardware lockout.
+In observatory session d6f8454c (2026-04-25) the agent called
+sandbox_create 18 times across 11 distinct hardware tiers (a10g-large,
+a100-large, t4-small, cpu-upgrade, cpu-basic, zero-a10g, l4x1, t4-medium,
+a10g-small, l40sx1, …). Every call returned 'Sandbox already active' for
+the same sandbox, but the message did not say that hardware can't be
+changed by re-calling, so the agent thought "still pending, retry with a
+different flavor" and burned 17 useless turns.
+The fix makes the response explicit when the requested hardware differs
+from what's already active.
+"""
+import asyncio
+from types import SimpleNamespace
+from agent.tools.sandbox_tool import sandbox_create_handler
+def _session_with_sandbox():
+    sb = SimpleNamespace(
+        space_id="user/sandbox-abc123",
+        url="https://huggingface.co/spaces/user/sandbox-abc123",
+    )
+    return SimpleNamespace(sandbox=sb)
+def test_already_active_with_different_hw_warns_about_lockout():
+    session = _session_with_sandbox()
+    out, ok = asyncio.run(
+        sandbox_create_handler({"hardware": "a100-large"}, session=session)
+    )
+    assert ok is True
+    # The message should mention the lockout AND the requested flavor.
+    assert "cannot be changed" in out.lower()
+    assert "a100-large" in out
+    assert "delete" in out.lower()
+def test_already_active_no_hw_request_just_returns_handle():
+    session = _session_with_sandbox()
+    out, ok = asyncio.run(sandbox_create_handler({}, session=session))
+    assert ok is True
+    assert "user/sandbox-abc123" in out
+    # No spurious lockout note when the agent didn't request a flavor.
+    assert "cannot be changed" not in out.lower()

uv.lock CHANGED Viewed

@@ -1799,10 +1799,12 @@ all = [
     { name = "inspect-ai" },
     { name = "pandas" },
     { name = "pytest" },
     { name = "tenacity" },
 ]
 dev = [
     { name = "pytest" },
 ]
 eval = [
     { name = "datasets" },
@@ -1830,6 +1832,7 @@ requires-dist = [
     { name = "prompt-toolkit", specifier = ">=3.0.0" },
     { name = "pydantic", specifier = ">=2.12.3" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2" },
     { name = "python-dotenv", specifier = ">=1.2.1" },
     { name = "requests", specifier = ">=2.33.0" },
     { name = "rich", specifier = ">=13.0.0" },
@@ -2789,6 +2792,19 @@ wheels = [
     { url = "https://files.pythonhosted.org/packages/3b/ab/b3226f0bd7cdcf710fbede2b3548584366da3b19b5021e74f5bde2a8fa3f/pytest-9.0.2-py3-none-any.whl", hash = "sha256:711ffd45bf766d5264d487b917733b453d917afd2b0ad65223959f59089f875b", size = 374801, upload-time = "2025-12-06T21:30:49.154Z" },
 ]
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"

     { name = "inspect-ai" },
     { name = "pandas" },
     { name = "pytest" },
+    { name = "pytest-asyncio" },
     { name = "tenacity" },
 ]
 dev = [
     { name = "pytest" },
+    { name = "pytest-asyncio" },
 ]
 eval = [
     { name = "datasets" },
     { name = "prompt-toolkit", specifier = ">=3.0.0" },
     { name = "pydantic", specifier = ">=2.12.3" },
     { name = "pytest", marker = "extra == 'dev'", specifier = ">=9.0.2" },
+    { name = "pytest-asyncio", marker = "extra == 'dev'", specifier = ">=0.26.0" },
     { name = "python-dotenv", specifier = ">=1.2.1" },
     { name = "requests", specifier = ">=2.33.0" },
     { name = "rich", specifier = ">=13.0.0" },
     { url = "https://files.pythonhosted.org/packages/3b/ab/b3226f0bd7cdcf710fbede2b3548584366da3b19b5021e74f5bde2a8fa3f/pytest-9.0.2-py3-none-any.whl", hash = "sha256:711ffd45bf766d5264d487b917733b453d917afd2b0ad65223959f59089f875b", size = 374801, upload-time = "2025-12-06T21:30:49.154Z" },
 ]
+[[package]]
+name = "pytest-asyncio"
+version = "1.3.0"
+source = { registry = "https://pypi.org/simple" }
+dependencies = [
+    { name = "pytest" },
+    { name = "typing-extensions", marker = "python_full_version < '3.13'" },
+]
+sdist = { url = "https://files.pythonhosted.org/packages/90/2c/8af215c0f776415f3590cac4f9086ccefd6fd463befeae41cd4d3f193e5a/pytest_asyncio-1.3.0.tar.gz", hash = "sha256:d7f52f36d231b80ee124cd216ffb19369aa168fc10095013c6b014a34d3ee9e5", size = 50087, upload-time = "2025-11-10T16:07:47.256Z" }
+wheels = [
+    { url = "https://files.pythonhosted.org/packages/e5/35/f8b19922b6a25bc0880171a2f1a003eaeb93657475193ab516fd87cac9da/pytest_asyncio-1.3.0-py3-none-any.whl", hash = "sha256:611e26147c7f77640e6d0a92a38ed17c3e9848063698d5c93d5aa7aa11cebff5", size = 15075, upload-time = "2025-11-10T16:07:45.537Z" },
+]
 [[package]]
 name = "python-dateutil"
 version = "2.9.0.post0"