Spaces:

smolagents
/

ml-intern

Running on CPU Upgrade

Aksel Joonas Reedi commited on Apr 22

Commit

e2552e8

unverified ·

1 Parent(s): f30ed48

fix /model switcher with effort params

* /model: probe effort combo on switch, stop passing reasoning_effort to litellm

Switching to claude-opus-4-7 with /effort set would 400 with
"thinking.type.enabled is not supported" because litellm 1.83.0's
Anthropic adapter substring-matches "4.6" to decide which thinking API
shape to send, and doesn't know about 4.7's adaptive + output_config.effort
contract. Rather than maintain a per-model capability table that rots
every time a new Claude family ships, this change trusts the API itself:

- llm_params.py: for anthropic/*, bypass litellm's reasoning_effort -> thinking
mapping and pass thinking={type: adaptive} plus output_config={effort: ...}
as top-level kwargs directly. litellm forwards unknown top-level params
into Anthropic request bodies (extra_body does NOT work here — Anthropic
rejects it as "Extra inputs are not permitted"). One localized monkey-patch
widens litellm 1.83's hardcoded _is_opus_4_6_model check so effort=max
isn't rejected synchronously on 4.7 — removable once litellm ships PR
#25867 upstream.

- effort_probe.py: new probe that fires a 1-token ping on /model switch
with the same params we'd use for real, walking a cascade
max -> xhigh -> high -> medium -> low until the provider stops rejecting.
Three outcomes: success (cache the level), thinking-unsupported (cache
None, strip on future calls), inconclusive (switch anyway with warning).
Persistent non-thinking 4xx (auth, model-not-found) bubbles up so
/model rejects the switch and keeps the current model.

- session.py: per-model effective_effort cache + effective_effort_for()
helper. Populated by the probe, read by the real LLM call so resolved
levels don't re-probe on every message. /effort change invalidates.

- agent_loop.py: safety net — if a real call 400s with thinking/effort
config errors mid-conversation (e.g. after /effort change without
re-probe), heal the cache and retry once before propagating.

- main.py: default reasoning_effort = "max" (was "high"); /model runs
the probe and prints (effort: X, Nms); /effort accepts xhigh and max
and shows per-model probed ceilings; SUGGESTED_MODELS includes Opus 4.7.

Live-tested against Opus 4.7, Haiku 4.5, DeepSeek-R1, Qwen3.5-9B,
Llama-3.1-8B, MiMo-V2-Flash, Gemma-4-31B, Arch-Router-1.5B, Kimi-K2.5,
and a non-existent id. All outcomes matched expectations.

* Extract /model switcher logic into agent.core.model_switcher

main.py was accumulating model-switch specifics (suggested list, id
format check, HF routing info printer, probe-and-switch flow, commit
helper). Moving them to a dedicated module keeps the REPL dispatcher
focused on input parsing and makes the switcher independently testable.

Net: main.py down ~160 lines, /model handler is now a 10-line delegation.
No behavior change.

Files changed (9) hide show

agent/config.py +9 -8
agent/core/agent_loop.py +74 -1
agent/core/effort_probe.py +229 -0
agent/core/llm_params.py +139 -24
agent/core/model_switcher.py +228 -0
agent/core/session.py +23 -0
agent/main.py +31 -139
agent/tools/research_tool.py +7 -1
agent/utils/terminal_display.py +1 -1

agent/config.py CHANGED Viewed

@@ -33,14 +33,15 @@ class Config(BaseModel):
     confirm_cpu_jobs: bool = True
     auto_file_upload: bool = False
-    # Reasoning effort for models that support it (GPT-5 / o-series, Claude
-    # extended thinking, HF reasoning models like MiniMax M2 / Kimi K2).
-    # Defaults to "high" — we'd rather spend tokens thinking than ship a
-    # wrong ML recipe. Users can dial down with `/effort low|medium|off`.
-    # "minimal" is an OpenAI-only level and is normalized to "low" for HF
-    # router models (MiniMax requires ≥low). Ignored for non-reasoning models.
-    # Valid values: None | "minimal" | "low" | "medium" | "high"
-    reasoning_effort: str | None = "high"
 def substitute_env_vars(obj: Any) -> Any:

     confirm_cpu_jobs: bool = True
     auto_file_upload: bool = False
+    # Reasoning effort *preference* — the ceiling the user wants. The probe
+    # on `/model` walks a cascade down from here (``max`` → ``xhigh`` → ``high``
+    # → …) and caches per-model what the provider actually accepted in
+    # ``Session.model_effective_effort``. Default ``max`` because we'd rather
+    # burn tokens thinking than ship a wrong ML recipe; the cascade lands on
+    # whichever level the model supports (``high`` for GPT-5 / HF router,
+    # ``xhigh`` or ``max`` for Anthropic 4.6 / 4.7). ``None`` = thinking off.
+    # Valid values: None | "minimal" | "low" | "medium" | "high" | "xhigh" | "max"
+    reasoning_effort: str | None = "max"
 def substitute_env_vars(obj: Any) -> Any:

agent/core/agent_loop.py CHANGED Viewed

@@ -136,6 +136,58 @@ def _is_transient_error(error: Exception) -> bool:
     return any(pattern in err_str for pattern in transient_patterns)
 def _friendly_error_message(error: Exception) -> str | None:
     """Return a user-friendly message for known error types, or None to fall back to traceback."""
     err_str = str(error).lower()
@@ -243,6 +295,7 @@ class LLMResult:
 async def _call_llm_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
     """Call the LLM with streaming, emitting assistant_chunk events."""
     response = None
     for _llm_attempt in range(_MAX_LLM_RETRIES):
         try:
             response = await acompletion(
@@ -258,6 +311,14 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
         except ContextWindowExceededError:
             raise
         except Exception as e:
             if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
                 _delay = _LLM_RETRY_DELAYS[_llm_attempt]
                 logger.warning(
@@ -328,6 +389,7 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
 async def _call_llm_non_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
     """Call the LLM without streaming, emit assistant_message at the end."""
     response = None
     for _llm_attempt in range(_MAX_LLM_RETRIES):
         try:
             response = await acompletion(
@@ -342,6 +404,14 @@ async def _call_llm_non_streaming(session: Session, messages, tools, llm_params)
         except ContextWindowExceededError:
             raise
         except Exception as e:
             if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
                 _delay = _LLM_RETRY_DELAYS[_llm_attempt]
                 logger.warning(
@@ -490,10 +560,13 @@ class Handlers:
             tools = session.tool_router.get_tool_specs_for_llm()
             try:
                 # ── Call the LLM (streaming or non-streaming) ──
                 llm_params = _resolve_llm_params(
                     session.config.model_name,
                     session.hf_token,
-                    reasoning_effort=session.config.reasoning_effort,
                 )
                 if session.stream:
                     llm_result = await _call_llm_streaming(session, messages, tools, llm_params)

     return any(pattern in err_str for pattern in transient_patterns)
+def _is_effort_config_error(error: Exception) -> bool:
+    """Catch the two 400s the effort probe also handles — thinking
+    unsupported for this model, or the specific effort level invalid.
+    This is our safety net for the case where ``/effort`` was changed
+    mid-conversation (which clears the probe cache) and the new level
+    doesn't work for the current model. We heal the cache and retry once.
+    """
+    from agent.core.effort_probe import _is_invalid_effort, _is_thinking_unsupported
+    return _is_thinking_unsupported(error) or _is_invalid_effort(error)
+async def _heal_effort_and_rebuild_params(
+    session: Session, error: Exception, llm_params: dict,
+) -> dict:
+    """Update the session's effort cache based on ``error`` and return new
+    llm_params. Called only when ``_is_effort_config_error(error)`` is True.
+    Two branches:
+      • thinking-unsupported → cache ``None`` for this model, next call
+        strips thinking entirely
+      • invalid-effort → re-run the full cascade probe; the result lands
+        in the cache
+    """
+    from agent.core.effort_probe import ProbeInconclusive, _is_thinking_unsupported, probe_effort
+    model = session.config.model_name
+    if _is_thinking_unsupported(error):
+        session.model_effective_effort[model] = None
+        logger.info("healed: %s doesn't support thinking — stripped", model)
+    else:
+        try:
+            outcome = await probe_effort(
+                model, session.config.reasoning_effort, session.hf_token,
+            )
+            session.model_effective_effort[model] = outcome.effective_effort
+            logger.info(
+                "healed: %s effort cascade → %s", model, outcome.effective_effort,
+            )
+        except ProbeInconclusive:
+            # Transient during healing — strip thinking for safety, next
+            # call will either succeed or surface the real error.
+            session.model_effective_effort[model] = None
+            logger.info("healed: %s probe inconclusive — stripped", model)
+    return _resolve_llm_params(
+        model,
+        session.hf_token,
+        reasoning_effort=session.effective_effort_for(model),
+    )
 def _friendly_error_message(error: Exception) -> str | None:
     """Return a user-friendly message for known error types, or None to fall back to traceback."""
     err_str = str(error).lower()
 async def _call_llm_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
     """Call the LLM with streaming, emitting assistant_chunk events."""
     response = None
+    _healed_effort = False  # one-shot safety net per call
     for _llm_attempt in range(_MAX_LLM_RETRIES):
         try:
             response = await acompletion(
         except ContextWindowExceededError:
             raise
         except Exception as e:
+            if not _healed_effort and _is_effort_config_error(e):
+                _healed_effort = True
+                llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
+                await session.send_event(Event(
+                    event_type="tool_log",
+                    data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
+                ))
+                continue
             if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
                 _delay = _LLM_RETRY_DELAYS[_llm_attempt]
                 logger.warning(
 async def _call_llm_non_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
     """Call the LLM without streaming, emit assistant_message at the end."""
     response = None
+    _healed_effort = False
     for _llm_attempt in range(_MAX_LLM_RETRIES):
         try:
             response = await acompletion(
         except ContextWindowExceededError:
             raise
         except Exception as e:
+            if not _healed_effort and _is_effort_config_error(e):
+                _healed_effort = True
+                llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
+                await session.send_event(Event(
+                    event_type="tool_log",
+                    data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
+                ))
+                continue
             if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
                 _delay = _LLM_RETRY_DELAYS[_llm_attempt]
                 logger.warning(
             tools = session.tool_router.get_tool_specs_for_llm()
             try:
                 # ── Call the LLM (streaming or non-streaming) ──
+                # Pull the per-model probed effort from the session cache when
+                # available; fall back to the raw preference for models we
+                # haven't probed yet (e.g. research sub-model).
                 llm_params = _resolve_llm_params(
                     session.config.model_name,
                     session.hf_token,
+                    reasoning_effort=session.effective_effort_for(session.config.model_name),
                 )
                 if session.stream:
                     llm_result = await _call_llm_streaming(session, messages, tools, llm_params)

agent/core/effort_probe.py ADDED Viewed

	@@ -0,0 +1,229 @@

+"""Probe-and-cascade for reasoning effort on /model switch.
+We don't maintain a per-model capability table. Instead, the first time a
+user picks a model we fire a 1-token ping with the same params we'd use
+for real and walk down a cascade (``max`` → ``xhigh`` → ``high`` → …)
+until the provider stops rejecting us. The result is cached per-model on
+the session, so real messages don't pay the probe cost again.
+Three outcomes, classified from the 400 error text:
+* success → cache the effort that worked
+* ``"thinking ... not supported"`` → model doesn't do thinking at all;
+  cache ``None`` so we stop sending thinking params
+* ``"effort ... invalid"`` / synonyms → cascade walks down and retries
+Transient errors (5xx, timeout, connection reset) bubble out as
+``ProbeInconclusive`` so the caller can complete the switch with a
+warning instead of blocking on a flaky provider.
+"""
+from __future__ import annotations
+import asyncio
+import logging
+from dataclasses import dataclass
+from litellm import acompletion
+from agent.core.llm_params import UnsupportedEffortError, _resolve_llm_params
+logger = logging.getLogger(__name__)
+# Cascade: for each user-stated preference, the ordered list of levels to
+# try. First success wins. ``max`` / ``xhigh`` are Anthropic-only; providers
+# that don't accept them raise ``UnsupportedEffortError`` synchronously (no
+# wasted network round-trip) and we advance to the next level.
+_EFFORT_CASCADE: dict[str, list[str]] = {
+    "max":     ["max", "xhigh", "high", "medium", "low"],
+    "xhigh":   ["xhigh", "high", "medium", "low"],
+    "high":    ["high", "medium", "low"],
+    "medium":  ["medium", "low"],
+    "minimal": ["minimal", "low"],
+    "low":     ["low"],
+}
+_PROBE_TIMEOUT = 15.0
+_PROBE_MAX_TOKENS = 16
+class ProbeInconclusive(Exception):
+    """The probe couldn't reach a verdict (transient network / provider error).
+    Caller should complete the switch with a warning — the next real call
+    will re-surface the error if it's persistent.
+    """
+@dataclass
+class ProbeOutcome:
+    """What the probe learned. ``effective_effort`` semantics match the cache:
+    * str → send this level
+    * None → model doesn't support thinking; strip it
+    """
+    effective_effort: str | None
+    attempts: int
+    elapsed_ms: int
+    note: str | None = None  # e.g. "max not supported, falling back"
+def _is_thinking_unsupported(e: Exception) -> bool:
+    """Model rejected any thinking config.
+    Matches Anthropic's 'thinking.type.enabled is not supported for this
+    model' as well as the adaptive variant. Substring-match because the
+    exact wording shifts across API versions.
+    """
+    s = str(e).lower()
+    return "thinking" in s and "not supported" in s
+def _is_invalid_effort(e: Exception) -> bool:
+    """The requested effort level isn't accepted for this model.
+    Covers both API responses (Anthropic/OpenAI 400 with "invalid", "must
+    be one of", etc.) and LiteLLM's local validation that fires *before*
+    the request (e.g. "effort='max' is only supported by Claude Opus 4.6"
+    — LiteLLM knows max is Opus-4.6-only and raises synchronously). The
+    cascade walks down on either.
+    Explicitly returns False when the message is really about thinking
+    itself (e.g. Anthropic's 4.7 error mentions ``output_config.effort``
+    in its fix hint, but the actual failure is ``thinking.type.enabled``
+    being unsupported). That case is caught by ``_is_thinking_unsupported``.
+    """
+    if _is_thinking_unsupported(e):
+        return False
+    s = str(e).lower()
+    if "effort" not in s and "output_config" not in s:
+        return False
+    return any(
+        phrase in s
+        for phrase in (
+            "invalid", "not supported", "must be one of", "not a valid",
+            "unrecognized", "unknown",
+            # LiteLLM's own pre-flight validation phrasing.
+            "only supported by", "is only supported",
+        )
+    )
+def _is_transient(e: Exception) -> bool:
+    """Network / provider-side flake. Keep in sync with agent_loop's list.
+    Also matches by type for ``asyncio.TimeoutError`` — its ``str(e)`` is
+    empty, so substring matching alone misses it.
+    """
+    if isinstance(e, (asyncio.TimeoutError, TimeoutError)):
+        return True
+    s = str(e).lower()
+    return any(
+        p in s
+        for p in (
+            "timeout", "timed out", "429", "rate limit",
+            "503", "service unavailable", "502", "bad gateway",
+            "500", "internal server error", "overloaded", "capacity",
+            "connection reset", "connection refused", "connection error",
+            "eof", "broken pipe",
+        )
+    )
+async def probe_effort(
+    model_name: str,
+    preference: str | None,
+    hf_token: str | None,
+) -> ProbeOutcome:
+    """Walk the cascade for ``preference`` on ``model_name``.
+    Returns the first effort the provider accepts, or ``None`` if it
+    rejects thinking altogether. Raises ``ProbeInconclusive`` only for
+    transient errors (5xx, timeout) — persistent 4xx that aren't thinking/
+    effort related bubble as the original exception so callers can surface
+    them (auth, model-not-found, quota, etc.).
+    """
+    loop = asyncio.get_event_loop()
+    start = loop.time()
+    attempts = 0
+    if not preference:
+        # User explicitly turned effort off — nothing to probe. A bare
+        # ping with no thinking params is pointless; just report "off".
+        return ProbeOutcome(effective_effort=None, attempts=0, elapsed_ms=0)
+    cascade = _EFFORT_CASCADE.get(preference, [preference])
+    skipped: list[str] = []  # levels the provider rejected synchronously
+    last_error: Exception | None = None
+    for effort in cascade:
+        try:
+            params = _resolve_llm_params(
+                model_name, hf_token, reasoning_effort=effort, strict=True,
+            )
+        except UnsupportedEffortError:
+            # Provider can't even accept this effort name (e.g. "max" on
+            # HF router). Skip without a network call.
+            skipped.append(effort)
+            continue
+        attempts += 1
+        try:
+            await asyncio.wait_for(
+                acompletion(
+                    messages=[{"role": "user", "content": "ping"}],
+                    max_tokens=_PROBE_MAX_TOKENS,
+                    stream=False,
+                    **params,
+                ),
+                timeout=_PROBE_TIMEOUT,
+            )
+        except Exception as e:
+            last_error = e
+            if _is_thinking_unsupported(e):
+                elapsed = int((loop.time() - start) * 1000)
+                return ProbeOutcome(
+                    effective_effort=None,
+                    attempts=attempts,
+                    elapsed_ms=elapsed,
+                    note="model doesn't support reasoning, dropped",
+                )
+            if _is_invalid_effort(e):
+                logger.debug("probe: %s rejected effort=%s, trying next", model_name, effort)
+                continue
+            if _is_transient(e):
+                raise ProbeInconclusive(str(e)) from e
+            # Persistent non-thinking 4xx (auth, quota, model-not-found) —
+            # let the caller classify & surface.
+            raise
+        else:
+            elapsed = int((loop.time() - start) * 1000)
+            note = None
+            if effort != preference:
+                note = f"{preference} not supported, using {effort}"
+            return ProbeOutcome(
+                effective_effort=effort,
+                attempts=attempts,
+                elapsed_ms=elapsed,
+                note=note,
+            )
+    # Cascade exhausted without a success. This only happens when every
+    # level was either rejected synchronously (``UnsupportedEffortError``,
+    # e.g. preference=max on HF and we also somehow filtered all others)
+    # or the provider 400'd ``invalid effort`` on every level.
+    elapsed = int((loop.time() - start) * 1000)
+    if last_error is not None and not _is_invalid_effort(last_error):
+        raise last_error
+    note = (
+        "no effort level accepted — proceeding without thinking"
+        if not skipped
+        else f"provider rejected all efforts ({', '.join(skipped)})"
+    )
+    return ProbeOutcome(
+        effective_effort=None,
+        attempts=attempts,
+        elapsed_ms=elapsed,
+        note=note,
+    )

agent/core/llm_params.py CHANGED Viewed

@@ -8,41 +8,122 @@ creating circular imports.
 import os
-# HF router reasoning models only accept "low" | "medium" | "high" (e.g.
-# MiniMax M2 actually *requires* reasoning to be enabled). OpenAI's GPT-5
-# also accepts "minimal" for near-zero thinking. We map "minimal" to "low"
-# for HF so the user doesn't get a 400.
-_HF_ALLOWED_EFFORTS = {"low", "medium", "high"}
 def _resolve_llm_params(
     model_name: str,
     session_hf_token: str | None = None,
     reasoning_effort: str | None = None,
 ) -> dict:
     """
     Build LiteLLM kwargs for a given model id.
-    • ``anthropic/<model>`` / ``openai/<model>`` — passed straight through; the
-      user's own ``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` env vars are picked
-      up by LiteLLM. ``reasoning_effort`` is forwarded as a top-level param
-      (GPT-5 / o-series accept "minimal" | "low" | "medium" | "high"; Claude
-      extended-thinking models accept "low" | "medium" | "high" and LiteLLM
-      translates to the thinking config).
     • Anything else is treated as a HuggingFace router id. We hit the
       auto-routing OpenAI-compatible endpoint at
-      ``https://router.huggingface.co/v1``, which bypasses LiteLLM's stale
-      per-provider HF adapter entirely. The id can be bare or carry an HF
-      routing suffix:
-          MiniMaxAI/MiniMax-M2.7              # auto = fastest + failover
-          MiniMaxAI/MiniMax-M2.7:cheapest
-          moonshotai/Kimi-K2.6:novita         # pin a specific provider
-      A leading ``huggingface/`` is stripped for convenience. ``reasoning_effort``
-      is forwarded via ``extra_body`` (LiteLLM's OpenAI adapter refuses it as a
-      top-level kwarg for non-OpenAI models). "minimal" is normalized to "low".
     Token precedence (first non-empty wins):
       1. INFERENCE_TOKEN env — shared key on the hosted Space (inference is
@@ -50,10 +131,39 @@ def _resolve_llm_params(
       2. session.hf_token — the user's own token (CLI / OAuth / cache file).
       3. HF_TOKEN env — belt-and-suspenders fallback for CLI users.
     """
-    if model_name.startswith(("anthropic/", "openai/")):
         params: dict = {"model": model_name}
         if reasoning_effort:
-            params["reasoning_effort"] = reasoning_effort
         return params
     hf_model = model_name.removeprefix("huggingface/")
@@ -72,6 +182,11 @@ def _resolve_llm_params(
         params["extra_headers"] = {"X-HF-Bill-To": bill_to}
     if reasoning_effort:
         hf_level = "low" if reasoning_effort == "minimal" else reasoning_effort
-        if hf_level in _HF_ALLOWED_EFFORTS:
             params["extra_body"] = {"reasoning_effort": hf_level}
     return params

 import os
+def _patch_litellm_effort_validation() -> None:
+    """Neuter LiteLLM 1.83's hardcoded effort-level validation.
+    Context: at ``litellm/llms/anthropic/chat/transformation.py:~1443`` the
+    Anthropic adapter validates ``output_config.effort ∈ {high, medium,
+    low, max}`` and gates ``max`` behind an ``_is_opus_4_6_model`` check
+    that only matches the substring ``opus-4-6`` / ``opus_4_6``. Result:
+    * ``xhigh`` — valid on Anthropic's real API for Claude 4.7 — is
+      rejected pre-flight with "Invalid effort value: xhigh".
+    * ``max`` on Opus 4.7 is rejected with "effort='max' is only supported
+      by Claude Opus 4.6", even though Opus 4.7 accepts it in practice.
+    We don't want to maintain a parallel model table, so we let the
+    Anthropic API itself be the validator: widen ``_is_opus_4_6_model``
+    to also match ``opus-4-7``+ families, and drop the valid-effort-set
+    check entirely. If Anthropic rejects an effort level, we see a 400
+    and the cascade walks down — exactly the behavior we want for any
+    future model family.
+    Removable once litellm ships 1.83.8-stable (which merges PR #25867,
+    "Litellm day 0 opus 4.7 support") — see commit 0868a82 on their main
+    branch. Until then, this one-time patch is the escape hatch.
+    """
+    try:
+        from litellm.llms.anthropic.chat import transformation as _t
+    except Exception:
+        return
+    cfg = getattr(_t, "AnthropicConfig", None)
+    if cfg is None:
+        return
+    original = getattr(cfg, "_is_opus_4_6_model", None)
+    if original is None or getattr(original, "_hf_agent_patched", False):
+        return
+    def _widened(model: str) -> bool:
+        m = model.lower()
+        # Original 4.6 match plus any future Opus >= 4.6. We only need this
+        # to return True for families where "max" / "xhigh" are acceptable
+        # at the API; the cascade handles the case when they're not.
+        return any(
+            v in m for v in (
+                "opus-4-6", "opus_4_6", "opus-4.6", "opus_4.6",
+                "opus-4-7", "opus_4_7", "opus-4.7", "opus_4.7",
+            )
+        )
+    _widened._hf_agent_patched = True  # type: ignore[attr-defined]
+    cfg._is_opus_4_6_model = staticmethod(_widened)
+_patch_litellm_effort_validation()
+# Effort levels accepted on the wire.
+#   Anthropic (4.6+):  low | medium | high | xhigh | max   (output_config.effort)
+#   OpenAI direct:     minimal | low | medium | high       (reasoning_effort top-level)
+#   HF router:         low | medium | high                 (extra_body.reasoning_effort)
+#
+# We validate *shape* here and let the probe cascade walk down on rejection;
+# we deliberately do NOT maintain a per-model capability table.
+_ANTHROPIC_EFFORTS = {"low", "medium", "high", "xhigh", "max"}
+_OPENAI_EFFORTS = {"minimal", "low", "medium", "high"}
+_HF_EFFORTS = {"low", "medium", "high"}
+class UnsupportedEffortError(ValueError):
+    """The requested effort isn't valid for this provider's API surface.
+    Raised synchronously before any network call so the probe cascade can
+    skip levels the provider can't accept (e.g. ``max`` on HF router).
+    """
 def _resolve_llm_params(
     model_name: str,
     session_hf_token: str | None = None,
     reasoning_effort: str | None = None,
+    strict: bool = False,
 ) -> dict:
     """
     Build LiteLLM kwargs for a given model id.
+    • ``anthropic/<model>`` — native thinking config. We bypass LiteLLM's
+      ``reasoning_effort`` → ``thinking`` mapping (which lags new Claude
+      releases like 4.7 and sends the wrong API shape). Instead we pass
+      both ``thinking={"type": "adaptive"}`` and ``output_config=
+      {"effort": <level>}`` as top-level kwargs — LiteLLM's Anthropic
+      adapter forwards unknown top-level kwargs into the request body
+      verbatim (confirmed by live probe; ``extra_body`` does NOT work
+      here because Anthropic's API rejects it as "Extra inputs are not
+      permitted"). This is the stable API for 4.6 and 4.7. Older
+      extended-thinking models that only accept ``thinking.type.enabled``
+      will reject this; the probe's cascade catches that and falls back
+      to no thinking.
+    • ``openai/<model>`` — ``reasoning_effort`` forwarded as a top-level
+      kwarg (GPT-5 / o-series). LiteLLM uses the user's ``OPENAI_API_KEY``.
     • Anything else is treated as a HuggingFace router id. We hit the
       auto-routing OpenAI-compatible endpoint at
+      ``https://router.huggingface.co/v1``. The id can be bare or carry an
+      HF routing suffix (``:fastest`` / ``:cheapest`` / ``:<provider>``).
+      A leading ``huggingface/`` is stripped. ``reasoning_effort`` is
+      forwarded via ``extra_body`` (LiteLLM's OpenAI adapter refuses it as
+      a top-level kwarg for non-OpenAI models). "minimal" normalizes to
+      "low".
+    ``strict=True`` raises ``UnsupportedEffortError`` when the requested
+    effort isn't in the provider's accepted set, instead of silently
+    dropping it. The probe cascade uses strict mode so it can walk down
+    (``max`` → ``xhigh`` → ``high`` …) without making an API call. Regular
+    runtime callers leave ``strict=False``, so a stale cached effort
+    can't crash a turn — it just doesn't get sent.
     Token precedence (first non-empty wins):
       1. INFERENCE_TOKEN env — shared key on the hosted Space (inference is
       2. session.hf_token — the user's own token (CLI / OAuth / cache file).
       3. HF_TOKEN env — belt-and-suspenders fallback for CLI users.
     """
+    if model_name.startswith("anthropic/"):
         params: dict = {"model": model_name}
         if reasoning_effort:
+            level = reasoning_effort
+            if level == "minimal":
+                level = "low"
+            if level not in _ANTHROPIC_EFFORTS:
+                if strict:
+                    raise UnsupportedEffortError(
+                        f"Anthropic doesn't accept effort={level!r}"
+                    )
+            else:
+                # Adaptive thinking + output_config.effort is the stable
+                # Anthropic API for Claude 4.6 / 4.7. Both kwargs are
+                # passed top-level: LiteLLM forwards unknown params into
+                # the request body for Anthropic, so ``output_config``
+                # reaches the API. ``extra_body`` does NOT work here —
+                # Anthropic rejects it as "Extra inputs are not
+                # permitted".
+                params["thinking"] = {"type": "adaptive"}
+                params["output_config"] = {"effort": level}
+        return params
+    if model_name.startswith("openai/"):
+        params = {"model": model_name}
+        if reasoning_effort:
+            if reasoning_effort not in _OPENAI_EFFORTS:
+                if strict:
+                    raise UnsupportedEffortError(
+                        f"OpenAI doesn't accept effort={reasoning_effort!r}"
+                    )
+            else:
+                params["reasoning_effort"] = reasoning_effort
         return params
     hf_model = model_name.removeprefix("huggingface/")
         params["extra_headers"] = {"X-HF-Bill-To": bill_to}
     if reasoning_effort:
         hf_level = "low" if reasoning_effort == "minimal" else reasoning_effort
+        if hf_level not in _HF_EFFORTS:
+            if strict:
+                raise UnsupportedEffortError(
+                    f"HF router doesn't accept effort={hf_level!r}"
+                )
+        else:
             params["extra_body"] = {"reasoning_effort": hf_level}
     return params

agent/core/model_switcher.py ADDED Viewed

	@@ -0,0 +1,228 @@

+"""Model-switching logic for the interactive CLI's ``/model`` command.
+Split out of ``agent.main`` so the REPL dispatcher stays focused on input
+parsing. Exposes:
+* ``SUGGESTED_MODELS`` — the short list shown by ``/model`` with no arg.
+* ``is_valid_model_id`` — loose format check on user input.
+* ``probe_and_switch_model`` — async: checks routing, fires a 1-token
+  probe to resolve the effort cascade, then commits the switch (or
+  rejects it on hard error).
+The probe's cascade lives in ``agent.core.effort_probe``; this module
+glues it to CLI output + session state.
+"""
+from __future__ import annotations
+from agent.core.effort_probe import ProbeInconclusive, probe_effort
+# Suggested models shown by `/model` (not a gate). Users can paste any HF
+# model id (e.g. "MiniMaxAI/MiniMax-M2.7") or an `anthropic/` / `openai/`
+# prefix for direct API access. For HF ids, append ":fastest" /
+# ":cheapest" / ":preferred" / ":<provider>" to override the default
+# routing policy (auto = fastest with failover).
+SUGGESTED_MODELS = [
+    {"id": "anthropic/claude-opus-4-7", "label": "Claude Opus 4.7"},
+    {"id": "anthropic/claude-opus-4-6", "label": "Claude Opus 4.6"},
+    {"id": "MiniMaxAI/MiniMax-M2.7", "label": "MiniMax M2.7"},
+    {"id": "moonshotai/Kimi-K2.6", "label": "Kimi K2.6"},
+    {"id": "zai-org/GLM-5.1", "label": "GLM 5.1"},
+]
+_ROUTING_POLICIES = {"fastest", "cheapest", "preferred"}
+def is_valid_model_id(model_id: str) -> bool:
+    """Loose format check — lets users pick any model id.
+    Accepts:
+      • anthropic/<model>
+      • openai/<model>
+      • <org>/<model>[:<tag>]            (HF router; tag = provider or policy)
+      • huggingface/<org>/<model>[:<tag>] (same, accepts legacy prefix)
+    Actual availability is verified against the HF router catalog on
+    switch, and by the provider on the probe's ping call.
+    """
+    if not model_id or "/" not in model_id:
+        return False
+    head = model_id.split(":", 1)[0]
+    parts = head.split("/")
+    return len(parts) >= 2 and all(parts)
+def _print_hf_routing_info(model_id: str, console) -> bool:
+    """Show HF router catalog info (providers, price, context, tool support)
+    for an HF-router model id. Returns ``True`` to signal the caller can
+    proceed with the switch, ``False`` to indicate a hard problem the user
+    should notice before we fire the effort probe.
+    Anthropic / OpenAI ids return ``True`` without printing anything —
+    the probe below covers "does this model exist".
+    """
+    if model_id.startswith(("anthropic/", "openai/")):
+        return True
+    from agent.core import hf_router_catalog as cat
+    bare, _, tag = model_id.partition(":")
+    info = cat.lookup(bare)
+    if info is None:
+        console.print(
+            f"[bold red]Warning:[/bold red] '{bare}' isn't in the HF router "
+            "catalog. Checking anyway — first call may fail."
+        )
+        suggestions = cat.fuzzy_suggest(bare)
+        if suggestions:
+            console.print(f"[dim]Did you mean: {', '.join(suggestions)}[/dim]")
+        return True
+    live = info.live_providers
+    if not live:
+        console.print(
+            f"[bold red]Warning:[/bold red] '{bare}' has no live providers "
+            "right now. First call will likely fail."
+        )
+        return True
+    if tag and tag not in _ROUTING_POLICIES:
+        matched = [p for p in live if p.provider == tag]
+        if not matched:
+            names = ", ".join(p.provider for p in live)
+            console.print(
+                f"[bold red]Warning:[/bold red] provider '{tag}' doesn't serve "
+                f"'{bare}'. Live providers: {names}. Checking anyway."
+            )
+    if not info.any_supports_tools:
+        console.print(
+            f"[bold red]Warning:[/bold red] no provider for '{bare}' advertises "
+            "tool-call support. This agent relies on tool calls — expect errors."
+        )
+    if tag in _ROUTING_POLICIES:
+        policy = tag
+    elif tag:
+        policy = f"pinned to {tag}"
+    else:
+        policy = "auto (fastest)"
+    console.print(f"  [dim]routing: {policy}[/dim]")
+    for p in live:
+        price = (
+            f"${p.input_price:g}/${p.output_price:g} per M tok"
+            if p.input_price is not None and p.output_price is not None
+            else "price n/a"
+        )
+        ctx = f"{p.context_length:,} ctx" if p.context_length else "ctx n/a"
+        tools = "tools" if p.supports_tools else "no tools"
+        console.print(
+            f"  [dim]{p.provider}: {price}, {ctx}, {tools}[/dim]"
+        )
+    return True
+def print_model_listing(config, console) -> None:
+    """Render the default ``/model`` (no-arg) view: current + suggested."""
+    current = config.model_name if config else ""
+    console.print("[bold]Current model:[/bold]")
+    console.print(f"  {current}")
+    console.print("\n[bold]Suggested:[/bold]")
+    for m in SUGGESTED_MODELS:
+        marker = " [dim]<-- current[/dim]" if m["id"] == current else ""
+        console.print(f"  {m['id']}  [dim]({m['label']})[/dim]{marker}")
+    console.print(
+        "\n[dim]Paste any HF model id (e.g. 'MiniMaxAI/MiniMax-M2.7').\n"
+        "Add ':fastest', ':cheapest', ':preferred', or ':<provider>' to override routing.\n"
+        "Use 'anthropic/<model>' or 'openai/<model>' for direct API access.[/dim]"
+    )
+def print_invalid_id(arg: str, console) -> None:
+    console.print(f"[bold red]Invalid model id format:[/bold red] {arg}")
+    console.print(
+        "[dim]Expected:\n"
+        "  • <org>/<model>[:tag]    (HF router — paste from huggingface.co)\n"
+        "  • anthropic/<model>\n"
+        "  • openai/<model>[/dim]"
+    )
+async def probe_and_switch_model(
+    model_id: str,
+    config,
+    session,
+    console,
+    hf_token: str | None,
+) -> None:
+    """Validate model+effort with a 1-token ping, cache the effective effort,
+    then commit the switch.
+    Three visible outcomes:
+    * ✓ ``effort: <level>`` — model accepted the preferred effort (or a
+      fallback from the cascade; the note explains if so)
+    * ✓ ``effort: off`` — model doesn't support thinking; we'll strip it
+    * ✗ hard error (auth, model-not-found, quota) — we reject the switch
+      and keep the current model so the user isn't stranded
+    Transient errors (5xx, timeout) complete the switch with a yellow
+    warning; the next real call re-surfaces the error if it's persistent.
+    """
+    preference = config.reasoning_effort
+    if not _print_hf_routing_info(model_id, console):
+        return
+    if not preference:
+        # Nothing to validate with a ping that we couldn't validate on the
+        # first real call just as cheaply. Skip the probe entirely.
+        _commit_switch(model_id, config, session, effective=None, cache=False)
+        console.print(f"[green]Model switched to {model_id}[/green] [dim](effort: off)[/dim]")
+        return
+    console.print(f"[dim]checking {model_id} (effort: {preference})...[/dim]")
+    try:
+        outcome = await probe_effort(model_id, preference, hf_token)
+    except ProbeInconclusive as e:
+        _commit_switch(model_id, config, session, effective=None, cache=False)
+        console.print(
+            f"[yellow]Model switched to {model_id}[/yellow] "
+            f"[dim](couldn't validate: {e}; will verify on first message)[/dim]"
+        )
+        return
+    except Exception as e:
+        # Hard persistent error — auth, unknown model, quota. Don't switch.
+        console.print(f"[bold red]Switch failed:[/bold red] {e}")
+        console.print(f"[dim]Keeping current model: {config.model_name}[/dim]")
+        return
+    _commit_switch(
+        model_id, config, session,
+        effective=outcome.effective_effort, cache=True,
+    )
+    effort_label = outcome.effective_effort or "off"
+    suffix = f" — {outcome.note}" if outcome.note else ""
+    console.print(
+        f"[green]Model switched to {model_id}[/green] "
+        f"[dim](effort: {effort_label}{suffix}, {outcome.elapsed_ms}ms)[/dim]"
+    )
+def _commit_switch(model_id, config, session, effective, cache: bool) -> None:
+    """Apply the switch to the session (or bare config if no session yet).
+    ``effective`` is the probe's resolved effort; ``cache=True`` stores it
+    in the session's per-model cache so real calls use the resolved level
+    instead of re-probing. ``cache=False`` (inconclusive probe / effort
+    off) leaves the cache untouched — next call falls back to preference.
+    """
+    if session is not None:
+        session.update_model(model_id)
+        if cache:
+            session.model_effective_effort[model_id] = effective
+        else:
+            session.model_effective_effort.pop(model_id, None)
+    else:
+        config.model_name = model_id

agent/core/session.py CHANGED Viewed

@@ -109,6 +109,16 @@ class Session:
         self.turn_count: int = 0
         self.last_auto_save_turn: int = 0
     async def send_event(self, event: Event) -> None:
         """Send event back to client and log to trajectory"""
         await self.event_queue.put(event)
@@ -139,6 +149,19 @@ class Session:
         self.config.model_name = model_name
         self.context_manager.model_max_tokens = _get_max_tokens_safe(model_name)
     def increment_turn(self) -> None:
         """Increment turn counter (called after each user interaction)"""
         self.turn_count += 1

         self.turn_count: int = 0
         self.last_auto_save_turn: int = 0
+        # Per-model probed reasoning-effort cache. Populated by the probe
+        # on /model switch, read by ``effective_effort_for`` below. Keys are
+        # raw model ids (including any ``:tag``). Values:
+        #   str  → the effort level to send (may be a downgrade from the
+        #          preference, e.g. "high" when user asked for "max")
+        #   None → model rejected all efforts in the cascade; send no
+        #          thinking params at all
+        # Key absent → not probed yet; fall back to the raw preference.
+        self.model_effective_effort: dict[str, str | None] = {}
     async def send_event(self, event: Event) -> None:
         """Send event back to client and log to trajectory"""
         await self.event_queue.put(event)
         self.config.model_name = model_name
         self.context_manager.model_max_tokens = _get_max_tokens_safe(model_name)
+    def effective_effort_for(self, model_name: str) -> str | None:
+        """Resolve the effort level to actually send for ``model_name``.
+        Returns the probed result when we have one (may be ``None`` meaning
+        "model doesn't do thinking, strip it"), else the raw preference.
+        Unknown-model case falls back to the preference so a stale cache
+        from a prior ``/model`` can't poison research sub-calls that use a
+        different model id.
+        """
+        if model_name in self.model_effective_effort:
+            return self.model_effective_effort[model_name]
+        return self.config.reasoning_effort
     def increment_turn(self) -> None:
         """Increment turn counter (called after each user interaction)"""
         self.turn_count += 1

agent/main.py CHANGED Viewed

@@ -22,6 +22,7 @@ from prompt_toolkit import PromptSession
 from agent.config import load_config
 from agent.core.agent_loop import submission_loop
 from agent.core.session import OpType
 from agent.core.tools import ToolRouter
 from agent.utils.reliability_checks import check_training_script_save_pattern
@@ -49,39 +50,6 @@ litellm.drop_params = True
 # on every error — users don't need it, and our friendly errors cover the case.
 litellm.suppress_debug_info = True
-# ── Suggested models shown by `/model` (not a gate) ──────────────────────
-# Users can paste any HF model id (e.g. "MiniMaxAI/MiniMax-M2.7") or use one
-# of the `anthropic/` / `openai/` prefixes for direct API access. For HF ids,
-# append ":fastest" / ":cheapest" / ":preferred" / ":<provider>" to override
-# the default routing policy (auto = fastest with failover).
-SUGGESTED_MODELS = [
-    {"id": "anthropic/claude-opus-4-6", "label": "Claude Opus 4.6"},
-    {"id": "MiniMaxAI/MiniMax-M2.7", "label": "MiniMax M2.7"},
-    {"id": "moonshotai/Kimi-K2.6", "label": "Kimi K2.6"},
-    {"id": "zai-org/GLM-5.1", "label": "GLM 5.1"},
-]
-def _is_valid_model_id(model_id: str) -> bool:
-    """Loose format check — lets users pick any model id.
-    Accepts:
-      • anthropic/<model>
-      • openai/<model>
-      • <org>/<model>[:<tag>]            (HF router; tag = provider or policy)
-      • huggingface/<org>/<model>[:<tag>] (same, accepts legacy prefix)
-    Actual availability is verified against the HF router catalog on switch,
-    or by the provider on first call.
-    """
-    if not model_id or "/" not in model_id:
-        return False
-    # Strip :tag suffix before structural check
-    head = model_id.split(":", 1)[0]
-    parts = head.split("/")
-    return len(parts) >= 2 and all(parts)
 def _safe_get_args(arguments: dict) -> dict:
     """Safely extract args dict from arguments, handling cases where LLM passes string."""
     args = arguments.get("args", {})
@@ -91,80 +59,6 @@ def _safe_get_args(arguments: dict) -> dict:
     return args if isinstance(args, dict) else {}
-_ROUTING_POLICIES = {"fastest", "cheapest", "preferred"}
-def _print_model_preflight(model_id: str, console) -> None:
-    """Validate a model switch against the HF router catalog and show the
-    user what they're about to use (providers, price, context, tool support).
-    Anthropic/OpenAI ids skip the catalog — those are direct API calls.
-    For unknown HF ids we print a red warning with fuzzy suggestions but
-    still allow the switch (the catalog might be lagging).
-    """
-    if model_id.startswith(("anthropic/", "openai/")):
-        console.print(f"[green]Model switched to {model_id}[/green]")
-        return
-    from agent.core import hf_router_catalog as cat
-    bare, _, tag = model_id.partition(":")
-    info = cat.lookup(bare)
-    if info is None:
-        console.print(
-            f"[bold red]Warning:[/bold red] '{bare}' isn't in the HF router "
-            "catalog. Switching anyway — first call may fail."
-        )
-        suggestions = cat.fuzzy_suggest(bare)
-        if suggestions:
-            console.print(f"[dim]Did you mean: {', '.join(suggestions)}[/dim]")
-        return
-    live = info.live_providers
-    if not live:
-        console.print(
-            f"[bold red]Warning:[/bold red] '{bare}' has no live providers "
-            "right now. First call will likely fail."
-        )
-        return
-    if tag and tag not in _ROUTING_POLICIES:
-        matched = [p for p in live if p.provider == tag]
-        if not matched:
-            names = ", ".join(p.provider for p in live)
-            console.print(
-                f"[bold red]Warning:[/bold red] provider '{tag}' doesn't serve "
-                f"'{bare}'. Live providers: {names}. Switching anyway."
-            )
-            return
-    if not info.any_supports_tools:
-        console.print(
-            f"[bold red]Warning:[/bold red] no provider for '{bare}' advertises "
-            "tool-call support. This agent relies on tool calls — expect errors."
-        )
-    console.print(f"[green]Model switched to {model_id}[/green]")
-    if tag in _ROUTING_POLICIES:
-        policy = tag
-    elif tag:
-        policy = f"pinned to {tag}"
-    else:
-        policy = "auto (fastest)"
-    console.print(f"  [dim]routing: {policy}[/dim]")
-    for p in live:
-        price = (
-            f"${p.input_price:g}/${p.output_price:g} per M tok"
-            if p.input_price is not None and p.output_price is not None
-            else "price n/a"
-        )
-        ctx = f"{p.context_length:,} ctx" if p.context_length else "ctx n/a"
-        tools = "tools" if p.supports_tools else "no tools"
-        console.print(
-            f"  [dim]{p.provider}: {price}, {ctx}, {tools}[/dim]"
-        )
 def _get_hf_token() -> str | None:
     """Get HF token from environment, huggingface_hub API, or cached token file."""
     token = os.environ.get("HF_TOKEN")
@@ -807,7 +701,7 @@ async def get_user_input(prompt_session: PromptSession) -> str:
 # Slash commands are defined in terminal_display
-def _handle_slash_command(
     cmd: str,
     config,
     session_holder: list,
@@ -817,6 +711,9 @@ def _handle_slash_command(
     """
     Handle a slash command. Returns a Submission to enqueue, or None if
     the command was handled locally (caller should set turn_complete_event).
     """
     parts = cmd.strip().split(None, 1)
     command = parts[0].lower()
@@ -843,35 +740,16 @@ def _handle_slash_command(
     if command == "/model":
         console = get_console()
         if not arg:
-            current = config.model_name if config else ""
-            console.print("[bold]Current model:[/bold]")
-            console.print(f"  {current}")
-            console.print("\n[bold]Suggested:[/bold]")
-            for m in SUGGESTED_MODELS:
-                marker = " [dim]<-- current[/dim]" if m["id"] == current else ""
-                console.print(f"  {m['id']}  [dim]({m['label']})[/dim]{marker}")
-            console.print(
-                "\n[dim]Paste any HF model id (e.g. 'MiniMaxAI/MiniMax-M2.7').\n"
-                "Add ':fastest', ':cheapest', ':preferred', or ':<provider>' to override routing.\n"
-                "Use 'anthropic/<model>' or 'openai/<model>' for direct API access.[/dim]"
-            )
             return None
-        if not _is_valid_model_id(arg):
-            console.print(f"[bold red]Invalid model id format:[/bold red] {arg}")
-            console.print(
-                "[dim]Expected:\n"
-                "  • <org>/<model>[:tag]    (HF router — paste from huggingface.co)\n"
-                "  • anthropic/<model>\n"
-                "  • openai/<model>[/dim]"
-            )
             return None
         normalized = arg.removeprefix("huggingface/")
-        _print_model_preflight(normalized, console)
         session = session_holder[0] if session_holder else None
-        if session:
-            session.update_model(normalized)
-        else:
-            config.model_name = normalized
         return None
     if command == "/yolo":
@@ -882,14 +760,19 @@ def _handle_slash_command(
     if command == "/effort":
         console = get_console()
-        valid = {"minimal", "low", "medium", "high", "off"}
         if not arg:
             current = config.reasoning_effort or "off"
-            console.print(f"[bold]Reasoning effort:[/bold] {current}")
             console.print(
-                "[dim]Set with '/effort minimal|low|medium|high|off'. "
-                "Applies to models that support it (GPT-5 / o-series, Claude "
-                "extended thinking, HF reasoning models); dropped otherwise.[/dim]"
             )
             return None
         level = arg.lower()
@@ -898,7 +781,16 @@ def _handle_slash_command(
             console.print(f"[dim]Expected one of: {', '.join(sorted(valid))}[/dim]")
             return None
         config.reasoning_effort = None if level == "off" else level
         console.print(f"[green]Reasoning effort: {level}[/green]")
         return None
     if command == "/status":
@@ -1083,7 +975,7 @@ async def main():
             # Handle slash commands
             if user_input.strip().startswith("/"):
-                sub = _handle_slash_command(
                     user_input.strip(), config, session_holder, submission_queue, submission_id
                 )
                 if sub is None:

 from agent.config import load_config
 from agent.core.agent_loop import submission_loop
+from agent.core import model_switcher
 from agent.core.session import OpType
 from agent.core.tools import ToolRouter
 from agent.utils.reliability_checks import check_training_script_save_pattern
 # on every error — users don't need it, and our friendly errors cover the case.
 litellm.suppress_debug_info = True
 def _safe_get_args(arguments: dict) -> dict:
     """Safely extract args dict from arguments, handling cases where LLM passes string."""
     args = arguments.get("args", {})
     return args if isinstance(args, dict) else {}
 def _get_hf_token() -> str | None:
     """Get HF token from environment, huggingface_hub API, or cached token file."""
     token = os.environ.get("HF_TOKEN")
 # Slash commands are defined in terminal_display
+async def _handle_slash_command(
     cmd: str,
     config,
     session_holder: list,
     """
     Handle a slash command. Returns a Submission to enqueue, or None if
     the command was handled locally (caller should set turn_complete_event).
+    Async because ``/model`` fires a probe ping to validate the model+effort
+    combo before committing the switch.
     """
     parts = cmd.strip().split(None, 1)
     command = parts[0].lower()
     if command == "/model":
         console = get_console()
         if not arg:
+            model_switcher.print_model_listing(config, console)
             return None
+        if not model_switcher.is_valid_model_id(arg):
+            model_switcher.print_invalid_id(arg, console)
             return None
         normalized = arg.removeprefix("huggingface/")
         session = session_holder[0] if session_holder else None
+        await model_switcher.probe_and_switch_model(
+            normalized, config, session, console, _get_hf_token(),
+        )
         return None
     if command == "/yolo":
     if command == "/effort":
         console = get_console()
+        valid = {"minimal", "low", "medium", "high", "xhigh", "max", "off"}
+        session = session_holder[0] if session_holder else None
         if not arg:
             current = config.reasoning_effort or "off"
+            console.print(f"[bold]Reasoning effort preference:[/bold] {current}")
+            if session and session.model_effective_effort:
+                console.print("[dim]Probed per model:[/dim]")
+                for m, eff in session.model_effective_effort.items():
+                    console.print(f"  [dim]{m}: {eff or 'off'}[/dim]")
             console.print(
+                "[dim]Set with '/effort minimal|low|medium|high|xhigh|max|off'. "
+                "'max' and 'xhigh' are Anthropic-only; the cascade falls back "
+                "to whatever the model actually accepts.[/dim]"
             )
             return None
         level = arg.lower()
             console.print(f"[dim]Expected one of: {', '.join(sorted(valid))}[/dim]")
             return None
         config.reasoning_effort = None if level == "off" else level
+        # Drop the per-model probe cache — the new preference may resolve
+        # differently. Next ``/model`` (or the retry safety net) reprobes.
+        if session is not None:
+            session.model_effective_effort.clear()
         console.print(f"[green]Reasoning effort: {level}[/green]")
+        if session is not None:
+            console.print(
+                "[dim]run /model <current> to re-probe, or send a message — "
+                "the agent adjusts automatically if the new level isn't supported.[/dim]"
+            )
         return None
     if command == "/status":
             # Handle slash commands
             if user_input.strip().startswith("/"):
+                sub = await _handle_slash_command(
                     user_input.strip(), config, session_holder, submission_queue, submission_id
                 )
                 if sub is None:

agent/tools/research_tool.py CHANGED Viewed

@@ -246,10 +246,16 @@ async def research_handler(
     # Use a cheaper/faster model for research
     main_model = session.config.model_name
     research_model = _get_research_model(main_model)
     llm_params = _resolve_llm_params(
         research_model,
         getattr(session, "hf_token", None),
-        reasoning_effort=getattr(session.config, "reasoning_effort", None),
     )
     # Get read-only tool specs from the session's tool router

     # Use a cheaper/faster model for research
     main_model = session.config.model_name
     research_model = _get_research_model(main_model)
+    # Research is a cheap sub-call — cap the main session's effort at "high"
+    # so a user preference of ``max``/``xhigh`` (valid for Opus 4.6/4.7) doesn't
+    # propagate to a Sonnet research model that may not accept those levels.
+    # We also haven't probed this sub-model so we don't know its ceiling.
+    _pref = getattr(session.config, "reasoning_effort", None)
+    _capped = "high" if _pref in ("max", "xhigh") else _pref
     llm_params = _resolve_llm_params(
         research_model,
         getattr(session, "hf_token", None),
+        reasoning_effort=_capped,
     )
     # Get read-only tool specs from the session's tool router

agent/utils/terminal_display.py CHANGED Viewed

@@ -440,7 +440,7 @@ HELP_TEXT = f"""\
 {_I}  [cyan]/undo[/cyan]            Undo last turn
 {_I}  [cyan]/compact[/cyan]         Compact context window
 {_I}  [cyan]/model[/cyan] [id]      Show available models or switch
-{_I}  [cyan]/effort[/cyan] [level]  Reasoning effort (minimal|low|medium|high|off)
 {_I}  [cyan]/yolo[/cyan]            Toggle auto-approve mode
 {_I}  [cyan]/status[/cyan]          Current model & turn count
 {_I}  [cyan]/quit[/cyan]            Exit"""

 {_I}  [cyan]/undo[/cyan]            Undo last turn
 {_I}  [cyan]/compact[/cyan]         Compact context window
 {_I}  [cyan]/model[/cyan] [id]      Show available models or switch
+{_I}  [cyan]/effort[/cyan] [level]  Reasoning effort (minimal|low|medium|high|xhigh|max|off)
 {_I}  [cyan]/yolo[/cyan]            Toggle auto-approve mode
 {_I}  [cyan]/status[/cyan]          Current model & turn count
 {_I}  [cyan]/quit[/cyan]            Exit"""