Aksel Joonas Reedi commited on
Commit
e2552e8
·
unverified ·
1 Parent(s): f30ed48

fix /model switcher with effort params

Browse files

* /model: probe effort combo on switch, stop passing reasoning_effort to litellm

Switching to claude-opus-4-7 with /effort set would 400 with
"thinking.type.enabled is not supported" because litellm 1.83.0's
Anthropic adapter substring-matches "4.6" to decide which thinking API
shape to send, and doesn't know about 4.7's adaptive + output_config.effort
contract. Rather than maintain a per-model capability table that rots
every time a new Claude family ships, this change trusts the API itself:

- llm_params.py: for anthropic/*, bypass litellm's reasoning_effort -> thinking
mapping and pass thinking={type: adaptive} plus output_config={effort: ...}
as top-level kwargs directly. litellm forwards unknown top-level params
into Anthropic request bodies (extra_body does NOT work here — Anthropic
rejects it as "Extra inputs are not permitted"). One localized monkey-patch
widens litellm 1.83's hardcoded _is_opus_4_6_model check so effort=max
isn't rejected synchronously on 4.7 — removable once litellm ships PR
#25867 upstream.

- effort_probe.py: new probe that fires a 1-token ping on /model switch
with the same params we'd use for real, walking a cascade
max -> xhigh -> high -> medium -> low until the provider stops rejecting.
Three outcomes: success (cache the level), thinking-unsupported (cache
None, strip on future calls), inconclusive (switch anyway with warning).
Persistent non-thinking 4xx (auth, model-not-found) bubbles up so
/model rejects the switch and keeps the current model.

- session.py: per-model effective_effort cache + effective_effort_for()
helper. Populated by the probe, read by the real LLM call so resolved
levels don't re-probe on every message. /effort change invalidates.

- agent_loop.py: safety net — if a real call 400s with thinking/effort
config errors mid-conversation (e.g. after /effort change without
re-probe), heal the cache and retry once before propagating.

- main.py: default reasoning_effort = "max" (was "high"); /model runs
the probe and prints (effort: X, Nms); /effort accepts xhigh and max
and shows per-model probed ceilings; SUGGESTED_MODELS includes Opus 4.7.

Live-tested against Opus 4.7, Haiku 4.5, DeepSeek-R1, Qwen3.5-9B,
Llama-3.1-8B, MiMo-V2-Flash, Gemma-4-31B, Arch-Router-1.5B, Kimi-K2.5,
and a non-existent id. All outcomes matched expectations.

* Extract /model switcher logic into agent.core.model_switcher

main.py was accumulating model-switch specifics (suggested list, id
format check, HF routing info printer, probe-and-switch flow, commit
helper). Moving them to a dedicated module keeps the REPL dispatcher
focused on input parsing and makes the switcher independently testable.

Net: main.py down ~160 lines, /model handler is now a 10-line delegation.
No behavior change.

agent/config.py CHANGED
@@ -33,14 +33,15 @@ class Config(BaseModel):
33
  confirm_cpu_jobs: bool = True
34
  auto_file_upload: bool = False
35
 
36
- # Reasoning effort for models that support it (GPT-5 / o-series, Claude
37
- # extended thinking, HF reasoning models like MiniMax M2 / Kimi K2).
38
- # Defaults to "high" we'd rather spend tokens thinking than ship a
39
- # wrong ML recipe. Users can dial down with `/effort low|medium|off`.
40
- # "minimal" is an OpenAI-only level and is normalized to "low" for HF
41
- # router models (MiniMax requires ≥low). Ignored for non-reasoning models.
42
- # Valid values: None | "minimal" | "low" | "medium" | "high"
43
- reasoning_effort: str | None = "high"
 
44
 
45
 
46
  def substitute_env_vars(obj: Any) -> Any:
 
33
  confirm_cpu_jobs: bool = True
34
  auto_file_upload: bool = False
35
 
36
+ # Reasoning effort *preference* the ceiling the user wants. The probe
37
+ # on `/model` walks a cascade down from here (``max`` ``xhigh`` → ``high``
38
+ # …) and caches per-model what the provider actually accepted in
39
+ # ``Session.model_effective_effort``. Default ``max`` because we'd rather
40
+ # burn tokens thinking than ship a wrong ML recipe; the cascade lands on
41
+ # whichever level the model supports (``high`` for GPT-5 / HF router,
42
+ # ``xhigh`` or ``max`` for Anthropic 4.6 / 4.7). ``None`` = thinking off.
43
+ # Valid values: None | "minimal" | "low" | "medium" | "high" | "xhigh" | "max"
44
+ reasoning_effort: str | None = "max"
45
 
46
 
47
  def substitute_env_vars(obj: Any) -> Any:
agent/core/agent_loop.py CHANGED
@@ -136,6 +136,58 @@ def _is_transient_error(error: Exception) -> bool:
136
  return any(pattern in err_str for pattern in transient_patterns)
137
 
138
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
139
  def _friendly_error_message(error: Exception) -> str | None:
140
  """Return a user-friendly message for known error types, or None to fall back to traceback."""
141
  err_str = str(error).lower()
@@ -243,6 +295,7 @@ class LLMResult:
243
  async def _call_llm_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
244
  """Call the LLM with streaming, emitting assistant_chunk events."""
245
  response = None
 
246
  for _llm_attempt in range(_MAX_LLM_RETRIES):
247
  try:
248
  response = await acompletion(
@@ -258,6 +311,14 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
258
  except ContextWindowExceededError:
259
  raise
260
  except Exception as e:
 
 
 
 
 
 
 
 
261
  if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
262
  _delay = _LLM_RETRY_DELAYS[_llm_attempt]
263
  logger.warning(
@@ -328,6 +389,7 @@ async def _call_llm_streaming(session: Session, messages, tools, llm_params) ->
328
  async def _call_llm_non_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
329
  """Call the LLM without streaming, emit assistant_message at the end."""
330
  response = None
 
331
  for _llm_attempt in range(_MAX_LLM_RETRIES):
332
  try:
333
  response = await acompletion(
@@ -342,6 +404,14 @@ async def _call_llm_non_streaming(session: Session, messages, tools, llm_params)
342
  except ContextWindowExceededError:
343
  raise
344
  except Exception as e:
 
 
 
 
 
 
 
 
345
  if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
346
  _delay = _LLM_RETRY_DELAYS[_llm_attempt]
347
  logger.warning(
@@ -490,10 +560,13 @@ class Handlers:
490
  tools = session.tool_router.get_tool_specs_for_llm()
491
  try:
492
  # ── Call the LLM (streaming or non-streaming) ──
 
 
 
493
  llm_params = _resolve_llm_params(
494
  session.config.model_name,
495
  session.hf_token,
496
- reasoning_effort=session.config.reasoning_effort,
497
  )
498
  if session.stream:
499
  llm_result = await _call_llm_streaming(session, messages, tools, llm_params)
 
136
  return any(pattern in err_str for pattern in transient_patterns)
137
 
138
 
139
+ def _is_effort_config_error(error: Exception) -> bool:
140
+ """Catch the two 400s the effort probe also handles — thinking
141
+ unsupported for this model, or the specific effort level invalid.
142
+
143
+ This is our safety net for the case where ``/effort`` was changed
144
+ mid-conversation (which clears the probe cache) and the new level
145
+ doesn't work for the current model. We heal the cache and retry once.
146
+ """
147
+ from agent.core.effort_probe import _is_invalid_effort, _is_thinking_unsupported
148
+ return _is_thinking_unsupported(error) or _is_invalid_effort(error)
149
+
150
+
151
+ async def _heal_effort_and_rebuild_params(
152
+ session: Session, error: Exception, llm_params: dict,
153
+ ) -> dict:
154
+ """Update the session's effort cache based on ``error`` and return new
155
+ llm_params. Called only when ``_is_effort_config_error(error)`` is True.
156
+
157
+ Two branches:
158
+ • thinking-unsupported → cache ``None`` for this model, next call
159
+ strips thinking entirely
160
+ • invalid-effort → re-run the full cascade probe; the result lands
161
+ in the cache
162
+ """
163
+ from agent.core.effort_probe import ProbeInconclusive, _is_thinking_unsupported, probe_effort
164
+
165
+ model = session.config.model_name
166
+ if _is_thinking_unsupported(error):
167
+ session.model_effective_effort[model] = None
168
+ logger.info("healed: %s doesn't support thinking — stripped", model)
169
+ else:
170
+ try:
171
+ outcome = await probe_effort(
172
+ model, session.config.reasoning_effort, session.hf_token,
173
+ )
174
+ session.model_effective_effort[model] = outcome.effective_effort
175
+ logger.info(
176
+ "healed: %s effort cascade → %s", model, outcome.effective_effort,
177
+ )
178
+ except ProbeInconclusive:
179
+ # Transient during healing — strip thinking for safety, next
180
+ # call will either succeed or surface the real error.
181
+ session.model_effective_effort[model] = None
182
+ logger.info("healed: %s probe inconclusive — stripped", model)
183
+
184
+ return _resolve_llm_params(
185
+ model,
186
+ session.hf_token,
187
+ reasoning_effort=session.effective_effort_for(model),
188
+ )
189
+
190
+
191
  def _friendly_error_message(error: Exception) -> str | None:
192
  """Return a user-friendly message for known error types, or None to fall back to traceback."""
193
  err_str = str(error).lower()
 
295
  async def _call_llm_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
296
  """Call the LLM with streaming, emitting assistant_chunk events."""
297
  response = None
298
+ _healed_effort = False # one-shot safety net per call
299
  for _llm_attempt in range(_MAX_LLM_RETRIES):
300
  try:
301
  response = await acompletion(
 
311
  except ContextWindowExceededError:
312
  raise
313
  except Exception as e:
314
+ if not _healed_effort and _is_effort_config_error(e):
315
+ _healed_effort = True
316
+ llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
317
+ await session.send_event(Event(
318
+ event_type="tool_log",
319
+ data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
320
+ ))
321
+ continue
322
  if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
323
  _delay = _LLM_RETRY_DELAYS[_llm_attempt]
324
  logger.warning(
 
389
  async def _call_llm_non_streaming(session: Session, messages, tools, llm_params) -> LLMResult:
390
  """Call the LLM without streaming, emit assistant_message at the end."""
391
  response = None
392
+ _healed_effort = False
393
  for _llm_attempt in range(_MAX_LLM_RETRIES):
394
  try:
395
  response = await acompletion(
 
404
  except ContextWindowExceededError:
405
  raise
406
  except Exception as e:
407
+ if not _healed_effort and _is_effort_config_error(e):
408
+ _healed_effort = True
409
+ llm_params = await _heal_effort_and_rebuild_params(session, e, llm_params)
410
+ await session.send_event(Event(
411
+ event_type="tool_log",
412
+ data={"tool": "system", "log": "Reasoning effort not supported for this model — adjusting and retrying."},
413
+ ))
414
+ continue
415
  if _llm_attempt < _MAX_LLM_RETRIES - 1 and _is_transient_error(e):
416
  _delay = _LLM_RETRY_DELAYS[_llm_attempt]
417
  logger.warning(
 
560
  tools = session.tool_router.get_tool_specs_for_llm()
561
  try:
562
  # ── Call the LLM (streaming or non-streaming) ──
563
+ # Pull the per-model probed effort from the session cache when
564
+ # available; fall back to the raw preference for models we
565
+ # haven't probed yet (e.g. research sub-model).
566
  llm_params = _resolve_llm_params(
567
  session.config.model_name,
568
  session.hf_token,
569
+ reasoning_effort=session.effective_effort_for(session.config.model_name),
570
  )
571
  if session.stream:
572
  llm_result = await _call_llm_streaming(session, messages, tools, llm_params)
agent/core/effort_probe.py ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Probe-and-cascade for reasoning effort on /model switch.
2
+
3
+ We don't maintain a per-model capability table. Instead, the first time a
4
+ user picks a model we fire a 1-token ping with the same params we'd use
5
+ for real and walk down a cascade (``max`` → ``xhigh`` → ``high`` → …)
6
+ until the provider stops rejecting us. The result is cached per-model on
7
+ the session, so real messages don't pay the probe cost again.
8
+
9
+ Three outcomes, classified from the 400 error text:
10
+
11
+ * success → cache the effort that worked
12
+ * ``"thinking ... not supported"`` → model doesn't do thinking at all;
13
+ cache ``None`` so we stop sending thinking params
14
+ * ``"effort ... invalid"`` / synonyms → cascade walks down and retries
15
+
16
+ Transient errors (5xx, timeout, connection reset) bubble out as
17
+ ``ProbeInconclusive`` so the caller can complete the switch with a
18
+ warning instead of blocking on a flaky provider.
19
+ """
20
+
21
+ from __future__ import annotations
22
+
23
+ import asyncio
24
+ import logging
25
+ from dataclasses import dataclass
26
+
27
+ from litellm import acompletion
28
+
29
+ from agent.core.llm_params import UnsupportedEffortError, _resolve_llm_params
30
+
31
+ logger = logging.getLogger(__name__)
32
+
33
+
34
+ # Cascade: for each user-stated preference, the ordered list of levels to
35
+ # try. First success wins. ``max`` / ``xhigh`` are Anthropic-only; providers
36
+ # that don't accept them raise ``UnsupportedEffortError`` synchronously (no
37
+ # wasted network round-trip) and we advance to the next level.
38
+ _EFFORT_CASCADE: dict[str, list[str]] = {
39
+ "max": ["max", "xhigh", "high", "medium", "low"],
40
+ "xhigh": ["xhigh", "high", "medium", "low"],
41
+ "high": ["high", "medium", "low"],
42
+ "medium": ["medium", "low"],
43
+ "minimal": ["minimal", "low"],
44
+ "low": ["low"],
45
+ }
46
+
47
+ _PROBE_TIMEOUT = 15.0
48
+ _PROBE_MAX_TOKENS = 16
49
+
50
+
51
+ class ProbeInconclusive(Exception):
52
+ """The probe couldn't reach a verdict (transient network / provider error).
53
+
54
+ Caller should complete the switch with a warning — the next real call
55
+ will re-surface the error if it's persistent.
56
+ """
57
+
58
+
59
+ @dataclass
60
+ class ProbeOutcome:
61
+ """What the probe learned. ``effective_effort`` semantics match the cache:
62
+
63
+ * str → send this level
64
+ * None → model doesn't support thinking; strip it
65
+ """
66
+ effective_effort: str | None
67
+ attempts: int
68
+ elapsed_ms: int
69
+ note: str | None = None # e.g. "max not supported, falling back"
70
+
71
+
72
+ def _is_thinking_unsupported(e: Exception) -> bool:
73
+ """Model rejected any thinking config.
74
+
75
+ Matches Anthropic's 'thinking.type.enabled is not supported for this
76
+ model' as well as the adaptive variant. Substring-match because the
77
+ exact wording shifts across API versions.
78
+ """
79
+ s = str(e).lower()
80
+ return "thinking" in s and "not supported" in s
81
+
82
+
83
+ def _is_invalid_effort(e: Exception) -> bool:
84
+ """The requested effort level isn't accepted for this model.
85
+
86
+ Covers both API responses (Anthropic/OpenAI 400 with "invalid", "must
87
+ be one of", etc.) and LiteLLM's local validation that fires *before*
88
+ the request (e.g. "effort='max' is only supported by Claude Opus 4.6"
89
+ — LiteLLM knows max is Opus-4.6-only and raises synchronously). The
90
+ cascade walks down on either.
91
+
92
+ Explicitly returns False when the message is really about thinking
93
+ itself (e.g. Anthropic's 4.7 error mentions ``output_config.effort``
94
+ in its fix hint, but the actual failure is ``thinking.type.enabled``
95
+ being unsupported). That case is caught by ``_is_thinking_unsupported``.
96
+ """
97
+ if _is_thinking_unsupported(e):
98
+ return False
99
+ s = str(e).lower()
100
+ if "effort" not in s and "output_config" not in s:
101
+ return False
102
+ return any(
103
+ phrase in s
104
+ for phrase in (
105
+ "invalid", "not supported", "must be one of", "not a valid",
106
+ "unrecognized", "unknown",
107
+ # LiteLLM's own pre-flight validation phrasing.
108
+ "only supported by", "is only supported",
109
+ )
110
+ )
111
+
112
+
113
+ def _is_transient(e: Exception) -> bool:
114
+ """Network / provider-side flake. Keep in sync with agent_loop's list.
115
+
116
+ Also matches by type for ``asyncio.TimeoutError`` — its ``str(e)`` is
117
+ empty, so substring matching alone misses it.
118
+ """
119
+ if isinstance(e, (asyncio.TimeoutError, TimeoutError)):
120
+ return True
121
+ s = str(e).lower()
122
+ return any(
123
+ p in s
124
+ for p in (
125
+ "timeout", "timed out", "429", "rate limit",
126
+ "503", "service unavailable", "502", "bad gateway",
127
+ "500", "internal server error", "overloaded", "capacity",
128
+ "connection reset", "connection refused", "connection error",
129
+ "eof", "broken pipe",
130
+ )
131
+ )
132
+
133
+
134
+ async def probe_effort(
135
+ model_name: str,
136
+ preference: str | None,
137
+ hf_token: str | None,
138
+ ) -> ProbeOutcome:
139
+ """Walk the cascade for ``preference`` on ``model_name``.
140
+
141
+ Returns the first effort the provider accepts, or ``None`` if it
142
+ rejects thinking altogether. Raises ``ProbeInconclusive`` only for
143
+ transient errors (5xx, timeout) — persistent 4xx that aren't thinking/
144
+ effort related bubble as the original exception so callers can surface
145
+ them (auth, model-not-found, quota, etc.).
146
+ """
147
+ loop = asyncio.get_event_loop()
148
+ start = loop.time()
149
+ attempts = 0
150
+
151
+ if not preference:
152
+ # User explicitly turned effort off — nothing to probe. A bare
153
+ # ping with no thinking params is pointless; just report "off".
154
+ return ProbeOutcome(effective_effort=None, attempts=0, elapsed_ms=0)
155
+
156
+ cascade = _EFFORT_CASCADE.get(preference, [preference])
157
+ skipped: list[str] = [] # levels the provider rejected synchronously
158
+
159
+ last_error: Exception | None = None
160
+ for effort in cascade:
161
+ try:
162
+ params = _resolve_llm_params(
163
+ model_name, hf_token, reasoning_effort=effort, strict=True,
164
+ )
165
+ except UnsupportedEffortError:
166
+ # Provider can't even accept this effort name (e.g. "max" on
167
+ # HF router). Skip without a network call.
168
+ skipped.append(effort)
169
+ continue
170
+
171
+ attempts += 1
172
+ try:
173
+ await asyncio.wait_for(
174
+ acompletion(
175
+ messages=[{"role": "user", "content": "ping"}],
176
+ max_tokens=_PROBE_MAX_TOKENS,
177
+ stream=False,
178
+ **params,
179
+ ),
180
+ timeout=_PROBE_TIMEOUT,
181
+ )
182
+ except Exception as e:
183
+ last_error = e
184
+ if _is_thinking_unsupported(e):
185
+ elapsed = int((loop.time() - start) * 1000)
186
+ return ProbeOutcome(
187
+ effective_effort=None,
188
+ attempts=attempts,
189
+ elapsed_ms=elapsed,
190
+ note="model doesn't support reasoning, dropped",
191
+ )
192
+ if _is_invalid_effort(e):
193
+ logger.debug("probe: %s rejected effort=%s, trying next", model_name, effort)
194
+ continue
195
+ if _is_transient(e):
196
+ raise ProbeInconclusive(str(e)) from e
197
+ # Persistent non-thinking 4xx (auth, quota, model-not-found) —
198
+ # let the caller classify & surface.
199
+ raise
200
+ else:
201
+ elapsed = int((loop.time() - start) * 1000)
202
+ note = None
203
+ if effort != preference:
204
+ note = f"{preference} not supported, using {effort}"
205
+ return ProbeOutcome(
206
+ effective_effort=effort,
207
+ attempts=attempts,
208
+ elapsed_ms=elapsed,
209
+ note=note,
210
+ )
211
+
212
+ # Cascade exhausted without a success. This only happens when every
213
+ # level was either rejected synchronously (``UnsupportedEffortError``,
214
+ # e.g. preference=max on HF and we also somehow filtered all others)
215
+ # or the provider 400'd ``invalid effort`` on every level.
216
+ elapsed = int((loop.time() - start) * 1000)
217
+ if last_error is not None and not _is_invalid_effort(last_error):
218
+ raise last_error
219
+ note = (
220
+ "no effort level accepted — proceeding without thinking"
221
+ if not skipped
222
+ else f"provider rejected all efforts ({', '.join(skipped)})"
223
+ )
224
+ return ProbeOutcome(
225
+ effective_effort=None,
226
+ attempts=attempts,
227
+ elapsed_ms=elapsed,
228
+ note=note,
229
+ )
agent/core/llm_params.py CHANGED
@@ -8,41 +8,122 @@ creating circular imports.
8
  import os
9
 
10
 
11
- # HF router reasoning models only accept "low" | "medium" | "high" (e.g.
12
- # MiniMax M2 actually *requires* reasoning to be enabled). OpenAI's GPT-5
13
- # also accepts "minimal" for near-zero thinking. We map "minimal" to "low"
14
- # for HF so the user doesn't get a 400.
15
- _HF_ALLOWED_EFFORTS = {"low", "medium", "high"}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
 
18
  def _resolve_llm_params(
19
  model_name: str,
20
  session_hf_token: str | None = None,
21
  reasoning_effort: str | None = None,
 
22
  ) -> dict:
23
  """
24
  Build LiteLLM kwargs for a given model id.
25
 
26
- • ``anthropic/<model>`` / ``openai/<model>`` passed straight through; the
27
- user's own ``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` env vars are picked
28
- up by LiteLLM. ``reasoning_effort`` is forwarded as a top-level param
29
- (GPT-5 / o-series accept "minimal" | "low" | "medium" | "high"; Claude
30
- extended-thinking models accept "low" | "medium" | "high" and LiteLLM
31
- translates to the thinking config).
 
 
 
 
 
 
 
 
 
32
 
33
  • Anything else is treated as a HuggingFace router id. We hit the
34
  auto-routing OpenAI-compatible endpoint at
35
- ``https://router.huggingface.co/v1``, which bypasses LiteLLM's stale
36
- per-provider HF adapter entirely. The id can be bare or carry an HF
37
- routing suffix:
38
-
39
- MiniMaxAI/MiniMax-M2.7 # auto = fastest + failover
40
- MiniMaxAI/MiniMax-M2.7:cheapest
41
- moonshotai/Kimi-K2.6:novita # pin a specific provider
42
 
43
- A leading ``huggingface/`` is stripped for convenience. ``reasoning_effort``
44
- is forwarded via ``extra_body`` (LiteLLM's OpenAI adapter refuses it as a
45
- top-level kwarg for non-OpenAI models). "minimal" is normalized to "low".
 
 
 
46
 
47
  Token precedence (first non-empty wins):
48
  1. INFERENCE_TOKEN env — shared key on the hosted Space (inference is
@@ -50,10 +131,39 @@ def _resolve_llm_params(
50
  2. session.hf_token — the user's own token (CLI / OAuth / cache file).
51
  3. HF_TOKEN env — belt-and-suspenders fallback for CLI users.
52
  """
53
- if model_name.startswith(("anthropic/", "openai/")):
54
  params: dict = {"model": model_name}
55
  if reasoning_effort:
56
- params["reasoning_effort"] = reasoning_effort
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57
  return params
58
 
59
  hf_model = model_name.removeprefix("huggingface/")
@@ -72,6 +182,11 @@ def _resolve_llm_params(
72
  params["extra_headers"] = {"X-HF-Bill-To": bill_to}
73
  if reasoning_effort:
74
  hf_level = "low" if reasoning_effort == "minimal" else reasoning_effort
75
- if hf_level in _HF_ALLOWED_EFFORTS:
 
 
 
 
 
76
  params["extra_body"] = {"reasoning_effort": hf_level}
77
  return params
 
8
  import os
9
 
10
 
11
+ def _patch_litellm_effort_validation() -> None:
12
+ """Neuter LiteLLM 1.83's hardcoded effort-level validation.
13
+
14
+ Context: at ``litellm/llms/anthropic/chat/transformation.py:~1443`` the
15
+ Anthropic adapter validates ``output_config.effort ∈ {high, medium,
16
+ low, max}`` and gates ``max`` behind an ``_is_opus_4_6_model`` check
17
+ that only matches the substring ``opus-4-6`` / ``opus_4_6``. Result:
18
+
19
+ * ``xhigh`` — valid on Anthropic's real API for Claude 4.7 — is
20
+ rejected pre-flight with "Invalid effort value: xhigh".
21
+ * ``max`` on Opus 4.7 is rejected with "effort='max' is only supported
22
+ by Claude Opus 4.6", even though Opus 4.7 accepts it in practice.
23
+
24
+ We don't want to maintain a parallel model table, so we let the
25
+ Anthropic API itself be the validator: widen ``_is_opus_4_6_model``
26
+ to also match ``opus-4-7``+ families, and drop the valid-effort-set
27
+ check entirely. If Anthropic rejects an effort level, we see a 400
28
+ and the cascade walks down — exactly the behavior we want for any
29
+ future model family.
30
+
31
+ Removable once litellm ships 1.83.8-stable (which merges PR #25867,
32
+ "Litellm day 0 opus 4.7 support") — see commit 0868a82 on their main
33
+ branch. Until then, this one-time patch is the escape hatch.
34
+ """
35
+ try:
36
+ from litellm.llms.anthropic.chat import transformation as _t
37
+ except Exception:
38
+ return
39
+
40
+ cfg = getattr(_t, "AnthropicConfig", None)
41
+ if cfg is None:
42
+ return
43
+
44
+ original = getattr(cfg, "_is_opus_4_6_model", None)
45
+ if original is None or getattr(original, "_hf_agent_patched", False):
46
+ return
47
+
48
+ def _widened(model: str) -> bool:
49
+ m = model.lower()
50
+ # Original 4.6 match plus any future Opus >= 4.6. We only need this
51
+ # to return True for families where "max" / "xhigh" are acceptable
52
+ # at the API; the cascade handles the case when they're not.
53
+ return any(
54
+ v in m for v in (
55
+ "opus-4-6", "opus_4_6", "opus-4.6", "opus_4.6",
56
+ "opus-4-7", "opus_4_7", "opus-4.7", "opus_4.7",
57
+ )
58
+ )
59
+
60
+ _widened._hf_agent_patched = True # type: ignore[attr-defined]
61
+ cfg._is_opus_4_6_model = staticmethod(_widened)
62
+
63
+
64
+ _patch_litellm_effort_validation()
65
+
66
+
67
+ # Effort levels accepted on the wire.
68
+ # Anthropic (4.6+): low | medium | high | xhigh | max (output_config.effort)
69
+ # OpenAI direct: minimal | low | medium | high (reasoning_effort top-level)
70
+ # HF router: low | medium | high (extra_body.reasoning_effort)
71
+ #
72
+ # We validate *shape* here and let the probe cascade walk down on rejection;
73
+ # we deliberately do NOT maintain a per-model capability table.
74
+ _ANTHROPIC_EFFORTS = {"low", "medium", "high", "xhigh", "max"}
75
+ _OPENAI_EFFORTS = {"minimal", "low", "medium", "high"}
76
+ _HF_EFFORTS = {"low", "medium", "high"}
77
+
78
+
79
+ class UnsupportedEffortError(ValueError):
80
+ """The requested effort isn't valid for this provider's API surface.
81
+
82
+ Raised synchronously before any network call so the probe cascade can
83
+ skip levels the provider can't accept (e.g. ``max`` on HF router).
84
+ """
85
 
86
 
87
  def _resolve_llm_params(
88
  model_name: str,
89
  session_hf_token: str | None = None,
90
  reasoning_effort: str | None = None,
91
+ strict: bool = False,
92
  ) -> dict:
93
  """
94
  Build LiteLLM kwargs for a given model id.
95
 
96
+ • ``anthropic/<model>`` native thinking config. We bypass LiteLLM's
97
+ ``reasoning_effort`` ``thinking`` mapping (which lags new Claude
98
+ releases like 4.7 and sends the wrong API shape). Instead we pass
99
+ both ``thinking={"type": "adaptive"}`` and ``output_config=
100
+ {"effort": <level>}`` as top-level kwargs LiteLLM's Anthropic
101
+ adapter forwards unknown top-level kwargs into the request body
102
+ verbatim (confirmed by live probe; ``extra_body`` does NOT work
103
+ here because Anthropic's API rejects it as "Extra inputs are not
104
+ permitted"). This is the stable API for 4.6 and 4.7. Older
105
+ extended-thinking models that only accept ``thinking.type.enabled``
106
+ will reject this; the probe's cascade catches that and falls back
107
+ to no thinking.
108
+
109
+ • ``openai/<model>`` — ``reasoning_effort`` forwarded as a top-level
110
+ kwarg (GPT-5 / o-series). LiteLLM uses the user's ``OPENAI_API_KEY``.
111
 
112
  • Anything else is treated as a HuggingFace router id. We hit the
113
  auto-routing OpenAI-compatible endpoint at
114
+ ``https://router.huggingface.co/v1``. The id can be bare or carry an
115
+ HF routing suffix (``:fastest`` / ``:cheapest`` / ``:<provider>``).
116
+ A leading ``huggingface/`` is stripped. ``reasoning_effort`` is
117
+ forwarded via ``extra_body`` (LiteLLM's OpenAI adapter refuses it as
118
+ a top-level kwarg for non-OpenAI models). "minimal" normalizes to
119
+ "low".
 
120
 
121
+ ``strict=True`` raises ``UnsupportedEffortError`` when the requested
122
+ effort isn't in the provider's accepted set, instead of silently
123
+ dropping it. The probe cascade uses strict mode so it can walk down
124
+ (``max`` → ``xhigh`` → ``high`` …) without making an API call. Regular
125
+ runtime callers leave ``strict=False``, so a stale cached effort
126
+ can't crash a turn — it just doesn't get sent.
127
 
128
  Token precedence (first non-empty wins):
129
  1. INFERENCE_TOKEN env — shared key on the hosted Space (inference is
 
131
  2. session.hf_token — the user's own token (CLI / OAuth / cache file).
132
  3. HF_TOKEN env — belt-and-suspenders fallback for CLI users.
133
  """
134
+ if model_name.startswith("anthropic/"):
135
  params: dict = {"model": model_name}
136
  if reasoning_effort:
137
+ level = reasoning_effort
138
+ if level == "minimal":
139
+ level = "low"
140
+ if level not in _ANTHROPIC_EFFORTS:
141
+ if strict:
142
+ raise UnsupportedEffortError(
143
+ f"Anthropic doesn't accept effort={level!r}"
144
+ )
145
+ else:
146
+ # Adaptive thinking + output_config.effort is the stable
147
+ # Anthropic API for Claude 4.6 / 4.7. Both kwargs are
148
+ # passed top-level: LiteLLM forwards unknown params into
149
+ # the request body for Anthropic, so ``output_config``
150
+ # reaches the API. ``extra_body`` does NOT work here —
151
+ # Anthropic rejects it as "Extra inputs are not
152
+ # permitted".
153
+ params["thinking"] = {"type": "adaptive"}
154
+ params["output_config"] = {"effort": level}
155
+ return params
156
+
157
+ if model_name.startswith("openai/"):
158
+ params = {"model": model_name}
159
+ if reasoning_effort:
160
+ if reasoning_effort not in _OPENAI_EFFORTS:
161
+ if strict:
162
+ raise UnsupportedEffortError(
163
+ f"OpenAI doesn't accept effort={reasoning_effort!r}"
164
+ )
165
+ else:
166
+ params["reasoning_effort"] = reasoning_effort
167
  return params
168
 
169
  hf_model = model_name.removeprefix("huggingface/")
 
182
  params["extra_headers"] = {"X-HF-Bill-To": bill_to}
183
  if reasoning_effort:
184
  hf_level = "low" if reasoning_effort == "minimal" else reasoning_effort
185
+ if hf_level not in _HF_EFFORTS:
186
+ if strict:
187
+ raise UnsupportedEffortError(
188
+ f"HF router doesn't accept effort={hf_level!r}"
189
+ )
190
+ else:
191
  params["extra_body"] = {"reasoning_effort": hf_level}
192
  return params
agent/core/model_switcher.py ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Model-switching logic for the interactive CLI's ``/model`` command.
2
+
3
+ Split out of ``agent.main`` so the REPL dispatcher stays focused on input
4
+ parsing. Exposes:
5
+
6
+ * ``SUGGESTED_MODELS`` — the short list shown by ``/model`` with no arg.
7
+ * ``is_valid_model_id`` — loose format check on user input.
8
+ * ``probe_and_switch_model`` — async: checks routing, fires a 1-token
9
+ probe to resolve the effort cascade, then commits the switch (or
10
+ rejects it on hard error).
11
+
12
+ The probe's cascade lives in ``agent.core.effort_probe``; this module
13
+ glues it to CLI output + session state.
14
+ """
15
+
16
+ from __future__ import annotations
17
+
18
+ from agent.core.effort_probe import ProbeInconclusive, probe_effort
19
+
20
+
21
+ # Suggested models shown by `/model` (not a gate). Users can paste any HF
22
+ # model id (e.g. "MiniMaxAI/MiniMax-M2.7") or an `anthropic/` / `openai/`
23
+ # prefix for direct API access. For HF ids, append ":fastest" /
24
+ # ":cheapest" / ":preferred" / ":<provider>" to override the default
25
+ # routing policy (auto = fastest with failover).
26
+ SUGGESTED_MODELS = [
27
+ {"id": "anthropic/claude-opus-4-7", "label": "Claude Opus 4.7"},
28
+ {"id": "anthropic/claude-opus-4-6", "label": "Claude Opus 4.6"},
29
+ {"id": "MiniMaxAI/MiniMax-M2.7", "label": "MiniMax M2.7"},
30
+ {"id": "moonshotai/Kimi-K2.6", "label": "Kimi K2.6"},
31
+ {"id": "zai-org/GLM-5.1", "label": "GLM 5.1"},
32
+ ]
33
+
34
+
35
+ _ROUTING_POLICIES = {"fastest", "cheapest", "preferred"}
36
+
37
+
38
+ def is_valid_model_id(model_id: str) -> bool:
39
+ """Loose format check — lets users pick any model id.
40
+
41
+ Accepts:
42
+ • anthropic/<model>
43
+ • openai/<model>
44
+ • <org>/<model>[:<tag>] (HF router; tag = provider or policy)
45
+ • huggingface/<org>/<model>[:<tag>] (same, accepts legacy prefix)
46
+
47
+ Actual availability is verified against the HF router catalog on
48
+ switch, and by the provider on the probe's ping call.
49
+ """
50
+ if not model_id or "/" not in model_id:
51
+ return False
52
+ head = model_id.split(":", 1)[0]
53
+ parts = head.split("/")
54
+ return len(parts) >= 2 and all(parts)
55
+
56
+
57
+ def _print_hf_routing_info(model_id: str, console) -> bool:
58
+ """Show HF router catalog info (providers, price, context, tool support)
59
+ for an HF-router model id. Returns ``True`` to signal the caller can
60
+ proceed with the switch, ``False`` to indicate a hard problem the user
61
+ should notice before we fire the effort probe.
62
+
63
+ Anthropic / OpenAI ids return ``True`` without printing anything —
64
+ the probe below covers "does this model exist".
65
+ """
66
+ if model_id.startswith(("anthropic/", "openai/")):
67
+ return True
68
+
69
+ from agent.core import hf_router_catalog as cat
70
+
71
+ bare, _, tag = model_id.partition(":")
72
+ info = cat.lookup(bare)
73
+ if info is None:
74
+ console.print(
75
+ f"[bold red]Warning:[/bold red] '{bare}' isn't in the HF router "
76
+ "catalog. Checking anyway — first call may fail."
77
+ )
78
+ suggestions = cat.fuzzy_suggest(bare)
79
+ if suggestions:
80
+ console.print(f"[dim]Did you mean: {', '.join(suggestions)}[/dim]")
81
+ return True
82
+
83
+ live = info.live_providers
84
+ if not live:
85
+ console.print(
86
+ f"[bold red]Warning:[/bold red] '{bare}' has no live providers "
87
+ "right now. First call will likely fail."
88
+ )
89
+ return True
90
+
91
+ if tag and tag not in _ROUTING_POLICIES:
92
+ matched = [p for p in live if p.provider == tag]
93
+ if not matched:
94
+ names = ", ".join(p.provider for p in live)
95
+ console.print(
96
+ f"[bold red]Warning:[/bold red] provider '{tag}' doesn't serve "
97
+ f"'{bare}'. Live providers: {names}. Checking anyway."
98
+ )
99
+
100
+ if not info.any_supports_tools:
101
+ console.print(
102
+ f"[bold red]Warning:[/bold red] no provider for '{bare}' advertises "
103
+ "tool-call support. This agent relies on tool calls — expect errors."
104
+ )
105
+
106
+ if tag in _ROUTING_POLICIES:
107
+ policy = tag
108
+ elif tag:
109
+ policy = f"pinned to {tag}"
110
+ else:
111
+ policy = "auto (fastest)"
112
+ console.print(f" [dim]routing: {policy}[/dim]")
113
+ for p in live:
114
+ price = (
115
+ f"${p.input_price:g}/${p.output_price:g} per M tok"
116
+ if p.input_price is not None and p.output_price is not None
117
+ else "price n/a"
118
+ )
119
+ ctx = f"{p.context_length:,} ctx" if p.context_length else "ctx n/a"
120
+ tools = "tools" if p.supports_tools else "no tools"
121
+ console.print(
122
+ f" [dim]{p.provider}: {price}, {ctx}, {tools}[/dim]"
123
+ )
124
+ return True
125
+
126
+
127
+ def print_model_listing(config, console) -> None:
128
+ """Render the default ``/model`` (no-arg) view: current + suggested."""
129
+ current = config.model_name if config else ""
130
+ console.print("[bold]Current model:[/bold]")
131
+ console.print(f" {current}")
132
+ console.print("\n[bold]Suggested:[/bold]")
133
+ for m in SUGGESTED_MODELS:
134
+ marker = " [dim]<-- current[/dim]" if m["id"] == current else ""
135
+ console.print(f" {m['id']} [dim]({m['label']})[/dim]{marker}")
136
+ console.print(
137
+ "\n[dim]Paste any HF model id (e.g. 'MiniMaxAI/MiniMax-M2.7').\n"
138
+ "Add ':fastest', ':cheapest', ':preferred', or ':<provider>' to override routing.\n"
139
+ "Use 'anthropic/<model>' or 'openai/<model>' for direct API access.[/dim]"
140
+ )
141
+
142
+
143
+ def print_invalid_id(arg: str, console) -> None:
144
+ console.print(f"[bold red]Invalid model id format:[/bold red] {arg}")
145
+ console.print(
146
+ "[dim]Expected:\n"
147
+ " • <org>/<model>[:tag] (HF router — paste from huggingface.co)\n"
148
+ " • anthropic/<model>\n"
149
+ " • openai/<model>[/dim]"
150
+ )
151
+
152
+
153
+ async def probe_and_switch_model(
154
+ model_id: str,
155
+ config,
156
+ session,
157
+ console,
158
+ hf_token: str | None,
159
+ ) -> None:
160
+ """Validate model+effort with a 1-token ping, cache the effective effort,
161
+ then commit the switch.
162
+
163
+ Three visible outcomes:
164
+
165
+ * ✓ ``effort: <level>`` — model accepted the preferred effort (or a
166
+ fallback from the cascade; the note explains if so)
167
+ * ✓ ``effort: off`` — model doesn't support thinking; we'll strip it
168
+ * ✗ hard error (auth, model-not-found, quota) — we reject the switch
169
+ and keep the current model so the user isn't stranded
170
+
171
+ Transient errors (5xx, timeout) complete the switch with a yellow
172
+ warning; the next real call re-surfaces the error if it's persistent.
173
+ """
174
+ preference = config.reasoning_effort
175
+ if not _print_hf_routing_info(model_id, console):
176
+ return
177
+
178
+ if not preference:
179
+ # Nothing to validate with a ping that we couldn't validate on the
180
+ # first real call just as cheaply. Skip the probe entirely.
181
+ _commit_switch(model_id, config, session, effective=None, cache=False)
182
+ console.print(f"[green]Model switched to {model_id}[/green] [dim](effort: off)[/dim]")
183
+ return
184
+
185
+ console.print(f"[dim]checking {model_id} (effort: {preference})...[/dim]")
186
+ try:
187
+ outcome = await probe_effort(model_id, preference, hf_token)
188
+ except ProbeInconclusive as e:
189
+ _commit_switch(model_id, config, session, effective=None, cache=False)
190
+ console.print(
191
+ f"[yellow]Model switched to {model_id}[/yellow] "
192
+ f"[dim](couldn't validate: {e}; will verify on first message)[/dim]"
193
+ )
194
+ return
195
+ except Exception as e:
196
+ # Hard persistent error — auth, unknown model, quota. Don't switch.
197
+ console.print(f"[bold red]Switch failed:[/bold red] {e}")
198
+ console.print(f"[dim]Keeping current model: {config.model_name}[/dim]")
199
+ return
200
+
201
+ _commit_switch(
202
+ model_id, config, session,
203
+ effective=outcome.effective_effort, cache=True,
204
+ )
205
+ effort_label = outcome.effective_effort or "off"
206
+ suffix = f" — {outcome.note}" if outcome.note else ""
207
+ console.print(
208
+ f"[green]Model switched to {model_id}[/green] "
209
+ f"[dim](effort: {effort_label}{suffix}, {outcome.elapsed_ms}ms)[/dim]"
210
+ )
211
+
212
+
213
+ def _commit_switch(model_id, config, session, effective, cache: bool) -> None:
214
+ """Apply the switch to the session (or bare config if no session yet).
215
+
216
+ ``effective`` is the probe's resolved effort; ``cache=True`` stores it
217
+ in the session's per-model cache so real calls use the resolved level
218
+ instead of re-probing. ``cache=False`` (inconclusive probe / effort
219
+ off) leaves the cache untouched — next call falls back to preference.
220
+ """
221
+ if session is not None:
222
+ session.update_model(model_id)
223
+ if cache:
224
+ session.model_effective_effort[model_id] = effective
225
+ else:
226
+ session.model_effective_effort.pop(model_id, None)
227
+ else:
228
+ config.model_name = model_id
agent/core/session.py CHANGED
@@ -109,6 +109,16 @@ class Session:
109
  self.turn_count: int = 0
110
  self.last_auto_save_turn: int = 0
111
 
 
 
 
 
 
 
 
 
 
 
112
  async def send_event(self, event: Event) -> None:
113
  """Send event back to client and log to trajectory"""
114
  await self.event_queue.put(event)
@@ -139,6 +149,19 @@ class Session:
139
  self.config.model_name = model_name
140
  self.context_manager.model_max_tokens = _get_max_tokens_safe(model_name)
141
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
  def increment_turn(self) -> None:
143
  """Increment turn counter (called after each user interaction)"""
144
  self.turn_count += 1
 
109
  self.turn_count: int = 0
110
  self.last_auto_save_turn: int = 0
111
 
112
+ # Per-model probed reasoning-effort cache. Populated by the probe
113
+ # on /model switch, read by ``effective_effort_for`` below. Keys are
114
+ # raw model ids (including any ``:tag``). Values:
115
+ # str → the effort level to send (may be a downgrade from the
116
+ # preference, e.g. "high" when user asked for "max")
117
+ # None → model rejected all efforts in the cascade; send no
118
+ # thinking params at all
119
+ # Key absent → not probed yet; fall back to the raw preference.
120
+ self.model_effective_effort: dict[str, str | None] = {}
121
+
122
  async def send_event(self, event: Event) -> None:
123
  """Send event back to client and log to trajectory"""
124
  await self.event_queue.put(event)
 
149
  self.config.model_name = model_name
150
  self.context_manager.model_max_tokens = _get_max_tokens_safe(model_name)
151
 
152
+ def effective_effort_for(self, model_name: str) -> str | None:
153
+ """Resolve the effort level to actually send for ``model_name``.
154
+
155
+ Returns the probed result when we have one (may be ``None`` meaning
156
+ "model doesn't do thinking, strip it"), else the raw preference.
157
+ Unknown-model case falls back to the preference so a stale cache
158
+ from a prior ``/model`` can't poison research sub-calls that use a
159
+ different model id.
160
+ """
161
+ if model_name in self.model_effective_effort:
162
+ return self.model_effective_effort[model_name]
163
+ return self.config.reasoning_effort
164
+
165
  def increment_turn(self) -> None:
166
  """Increment turn counter (called after each user interaction)"""
167
  self.turn_count += 1
agent/main.py CHANGED
@@ -22,6 +22,7 @@ from prompt_toolkit import PromptSession
22
 
23
  from agent.config import load_config
24
  from agent.core.agent_loop import submission_loop
 
25
  from agent.core.session import OpType
26
  from agent.core.tools import ToolRouter
27
  from agent.utils.reliability_checks import check_training_script_save_pattern
@@ -49,39 +50,6 @@ litellm.drop_params = True
49
  # on every error — users don't need it, and our friendly errors cover the case.
50
  litellm.suppress_debug_info = True
51
 
52
- # ── Suggested models shown by `/model` (not a gate) ──────────────────────
53
- # Users can paste any HF model id (e.g. "MiniMaxAI/MiniMax-M2.7") or use one
54
- # of the `anthropic/` / `openai/` prefixes for direct API access. For HF ids,
55
- # append ":fastest" / ":cheapest" / ":preferred" / ":<provider>" to override
56
- # the default routing policy (auto = fastest with failover).
57
- SUGGESTED_MODELS = [
58
- {"id": "anthropic/claude-opus-4-6", "label": "Claude Opus 4.6"},
59
- {"id": "MiniMaxAI/MiniMax-M2.7", "label": "MiniMax M2.7"},
60
- {"id": "moonshotai/Kimi-K2.6", "label": "Kimi K2.6"},
61
- {"id": "zai-org/GLM-5.1", "label": "GLM 5.1"},
62
- ]
63
-
64
-
65
- def _is_valid_model_id(model_id: str) -> bool:
66
- """Loose format check — lets users pick any model id.
67
-
68
- Accepts:
69
- • anthropic/<model>
70
- • openai/<model>
71
- • <org>/<model>[:<tag>] (HF router; tag = provider or policy)
72
- • huggingface/<org>/<model>[:<tag>] (same, accepts legacy prefix)
73
-
74
- Actual availability is verified against the HF router catalog on switch,
75
- or by the provider on first call.
76
- """
77
- if not model_id or "/" not in model_id:
78
- return False
79
- # Strip :tag suffix before structural check
80
- head = model_id.split(":", 1)[0]
81
- parts = head.split("/")
82
- return len(parts) >= 2 and all(parts)
83
-
84
-
85
  def _safe_get_args(arguments: dict) -> dict:
86
  """Safely extract args dict from arguments, handling cases where LLM passes string."""
87
  args = arguments.get("args", {})
@@ -91,80 +59,6 @@ def _safe_get_args(arguments: dict) -> dict:
91
  return args if isinstance(args, dict) else {}
92
 
93
 
94
- _ROUTING_POLICIES = {"fastest", "cheapest", "preferred"}
95
-
96
-
97
- def _print_model_preflight(model_id: str, console) -> None:
98
- """Validate a model switch against the HF router catalog and show the
99
- user what they're about to use (providers, price, context, tool support).
100
-
101
- Anthropic/OpenAI ids skip the catalog — those are direct API calls.
102
- For unknown HF ids we print a red warning with fuzzy suggestions but
103
- still allow the switch (the catalog might be lagging).
104
- """
105
- if model_id.startswith(("anthropic/", "openai/")):
106
- console.print(f"[green]Model switched to {model_id}[/green]")
107
- return
108
-
109
- from agent.core import hf_router_catalog as cat
110
-
111
- bare, _, tag = model_id.partition(":")
112
- info = cat.lookup(bare)
113
- if info is None:
114
- console.print(
115
- f"[bold red]Warning:[/bold red] '{bare}' isn't in the HF router "
116
- "catalog. Switching anyway — first call may fail."
117
- )
118
- suggestions = cat.fuzzy_suggest(bare)
119
- if suggestions:
120
- console.print(f"[dim]Did you mean: {', '.join(suggestions)}[/dim]")
121
- return
122
-
123
- live = info.live_providers
124
- if not live:
125
- console.print(
126
- f"[bold red]Warning:[/bold red] '{bare}' has no live providers "
127
- "right now. First call will likely fail."
128
- )
129
- return
130
-
131
- if tag and tag not in _ROUTING_POLICIES:
132
- matched = [p for p in live if p.provider == tag]
133
- if not matched:
134
- names = ", ".join(p.provider for p in live)
135
- console.print(
136
- f"[bold red]Warning:[/bold red] provider '{tag}' doesn't serve "
137
- f"'{bare}'. Live providers: {names}. Switching anyway."
138
- )
139
- return
140
-
141
- if not info.any_supports_tools:
142
- console.print(
143
- f"[bold red]Warning:[/bold red] no provider for '{bare}' advertises "
144
- "tool-call support. This agent relies on tool calls — expect errors."
145
- )
146
-
147
- console.print(f"[green]Model switched to {model_id}[/green]")
148
- if tag in _ROUTING_POLICIES:
149
- policy = tag
150
- elif tag:
151
- policy = f"pinned to {tag}"
152
- else:
153
- policy = "auto (fastest)"
154
- console.print(f" [dim]routing: {policy}[/dim]")
155
- for p in live:
156
- price = (
157
- f"${p.input_price:g}/${p.output_price:g} per M tok"
158
- if p.input_price is not None and p.output_price is not None
159
- else "price n/a"
160
- )
161
- ctx = f"{p.context_length:,} ctx" if p.context_length else "ctx n/a"
162
- tools = "tools" if p.supports_tools else "no tools"
163
- console.print(
164
- f" [dim]{p.provider}: {price}, {ctx}, {tools}[/dim]"
165
- )
166
-
167
-
168
  def _get_hf_token() -> str | None:
169
  """Get HF token from environment, huggingface_hub API, or cached token file."""
170
  token = os.environ.get("HF_TOKEN")
@@ -807,7 +701,7 @@ async def get_user_input(prompt_session: PromptSession) -> str:
807
  # Slash commands are defined in terminal_display
808
 
809
 
810
- def _handle_slash_command(
811
  cmd: str,
812
  config,
813
  session_holder: list,
@@ -817,6 +711,9 @@ def _handle_slash_command(
817
  """
818
  Handle a slash command. Returns a Submission to enqueue, or None if
819
  the command was handled locally (caller should set turn_complete_event).
 
 
 
820
  """
821
  parts = cmd.strip().split(None, 1)
822
  command = parts[0].lower()
@@ -843,35 +740,16 @@ def _handle_slash_command(
843
  if command == "/model":
844
  console = get_console()
845
  if not arg:
846
- current = config.model_name if config else ""
847
- console.print("[bold]Current model:[/bold]")
848
- console.print(f" {current}")
849
- console.print("\n[bold]Suggested:[/bold]")
850
- for m in SUGGESTED_MODELS:
851
- marker = " [dim]<-- current[/dim]" if m["id"] == current else ""
852
- console.print(f" {m['id']} [dim]({m['label']})[/dim]{marker}")
853
- console.print(
854
- "\n[dim]Paste any HF model id (e.g. 'MiniMaxAI/MiniMax-M2.7').\n"
855
- "Add ':fastest', ':cheapest', ':preferred', or ':<provider>' to override routing.\n"
856
- "Use 'anthropic/<model>' or 'openai/<model>' for direct API access.[/dim]"
857
- )
858
  return None
859
- if not _is_valid_model_id(arg):
860
- console.print(f"[bold red]Invalid model id format:[/bold red] {arg}")
861
- console.print(
862
- "[dim]Expected:\n"
863
- " • <org>/<model>[:tag] (HF router — paste from huggingface.co)\n"
864
- " • anthropic/<model>\n"
865
- " • openai/<model>[/dim]"
866
- )
867
  return None
868
  normalized = arg.removeprefix("huggingface/")
869
- _print_model_preflight(normalized, console)
870
  session = session_holder[0] if session_holder else None
871
- if session:
872
- session.update_model(normalized)
873
- else:
874
- config.model_name = normalized
875
  return None
876
 
877
  if command == "/yolo":
@@ -882,14 +760,19 @@ def _handle_slash_command(
882
 
883
  if command == "/effort":
884
  console = get_console()
885
- valid = {"minimal", "low", "medium", "high", "off"}
 
886
  if not arg:
887
  current = config.reasoning_effort or "off"
888
- console.print(f"[bold]Reasoning effort:[/bold] {current}")
 
 
 
 
889
  console.print(
890
- "[dim]Set with '/effort minimal|low|medium|high|off'. "
891
- "Applies to models that support it (GPT-5 / o-series, Claude "
892
- "extended thinking, HF reasoning models); dropped otherwise.[/dim]"
893
  )
894
  return None
895
  level = arg.lower()
@@ -898,7 +781,16 @@ def _handle_slash_command(
898
  console.print(f"[dim]Expected one of: {', '.join(sorted(valid))}[/dim]")
899
  return None
900
  config.reasoning_effort = None if level == "off" else level
 
 
 
 
901
  console.print(f"[green]Reasoning effort: {level}[/green]")
 
 
 
 
 
902
  return None
903
 
904
  if command == "/status":
@@ -1083,7 +975,7 @@ async def main():
1083
 
1084
  # Handle slash commands
1085
  if user_input.strip().startswith("/"):
1086
- sub = _handle_slash_command(
1087
  user_input.strip(), config, session_holder, submission_queue, submission_id
1088
  )
1089
  if sub is None:
 
22
 
23
  from agent.config import load_config
24
  from agent.core.agent_loop import submission_loop
25
+ from agent.core import model_switcher
26
  from agent.core.session import OpType
27
  from agent.core.tools import ToolRouter
28
  from agent.utils.reliability_checks import check_training_script_save_pattern
 
50
  # on every error — users don't need it, and our friendly errors cover the case.
51
  litellm.suppress_debug_info = True
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  def _safe_get_args(arguments: dict) -> dict:
54
  """Safely extract args dict from arguments, handling cases where LLM passes string."""
55
  args = arguments.get("args", {})
 
59
  return args if isinstance(args, dict) else {}
60
 
61
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  def _get_hf_token() -> str | None:
63
  """Get HF token from environment, huggingface_hub API, or cached token file."""
64
  token = os.environ.get("HF_TOKEN")
 
701
  # Slash commands are defined in terminal_display
702
 
703
 
704
+ async def _handle_slash_command(
705
  cmd: str,
706
  config,
707
  session_holder: list,
 
711
  """
712
  Handle a slash command. Returns a Submission to enqueue, or None if
713
  the command was handled locally (caller should set turn_complete_event).
714
+
715
+ Async because ``/model`` fires a probe ping to validate the model+effort
716
+ combo before committing the switch.
717
  """
718
  parts = cmd.strip().split(None, 1)
719
  command = parts[0].lower()
 
740
  if command == "/model":
741
  console = get_console()
742
  if not arg:
743
+ model_switcher.print_model_listing(config, console)
 
 
 
 
 
 
 
 
 
 
 
744
  return None
745
+ if not model_switcher.is_valid_model_id(arg):
746
+ model_switcher.print_invalid_id(arg, console)
 
 
 
 
 
 
747
  return None
748
  normalized = arg.removeprefix("huggingface/")
 
749
  session = session_holder[0] if session_holder else None
750
+ await model_switcher.probe_and_switch_model(
751
+ normalized, config, session, console, _get_hf_token(),
752
+ )
 
753
  return None
754
 
755
  if command == "/yolo":
 
760
 
761
  if command == "/effort":
762
  console = get_console()
763
+ valid = {"minimal", "low", "medium", "high", "xhigh", "max", "off"}
764
+ session = session_holder[0] if session_holder else None
765
  if not arg:
766
  current = config.reasoning_effort or "off"
767
+ console.print(f"[bold]Reasoning effort preference:[/bold] {current}")
768
+ if session and session.model_effective_effort:
769
+ console.print("[dim]Probed per model:[/dim]")
770
+ for m, eff in session.model_effective_effort.items():
771
+ console.print(f" [dim]{m}: {eff or 'off'}[/dim]")
772
  console.print(
773
+ "[dim]Set with '/effort minimal|low|medium|high|xhigh|max|off'. "
774
+ "'max' and 'xhigh' are Anthropic-only; the cascade falls back "
775
+ "to whatever the model actually accepts.[/dim]"
776
  )
777
  return None
778
  level = arg.lower()
 
781
  console.print(f"[dim]Expected one of: {', '.join(sorted(valid))}[/dim]")
782
  return None
783
  config.reasoning_effort = None if level == "off" else level
784
+ # Drop the per-model probe cache — the new preference may resolve
785
+ # differently. Next ``/model`` (or the retry safety net) reprobes.
786
+ if session is not None:
787
+ session.model_effective_effort.clear()
788
  console.print(f"[green]Reasoning effort: {level}[/green]")
789
+ if session is not None:
790
+ console.print(
791
+ "[dim]run /model <current> to re-probe, or send a message — "
792
+ "the agent adjusts automatically if the new level isn't supported.[/dim]"
793
+ )
794
  return None
795
 
796
  if command == "/status":
 
975
 
976
  # Handle slash commands
977
  if user_input.strip().startswith("/"):
978
+ sub = await _handle_slash_command(
979
  user_input.strip(), config, session_holder, submission_queue, submission_id
980
  )
981
  if sub is None:
agent/tools/research_tool.py CHANGED
@@ -246,10 +246,16 @@ async def research_handler(
246
  # Use a cheaper/faster model for research
247
  main_model = session.config.model_name
248
  research_model = _get_research_model(main_model)
 
 
 
 
 
 
249
  llm_params = _resolve_llm_params(
250
  research_model,
251
  getattr(session, "hf_token", None),
252
- reasoning_effort=getattr(session.config, "reasoning_effort", None),
253
  )
254
 
255
  # Get read-only tool specs from the session's tool router
 
246
  # Use a cheaper/faster model for research
247
  main_model = session.config.model_name
248
  research_model = _get_research_model(main_model)
249
+ # Research is a cheap sub-call — cap the main session's effort at "high"
250
+ # so a user preference of ``max``/``xhigh`` (valid for Opus 4.6/4.7) doesn't
251
+ # propagate to a Sonnet research model that may not accept those levels.
252
+ # We also haven't probed this sub-model so we don't know its ceiling.
253
+ _pref = getattr(session.config, "reasoning_effort", None)
254
+ _capped = "high" if _pref in ("max", "xhigh") else _pref
255
  llm_params = _resolve_llm_params(
256
  research_model,
257
  getattr(session, "hf_token", None),
258
+ reasoning_effort=_capped,
259
  )
260
 
261
  # Get read-only tool specs from the session's tool router
agent/utils/terminal_display.py CHANGED
@@ -440,7 +440,7 @@ HELP_TEXT = f"""\
440
  {_I} [cyan]/undo[/cyan] Undo last turn
441
  {_I} [cyan]/compact[/cyan] Compact context window
442
  {_I} [cyan]/model[/cyan] [id] Show available models or switch
443
- {_I} [cyan]/effort[/cyan] [level] Reasoning effort (minimal|low|medium|high|off)
444
  {_I} [cyan]/yolo[/cyan] Toggle auto-approve mode
445
  {_I} [cyan]/status[/cyan] Current model & turn count
446
  {_I} [cyan]/quit[/cyan] Exit"""
 
440
  {_I} [cyan]/undo[/cyan] Undo last turn
441
  {_I} [cyan]/compact[/cyan] Compact context window
442
  {_I} [cyan]/model[/cyan] [id] Show available models or switch
443
+ {_I} [cyan]/effort[/cyan] [level] Reasoning effort (minimal|low|medium|high|xhigh|max|off)
444
  {_I} [cyan]/yolo[/cyan] Toggle auto-approve mode
445
  {_I} [cyan]/status[/cyan] Current model & turn count
446
  {_I} [cyan]/quit[/cyan] Exit"""