Spaces:

osunlp
/

QUEST

Running

TomLii commited on Apr 20

Commit

8e8119b

1 Parent(s): 3fd8fc1

Speed up Quest-4B research: add Serper backend and stream live progress

Two user-perceivable wins on the Quest endpoint, which was taking 60+ s per
question on the Space and left the UI blank the whole time:

1. Wire up Google Serper as the primary search backend. When
SERPER_API_KEY (or SERPER_KEY_ID, matching the research repo's env
name) is set in Space secrets, `_run_search_single` now hits Serper
first and falls back to DuckDuckGo only if Serper errors. Serper
responds in <1 s and is not subject to the 202 Ratelimit that shared
HF Space IPs routinely trip on html.duckduckgo.com, which both cuts
latency and eliminates the "Error: 202 Ratelimit" failures users
were hitting on comparison-table queries.

2. Convert build_research_agent and run_ui into Gradio generators that
emit a live progress panel between turns: "turn N: thinking…",
"turn N: searching `...`", "got 5 hit(s) via serper", "writing final
answer". The total wall-clock time of a Quest run is unchanged but
the user now sees what the agent is doing instead of staring at an
empty Result pane for a minute.

Also: lower the default Max Turns slider from 8 to 6 (most research
queries finish in 2-4 turns; going to 8 mostly just burns budget on
dead-end branches) and update .env.example to document SERPER_API_KEY,
QUEST_MAX_NEW_TOKENS, and which of the other research-repo env vars
(JINA_API_KEYS, OpenAI keys, SUMMARY_MODEL_NAME, etc.) are NOT currently
wired into the Space starter so future deploys are not surprised that
setting them has no effect.

Regression coverage in _test_markdown_fix.py now includes: Serper being
preferred when the key is set, graceful DDG fallback when Serper errors,
graceful error when both fail, and an end-to-end mock run of the
generator verifying multiple progress yields before a final real answer.

Made-with: Cursor

Files changed (3) hide show

.env.example +36 -1
_test_markdown_fix.py +466 -0
app.py +242 -40

.env.example CHANGED Viewed

@@ -1,4 +1,8 @@
-# Required: personal HF token with read access to osunlp/Quest-4B.
 HF_TOKEN=hf_xxx
 # Dedicated HF Inference Endpoint URL that serves osunlp/Quest-4B.
@@ -11,3 +15,34 @@ QUEST_ENDPOINT_MODEL=tgi
 # Default model preselected in the dropdown.
 DEFAULT_MODEL=osunlp/Quest-4B

+# =============================================================================
+# Required
+# =============================================================================
+# Personal HF token with read access to osunlp/Quest-4B.
 HF_TOKEN=hf_xxx
 # Dedicated HF Inference Endpoint URL that serves osunlp/Quest-4B.
 # Default model preselected in the dropdown.
 DEFAULT_MODEL=osunlp/Quest-4B
+# =============================================================================
+# Recommended: strongly improves latency and reliability
+# =============================================================================
+# Google Serper API key. When set, the `search` tool uses Serper first and only
+# falls back to the DuckDuckGo HTML backend if Serper fails. Serper is ~10x
+# faster than scraping DDG and is not subject to the 202 Ratelimit that hits
+# shared HF Space IPs. Get one at https://serper.dev/api-key
+# Either name is accepted to match the research repo's convention:
+SERPER_API_KEY=
+# SERPER_KEY_ID=
+# Max tokens the Quest endpoint is allowed to emit per turn. 4096 gives the
+# <think> block enough room; raise to 6144 for very long research reports.
+QUEST_MAX_NEW_TOKENS=4096
+# =============================================================================
+# Optional: not currently wired into app.py (listed for reference)
+# =============================================================================
+# The research repo (QUEST-main/inference) uses these to plug in Jina Reader
+# for HTML-to-markdown extraction and GPT for condenser/summarization, but the
+# Space starter does not call either of them. Setting them here has no effect
+# today; they are listed only so you know what you'd plug in for the full
+# research pipeline.
+# JINA_API_KEYS=
+# API_KEY=                   # OpenAI API key
+# SUMMARY_MODEL_NAME=gpt-5-mini
+# MEMORY_MODEL_NAME=gpt-5-mini
+# MEMORY_OPENAI_API_KEY=

_test_markdown_fix.py ADDED Viewed

	@@ -0,0 +1,466 @@

+"""
+Regression tests for the '<answer>...</answer>' placeholder bug that caused the
+Space to render only a literal `...` instead of the real (often table-shaped)
+final answer.
+These tests are plain asserts, runnable with `python _test_markdown_fix.py`.
+They import the fixed helpers directly from `app.py` without booting Gradio.
+"""
+import os
+import sys
+from pathlib import Path
+# Do not start the Gradio UI when importing app.py.
+os.environ.setdefault("GRADIO_SERVER_PORT", "0")
+HERE = Path(__file__).resolve().parent
+sys.path.insert(0, str(HERE))
+from unittest import mock
+from app import (
+    extract_answer,
+    strip_think_blocks,
+    ensure_markdown_table_blank_lines,
+    decode_escaped_whitespace,
+    _is_placeholder_answer,
+    parse_tool_call,
+)
+def _check(name: str, actual, expected) -> None:
+    ok = actual == expected
+    status = "PASS" if ok else "FAIL"
+    print(f"[{status}] {name}")
+    if not ok:
+        print(f"  expected: {expected!r}")
+        print(f"  actual  : {actual!r}")
+    assert ok, name
+# -------------------------------------------------------------------------
+# 1. The original bug: Quest-4B echoes the template literally.
+# -------------------------------------------------------------------------
+_check(
+    "echoed placeholder `<answer>...</answer>` is rejected",
+    extract_answer("<answer>...</answer>"),
+    None,
+)
+_check(
+    "echoed unicode ellipsis `<answer>…</answer>` is rejected",
+    extract_answer("<answer>…</answer>"),
+    None,
+)
+_check(
+    "whitespace-only `<answer>   </answer>` is rejected",
+    extract_answer("<answer>   </answer>"),
+    None,
+)
+_check(
+    "placeholder detector recognises ASCII dots",
+    _is_placeholder_answer("..."),
+    True,
+)
+_check(
+    "placeholder detector recognises unicode ellipsis",
+    _is_placeholder_answer("…"),
+    True,
+)
+_check(
+    "placeholder detector recognises interpunct",
+    _is_placeholder_answer("·"),
+    True,
+)
+_check(
+    "placeholder detector accepts real text",
+    _is_placeholder_answer("The answer is 3..."),
+    False,
+)
+# -------------------------------------------------------------------------
+# 2. A real Markdown table inside <answer> survives round-trip.
+# -------------------------------------------------------------------------
+table_body = "| Color | Hex |\n|---|---|\n| Red | #ff0000 |\n| Green | #00ff00 |"
+_check(
+    "Markdown table inside <answer> is returned intact",
+    extract_answer(f"<answer>\n{table_body}\n</answer>"),
+    table_body,
+)
+# -------------------------------------------------------------------------
+# 3. <think> block is stripped before extracting the answer.
+# -------------------------------------------------------------------------
+_check(
+    "<think>...</think> is removed from answer content",
+    extract_answer("<think>reasoning goes here</think><answer>real answer</answer>"),
+    "real answer",
+)
+_check(
+    "multi-line <think> is removed",
+    extract_answer(
+        "<think>line 1\nline 2\nline 3</think>\n<answer>the truth</answer>"
+    ),
+    "the truth",
+)
+_check(
+    "strip_think_blocks leaves non-think content alone",
+    strip_think_blocks("plain text"),
+    "plain text",
+)
+# -------------------------------------------------------------------------
+# 4. Truncated output: <answer> opened, never closed.
+# -------------------------------------------------------------------------
+_check(
+    "truncated `<answer>` with real text is still extracted",
+    extract_answer("<answer>Here is the partial answer"),
+    "Here is the partial answer",
+)
+_check(
+    "truncated `<answer>` that is just dots is still rejected",
+    extract_answer("<answer>..."),
+    None,
+)
+# -------------------------------------------------------------------------
+# 5. ensure_markdown_table_blank_lines inserts the required break.
+# -------------------------------------------------------------------------
+glued = "Here is the comparison:\n| Col | Val |\n|---|---|\n| a | b |"
+fixed = ensure_markdown_table_blank_lines(glued)
+assert "\n\n| Col | Val |" in fixed, f"blank line was not inserted: {fixed!r}"
+print("[PASS] ensure_markdown_table_blank_lines inserts break before table")
+already_ok = "Here is the comparison:\n\n| Col | Val |\n|---|---|\n| a | b |"
+_check(
+    "ensure_markdown_table_blank_lines is a no-op when blank line already exists",
+    ensure_markdown_table_blank_lines(already_ok),
+    already_ok,
+)
+table_at_start = "| Col | Val |\n|---|---|\n| a | b |"
+_check(
+    "ensure_markdown_table_blank_lines leaves a table at the very start alone",
+    ensure_markdown_table_blank_lines(table_at_start),
+    table_at_start,
+)
+# -------------------------------------------------------------------------
+# 6. parse_tool_call still works after the <think>-stripping refactor.
+# -------------------------------------------------------------------------
+tool_out = (
+    "<think>I should search for this</think>\n"
+    '<tool_call>{"name": "search", "arguments": {"query": ["hello"]}}</tool_call>'
+)
+name, args, err = parse_tool_call(tool_out)
+assert err is None, f"unexpected parse error: {err}"
+_check("parse_tool_call extracts name", name, "search")
+_check("parse_tool_call extracts arguments", args, {"query": ["hello"]})
+# -------------------------------------------------------------------------
+# 7. Escaped-whitespace decoding (the 2nd reported bug):
+#    the endpoint returned `\n` as literal 2-char sequences, so the
+#    pipe table rendered as a one-line sentence of `| a | b |\n...`.
+# -------------------------------------------------------------------------
+user_reported_payload = (
+    "\\n| Color | Hex |\\n|---|---|\\n| Red | #FF0000 |"
+    "\\n| Green | #00FF00 |\\n| Blue | #0000FF |\\n"
+)
+decoded_user_payload = decode_escaped_whitespace(user_reported_payload)
+assert "\n| Color | Hex |" in decoded_user_payload, decoded_user_payload
+assert "\\n" not in decoded_user_payload, decoded_user_payload
+print("[PASS] decode_escaped_whitespace converts the user-reported payload")
+# Extract from a full <answer> block whose content is escape-encoded.
+escape_encoded_answer = f"<answer>{user_reported_payload}</answer>"
+extracted_escape = extract_answer(escape_encoded_answer)
+assert extracted_escape is not None
+assert "| Red | #FF0000 |" in extracted_escape
+assert "\\n" not in extracted_escape
+# And the separator must be on its own line so GFM recognises the table.
+assert "|---|---|" in extracted_escape
+print("[PASS] extract_answer decodes escape-encoded <answer> into real newlines")
+# Heuristic: do NOT decode when escapes are rare (a real code example).
+code_example = 'Some prose with a single \\n in a code example.'
+_check(
+    "decode_escaped_whitespace leaves lightly-escaped prose alone",
+    decode_escaped_whitespace(code_example),
+    code_example,
+)
+# Heuristic: do NOT decode when real newlines already dominate.
+mostly_real = "real\nnewlines\nhere\nwith\\none escape"
+_check(
+    "decode_escaped_whitespace leaves mostly-real-newline text alone",
+    decode_escaped_whitespace(mostly_real),
+    mostly_real,
+)
+# Heuristic: DO decode when escapes clearly dominate.
+mostly_escaped = "one real\n then \\na \\nb \\nc \\nd"
+decoded_ok = decode_escaped_whitespace(mostly_escaped)
+assert decoded_ok.count("\n") > mostly_escaped.count("\n"), decoded_ok
+assert decoded_ok.count("\\n") == 0, decoded_ok
+print("[PASS] decode_escaped_whitespace decodes when escapes dominate")
+# -------------------------------------------------------------------------
+# 8. End-to-end: the originally-reported scenario now renders a real table.
+# -------------------------------------------------------------------------
+buggy_output = "<answer>...</answer>"
+good_output = (
+    "<think>let me build the table</think>\n"
+    "<answer>\n"
+    "Here is the table:\n"
+    "| Planet | Distance (AU) |\n"
+    "|---|---|\n"
+    "| Mercury | 0.39 |\n"
+    "| Venus | 0.72 |\n"
+    "| Earth | 1.00 |\n"
+    "</answer>"
+)
+# The buggy case must no longer be accepted as an answer.
+assert extract_answer(buggy_output) is None
+# The good case must round-trip AND come out table-ready.
+extracted = extract_answer(good_output)
+assert extracted is not None
+rendered_ready = ensure_markdown_table_blank_lines(extracted)
+assert "\n\n| Planet | Distance (AU) |" in rendered_ready, rendered_ready
+print("[PASS] end-to-end: placeholder rejected, real table rendered with blank line")
+# -------------------------------------------------------------------------
+# 9. Search backend rate-limit no longer crashes the whole agent.
+#    Simulates the DuckDuckGo 202 Ratelimit error the user reported.
+# -------------------------------------------------------------------------
+import app as _app
+class _FakeRatelimit(Exception):
+    pass
+class _RatelimitedDDGS:
+    """Stand-in for DDGS that always raises the way ddgs does on 202."""
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc, tb):
+        return False
+    def text(self, *args, **kwargs):
+        raise _FakeRatelimit("https://html.duckduckgo.com/html 202 Ratelimit")
+# Clear in-memory cache so the mock is actually exercised.
+_app.SEARCH_CACHE.clear()
+with mock.patch.object(_app, "DDGS", _RatelimitedDDGS), \
+     mock.patch.object(_app.time, "sleep", lambda *_a, **_k: None):
+    out = _app._run_search_single("iPhone 15 vs iPhone 16 features", max_results=3)
+assert out["ok"] is False, out
+assert "Ratelimit" in out["error"], out
+assert out["results"] == []
+assert "hint" in out and "training knowledge" in out["hint"], out
+print("[PASS] _run_search_single converts DDG rate-limit into a graceful tool error")
+# The caller that invokes build_research_agent wraps tool responses into a
+# user message; the important thing is that _run_search_single NEVER raises,
+# so the agent loop can continue and let the model produce an <answer>.
+_app.SEARCH_CACHE.clear()
+with mock.patch.object(_app, "DDGS", _RatelimitedDDGS), \
+     mock.patch.object(_app.time, "sleep", lambda *_a, **_k: None):
+    try:
+        _ = _app.run_search(["q1", "q2"], max_results=3)
+        raised = False
+    except Exception:
+        raised = True
+assert not raised, "run_search should not raise when DDG rate-limits"
+print("[PASS] run_search swallows backend errors across multi-query calls")
+# -------------------------------------------------------------------------
+# 10. Serper backend is preferred when SERPER_API_KEY is set, and DDG is
+#     used as a fallback. Verifies the latency fix for the iPhone query.
+# -------------------------------------------------------------------------
+class _FakeResponse:
+    def __init__(self, payload):
+        self._payload = payload
+    def raise_for_status(self):
+        return None
+    def json(self):
+        return self._payload
+def _fake_serper_ok(url, headers, json, timeout):  # noqa: A002 - gradio-style arg
+    assert headers.get("X-API-KEY") == "test-serper-key"
+    return _FakeResponse(
+        {
+            "answerBox": {
+                "title": "iPhone 16 vs 15",
+                "link": "https://example.com/answer",
+                "snippet": "Apple replaced the mute switch with an action button.",
+            },
+            "organic": [
+                {
+                    "title": "iPhone 16 Specs",
+                    "link": "https://example.com/iphone-16",
+                    "snippet": "A18 chip, 48 MP camera, ...",
+                },
+                {
+                    "title": "iPhone 15 Specs",
+                    "link": "https://example.com/iphone-15",
+                    "snippet": "A16 Bionic, Dynamic Island...",
+                },
+            ],
+        }
+    )
+_app.SEARCH_CACHE.clear()
+with mock.patch.object(_app, "SERPER_API_KEY", "test-serper-key"), \
+     mock.patch.object(_app.requests, "post", side_effect=_fake_serper_ok):
+    serper_out = _app._run_search_single("iPhone 16 vs iPhone 15", max_results=5)
+assert serper_out["ok"] is True, serper_out
+assert serper_out.get("backend") == "serper", serper_out
+assert serper_out["results"][0]["title"] == "iPhone 16 vs 15", serper_out  # answer box first
+assert len(serper_out["results"]) == 3, serper_out
+print("[PASS] Serper backend is preferred when SERPER_API_KEY is set")
+def _fake_serper_fail(url, headers, json, timeout):  # noqa: A002
+    raise RuntimeError("serper: 429 quota exceeded")
+class _WorkingDDGS:
+    def __enter__(self):
+        return self
+    def __exit__(self, exc_type, exc, tb):
+        return False
+    def text(self, *args, **kwargs):
+        yield {
+            "title": "DDG result",
+            "href": "https://example.org/ddg",
+            "body": "ddg fallback body",
+        }
+_app.SEARCH_CACHE.clear()
+with mock.patch.object(_app, "SERPER_API_KEY", "test-serper-key"), \
+     mock.patch.object(_app.requests, "post", side_effect=_fake_serper_fail), \
+     mock.patch.object(_app, "DDGS", _WorkingDDGS):
+    fallback_out = _app._run_search_single("anything", max_results=2)
+assert fallback_out["ok"] is True, fallback_out
+assert fallback_out.get("backend") == "duckduckgo", fallback_out
+assert fallback_out["results"][0]["href"] == "https://example.org/ddg"
+print("[PASS] Falls back to DuckDuckGo when Serper errors out")
+_app.SEARCH_CACHE.clear()
+with mock.patch.object(_app, "SERPER_API_KEY", "test-serper-key"), \
+     mock.patch.object(_app.requests, "post", side_effect=_fake_serper_fail), \
+     mock.patch.object(_app, "DDGS", _RatelimitedDDGS), \
+     mock.patch.object(_app.time, "sleep", lambda *_a, **_k: None):
+    both_fail = _app._run_search_single("anything", max_results=2)
+assert both_fail["ok"] is False, both_fail
+assert "serper" in both_fail["error"].lower(), both_fail
+assert "duckduckgo" in both_fail["error"].lower(), both_fail
+assert "hint" in both_fail
+print("[PASS] Returns graceful error when both Serper and DDG fail")
+# -------------------------------------------------------------------------
+# 11. build_research_agent streams progress (is a generator).
+# -------------------------------------------------------------------------
+import inspect as _inspect
+assert _inspect.isgeneratorfunction(_app.build_research_agent), (
+    "build_research_agent should be a generator so run_ui can stream progress"
+)
+assert _inspect.isgeneratorfunction(_app.run_ui), (
+    "run_ui should be a generator so Gradio streams per-turn status to the UI"
+)
+print("[PASS] build_research_agent and run_ui are streaming generators")
+# -------------------------------------------------------------------------
+# 12. End-to-end dry run of the generator: verify at least one progress
+#     tuple is yielded BEFORE the final answer, and that the final yield
+#     is a real answer (not a placeholder).
+# -------------------------------------------------------------------------
+_fake_model_script = [
+    (
+        "<think>I should search the web for Mercury distance.</think>"
+        '<tool_call>{"name": "search", "arguments": {"query": ["Mercury distance AU"]}}</tool_call>',
+        "fake-model",
+    ),
+    (
+        "<answer>\n"
+        "Here is the table:\n"
+        "| Planet | Distance (AU) |\n"
+        "|---|---|\n"
+        "| Mercury | 0.39 |\n"
+        "</answer>",
+        "fake-model",
+    ),
+]
+def _fake_call_model(*args, **kwargs):
+    return _fake_model_script.pop(0)
+class _FakeInferenceClient:
+    def __init__(self, *a, **k):
+        pass
+_app.SEARCH_CACHE.clear()
+with mock.patch.object(_app, "call_model", side_effect=_fake_call_model), \
+     mock.patch.object(_app, "_build_client_for_model",
+                       return_value=(_FakeInferenceClient(), "fake-model", [])), \
+     mock.patch.object(_app, "SERPER_API_KEY", "test-serper-key"), \
+     mock.patch.object(_app.requests, "post", side_effect=_fake_serper_ok):
+    gen = _app.build_research_agent(
+        question="How far is Mercury from the sun?",
+        model="fake-model",
+        max_turns=4,
+        max_search_results=3,
+        temperature=0.0,
+    )
+    emitted = list(gen)
+assert len(emitted) >= 3, f"expected multiple progress yields, got {len(emitted)}"
+final_answer, final_trace = emitted[-1]
+assert "Mercury" in final_answer, final_answer
+assert "| Planet |" in final_answer, final_answer
+assert "...</answer>" not in final_answer
+# Intermediate yields should have progress scaffolding.
+assert any("⏳ Researching" in ans for ans, _ in emitted[:-1]), (
+    "no intermediate progress yield detected"
+)
+print("[PASS] build_research_agent streams progress then a real final answer")
+print()
+print("All markdown-fix regression tests passed.")

app.py CHANGED Viewed

@@ -960,22 +960,85 @@ _SEARCH_UNAVAILABLE_HINT = (
     "retry later if the question truly requires a fresh web lookup."
 )
-def _run_search_single(query: str, max_results: int) -> Dict[str, Any]:
-    """Run one DuckDuckGo query.
-    Returns a structured dict on both success and failure, never raises. If
-    the search backend rate-limits us (Space IPs share outbound NAT and
-    often trip DuckDuckGo's anti-scraping throttle), we return an
-    `ok: False` payload with a hint that lets the agent fall back to its
-    own knowledge instead of aborting the whole research run.
     """
-    if not query.strip():
-        return {"ok": False, "error": "Search query cannot be empty."}
-    cache_key = f"{query.strip().lower()}::{max_results}"
-    if cache_key in SEARCH_CACHE:
-        return {**SEARCH_CACHE[cache_key], "cached": True}
     last_exc: Optional[BaseException] = None
     for attempt in range(2):
         try:
@@ -989,14 +1052,15 @@ def _run_search_single(query: str, max_results: int) -> Dict[str, Any]:
                             "body": item.get("body", ""),
                         }
                     )
-            payload = {"ok": True, "query": query, "results": rows, "cached": False}
-            SEARCH_CACHE[cache_key] = payload
-            return payload
         except Exception as exc:
             last_exc = exc
-            # One retry with a small backoff covers most transient 202
-            # Ratelimit / transient network hiccups; on the second failure
-            # we give up and return a graceful error to the agent.
             if attempt == 0:
                 time.sleep(1.5)
                 continue
@@ -1005,7 +1069,53 @@ def _run_search_single(query: str, max_results: int) -> Dict[str, Any]:
     return {
         "ok": False,
         "query": query,
-        "error": f"Search backend unavailable ({err}).",
         "results": [],
         "hint": _SEARCH_UNAVAILABLE_HINT,
     }
@@ -1126,18 +1236,67 @@ def call_model(
     raise RuntimeError(f"All model candidates failed. Last error: {last_error}")
 def build_research_agent(
     question: str,
     model: str,
     max_turns: int,
     max_search_results: int,
     temperature: float,
-) -> Tuple[str, str]:
     client, primary_model, fallback_models = _build_client_for_model(model)
     # Display label: the real HF repo id is nicer than the TGI shim name.
     display_primary = model if (model == QUEST_MODEL_ID) else primary_model
     state = AgentState()
     used_model = display_primary
     messages: List[Dict[str, str]] = [
         {"role": "system", "content": build_system_prompt()},
@@ -1146,6 +1305,9 @@ def build_research_agent(
     final_answer: Optional[str] = None
     for turn in range(1, max_turns + 1):
         if state.trusted_notes and turn > 1 and turn % 3 == 0:
             summary_lines = "\n".join(f"- {n}" for n in state.trusted_notes[-6:])
@@ -1156,6 +1318,10 @@ def build_research_agent(
                 }
             )
         raw_output, endpoint_model = call_model(
             client=client,
             messages=messages,
@@ -1164,21 +1330,28 @@ def build_research_agent(
             temperature=temperature,
             max_new_tokens=int(os.getenv("QUEST_MAX_NEW_TOKENS", "4096")),
         )
         model_output = raw_output
         # Preserve the human-friendly model id for the trace even if the
         # endpoint ignores the "model" param and returns the TGI shim name.
         used_model = display_primary if endpoint_model == primary_model == QUEST_ENDPOINT_MODEL else endpoint_model
         messages.append({"role": "assistant", "content": model_output})
-        state.trace.append({"turn": turn, "assistant": model_output})
         extracted_answer = extract_answer(model_output)
         if extracted_answer:
             final_answer = extracted_answer
             break
         tool_name, tool_args, tool_err = parse_tool_call(model_output)
         if tool_err:
             tool_response = {"ok": False, "error": tool_err}
         elif not tool_name:
             # No explicit tool call and no final answer: force finalization.
             # IMPORTANT: do not write the literal characters `<answer>...</answer>`
@@ -1202,6 +1375,8 @@ def build_research_agent(
                     ),
                 }
             )
             continue
         else:
             if tool_name == "search":
@@ -1214,7 +1389,13 @@ def build_research_agent(
                 max_results = int(tool_args.get("max_results", max_search_results))
                 max_results = max(1, min(max_results, 10))
                 per_query: List[Dict[str, Any]] = []
                 for q in queries:
                     if q in state.searched_query_set:
                         per_query.append({
@@ -1224,22 +1405,36 @@ def build_research_agent(
                             "note": "Already searched; reusing cached result.",
                             "results": [],
                         })
                         continue
                     state.searched_queries.append(q)
                     state.searched_query_set.add(q)
                     single = _run_search_single(q, max_results)
                     per_query.append(single)
                     if single.get("ok"):
                         first_titles = [r.get("title", "") for r in single.get("results", [])[:2]]
                         if first_titles:
                             state.trusted_notes.append(
                                 f"Searched '{q}' and found leads: {', '.join(t for t in first_titles if t)}"
                             )
                 tool_response = (
                     per_query[0]
                     if len(per_query) == 1
                     else {"ok": True, "queries": queries, "results": per_query}
                 )
             elif tool_name == "visit":
                 raw_url = tool_args.get("url", "")
                 urls: List[str]
@@ -1251,7 +1446,12 @@ def build_research_agent(
                 max_chars = int(tool_args.get("max_chars", 6000))
                 max_chars = max(500, min(max_chars, 20000))
                 per_url: List[Dict[str, Any]] = []
                 for u in urls:
                     if u in state.visited_url_set:
                         per_url.append({
@@ -1260,12 +1460,14 @@ def build_research_agent(
                             "cached": True,
                             "note": "Already visited; reusing cached result.",
                         })
                         continue
                     state.visited_urls.append(u)
                     state.visited_url_set.add(u)
                     single = _run_visit_single(u, max_chars, goal)
                     per_url.append(single)
                     if single.get("ok"):
                         snippet = str(single.get("content", ""))[:180]
                         if snippet:
                             state.trusted_notes.append(
@@ -1276,8 +1478,14 @@ def build_research_agent(
                     if len(per_url) == 1
                     else {"ok": True, "goal": goal, "results": per_url}
                 )
             else:
                 tool_response = {"ok": False, "error": f"Unknown tool: {tool_name}"}
         state.trace.append({"turn": turn, "tool": tool_name, "tool_response": tool_response})
         messages.append(
@@ -1302,18 +1510,8 @@ def build_research_agent(
     if citations:
         final_answer = f"{final_answer}\n\n### Visited Sources\n{citations}"
-    trace_text = json.dumps(
-        {
-            "used_model": used_model,
-            "searched_queries": state.searched_queries,
-            "visited_urls": state.visited_urls,
-            "trusted_notes": state.trusted_notes[-10:],
-            "trace": state.trace,
-        },
-        ensure_ascii=False,
-        indent=2,
-    )
-    return final_answer, trace_text
 def run_ui(
@@ -1324,13 +1522,15 @@ def run_ui(
     temperature: float,
 ):
     if not question.strip():
-        return "Please input a question.", "{}"
     if not os.getenv("HF_TOKEN"):
         warning = (
             "HF_TOKEN is not configured in Space Secrets. "
             "Go to Settings -> Secrets -> add `HF_TOKEN`, then retry."
         )
-        return warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
     if model == QUEST_MODEL_ID and not QUEST_BASE_URL:
         warning = (
             f"`{QUEST_MODEL_ID}` is private and not available via the free HF Inference API. "
@@ -1338,17 +1538,19 @@ def run_ui(
             "then set `QUEST_BASE_URL` in Space Secrets to the endpoint's `/v1/` URL. "
             "In the meantime you can pick one of the open-weights models in the dropdown."
         )
-        return warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
     try:
-        return build_research_agent(
             question=question,
             model=model,
             max_turns=max_turns,
             max_search_results=max_search_results,
             temperature=temperature,
-        )
     except Exception as exc:
-        return f"Error: {exc}", json.dumps({"error": str(exc)}, ensure_ascii=False, indent=2)
 EXAMPLES = [
@@ -1470,7 +1672,7 @@ with gr.Blocks(
                     label="Max Turns",
                     minimum=2,
                     maximum=20,
-                    value=8,
                     step=1,
                 )
                 max_search_results = gr.Slider(

     "retry later if the question truly requires a fresh web lookup."
 )
+# Google Serper API key. Either SERPER_API_KEY or SERPER_KEY_ID is accepted
+# so that the Space matches the env-var name used by the research repo.
+SERPER_API_KEY = (
+    os.getenv("SERPER_API_KEY") or os.getenv("SERPER_KEY_ID") or ""
+).strip()
+SERPER_ENDPOINT = os.getenv("SERPER_ENDPOINT", "https://google.serper.dev/search")
+def _serper_search(query: str, max_results: int) -> Dict[str, Any]:
+    """Hit the Google Serper API. Returns the same shape as `_ddg_search`.
+    Serper responds in well under a second and is not subject to the 202
+    Ratelimit we get from html.duckduckgo.com, so preferring it when the
+    key is set cuts latency dramatically and eliminates most search
+    failures on shared Space IPs.
     """
+    try:
+        resp = requests.post(
+            SERPER_ENDPOINT,
+            headers={
+                "X-API-KEY": SERPER_API_KEY,
+                "Content-Type": "application/json",
+            },
+            json={"q": query, "num": max_results},
+            timeout=15,
+        )
+        resp.raise_for_status()
+        data = resp.json()
+    except Exception as exc:
+        return {
+            "ok": False,
+            "query": query,
+            "error": f"Serper error: {type(exc).__name__}: {exc}",
+            "results": [],
+            "backend": "serper",
+        }
+    rows: List[Dict[str, str]] = []
+    for item in (data.get("organic") or [])[:max_results]:
+        rows.append(
+            {
+                "title": item.get("title", ""),
+                "href": item.get("link", ""),
+                "body": item.get("snippet", ""),
+            }
+        )
+    # Fold in the answer box and knowledge graph when present; these often
+    # carry the exact fact the model is looking for in a compact form.
+    answer_box = data.get("answerBox") or {}
+    if answer_box:
+        rows.insert(
+            0,
+            {
+                "title": answer_box.get("title", "Answer box"),
+                "href": answer_box.get("link", ""),
+                "body": answer_box.get("snippet")
+                or answer_box.get("answer")
+                or "",
+            },
+        )
+    if not rows:
+        return {
+            "ok": False,
+            "query": query,
+            "error": "Serper returned no organic results",
+            "results": [],
+            "backend": "serper",
+        }
+    return {
+        "ok": True,
+        "query": query,
+        "results": rows,
+        "cached": False,
+        "backend": "serper",
+    }
+def _ddg_search(query: str, max_results: int) -> Dict[str, Any]:
+    """Fallback path: scrape DuckDuckGo. Rate-limits on shared IPs."""
     last_exc: Optional[BaseException] = None
     for attempt in range(2):
         try:
                             "body": item.get("body", ""),
                         }
                     )
+            return {
+                "ok": True,
+                "query": query,
+                "results": rows,
+                "cached": False,
+                "backend": "duckduckgo",
+            }
         except Exception as exc:
             last_exc = exc
             if attempt == 0:
                 time.sleep(1.5)
                 continue
     return {
         "ok": False,
         "query": query,
+        "error": f"DuckDuckGo unavailable ({err}).",
+        "results": [],
+        "backend": "duckduckgo",
+    }
+def _run_search_single(query: str, max_results: int) -> Dict[str, Any]:
+    """Run one search query, preferring Serper when the key is set.
+    Returns a structured dict on both success and failure; never raises.
+    Order of preference:
+    1. Google Serper (fast, no scraping, requires `SERPER_API_KEY` /
+       `SERPER_KEY_ID`).
+    2. DuckDuckGo HTML backend (free, but rate-limits on shared Space IPs).
+    3. Graceful `ok: False` payload with a hint that tells the agent to
+       answer from its own knowledge if it reasonably can.
+    """
+    if not query.strip():
+        return {"ok": False, "error": "Search query cannot be empty."}
+    cache_key = f"{query.strip().lower()}::{max_results}"
+    if cache_key in SEARCH_CACHE:
+        return {**SEARCH_CACHE[cache_key], "cached": True}
+    tried: List[Dict[str, Any]] = []
+    if SERPER_API_KEY:
+        serper_result = _serper_search(query, max_results)
+        if serper_result.get("ok"):
+            SEARCH_CACHE[cache_key] = serper_result
+            return serper_result
+        tried.append(serper_result)
+    ddg_result = _ddg_search(query, max_results)
+    if ddg_result.get("ok"):
+        SEARCH_CACHE[cache_key] = ddg_result
+        return ddg_result
+    tried.append(ddg_result)
+    # Both backends failed (or no Serper key and DDG rate-limited).
+    errors = "; ".join(
+        f"{r.get('backend', 'unknown')}: {r.get('error', 'no results')}"
+        for r in tried
+    )
+    return {
+        "ok": False,
+        "query": query,
+        "error": f"All search backends failed ({errors}).",
         "results": [],
         "hint": _SEARCH_UNAVAILABLE_HINT,
     }
     raise RuntimeError(f"All model candidates failed. Last error: {last_error}")
+def _render_progress(
+    lines: List[str],
+    used_model: str,
+    question: str,
+) -> str:
+    """Render the in-progress status view that replaces the Markdown panel
+    while the agent is still running, so the user is not staring at a blank
+    box for the 20-60 seconds a full Quest-4B research run can take."""
+    header = (
+        f"### ⏳ Researching…\n\n"
+        f"**Model:** `{used_model}`  \n"
+        f"**Question:** {question.strip()[:200]}"
+    )
+    if not lines:
+        body = "_Starting agent…_"
+    else:
+        body = "\n".join(f"- {line}" for line in lines)
+    return f"{header}\n\n{body}"
+def _trace_to_json(state: "AgentState", used_model: str) -> str:
+    return json.dumps(
+        {
+            "used_model": used_model,
+            "searched_queries": state.searched_queries,
+            "visited_urls": state.visited_urls,
+            "trusted_notes": state.trusted_notes[-10:],
+            "trace": state.trace,
+        },
+        ensure_ascii=False,
+        indent=2,
+    )
 def build_research_agent(
     question: str,
     model: str,
     max_turns: int,
     max_search_results: int,
     temperature: float,
+):
+    """Run the ReAct research loop as a generator.
+    Each `yield` emits a `(markdown_for_answer_panel, json_for_record_panel)`
+    tuple. Intermediate yields show progress so that Gradio streams the
+    status lines into the UI as work happens. The last yield contains the
+    final answer and the final trace.
+    """
     client, primary_model, fallback_models = _build_client_for_model(model)
     # Display label: the real HF repo id is nicer than the TGI shim name.
     display_primary = model if (model == QUEST_MODEL_ID) else primary_model
     state = AgentState()
     used_model = display_primary
+    status_lines: List[str] = []
+    def _emit():
+        """Yield the current progress snapshot to Gradio."""
+        return (
+            _render_progress(status_lines, used_model, question),
+            _trace_to_json(state, used_model),
+        )
     messages: List[Dict[str, str]] = [
         {"role": "system", "content": build_system_prompt()},
     final_answer: Optional[str] = None
+    status_lines.append("🚀 Starting research agent")
+    yield _emit()
     for turn in range(1, max_turns + 1):
         if state.trusted_notes and turn > 1 and turn % 3 == 0:
             summary_lines = "\n".join(f"- {n}" for n in state.trusted_notes[-6:])
                 }
             )
+        status_lines.append(f"🧠 turn {turn}: thinking…")
+        yield _emit()
+        t0 = time.time()
         raw_output, endpoint_model = call_model(
             client=client,
             messages=messages,
             temperature=temperature,
             max_new_tokens=int(os.getenv("QUEST_MAX_NEW_TOKENS", "4096")),
         )
+        dt = time.time() - t0
         model_output = raw_output
         # Preserve the human-friendly model id for the trace even if the
         # endpoint ignores the "model" param and returns the TGI shim name.
         used_model = display_primary if endpoint_model == primary_model == QUEST_ENDPOINT_MODEL else endpoint_model
         messages.append({"role": "assistant", "content": model_output})
+        state.trace.append({"turn": turn, "assistant": model_output, "elapsed_s": round(dt, 2)})
+        status_lines[-1] = f"🧠 turn {turn}: model reply in {dt:.1f}s"
+        yield _emit()
         extracted_answer = extract_answer(model_output)
         if extracted_answer:
             final_answer = extracted_answer
+            status_lines.append("✍️ writing final answer")
+            yield _emit()
             break
         tool_name, tool_args, tool_err = parse_tool_call(model_output)
         if tool_err:
             tool_response = {"ok": False, "error": tool_err}
+            status_lines.append(f"⚠️ turn {turn}: malformed tool call — {tool_err}")
+            yield _emit()
         elif not tool_name:
             # No explicit tool call and no final answer: force finalization.
             # IMPORTANT: do not write the literal characters `<answer>...</answer>`
                     ),
                 }
             )
+            status_lines.append(f"🙃 turn {turn}: model stalled; asking for an answer")
+            yield _emit()
             continue
         else:
             if tool_name == "search":
                 max_results = int(tool_args.get("max_results", max_search_results))
                 max_results = max(1, min(max_results, 10))
+                queries_preview = ", ".join(f"`{q}`" for q in queries) or "_(empty)_"
+                status_lines.append(f"🔍 turn {turn}: searching {queries_preview}")
+                yield _emit()
                 per_query: List[Dict[str, Any]] = []
+                backend_labels: List[str] = []
+                hits_total = 0
                 for q in queries:
                     if q in state.searched_query_set:
                         per_query.append({
                             "note": "Already searched; reusing cached result.",
                             "results": [],
                         })
+                        backend_labels.append("cache")
                         continue
                     state.searched_queries.append(q)
                     state.searched_query_set.add(q)
                     single = _run_search_single(q, max_results)
                     per_query.append(single)
+                    backend_labels.append(single.get("backend", "unknown"))
                     if single.get("ok"):
+                        hits_total += len(single.get("results", []))
                         first_titles = [r.get("title", "") for r in single.get("results", [])[:2]]
                         if first_titles:
                             state.trusted_notes.append(
                                 f"Searched '{q}' and found leads: {', '.join(t for t in first_titles if t)}"
                             )
+                    else:
+                        status_lines.append(
+                            f"⚠️ search failed on `{q}` via {single.get('backend', 'unknown')}: "
+                            f"{single.get('error', 'no results')}"
+                        )
                 tool_response = (
                     per_query[0]
                     if len(per_query) == 1
                     else {"ok": True, "queries": queries, "results": per_query}
                 )
+                unique_backends = sorted(set(backend_labels))
+                backend_str = "/".join(unique_backends) if unique_backends else "?"
+                status_lines.append(
+                    f"✅ turn {turn}: got {hits_total} hit(s) via {backend_str}"
+                )
+                yield _emit()
             elif tool_name == "visit":
                 raw_url = tool_args.get("url", "")
                 urls: List[str]
                 max_chars = int(tool_args.get("max_chars", 6000))
                 max_chars = max(500, min(max_chars, 20000))
+                urls_preview = ", ".join(f"`{u[:60]}`" for u in urls) or "_(empty)_"
+                status_lines.append(f"🌐 turn {turn}: visiting {urls_preview}")
+                yield _emit()
                 per_url: List[Dict[str, Any]] = []
+                visit_ok = 0
                 for u in urls:
                     if u in state.visited_url_set:
                         per_url.append({
                             "cached": True,
                             "note": "Already visited; reusing cached result.",
                         })
+                        visit_ok += 1
                         continue
                     state.visited_urls.append(u)
                     state.visited_url_set.add(u)
                     single = _run_visit_single(u, max_chars, goal)
                     per_url.append(single)
                     if single.get("ok"):
+                        visit_ok += 1
                         snippet = str(single.get("content", ""))[:180]
                         if snippet:
                             state.trusted_notes.append(
                     if len(per_url) == 1
                     else {"ok": True, "goal": goal, "results": per_url}
                 )
+                status_lines.append(
+                    f"✅ turn {turn}: read {visit_ok}/{len(urls)} page(s)"
+                )
+                yield _emit()
             else:
                 tool_response = {"ok": False, "error": f"Unknown tool: {tool_name}"}
+                status_lines.append(f"⚠️ turn {turn}: unknown tool `{tool_name}`")
+                yield _emit()
         state.trace.append({"turn": turn, "tool": tool_name, "tool_response": tool_response})
         messages.append(
     if citations:
         final_answer = f"{final_answer}\n\n### Visited Sources\n{citations}"
+    trace_text = _trace_to_json(state, used_model)
+    yield (final_answer, trace_text)
 def run_ui(
     temperature: float,
 ):
     if not question.strip():
+        yield "Please input a question.", "{}"
+        return
     if not os.getenv("HF_TOKEN"):
         warning = (
             "HF_TOKEN is not configured in Space Secrets. "
             "Go to Settings -> Secrets -> add `HF_TOKEN`, then retry."
         )
+        yield warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
+        return
     if model == QUEST_MODEL_ID and not QUEST_BASE_URL:
         warning = (
             f"`{QUEST_MODEL_ID}` is private and not available via the free HF Inference API. "
             "then set `QUEST_BASE_URL` in Space Secrets to the endpoint's `/v1/` URL. "
             "In the meantime you can pick one of the open-weights models in the dropdown."
         )
+        yield warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
+        return
     try:
+        for partial_answer, partial_trace in build_research_agent(
             question=question,
             model=model,
             max_turns=max_turns,
             max_search_results=max_search_results,
             temperature=temperature,
+        ):
+            yield partial_answer, partial_trace
     except Exception as exc:
+        yield f"Error: {exc}", json.dumps({"error": str(exc)}, ensure_ascii=False, indent=2)
 EXAMPLES = [
                     label="Max Turns",
                     minimum=2,
                     maximum=20,
+                    value=6,
                     step=1,
                 )
                 max_search_results = gr.Slider(