Spaces:

osunlp
/

QUEST

Running

App Files Files Community

TomLii commited on Apr 18

Commit

04b8201

1 Parent(s): 080fac5

Route Quest-4B through dedicated HF Inference Endpoint; ship QUEST prompt/schema

Browse files

Files changed (3) hide show

.env.example +11 -3
README.md +96 -40
app.py +192 -69

.env.example CHANGED Viewed

@@ -1,5 +1,13 @@
-# Optional. For Hugging Face Inference API (free quota with your HF account).
 HF_TOKEN=hf_xxx
-# Default model shown in UI.
-DEFAULT_MODEL=Qwen/Qwen2.5-7B-Instruct

+# Required: personal HF token with read access to osunlp/Quest-4B.
 HF_TOKEN=hf_xxx
+# Dedicated HF Inference Endpoint URL that serves osunlp/Quest-4B.
+# Must end with /v1/.
+QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
+# Model name the endpoint responds to. TGI containers usually use "tgi";
+# vLLM containers usually use the original repo id ("osunlp/Quest-4B").
+QUEST_ENDPOINT_MODEL=tgi
+# Default model preselected in the dropdown.
+DEFAULT_MODEL=osunlp/Quest-4B

README.md CHANGED Viewed

@@ -9,58 +9,114 @@ app_file: app.py
 pinned: false
 ---
-# DeepResearch Space Starter
-A standalone Hugging Face Space starter for a DeepResearch-style agent.
-It supports:
-- multi-turn reasoning loop
-- `search` tool (DuckDuckGo)
-- `visit` tool (webpage fetch + text extraction)
-- final answer in `<answer>...</answer>`
-- easy model replacement later
-## 1) Quick Start (Local)
-```bash
-python -m venv .venv
-source .venv/bin/activate
-pip install -r requirements.txt
-python app.py
-```
-## 2) Deploy to Hugging Face Space
-1. Create a new Space (SDK = **Gradio**).
-2. Push this repository to the Space repository.
-3. In Space **Settings -> Secrets**, add:
-   - `HF_TOKEN` (recommended for stable free inference access)
-4. Optional Variables:
-   - `DEFAULT_MODEL` (default: `Qwen/Qwen2.5-7B-Instruct`)
-### Public Space Configuration Checklist
-If the Space is public and you want users to run it directly:
-1. Keep the Space **Visibility** as Public.
-2. Add `HF_TOKEN` in **Secrets** (required by this starter).
-3. Click **Restart this Space** after saving the secret.
-4. Open the app and run a simple query like: `What is retrieval-augmented generation?`
-## 3) Free Model First, Your Model Later
-You can start with a free inference model, then switch by changing only env/config:
-- Current: `DEFAULT_MODEL=Qwen/Qwen2.5-7B-Instruct`
-- Later: set your own model name or API-compatible endpoint logic in `app.py` (`call_model` function).
-Recommended migration strategy:
-1. keep tool protocol unchanged (`<tool_call>`, `<tool_response>`, `<answer>`)
-2. replace only model adapter (`call_model`)
-3. keep UI and tool chain unchanged
-## 4) Notes
-- This is a lightweight starter, not a full production benchmark runner.
-- Web fetching quality depends on target website anti-bot rules and page structure.
-- For stronger reliability, add retry/backoff and persistent tool cache.

 pinned: false
 ---
+# DeepResearch Space
+An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
+can either talk to **`osunlp/Quest-4B`** (our own fine-tuned research model,
+routed through a private HF Inference Endpoint) or fall back to open-weights
+models through the shared HF Inference API.
+Supported tools:
+- `search` (DuckDuckGo, multi-query)
+- `visit` (HTTP fetch + text extraction, multi-URL)
+- lightweight research-state summary to cut repeated work
+- `<answer>` extraction for the final response
+---
+## 1) Use our own `osunlp/Quest-4B` model (recommended)
+Because the model is **private** during the beta, it is not on the free
+Inference API. You host it yourself on a dedicated HF Inference Endpoint
+(pay-as-you-go, scale-to-zero), and point this Space at it.
+### 1a) Create the endpoint once
+1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
+2. **Model repository**: `osunlp/Quest-4B` (use a token with access).
+3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 4B
+   model. `Nvidia T4 small (16GB)` works too and is cheaper.
+4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
+   pick `vLLM`. Both expose an OpenAI-compatible `/v1/` route.
+5. **Autoscaling → Scale-to-Zero**: enable it so you only pay when the
+   endpoint is serving traffic.
+6. Hit **Create endpoint**. After ~1–2 minutes it turns `Running` and shows a
+   base URL like `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud`.
+### 1b) Tell the Space how to reach it
+In this Space's **Settings → Secrets / Variables**:
+| Name | Value | Why |
+|---|---|---|
+| `HF_TOKEN` | your personal HF token with read access to `osunlp/Quest-4B` | pulls private weights & authenticates the endpoint call |
+| `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
+| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/Quest-4B` if you deployed with vLLM) | some containers need the exact model name |
+| `DEFAULT_MODEL` | `osunlp/Quest-4B` | preselects the right option in the UI |
+Click **Restart this Space**. The `Model` dropdown now shows
+`osunlp/Quest-4B` at the top; selecting it routes requests through your
+endpoint.
+> Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
+> internal beta (a handful of testers, dozens of queries per day) typically
+> stays under **\$100/month**. You can stop the endpoint manually from the UI
+> any time to freeze costs.
+---
+## 2) Fallback: free open-weights models
+If you just want to try the UI without spinning up an endpoint, pick any of
+these in the dropdown. They run through the shared HF Inference API.
+- `Qwen/Qwen3-8B`
+- `google/gemma-3-12b-it`
+- `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B`
+- `Qwen/Qwen2.5-7B-Instruct`
+- `meta-llama/Llama-3.1-8B-Instruct`
+Only `HF_TOKEN` is required for this path.
+---
+## 3) Share the beta with org members (without paying for Team)
+Option A (simplest, **\$0** for access, Space Hardware stays on free CPU):
+1. Keep the Space under your personal account.
+2. **Settings → Visibility → Private**.
+3. **Settings → Collaborators** → add each tester by HF username.
+4. Endpoint lives under your personal namespace too, so the bill goes to
+   your personal payment method (you can expense invoices from
+   <https://huggingface.co/settings/billing>).
+Option B (org-level billing): upgrade the organization to a Team plan and
+recreate both the Space and the endpoint under the org namespace.
+---
+## 4) Local development
+```bash
+python -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+export HF_TOKEN=...                      # required
+export QUEST_BASE_URL=https://.../v1/    # optional; only if testing against the endpoint
+python app.py
+```
+---
+## 5) Architecture notes
+- `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
+  for the private-endpoint path and the same client without `base_url` for the
+  shared API path.
+- The system prompt matches the schema Quest-4B was trained on (array-based
+  `search` / `visit` with an explicit `goal`), so the private model stays
+  in-distribution. The open-weights fallbacks also follow the same schema.
+- Visited URLs and search queries are cached in-process so repeated tool
+  calls don't re-hit the network.
+- `<answer>...</answer>` terminates the ReAct loop.

app.py CHANGED Viewed

@@ -2,8 +2,9 @@ import json
 import os
 import re
 from dataclasses import dataclass, field
 from pathlib import Path
-from typing import Any, Dict, List, Optional, Set, Tuple
 import gradio as gr
 import requests
@@ -12,44 +13,73 @@ from duckduckgo_search import DDGS
 from huggingface_hub import InferenceClient
-DEFAULT_FREE_MODELS = [
-    # Newer free-friendly candidates (availability depends on HF Inference quota/region)
     "Qwen/Qwen3-8B",
     "google/gemma-3-12b-it",
     "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
-    # Fallback older but usually reliable
     "Qwen/Qwen2.5-7B-Instruct",
     "meta-llama/Llama-3.1-8B-Instruct",
 ]
-DEFAULT_MODEL = os.getenv("DEFAULT_MODEL", DEFAULT_FREE_MODELS[0])
 PAPER_URL = os.getenv("PAPER_URL", "#")
 CODE_URL = os.getenv("CODE_URL", "#")
 DATASET_URL = os.getenv("DATASET_URL", "#")
 MODEL_URL = os.getenv("MODEL_URL", "#")
-SYSTEM_PROMPT = """You are a Deep Research assistant.
-You can think step by step, use tools, and then return a final answer.
-Tool protocol:
-- To call a tool, output exactly one block:
-<tool_call>
-{"name":"search","arguments":{"query":"...","max_results":5}}
-</tool_call>
-or
 <tool_call>
-{"name":"visit","arguments":{"url":"...","max_chars":6000}}
 </tool_call>
-- When you are done, output:
-<answer>
-...final answer...
-</answer>
-Rules:
-- Use tools when needed, but avoid repeated calls to the same URL/query.
-- Cite useful URLs in your final answer.
-- If a tool fails, recover and continue.
-"""
 TOOL_RESPONSE_TEMPLATE = """<tool_response>
@@ -682,7 +712,7 @@ def parse_tool_call(text: str) -> Tuple[Optional[str], Optional[Dict[str, Any]],
     return name, arguments, None
-def run_search(query: str, max_results: int = 5) -> Dict[str, Any]:
     if not query.strip():
         return {"ok": False, "error": "Search query cannot be empty."}
     cache_key = f"{query.strip().lower()}::{max_results}"
@@ -704,6 +734,22 @@ def run_search(query: str, max_results: int = 5) -> Dict[str, Any]:
     return payload
 def _clean_html_to_text(html: str, max_chars: int) -> str:
     soup = BeautifulSoup(html, "html.parser")
     for tag in soup(["script", "style", "noscript"]):
@@ -713,12 +759,12 @@ def _clean_html_to_text(html: str, max_chars: int) -> str:
     return text[:max_chars]
-def run_visit(url: str, max_chars: int = 6000) -> Dict[str, Any]:
     if not url.strip():
         return {"ok": False, "error": "URL cannot be empty."}
     cache_key = f"{url.strip()}::{max_chars}"
     if cache_key in VISIT_CACHE:
-        return {**VISIT_CACHE[cache_key], "cached": True}
     try:
         resp = requests.get(
             url,
@@ -731,11 +777,47 @@ def run_visit(url: str, max_chars: int = 6000) -> Dict[str, Any]:
             text = _clean_html_to_text(resp.text, max_chars=max_chars)
         else:
             text = resp.text[:max_chars]
-        payload = {"ok": True, "url": url, "content": text, "cached": False}
         VISIT_CACHE[cache_key] = payload
         return payload
     except Exception as exc:
-        return {"ok": False, "url": url, "error": str(exc)}
 def call_model(
@@ -774,14 +856,14 @@ def build_research_agent(
     max_search_results: int,
     temperature: float,
 ) -> Tuple[str, str]:
-    token = os.getenv("HF_TOKEN")
-    client = InferenceClient(token=token)
     state = AgentState()
-    used_model = model
-    recent_model_candidates = [m for m in DEFAULT_FREE_MODELS if m != model]
     messages: List[Dict[str, str]] = [
-        {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": question},
     ]
@@ -797,14 +879,18 @@ def build_research_agent(
                 }
             )
-        model_output, used_model = call_model(
             client=client,
             messages=messages,
-            preferred_model=model,
-            candidate_models=recent_model_candidates,
             temperature=temperature,
             max_new_tokens=1400,
         )
         messages.append({"role": "assistant", "content": model_output})
         state.trace.append({"turn": turn, "assistant": model_output})
@@ -827,48 +913,77 @@ def build_research_agent(
             continue
         else:
             if tool_name == "search":
-                query = str(tool_args.get("query", "")).strip()
                 max_results = int(tool_args.get("max_results", max_search_results))
                 max_results = max(1, min(max_results, 10))
-                if query in state.searched_query_set:
-                    tool_response = {
-                        "ok": True,
-                        "query": query,
-                        "cached": True,
-                        "note": "This query was already searched. Reusing cached result to avoid duplicate work.",
-                        "results": [],
-                    }
-                else:
-                    state.searched_queries.append(query)
-                    state.searched_query_set.add(query)
-                    tool_response = run_search(query=query, max_results=max_results)
-                    if tool_response.get("ok"):
-                        first_titles = [r.get("title", "") for r in tool_response.get("results", [])[:2]]
                         if first_titles:
                             state.trusted_notes.append(
-                                f"Searched '{query}' and found leads: {', '.join(t for t in first_titles if t)}"
                             )
             elif tool_name == "visit":
-                url = str(tool_args.get("url", "")).strip()
                 max_chars = int(tool_args.get("max_chars", 6000))
                 max_chars = max(500, min(max_chars, 20000))
-                if url in state.visited_url_set:
-                    tool_response = {
-                        "ok": True,
-                        "url": url,
-                        "cached": True,
-                        "note": "This URL was already visited. Reusing cached result to avoid duplicate work.",
-                    }
-                else:
-                    state.visited_urls.append(url)
-                    state.visited_url_set.add(url)
-                    tool_response = run_visit(url=url, max_chars=max_chars)
-                    if tool_response.get("ok"):
-                        snippet = str(tool_response.get("content", ""))[:180]
                         if snippet:
                             state.trusted_notes.append(
-                                f"Visited {url} and extracted key context: {snippet}"
                             )
             else:
                 tool_response = {"ok": False, "error": f"Unknown tool: {tool_name}"}
@@ -922,6 +1037,14 @@ def run_ui(
             "Go to Settings -> Secrets -> add `HF_TOKEN`, then retry."
         )
         return warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
     try:
         return build_research_agent(
             question=question,
@@ -997,8 +1120,8 @@ with gr.Blocks(
                 gr.HTML('<div class="section-heading">Settings</div>')
                 model = gr.Dropdown(
                     label="Model",
-                    choices=DEFAULT_FREE_MODELS,
-                    value=DEFAULT_MODEL if DEFAULT_MODEL in DEFAULT_FREE_MODELS else DEFAULT_FREE_MODELS[0],
                     allow_custom_value=True,
                 )
                 max_turns = gr.Slider(

 import os
 import re
 from dataclasses import dataclass, field
+from datetime import date
 from pathlib import Path
+from typing import Any, Dict, List, Optional, Set, Tuple, Union
 import gradio as gr
 import requests
 from huggingface_hub import InferenceClient
+# --- Model configuration ---------------------------------------------------
+# Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
+# Secrets, the app will route requests to that dedicated HF Inference Endpoint
+# instead of the shared HF Inference API.
+QUEST_MODEL_ID = "osunlp/Quest-4B"
+QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
+# Endpoints built from the TGI image expose a single-model OpenAI route; the
+# model name passed to chat_completion is usually "tgi". vLLM endpoints usually
+# want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
+QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
+# Shared HF Inference API fallbacks (free, rate-limited). These are used when
+# the user picks one of these from the Model dropdown; they do NOT go through
+# the private endpoint.
+FREE_FALLBACK_MODELS = [
     "Qwen/Qwen3-8B",
     "google/gemma-3-12b-it",
     "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B",
     "Qwen/Qwen2.5-7B-Instruct",
     "meta-llama/Llama-3.1-8B-Instruct",
 ]
+# Quest-4B shows up first when the endpoint is wired; otherwise we still list
+# it so you can see what the target model is, but it will only work after the
+# QUEST_BASE_URL secret is configured.
+DEFAULT_MODEL_CHOICES = [QUEST_MODEL_ID] + FREE_FALLBACK_MODELS
+DEFAULT_MODEL = os.getenv(
+    "DEFAULT_MODEL",
+    QUEST_MODEL_ID if QUEST_BASE_URL else FREE_FALLBACK_MODELS[0],
+)
 PAPER_URL = os.getenv("PAPER_URL", "#")
 CODE_URL = os.getenv("CODE_URL", "#")
 DATASET_URL = os.getenv("DATASET_URL", "#")
 MODEL_URL = os.getenv("MODEL_URL", "#")
+# --- System prompt ---------------------------------------------------------
+# Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
+# so that Quest-4B sees the exact tool schema it was trained with. Other
+# models still follow this schema just fine in practice.
+QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
+# Tools
+You may call one or more functions to assist with the user query.
+You are provided with function signatures within <tools></tools> XML tags:
+<tools>
+{"type": "function", "function": {"name": "search", "description": "Perform Google web searches then returns a string of the top search results. Accepts multiple queries.", "parameters": {"type": "object", "properties": {"query": {"type": "array", "items": {"type": "string", "description": "The search query."}, "minItems": 1, "description": "The list of search queries."}}, "required": ["query"]}}}
+{"type": "function", "function": {"name": "visit", "description": "Visit webpage(s) and return the summary of the content.", "parameters": {"type": "object", "properties": {"url": {"type": "array", "items": {"type": "string"}, "description": "The URL(s) of the webpage(s) to visit. Can be a single URL or an array of URLs."}, "goal": {"type": "string", "description": "The specific information goal for visiting webpage(s)."}}, "required": ["url", "goal"]}}}
+</tools>
+# Using prev_state (Research State Summary)
+If you see a "RESEARCH STATE SUMMARY (prev_state)" section in the user message, it contains a compressed summary of previous research progress. Use it to avoid repeating searches/visits that have already been executed, use verified information directly in your answer, and follow up on uncertain claims only when needed.
+For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
 <tool_call>
+{"name": <function-name>, "arguments": <args-json-object>}
 </tool_call>
+Current date: """
+def build_system_prompt() -> str:
+    return QUEST_SYSTEM_PROMPT + date.today().isoformat()
 TOOL_RESPONSE_TEMPLATE = """<tool_response>
     return name, arguments, None
+def _run_search_single(query: str, max_results: int) -> Dict[str, Any]:
     if not query.strip():
         return {"ok": False, "error": "Search query cannot be empty."}
     cache_key = f"{query.strip().lower()}::{max_results}"
     return payload
+def run_search(query: Union[str, List[str]], max_results: int = 5) -> Dict[str, Any]:
+    """Runs one or more queries through DuckDuckGo.
+    QUEST's schema passes `query` as an array of strings, while the simpler
+    starter schema used a single string. We accept both shapes.
+    """
+    if isinstance(query, list):
+        sub_results: List[Dict[str, Any]] = []
+        for q in query:
+            if not isinstance(q, str) or not q.strip():
+                continue
+            sub_results.append(_run_search_single(q, max_results))
+        return {"ok": True, "queries": query, "results": sub_results}
+    return _run_search_single(str(query or "").strip(), max_results)
 def _clean_html_to_text(html: str, max_chars: int) -> str:
     soup = BeautifulSoup(html, "html.parser")
     for tag in soup(["script", "style", "noscript"]):
     return text[:max_chars]
+def _run_visit_single(url: str, max_chars: int, goal: str = "") -> Dict[str, Any]:
     if not url.strip():
         return {"ok": False, "error": "URL cannot be empty."}
     cache_key = f"{url.strip()}::{max_chars}"
     if cache_key in VISIT_CACHE:
+        return {**VISIT_CACHE[cache_key], "cached": True, "goal": goal}
     try:
         resp = requests.get(
             url,
             text = _clean_html_to_text(resp.text, max_chars=max_chars)
         else:
             text = resp.text[:max_chars]
+        payload = {"ok": True, "url": url, "content": text, "cached": False, "goal": goal}
         VISIT_CACHE[cache_key] = payload
         return payload
     except Exception as exc:
+        return {"ok": False, "url": url, "error": str(exc), "goal": goal}
+def run_visit(
+    url: Union[str, List[str]],
+    max_chars: int = 6000,
+    goal: str = "",
+) -> Dict[str, Any]:
+    """Fetches one or more URLs. Accepts string or list (QUEST schema)."""
+    if isinstance(url, list):
+        sub_results: List[Dict[str, Any]] = []
+        for u in url:
+            if not isinstance(u, str) or not u.strip():
+                continue
+            sub_results.append(_run_visit_single(u, max_chars, goal))
+        return {"ok": True, "goal": goal, "results": sub_results}
+    return _run_visit_single(str(url or "").strip(), max_chars, goal)
+def _build_client_for_model(model: str) -> Tuple[InferenceClient, str, List[str]]:
+    """Returns (client, primary_model_id, fallback_model_ids).
+    When the user picks the Quest model and QUEST_BASE_URL is configured, the
+    InferenceClient is pointed at the dedicated endpoint; otherwise we hit the
+    shared HF Inference API and let the starter fall back across free models.
+    """
+    token = os.getenv("HF_TOKEN")
+    if model == QUEST_MODEL_ID and QUEST_BASE_URL:
+        client = InferenceClient(
+            base_url=QUEST_BASE_URL,
+            token=token,
+            timeout=120,
+        )
+        return client, QUEST_ENDPOINT_MODEL, []
+    client = InferenceClient(token=token, timeout=60)
+    fallbacks = [m for m in FREE_FALLBACK_MODELS if m != model]
+    return client, model, fallbacks
 def call_model(
     max_search_results: int,
     temperature: float,
 ) -> Tuple[str, str]:
+    client, primary_model, fallback_models = _build_client_for_model(model)
+    # Display label: the real HF repo id is nicer than the TGI shim name.
+    display_primary = model if (model == QUEST_MODEL_ID) else primary_model
     state = AgentState()
+    used_model = display_primary
     messages: List[Dict[str, str]] = [
+        {"role": "system", "content": build_system_prompt()},
         {"role": "user", "content": question},
     ]
                 }
             )
+        raw_output, endpoint_model = call_model(
             client=client,
             messages=messages,
+            preferred_model=primary_model,
+            candidate_models=fallback_models,
             temperature=temperature,
             max_new_tokens=1400,
         )
+        model_output = raw_output
+        # Preserve the human-friendly model id for the trace even if the
+        # endpoint ignores the "model" param and returns the TGI shim name.
+        used_model = display_primary if endpoint_model == primary_model == QUEST_ENDPOINT_MODEL else endpoint_model
         messages.append({"role": "assistant", "content": model_output})
         state.trace.append({"turn": turn, "assistant": model_output})
             continue
         else:
             if tool_name == "search":
+                raw_query = tool_args.get("query", "")
+                queries: List[str]
+                if isinstance(raw_query, list):
+                    queries = [str(q).strip() for q in raw_query if str(q).strip()]
+                else:
+                    queries = [str(raw_query).strip()] if str(raw_query).strip() else []
                 max_results = int(tool_args.get("max_results", max_search_results))
                 max_results = max(1, min(max_results, 10))
+                per_query: List[Dict[str, Any]] = []
+                for q in queries:
+                    if q in state.searched_query_set:
+                        per_query.append({
+                            "ok": True,
+                            "query": q,
+                            "cached": True,
+                            "note": "Already searched; reusing cached result.",
+                            "results": [],
+                        })
+                        continue
+                    state.searched_queries.append(q)
+                    state.searched_query_set.add(q)
+                    single = _run_search_single(q, max_results)
+                    per_query.append(single)
+                    if single.get("ok"):
+                        first_titles = [r.get("title", "") for r in single.get("results", [])[:2]]
                         if first_titles:
                             state.trusted_notes.append(
+                                f"Searched '{q}' and found leads: {', '.join(t for t in first_titles if t)}"
                             )
+                tool_response = (
+                    per_query[0]
+                    if len(per_query) == 1
+                    else {"ok": True, "queries": queries, "results": per_query}
+                )
             elif tool_name == "visit":
+                raw_url = tool_args.get("url", "")
+                urls: List[str]
+                if isinstance(raw_url, list):
+                    urls = [str(u).strip() for u in raw_url if str(u).strip()]
+                else:
+                    urls = [str(raw_url).strip()] if str(raw_url).strip() else []
+                goal = str(tool_args.get("goal", "")).strip()
                 max_chars = int(tool_args.get("max_chars", 6000))
                 max_chars = max(500, min(max_chars, 20000))
+                per_url: List[Dict[str, Any]] = []
+                for u in urls:
+                    if u in state.visited_url_set:
+                        per_url.append({
+                            "ok": True,
+                            "url": u,
+                            "cached": True,
+                            "note": "Already visited; reusing cached result.",
+                        })
+                        continue
+                    state.visited_urls.append(u)
+                    state.visited_url_set.add(u)
+                    single = _run_visit_single(u, max_chars, goal)
+                    per_url.append(single)
+                    if single.get("ok"):
+                        snippet = str(single.get("content", ""))[:180]
                         if snippet:
                             state.trusted_notes.append(
+                                f"Visited {u} and extracted key context: {snippet}"
                             )
+                tool_response = (
+                    per_url[0]
+                    if len(per_url) == 1
+                    else {"ok": True, "goal": goal, "results": per_url}
+                )
             else:
                 tool_response = {"ok": False, "error": f"Unknown tool: {tool_name}"}
             "Go to Settings -> Secrets -> add `HF_TOKEN`, then retry."
         )
         return warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
+    if model == QUEST_MODEL_ID and not QUEST_BASE_URL:
+        warning = (
+            f"`{QUEST_MODEL_ID}` is private and not available via the free HF Inference API. "
+            "Create a dedicated HF Inference Endpoint for it (https://ui.endpoints.huggingface.co/), "
+            "then set `QUEST_BASE_URL` in Space Secrets to the endpoint's `/v1/` URL. "
+            "In the meantime you can pick one of the open-weights models in the dropdown."
+        )
+        return warning, json.dumps({"error": warning}, ensure_ascii=False, indent=2)
     try:
         return build_research_agent(
             question=question,
                 gr.HTML('<div class="section-heading">Settings</div>')
                 model = gr.Dropdown(
                     label="Model",
+                    choices=DEFAULT_MODEL_CHOICES,
+                    value=DEFAULT_MODEL if DEFAULT_MODEL in DEFAULT_MODEL_CHOICES else DEFAULT_MODEL_CHOICES[0],
                     allow_custom_value=True,
                 )
                 max_turns = gr.Slider(