Spaces:

osunlp
/

QUEST

Running

App Files Files Community

Lzy01241010 Claude Opus 4.7 commited on 13 days ago

Commit

0c32859

1 Parent(s): cf22067

rename: Quest-35B -> QUEST-35B (uniform uppercase branding)

Browse files

Files changed (3) hide show

.env.example +4 -4
README.md +8 -8
app.py +6 -6

.env.example CHANGED Viewed

@@ -2,15 +2,15 @@
 # Required
 # =============================================================================
-# Personal HF token with read access to osunlp/Quest-35B.
 HF_TOKEN=hf_xxx
-# Dedicated HF Inference Endpoint URL that serves osunlp/Quest-35B.
 # Must end with /v1/.
 QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
 # Model name the endpoint responds to. TGI containers usually use "tgi";
-# vLLM containers usually use the original repo id ("osunlp/Quest-35B").
 QUEST_ENDPOINT_MODEL=tgi
 # Bearer token sent to QUEST_BASE_URL. Optional. When unset, HF_TOKEN is used
@@ -21,7 +21,7 @@ QUEST_ENDPOINT_MODEL=tgi
 QUEST_API_KEY=
 # Default model preselected in the dropdown.
-DEFAULT_MODEL=osunlp/Quest-35B
 # =============================================================================
 # Recommended: strongly improves latency and reliability

 # Required
 # =============================================================================
+# Personal HF token with read access to osunlp/QUEST-35B.
 HF_TOKEN=hf_xxx
+# Dedicated HF Inference Endpoint URL that serves osunlp/QUEST-35B.
 # Must end with /v1/.
 QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
 # Model name the endpoint responds to. TGI containers usually use "tgi";
+# vLLM containers usually use the original repo id ("osunlp/QUEST-35B").
 QUEST_ENDPOINT_MODEL=tgi
 # Bearer token sent to QUEST_BASE_URL. Optional. When unset, HF_TOKEN is used
 QUEST_API_KEY=
 # Default model preselected in the dropdown.
+DEFAULT_MODEL=osunlp/QUEST-35B
 # =============================================================================
 # Recommended: strongly improves latency and reliability

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ pinned: false
 # DeepResearch Space
 An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
-can either talk to **`osunlp/Quest-35B`** (our own fine-tuned research model,
 routed through a private HF Inference Endpoint) or fall back to open-weights
 models through the shared HF Inference API.
@@ -24,7 +24,7 @@ Supported tools:
 ---
-## 1) Use our own `osunlp/Quest-35B` model (recommended)
 Because the model is **private** during the beta, it is not on the free
 Inference API. You host it yourself on a dedicated HF Inference Endpoint
@@ -33,7 +33,7 @@ Inference API. You host it yourself on a dedicated HF Inference Endpoint
 ### 1a) Create the endpoint once
 1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
-2. **Model repository**: `osunlp/Quest-35B` (use a token with access).
 3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
    model. `Nvidia T4 small (16GB)` works too and is cheaper.
 4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
@@ -49,13 +49,13 @@ In this Space's **Settings → Secrets / Variables**:
 | Name | Value | Why |
 |---|---|---|
-| `HF_TOKEN` | your personal HF token with read access to `osunlp/Quest-35B` | pulls private weights & authenticates the endpoint call |
 | `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
-| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/Quest-35B` if you deployed with vLLM) | some containers need the exact model name |
-| `DEFAULT_MODEL` | `osunlp/Quest-35B` | preselects the right option in the UI |
 Click **Restart this Space**. The `Model` dropdown now shows
-`osunlp/Quest-35B` at the top; selecting it routes requests through your
 endpoint.
 > Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
@@ -114,7 +114,7 @@ python app.py
 - `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
   for the private-endpoint path and the same client without `base_url` for the
   shared API path.
-- The system prompt matches the schema Quest-35B was trained on (array-based
   `search` / `visit` with an explicit `goal`), so the private model stays
   in-distribution. The open-weights fallbacks also follow the same schema.
 - Visited URLs and search queries are cached in-process so repeated tool

 # DeepResearch Space
 An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
+can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model,
 routed through a private HF Inference Endpoint) or fall back to open-weights
 models through the shared HF Inference API.
 ---
+## 1) Use our own `osunlp/QUEST-35B` model (recommended)
 Because the model is **private** during the beta, it is not on the free
 Inference API. You host it yourself on a dedicated HF Inference Endpoint
 ### 1a) Create the endpoint once
 1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
+2. **Model repository**: `osunlp/QUEST-35B` (use a token with access).
 3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
    model. `Nvidia T4 small (16GB)` works too and is cheaper.
 4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
 | Name | Value | Why |
 |---|---|---|
+| `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call |
 | `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
+| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name |
+| `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI |
 Click **Restart this Space**. The `Model` dropdown now shows
+`osunlp/QUEST-35B` at the top; selecting it routes requests through your
 endpoint.
 > Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
 - `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
   for the private-endpoint path and the same client without `base_url` for the
   shared API path.
+- The system prompt matches the schema QUEST-35B was trained on (array-based
   `search` / `visit` with an explicit `goal`), so the private model stays
   in-distribution. The open-weights fallbacks also follow the same schema.
 - Visited URLs and search queries are cached in-process so repeated tool

app.py CHANGED Viewed

@@ -17,14 +17,14 @@ from huggingface_hub import InferenceClient
 # Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
 # Secrets, the app will route requests to that dedicated HF Inference Endpoint
 # instead of the shared HF Inference API.
-QUEST_MODEL_ID = "osunlp/Quest-35B"
 QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
 # Endpoints built from the TGI image expose a single-model OpenAI route; the
 # model name passed to chat_completion is usually "tgi". vLLM endpoints usually
 # want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
 QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
-# This Space runs exclusively on Quest-35B served via the private HF Inference
 # Endpoint pointed to by QUEST_BASE_URL. No public fallback list — the model
 # field in the UI is display-only.
 DEFAULT_MODEL = QUEST_MODEL_ID
@@ -40,7 +40,7 @@ MODEL_URL = os.getenv("MODEL_URL", "#")
 # --- System prompt ---------------------------------------------------------
 # Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
-# so that Quest-35B sees the exact tool schema it was trained with. Other
 # models still follow this schema just fine in practice.
 QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
@@ -1014,7 +1014,7 @@ _TABLE_SEPARATOR_RE = re.compile(
 def strip_think_blocks(text: str) -> str:
     """Remove any <think>...</think> reasoning blocks.
-    Quest-35B (Qwen3 family) emits `<think>` reasoning before the final
     answer. When the endpoint is deployed without a reasoning parser, the raw
     tags leak into chat completion `content`; stripping them here keeps the
     extracted answer clean for Markdown rendering.
@@ -1438,7 +1438,7 @@ def _render_progress(
 ) -> str:
     """Render the in-progress status view that replaces the Markdown panel
     while the agent is still running, so the user is not staring at a blank
-    box for the 20-60 seconds a full Quest-35B research run can take."""
     header = (
         f"### ⏳ Researching…\n\n"
         f"**Model:** `{used_model}`  \n"
@@ -1610,7 +1610,7 @@ def build_research_agent(
         elif not tool_name:
             # No explicit tool call and no final answer: force finalization.
             # IMPORTANT: do not write the literal characters `<answer>...</answer>`
-            # here. Some models (notably the Qwen3 family that Quest-35B is
             # built on) will echo the template verbatim, which means the
             # extracted answer ends up being the three-dot placeholder `...`
             # and the user sees an empty-looking result.

 # Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
 # Secrets, the app will route requests to that dedicated HF Inference Endpoint
 # instead of the shared HF Inference API.
+QUEST_MODEL_ID = "osunlp/QUEST-35B"
 QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
 # Endpoints built from the TGI image expose a single-model OpenAI route; the
 # model name passed to chat_completion is usually "tgi". vLLM endpoints usually
 # want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
 QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
+# This Space runs exclusively on QUEST-35B served via the private HF Inference
 # Endpoint pointed to by QUEST_BASE_URL. No public fallback list — the model
 # field in the UI is display-only.
 DEFAULT_MODEL = QUEST_MODEL_ID
 # --- System prompt ---------------------------------------------------------
 # Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
+# so that QUEST-35B sees the exact tool schema it was trained with. Other
 # models still follow this schema just fine in practice.
 QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
 def strip_think_blocks(text: str) -> str:
     """Remove any <think>...</think> reasoning blocks.
+    QUEST-35B (Qwen3 family) emits `<think>` reasoning before the final
     answer. When the endpoint is deployed without a reasoning parser, the raw
     tags leak into chat completion `content`; stripping them here keeps the
     extracted answer clean for Markdown rendering.
 ) -> str:
     """Render the in-progress status view that replaces the Markdown panel
     while the agent is still running, so the user is not staring at a blank
+    box for the 20-60 seconds a full QUEST-35B research run can take."""
     header = (
         f"### ⏳ Researching…\n\n"
         f"**Model:** `{used_model}`  \n"
         elif not tool_name:
             # No explicit tool call and no final answer: force finalization.
             # IMPORTANT: do not write the literal characters `<answer>...</answer>`
+            # here. Some models (notably the Qwen3 family that QUEST-35B is
             # built on) will echo the template verbatim, which means the
             # extracted answer ends up being the three-dot placeholder `...`
             # and the user sees an empty-looking result.