Commit ·
0c32859
1
Parent(s): cf22067
rename: Quest-35B -> QUEST-35B (uniform uppercase branding)
Browse files- .env.example +4 -4
- README.md +8 -8
- app.py +6 -6
.env.example
CHANGED
|
@@ -2,15 +2,15 @@
|
|
| 2 |
# Required
|
| 3 |
# =============================================================================
|
| 4 |
|
| 5 |
-
# Personal HF token with read access to osunlp/
|
| 6 |
HF_TOKEN=hf_xxx
|
| 7 |
|
| 8 |
-
# Dedicated HF Inference Endpoint URL that serves osunlp/
|
| 9 |
# Must end with /v1/.
|
| 10 |
QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
|
| 11 |
|
| 12 |
# Model name the endpoint responds to. TGI containers usually use "tgi";
|
| 13 |
-
# vLLM containers usually use the original repo id ("osunlp/
|
| 14 |
QUEST_ENDPOINT_MODEL=tgi
|
| 15 |
|
| 16 |
# Bearer token sent to QUEST_BASE_URL. Optional. When unset, HF_TOKEN is used
|
|
@@ -21,7 +21,7 @@ QUEST_ENDPOINT_MODEL=tgi
|
|
| 21 |
QUEST_API_KEY=
|
| 22 |
|
| 23 |
# Default model preselected in the dropdown.
|
| 24 |
-
DEFAULT_MODEL=osunlp/
|
| 25 |
|
| 26 |
# =============================================================================
|
| 27 |
# Recommended: strongly improves latency and reliability
|
|
|
|
| 2 |
# Required
|
| 3 |
# =============================================================================
|
| 4 |
|
| 5 |
+
# Personal HF token with read access to osunlp/QUEST-35B.
|
| 6 |
HF_TOKEN=hf_xxx
|
| 7 |
|
| 8 |
+
# Dedicated HF Inference Endpoint URL that serves osunlp/QUEST-35B.
|
| 9 |
# Must end with /v1/.
|
| 10 |
QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
|
| 11 |
|
| 12 |
# Model name the endpoint responds to. TGI containers usually use "tgi";
|
| 13 |
+
# vLLM containers usually use the original repo id ("osunlp/QUEST-35B").
|
| 14 |
QUEST_ENDPOINT_MODEL=tgi
|
| 15 |
|
| 16 |
# Bearer token sent to QUEST_BASE_URL. Optional. When unset, HF_TOKEN is used
|
|
|
|
| 21 |
QUEST_API_KEY=
|
| 22 |
|
| 23 |
# Default model preselected in the dropdown.
|
| 24 |
+
DEFAULT_MODEL=osunlp/QUEST-35B
|
| 25 |
|
| 26 |
# =============================================================================
|
| 27 |
# Recommended: strongly improves latency and reliability
|
README.md
CHANGED
|
@@ -12,7 +12,7 @@ pinned: false
|
|
| 12 |
# DeepResearch Space
|
| 13 |
|
| 14 |
An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
|
| 15 |
-
can either talk to **`osunlp/
|
| 16 |
routed through a private HF Inference Endpoint) or fall back to open-weights
|
| 17 |
models through the shared HF Inference API.
|
| 18 |
|
|
@@ -24,7 +24,7 @@ Supported tools:
|
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
-
## 1) Use our own `osunlp/
|
| 28 |
|
| 29 |
Because the model is **private** during the beta, it is not on the free
|
| 30 |
Inference API. You host it yourself on a dedicated HF Inference Endpoint
|
|
@@ -33,7 +33,7 @@ Inference API. You host it yourself on a dedicated HF Inference Endpoint
|
|
| 33 |
### 1a) Create the endpoint once
|
| 34 |
|
| 35 |
1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
|
| 36 |
-
2. **Model repository**: `osunlp/
|
| 37 |
3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
|
| 38 |
model. `Nvidia T4 small (16GB)` works too and is cheaper.
|
| 39 |
4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
|
|
@@ -49,13 +49,13 @@ In this Space's **Settings → Secrets / Variables**:
|
|
| 49 |
|
| 50 |
| Name | Value | Why |
|
| 51 |
|---|---|---|
|
| 52 |
-
| `HF_TOKEN` | your personal HF token with read access to `osunlp/
|
| 53 |
| `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
|
| 54 |
-
| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/
|
| 55 |
-
| `DEFAULT_MODEL` | `osunlp/
|
| 56 |
|
| 57 |
Click **Restart this Space**. The `Model` dropdown now shows
|
| 58 |
-
`osunlp/
|
| 59 |
endpoint.
|
| 60 |
|
| 61 |
> Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
|
|
@@ -114,7 +114,7 @@ python app.py
|
|
| 114 |
- `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
|
| 115 |
for the private-endpoint path and the same client without `base_url` for the
|
| 116 |
shared API path.
|
| 117 |
-
- The system prompt matches the schema
|
| 118 |
`search` / `visit` with an explicit `goal`), so the private model stays
|
| 119 |
in-distribution. The open-weights fallbacks also follow the same schema.
|
| 120 |
- Visited URLs and search queries are cached in-process so repeated tool
|
|
|
|
| 12 |
# DeepResearch Space
|
| 13 |
|
| 14 |
An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
|
| 15 |
+
can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model,
|
| 16 |
routed through a private HF Inference Endpoint) or fall back to open-weights
|
| 17 |
models through the shared HF Inference API.
|
| 18 |
|
|
|
|
| 24 |
|
| 25 |
---
|
| 26 |
|
| 27 |
+
## 1) Use our own `osunlp/QUEST-35B` model (recommended)
|
| 28 |
|
| 29 |
Because the model is **private** during the beta, it is not on the free
|
| 30 |
Inference API. You host it yourself on a dedicated HF Inference Endpoint
|
|
|
|
| 33 |
### 1a) Create the endpoint once
|
| 34 |
|
| 35 |
1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
|
| 36 |
+
2. **Model repository**: `osunlp/QUEST-35B` (use a token with access).
|
| 37 |
3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
|
| 38 |
model. `Nvidia T4 small (16GB)` works too and is cheaper.
|
| 39 |
4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
|
|
|
|
| 49 |
|
| 50 |
| Name | Value | Why |
|
| 51 |
|---|---|---|
|
| 52 |
+
| `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call |
|
| 53 |
| `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
|
| 54 |
+
| `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name |
|
| 55 |
+
| `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI |
|
| 56 |
|
| 57 |
Click **Restart this Space**. The `Model` dropdown now shows
|
| 58 |
+
`osunlp/QUEST-35B` at the top; selecting it routes requests through your
|
| 59 |
endpoint.
|
| 60 |
|
| 61 |
> Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
|
|
|
|
| 114 |
- `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
|
| 115 |
for the private-endpoint path and the same client without `base_url` for the
|
| 116 |
shared API path.
|
| 117 |
+
- The system prompt matches the schema QUEST-35B was trained on (array-based
|
| 118 |
`search` / `visit` with an explicit `goal`), so the private model stays
|
| 119 |
in-distribution. The open-weights fallbacks also follow the same schema.
|
| 120 |
- Visited URLs and search queries are cached in-process so repeated tool
|
app.py
CHANGED
|
@@ -17,14 +17,14 @@ from huggingface_hub import InferenceClient
|
|
| 17 |
# Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
|
| 18 |
# Secrets, the app will route requests to that dedicated HF Inference Endpoint
|
| 19 |
# instead of the shared HF Inference API.
|
| 20 |
-
QUEST_MODEL_ID = "osunlp/
|
| 21 |
QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
|
| 22 |
# Endpoints built from the TGI image expose a single-model OpenAI route; the
|
| 23 |
# model name passed to chat_completion is usually "tgi". vLLM endpoints usually
|
| 24 |
# want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
|
| 25 |
QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
|
| 26 |
|
| 27 |
-
# This Space runs exclusively on
|
| 28 |
# Endpoint pointed to by QUEST_BASE_URL. No public fallback list — the model
|
| 29 |
# field in the UI is display-only.
|
| 30 |
DEFAULT_MODEL = QUEST_MODEL_ID
|
|
@@ -40,7 +40,7 @@ MODEL_URL = os.getenv("MODEL_URL", "#")
|
|
| 40 |
|
| 41 |
# --- System prompt ---------------------------------------------------------
|
| 42 |
# Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
|
| 43 |
-
# so that
|
| 44 |
# models still follow this schema just fine in practice.
|
| 45 |
QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
|
| 46 |
|
|
@@ -1014,7 +1014,7 @@ _TABLE_SEPARATOR_RE = re.compile(
|
|
| 1014 |
def strip_think_blocks(text: str) -> str:
|
| 1015 |
"""Remove any <think>...</think> reasoning blocks.
|
| 1016 |
|
| 1017 |
-
|
| 1018 |
answer. When the endpoint is deployed without a reasoning parser, the raw
|
| 1019 |
tags leak into chat completion `content`; stripping them here keeps the
|
| 1020 |
extracted answer clean for Markdown rendering.
|
|
@@ -1438,7 +1438,7 @@ def _render_progress(
|
|
| 1438 |
) -> str:
|
| 1439 |
"""Render the in-progress status view that replaces the Markdown panel
|
| 1440 |
while the agent is still running, so the user is not staring at a blank
|
| 1441 |
-
box for the 20-60 seconds a full
|
| 1442 |
header = (
|
| 1443 |
f"### ⏳ Researching…\n\n"
|
| 1444 |
f"**Model:** `{used_model}` \n"
|
|
@@ -1610,7 +1610,7 @@ def build_research_agent(
|
|
| 1610 |
elif not tool_name:
|
| 1611 |
# No explicit tool call and no final answer: force finalization.
|
| 1612 |
# IMPORTANT: do not write the literal characters `<answer>...</answer>`
|
| 1613 |
-
# here. Some models (notably the Qwen3 family that
|
| 1614 |
# built on) will echo the template verbatim, which means the
|
| 1615 |
# extracted answer ends up being the three-dot placeholder `...`
|
| 1616 |
# and the user sees an empty-looking result.
|
|
|
|
| 17 |
# Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
|
| 18 |
# Secrets, the app will route requests to that dedicated HF Inference Endpoint
|
| 19 |
# instead of the shared HF Inference API.
|
| 20 |
+
QUEST_MODEL_ID = "osunlp/QUEST-35B"
|
| 21 |
QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
|
| 22 |
# Endpoints built from the TGI image expose a single-model OpenAI route; the
|
| 23 |
# model name passed to chat_completion is usually "tgi". vLLM endpoints usually
|
| 24 |
# want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
|
| 25 |
QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
|
| 26 |
|
| 27 |
+
# This Space runs exclusively on QUEST-35B served via the private HF Inference
|
| 28 |
# Endpoint pointed to by QUEST_BASE_URL. No public fallback list — the model
|
| 29 |
# field in the UI is display-only.
|
| 30 |
DEFAULT_MODEL = QUEST_MODEL_ID
|
|
|
|
| 40 |
|
| 41 |
# --- System prompt ---------------------------------------------------------
|
| 42 |
# Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
|
| 43 |
+
# so that QUEST-35B sees the exact tool schema it was trained with. Other
|
| 44 |
# models still follow this schema just fine in practice.
|
| 45 |
QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
|
| 46 |
|
|
|
|
| 1014 |
def strip_think_blocks(text: str) -> str:
|
| 1015 |
"""Remove any <think>...</think> reasoning blocks.
|
| 1016 |
|
| 1017 |
+
QUEST-35B (Qwen3 family) emits `<think>` reasoning before the final
|
| 1018 |
answer. When the endpoint is deployed without a reasoning parser, the raw
|
| 1019 |
tags leak into chat completion `content`; stripping them here keeps the
|
| 1020 |
extracted answer clean for Markdown rendering.
|
|
|
|
| 1438 |
) -> str:
|
| 1439 |
"""Render the in-progress status view that replaces the Markdown panel
|
| 1440 |
while the agent is still running, so the user is not staring at a blank
|
| 1441 |
+
box for the 20-60 seconds a full QUEST-35B research run can take."""
|
| 1442 |
header = (
|
| 1443 |
f"### ⏳ Researching…\n\n"
|
| 1444 |
f"**Model:** `{used_model}` \n"
|
|
|
|
| 1610 |
elif not tool_name:
|
| 1611 |
# No explicit tool call and no final answer: force finalization.
|
| 1612 |
# IMPORTANT: do not write the literal characters `<answer>...</answer>`
|
| 1613 |
+
# here. Some models (notably the Qwen3 family that QUEST-35B is
|
| 1614 |
# built on) will echo the template verbatim, which means the
|
| 1615 |
# extracted answer ends up being the three-dot placeholder `...`
|
| 1616 |
# and the user sees an empty-looking result.
|