Lzy01241010 Claude Opus 4.7 commited on
Commit
0c32859
·
1 Parent(s): cf22067

rename: Quest-35B -> QUEST-35B (uniform uppercase branding)

Browse files
Files changed (3) hide show
  1. .env.example +4 -4
  2. README.md +8 -8
  3. app.py +6 -6
.env.example CHANGED
@@ -2,15 +2,15 @@
2
  # Required
3
  # =============================================================================
4
 
5
- # Personal HF token with read access to osunlp/Quest-35B.
6
  HF_TOKEN=hf_xxx
7
 
8
- # Dedicated HF Inference Endpoint URL that serves osunlp/Quest-35B.
9
  # Must end with /v1/.
10
  QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
11
 
12
  # Model name the endpoint responds to. TGI containers usually use "tgi";
13
- # vLLM containers usually use the original repo id ("osunlp/Quest-35B").
14
  QUEST_ENDPOINT_MODEL=tgi
15
 
16
  # Bearer token sent to QUEST_BASE_URL. Optional. When unset, HF_TOKEN is used
@@ -21,7 +21,7 @@ QUEST_ENDPOINT_MODEL=tgi
21
  QUEST_API_KEY=
22
 
23
  # Default model preselected in the dropdown.
24
- DEFAULT_MODEL=osunlp/Quest-35B
25
 
26
  # =============================================================================
27
  # Recommended: strongly improves latency and reliability
 
2
  # Required
3
  # =============================================================================
4
 
5
+ # Personal HF token with read access to osunlp/QUEST-35B.
6
  HF_TOKEN=hf_xxx
7
 
8
+ # Dedicated HF Inference Endpoint URL that serves osunlp/QUEST-35B.
9
  # Must end with /v1/.
10
  QUEST_BASE_URL=https://your-endpoint-id.aws.endpoints.huggingface.cloud/v1/
11
 
12
  # Model name the endpoint responds to. TGI containers usually use "tgi";
13
+ # vLLM containers usually use the original repo id ("osunlp/QUEST-35B").
14
  QUEST_ENDPOINT_MODEL=tgi
15
 
16
  # Bearer token sent to QUEST_BASE_URL. Optional. When unset, HF_TOKEN is used
 
21
  QUEST_API_KEY=
22
 
23
  # Default model preselected in the dropdown.
24
+ DEFAULT_MODEL=osunlp/QUEST-35B
25
 
26
  # =============================================================================
27
  # Recommended: strongly improves latency and reliability
README.md CHANGED
@@ -12,7 +12,7 @@ pinned: false
12
  # DeepResearch Space
13
 
14
  An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
15
- can either talk to **`osunlp/Quest-35B`** (our own fine-tuned research model,
16
  routed through a private HF Inference Endpoint) or fall back to open-weights
17
  models through the shared HF Inference API.
18
 
@@ -24,7 +24,7 @@ Supported tools:
24
 
25
  ---
26
 
27
- ## 1) Use our own `osunlp/Quest-35B` model (recommended)
28
 
29
  Because the model is **private** during the beta, it is not on the free
30
  Inference API. You host it yourself on a dedicated HF Inference Endpoint
@@ -33,7 +33,7 @@ Inference API. You host it yourself on a dedicated HF Inference Endpoint
33
  ### 1a) Create the endpoint once
34
 
35
  1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
36
- 2. **Model repository**: `osunlp/Quest-35B` (use a token with access).
37
  3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
38
  model. `Nvidia T4 small (16GB)` works too and is cheaper.
39
  4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
@@ -49,13 +49,13 @@ In this Space's **Settings → Secrets / Variables**:
49
 
50
  | Name | Value | Why |
51
  |---|---|---|
52
- | `HF_TOKEN` | your personal HF token with read access to `osunlp/Quest-35B` | pulls private weights & authenticates the endpoint call |
53
  | `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
54
- | `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/Quest-35B` if you deployed with vLLM) | some containers need the exact model name |
55
- | `DEFAULT_MODEL` | `osunlp/Quest-35B` | preselects the right option in the UI |
56
 
57
  Click **Restart this Space**. The `Model` dropdown now shows
58
- `osunlp/Quest-35B` at the top; selecting it routes requests through your
59
  endpoint.
60
 
61
  > Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
@@ -114,7 +114,7 @@ python app.py
114
  - `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
115
  for the private-endpoint path and the same client without `base_url` for the
116
  shared API path.
117
- - The system prompt matches the schema Quest-35B was trained on (array-based
118
  `search` / `visit` with an explicit `goal`), so the private model stays
119
  in-distribution. The open-weights fallbacks also follow the same schema.
120
  - Visited URLs and search queries are cached in-process so repeated tool
 
12
  # DeepResearch Space
13
 
14
  An interactive Hugging Face Space for a **Quest DeepResearch** agent. The app
15
+ can either talk to **`osunlp/QUEST-35B`** (our own fine-tuned research model,
16
  routed through a private HF Inference Endpoint) or fall back to open-weights
17
  models through the shared HF Inference API.
18
 
 
24
 
25
  ---
26
 
27
+ ## 1) Use our own `osunlp/QUEST-35B` model (recommended)
28
 
29
  Because the model is **private** during the beta, it is not on the free
30
  Inference API. You host it yourself on a dedicated HF Inference Endpoint
 
33
  ### 1a) Create the endpoint once
34
 
35
  1. Open <https://ui.endpoints.huggingface.co/> and click **"New endpoint"**.
36
+ 2. **Model repository**: `osunlp/QUEST-35B` (use a token with access).
37
  3. **Hardware**: `1x Nvidia L4 (24GB)` is usually the sweet spot for a 35B
38
  model. `Nvidia T4 small (16GB)` works too and is cheaper.
39
  4. **Advanced → Container Type**: keep `Text Generation Inference` (TGI) or
 
49
 
50
  | Name | Value | Why |
51
  |---|---|---|
52
+ | `HF_TOKEN` | your personal HF token with read access to `osunlp/QUEST-35B` | pulls private weights & authenticates the endpoint call |
53
  | `QUEST_BASE_URL` | the endpoint URL **ending with `/v1/`** (e.g. `https://abcdef.us-east-1.aws.endpoints.huggingface.cloud/v1/`) | tells the app to route chat completions to your endpoint |
54
+ | `QUEST_ENDPOINT_MODEL` | `tgi` (default; set to the original repo id `osunlp/QUEST-35B` if you deployed with vLLM) | some containers need the exact model name |
55
+ | `DEFAULT_MODEL` | `osunlp/QUEST-35B` | preselects the right option in the UI |
56
 
57
  Click **Restart this Space**. The `Model` dropdown now shows
58
+ `osunlp/QUEST-35B` at the top; selecting it routes requests through your
59
  endpoint.
60
 
61
  > Cost reality-check: on a 1× L4 at `$0.80/hr` with Scale-to-Zero, a small
 
114
  - `app.py` uses `huggingface_hub.InferenceClient(base_url=QUEST_BASE_URL, ...)`
115
  for the private-endpoint path and the same client without `base_url` for the
116
  shared API path.
117
+ - The system prompt matches the schema QUEST-35B was trained on (array-based
118
  `search` / `visit` with an explicit `goal`), so the private model stays
119
  in-distribution. The open-weights fallbacks also follow the same schema.
120
  - Visited URLs and search queries are cached in-process so repeated tool
app.py CHANGED
@@ -17,14 +17,14 @@ from huggingface_hub import InferenceClient
17
  # Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
18
  # Secrets, the app will route requests to that dedicated HF Inference Endpoint
19
  # instead of the shared HF Inference API.
20
- QUEST_MODEL_ID = "osunlp/Quest-35B"
21
  QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
22
  # Endpoints built from the TGI image expose a single-model OpenAI route; the
23
  # model name passed to chat_completion is usually "tgi". vLLM endpoints usually
24
  # want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
25
  QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
26
 
27
- # This Space runs exclusively on Quest-35B served via the private HF Inference
28
  # Endpoint pointed to by QUEST_BASE_URL. No public fallback list — the model
29
  # field in the UI is display-only.
30
  DEFAULT_MODEL = QUEST_MODEL_ID
@@ -40,7 +40,7 @@ MODEL_URL = os.getenv("MODEL_URL", "#")
40
 
41
  # --- System prompt ---------------------------------------------------------
42
  # Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
43
- # so that Quest-35B sees the exact tool schema it was trained with. Other
44
  # models still follow this schema just fine in practice.
45
  QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
46
 
@@ -1014,7 +1014,7 @@ _TABLE_SEPARATOR_RE = re.compile(
1014
  def strip_think_blocks(text: str) -> str:
1015
  """Remove any <think>...</think> reasoning blocks.
1016
 
1017
- Quest-35B (Qwen3 family) emits `<think>` reasoning before the final
1018
  answer. When the endpoint is deployed without a reasoning parser, the raw
1019
  tags leak into chat completion `content`; stripping them here keeps the
1020
  extracted answer clean for Markdown rendering.
@@ -1438,7 +1438,7 @@ def _render_progress(
1438
  ) -> str:
1439
  """Render the in-progress status view that replaces the Markdown panel
1440
  while the agent is still running, so the user is not staring at a blank
1441
- box for the 20-60 seconds a full Quest-35B research run can take."""
1442
  header = (
1443
  f"### ⏳ Researching…\n\n"
1444
  f"**Model:** `{used_model}` \n"
@@ -1610,7 +1610,7 @@ def build_research_agent(
1610
  elif not tool_name:
1611
  # No explicit tool call and no final answer: force finalization.
1612
  # IMPORTANT: do not write the literal characters `<answer>...</answer>`
1613
- # here. Some models (notably the Qwen3 family that Quest-35B is
1614
  # built on) will echo the template verbatim, which means the
1615
  # extracted answer ends up being the three-dot placeholder `...`
1616
  # and the user sees an empty-looking result.
 
17
  # Our own DeepResearch model. When QUEST_BASE_URL is configured in Space
18
  # Secrets, the app will route requests to that dedicated HF Inference Endpoint
19
  # instead of the shared HF Inference API.
20
+ QUEST_MODEL_ID = "osunlp/QUEST-35B"
21
  QUEST_BASE_URL = os.getenv("QUEST_BASE_URL", "").strip()
22
  # Endpoints built from the TGI image expose a single-model OpenAI route; the
23
  # model name passed to chat_completion is usually "tgi". vLLM endpoints usually
24
  # want the original repo id. QUEST_ENDPOINT_MODEL overrides this if needed.
25
  QUEST_ENDPOINT_MODEL = os.getenv("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
26
 
27
+ # This Space runs exclusively on QUEST-35B served via the private HF Inference
28
  # Endpoint pointed to by QUEST_BASE_URL. No public fallback list — the model
29
  # field in the UI is display-only.
30
  DEFAULT_MODEL = QUEST_MODEL_ID
 
40
 
41
  # --- System prompt ---------------------------------------------------------
42
  # Full QUEST SYSTEM_PROMPT (mirrors inference/prompt.py in the research repo)
43
+ # so that QUEST-35B sees the exact tool schema it was trained with. Other
44
  # models still follow this schema just fine in practice.
45
  QUEST_SYSTEM_PROMPT = """You are a deep research assistant. Your core function is to conduct thorough, multi-source investigations into any topic. You must handle both broad, open-domain inquiries and queries within specialized academic fields. For every request, synthesize information from credible, diverse sources to deliver a comprehensive, accurate, and objective response. When you have gathered sufficient information and are ready to provide the definitive response, you must enclose the entire final answer within <answer></answer> tags.
46
 
 
1014
  def strip_think_blocks(text: str) -> str:
1015
  """Remove any <think>...</think> reasoning blocks.
1016
 
1017
+ QUEST-35B (Qwen3 family) emits `<think>` reasoning before the final
1018
  answer. When the endpoint is deployed without a reasoning parser, the raw
1019
  tags leak into chat completion `content`; stripping them here keeps the
1020
  extracted answer clean for Markdown rendering.
 
1438
  ) -> str:
1439
  """Render the in-progress status view that replaces the Markdown panel
1440
  while the agent is still running, so the user is not staring at a blank
1441
+ box for the 20-60 seconds a full QUEST-35B research run can take."""
1442
  header = (
1443
  f"### ⏳ Researching…\n\n"
1444
  f"**Model:** `{used_model}` \n"
 
1610
  elif not tool_name:
1611
  # No explicit tool call and no final answer: force finalization.
1612
  # IMPORTANT: do not write the literal characters `<answer>...</answer>`
1613
+ # here. Some models (notably the Qwen3 family that QUEST-35B is
1614
  # built on) will echo the template verbatim, which means the
1615
  # extracted answer ends up being the three-dot placeholder `...`
1616
  # and the user sees an empty-looking result.