Spaces:

osunlp
/

QUEST

Running

TomLii Claude Sonnet 4.6 commited on Apr 21

Commit

98cabfd

1 Parent(s): f6d633d

Add blog_demo/: proxy + single-page chat UI for embedding Quest-4B on a blog

- server.py: FastAPI proxy that holds HF_TOKEN server-side and streams
/chat/completions from QUEST_BASE_URL back to the browser as SSE. Also
serves the static index.html and a /health probe.
- index.html: vanilla-JS chat widget that POSTs /api/chat and renders
the streaming delta.
- README.md: why a proxy is required (token safety + HF endpoints not
sending CORS), local run instructions, and a checklist for hardening
before exposing on a public blog (origin allowlist, rate limit, input
caps, fine-grained token, logging).

Smoke-tested locally: /health reports config status, /api/chat returns
a clean 500 when HF_TOKEN/QUEST_BASE_URL are missing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (4) hide show

blog_demo/README.md +80 -0
blog_demo/index.html +238 -0
blog_demo/requirements.txt +3 -0
blog_demo/server.py +169 -0

blog_demo/README.md ADDED Viewed

	@@ -0,0 +1,80 @@

+# Blog chat widget — Quest-4B
+Tiny proxy + one-page chat UI you can lift into a blog to let readers
+talk to the Quest-4B HF Inference Endpoint.
+## Why a proxy
+- **Token safety.** `HF_TOKEN` can never ship in client-side JS. Anyone
+  loading the page would grab it and run jobs on your HF account.
+- **CORS.** HF Inference Endpoints don't emit permissive CORS headers, so
+  a browser `fetch` straight to the endpoint is blocked even if the
+  token problem were solved.
+The proxy (server.py) holds the token server-side, validates incoming
+requests, and streams the model's reply back.
+## Run locally
+```bash
+cd blog_demo
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+export HF_TOKEN=hf_xxx
+export QUEST_BASE_URL=https://<your-endpoint>.endpoints.huggingface.cloud/v1/
+# optional:
+export QUEST_ENDPOINT_MODEL=tgi           # "osunlp/Quest-4B" if the container is vLLM
+export ALLOWED_ORIGINS=http://127.0.0.1:8000
+python server.py
+```
+Open <http://127.0.0.1:8000/>, ask a question, watch the reply stream in.
+Health check: <http://127.0.0.1:8000/health> returns whether the token
+and base URL are wired up without leaking them.
+## What gets sent upstream
+```
+POST  {QUEST_BASE_URL}/chat/completions
+Headers: Authorization: Bearer $HF_TOKEN
+Body: {
+  "model": "tgi",
+  "messages": [...],
+  "temperature": 0.4,
+  "max_tokens": 1024,
+  "stream": true
+}
+```
+Any OpenAI-compatible endpoint (vLLM, TGI, SGLang, …) responds to this
+shape. The proxy pipes the upstream SSE frames straight to the browser;
+the page parses `choices[].delta.content` to render the streaming answer.
+## Deploying on the blog
+Pick whichever backend is closest to where the blog is hosted:
+| Host | How |
+|---|---|
+| Next.js / Vercel | paste the `POST /api/chat` handler logic into `app/api/chat/route.ts` (use Node's `fetch` + `ReadableStream`), set `HF_TOKEN` and `QUEST_BASE_URL` in Vercel env vars |
+| Cloudflare Workers | port the proxy to a Worker, put `HF_API_TOKEN` in Worker Secrets, bind your blog domain as `ALLOWED_ORIGINS` |
+| FastAPI behind nginx | run `server.py` under `systemd` or `supervisor`, proxy `/api/chat` from the blog hostname |
+| Hugging Face Space (Docker) | drop the whole folder in a Docker Space, set `HF_TOKEN` and `QUEST_BASE_URL` as Space Secrets |
+### Lock it down before going public
+1. **Origin allowlist** — set `ALLOWED_ORIGINS=https://your-blog.com`
+   so other sites can't call your proxy from a browser.
+2. **Rate limit** — add an IP-based limit (e.g. `slowapi` for FastAPI,
+   Cloudflare Rate Limiting for Workers). A single abusive visitor can
+   drain your endpoint budget fast.
+3. **Input caps** — the proxy already trims each message to 8000 chars
+   and caps history at 40 turns; tune these for your use case.
+4. **Fine-grained token** — create a new HF token with access only to
+   the Quest endpoint so a leak can't touch anything else.
+5. **Observability** — log request counts, latency, and 4xx/5xx rates
+   so you notice abuse early.

blog_demo/index.html ADDED Viewed

	@@ -0,0 +1,238 @@

+<!doctype html>
+<html lang="en">
+  <head>
+    <meta charset="utf-8" />
+    <meta name="viewport" content="width=device-width,initial-scale=1" />
+    <title>Quest-4B chat demo</title>
+    <style>
+      :root {
+        --bg: #f2f4f8;
+        --paper: #ffffff;
+        --text: #0d1117;
+        --muted: #64748b;
+        --accent: #be5b2b;
+        --line: rgba(10, 15, 40, 0.1);
+      }
+      * { box-sizing: border-box; }
+      body {
+        margin: 0;
+        background: var(--bg);
+        color: var(--text);
+        font: 15px/1.55 -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
+          sans-serif;
+      }
+      .shell {
+        max-width: 760px;
+        margin: 32px auto;
+        padding: 0 20px;
+      }
+      h1 {
+        font-size: 1.6rem;
+        margin: 0 0 4px;
+      }
+      .sub {
+        color: var(--muted);
+        margin: 0 0 20px;
+        font-size: 0.92rem;
+      }
+      .sub code { background: #fff; padding: 1px 6px; border-radius: 6px; border: 1px solid var(--line); }
+      .card {
+        background: var(--paper);
+        border: 1px solid var(--line);
+        border-radius: 14px;
+        box-shadow: 0 1px 2px rgba(10, 15, 40, 0.05),
+          0 2px 10px rgba(10, 15, 40, 0.06);
+        padding: 20px;
+      }
+      #log {
+        min-height: 220px;
+        max-height: 60vh;
+        overflow-y: auto;
+        padding: 4px 6px;
+        margin-bottom: 14px;
+      }
+      .msg { margin: 0 0 14px; }
+      .msg .who {
+        font-size: 0.75rem;
+        font-weight: 700;
+        letter-spacing: 0.06em;
+        text-transform: uppercase;
+        color: var(--muted);
+        margin-bottom: 4px;
+      }
+      .msg.assistant .who { color: var(--accent); }
+      .msg .body { white-space: pre-wrap; word-wrap: break-word; }
+      form { display: flex; gap: 10px; align-items: stretch; }
+      textarea {
+        flex: 1;
+        min-height: 46px;
+        max-height: 160px;
+        resize: vertical;
+        padding: 12px 14px;
+        border: 1px solid var(--line);
+        border-radius: 12px;
+        font: inherit;
+        outline: none;
+      }
+      textarea:focus { border-color: var(--accent); box-shadow: 0 0 0 3px rgba(190,91,43,0.15); }
+      button {
+        background: var(--text);
+        color: #fff;
+        border: 0;
+        border-radius: 999px;
+        padding: 0 22px;
+        font-weight: 600;
+        cursor: pointer;
+      }
+      button:disabled { opacity: 0.55; cursor: default; }
+      .status { color: var(--muted); font-size: 0.85rem; margin-top: 10px; }
+      .status.err { color: #dc2626; }
+    </style>
+  </head>
+  <body>
+    <div class="shell">
+      <h1>Quest-4B chat demo</h1>
+      <p class="sub">
+        Front-end calls <code>/api/chat</code> on this host; the Python proxy
+        adds <code>Authorization: Bearer $HF_TOKEN</code> and forwards the
+        request to <code>$QUEST_BASE_URL</code>.
+      </p>
+      <div class="card">
+        <div id="log"></div>
+        <form id="f">
+          <textarea
+            id="q"
+            placeholder="Ask Quest-4B something... (Enter to send, Shift+Enter for newline)"
+            autofocus
+          ></textarea>
+          <button id="send" type="submit">Send</button>
+        </form>
+        <div class="status" id="status"></div>
+      </div>
+    </div>
+    <script>
+      const log = document.getElementById("log");
+      const form = document.getElementById("f");
+      const input = document.getElementById("q");
+      const send = document.getElementById("send");
+      const status = document.getElementById("status");
+      const history = [];
+      function addMessage(role, text) {
+        const el = document.createElement("div");
+        el.className = "msg " + role;
+        const who = document.createElement("div");
+        who.className = "who";
+        who.textContent = role === "user" ? "You" : "Quest-4B";
+        const body = document.createElement("div");
+        body.className = "body";
+        body.textContent = text;
+        el.appendChild(who);
+        el.appendChild(body);
+        log.appendChild(el);
+        log.scrollTop = log.scrollHeight;
+        return body;
+      }
+      function setStatus(text, isError = false) {
+        status.textContent = text || "";
+        status.classList.toggle("err", Boolean(isError));
+      }
+      async function send_message(content) {
+        history.push({ role: "user", content });
+        addMessage("user", content);
+        const assistantBody = addMessage("assistant", "…");
+        setStatus("Waiting for the endpoint…");
+        send.disabled = true;
+        try {
+          const res = await fetch("/api/chat", {
+            method: "POST",
+            headers: { "Content-Type": "application/json" },
+            body: JSON.stringify({ messages: history, temperature: 0.4 }),
+          });
+          if (!res.ok || !res.body) {
+            const text = await res.text();
+            assistantBody.textContent = "";
+            throw new Error(text || res.statusText);
+          }
+          const reader = res.body.getReader();
+          const decoder = new TextDecoder();
+          let buffer = "";
+          let acc = "";
+          assistantBody.textContent = "";
+          while (true) {
+            const { value, done } = await reader.read();
+            if (done) break;
+            buffer += decoder.decode(value, { stream: true });
+            const lines = buffer.split("\n");
+            buffer = lines.pop() || "";
+            for (const raw of lines) {
+              const line = raw.trim();
+              if (!line.startsWith("data:")) continue;
+              const payload = line.slice(5).trim();
+              if (!payload || payload === "[DONE]") continue;
+              try {
+                const obj = JSON.parse(payload);
+                if (obj.error) {
+                  throw new Error(
+                    "endpoint " +
+                      (obj.error.status || "?") +
+                      ": " +
+                      (obj.error.body || "unknown")
+                  );
+                }
+                const delta = obj.choices?.[0]?.delta?.content || "";
+                if (delta) {
+                  acc += delta;
+                  assistantBody.textContent = acc;
+                  log.scrollTop = log.scrollHeight;
+                }
+              } catch (parseErr) {
+                if (parseErr.message?.startsWith("endpoint ")) throw parseErr;
+              }
+            }
+          }
+          history.push({ role: "assistant", content: acc });
+          setStatus("");
+        } catch (err) {
+          assistantBody.textContent =
+            assistantBody.textContent || "[no response]";
+          setStatus(String(err.message || err), true);
+        } finally {
+          send.disabled = false;
+          input.focus();
+        }
+      }
+      form.addEventListener("submit", (e) => {
+        e.preventDefault();
+        const text = input.value.trim();
+        if (!text) return;
+        input.value = "";
+        send_message(text);
+      });
+      input.addEventListener("keydown", (e) => {
+        if (e.key === "Enter" && !e.shiftKey) {
+          e.preventDefault();
+          form.requestSubmit();
+        }
+      });
+      fetch("/health")
+        .then((r) => r.json())
+        .then((j) => {
+          if (!j.has_token || !j.has_base_url) {
+            setStatus(
+              "Server is running but HF_TOKEN / QUEST_BASE_URL are not set — chat will 500 until you export them.",
+              true
+            );
+          }
+        })
+        .catch(() => setStatus("Cannot reach /health", true));
+    </script>
+  </body>
+</html>

blog_demo/requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+fastapi>=0.110
+uvicorn[standard]>=0.27
+httpx>=0.27

blog_demo/server.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""
+Minimal proxy server so a static blog page can safely chat with the
+Quest-4B HF Inference Endpoint.
+Why a proxy at all?
+1. The browser cannot put HF_TOKEN into client-side JS -- the moment you
+   ship it to visitors, the token is stolen and anyone can rack up a bill
+   on your HF account.
+2. HF Inference Endpoints do not emit permissive CORS headers, so even
+   without the token concern, a browser `fetch` straight to the endpoint
+   would be blocked.
+This tiny FastAPI app holds the token server-side, forwards chat turns
+to QUEST_BASE_URL, and streams the response back to the browser.
+Run locally:
+    cd blog_demo
+    pip install fastapi uvicorn httpx
+    export HF_TOKEN=hf_xxx
+    export QUEST_BASE_URL=https://<your-endpoint>.endpoints.huggingface.cloud/v1/
+    python server.py
+Then open http://127.0.0.1:8000/ in a browser.
+"""
+from __future__ import annotations
+import json
+import os
+from typing import Any, Dict, List
+import httpx
+from fastapi import FastAPI, HTTPException, Request
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import FileResponse, StreamingResponse
+HF_TOKEN = os.environ.get("HF_TOKEN", "").strip()
+QUEST_BASE_URL = os.environ.get("QUEST_BASE_URL", "").strip().rstrip("/")
+QUEST_MODEL = os.environ.get("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
+ALLOWED_ORIGINS = [
+    o.strip()
+    for o in os.environ.get(
+        "ALLOWED_ORIGINS",
+        "http://127.0.0.1:8000,http://localhost:8000",
+    ).split(",")
+    if o.strip()
+]
+REQUEST_TIMEOUT = float(os.environ.get("REQUEST_TIMEOUT", "600"))
+app = FastAPI(title="Quest-4B blog proxy")
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=ALLOWED_ORIGINS,
+    allow_methods=["GET", "POST", "OPTIONS"],
+    allow_headers=["Content-Type"],
+    allow_credentials=False,
+)
+STATIC_DIR = os.path.dirname(os.path.abspath(__file__))
+@app.get("/")
+def index() -> FileResponse:
+    return FileResponse(os.path.join(STATIC_DIR, "index.html"))
+@app.get("/health")
+def health() -> Dict[str, Any]:
+    return {
+        "ok": True,
+        "has_token": bool(HF_TOKEN),
+        "has_base_url": bool(QUEST_BASE_URL),
+        "model": QUEST_MODEL,
+    }
+def _validate_config() -> None:
+    if not HF_TOKEN:
+        raise HTTPException(500, "HF_TOKEN is not set on the server")
+    if not QUEST_BASE_URL:
+        raise HTTPException(500, "QUEST_BASE_URL is not set on the server")
+def _sanitise_messages(raw: Any) -> List[Dict[str, str]]:
+    if not isinstance(raw, list) or not raw:
+        raise HTTPException(400, "`messages` must be a non-empty array")
+    cleaned: List[Dict[str, str]] = []
+    for m in raw:
+        if not isinstance(m, dict):
+            raise HTTPException(400, "each message must be an object")
+        role = str(m.get("role", "")).strip()
+        content = m.get("content", "")
+        if role not in {"system", "user", "assistant"}:
+            raise HTTPException(400, f"invalid role: {role!r}")
+        if not isinstance(content, str):
+            raise HTTPException(400, "message.content must be a string")
+        cleaned.append({"role": role, "content": content[:8000]})
+    if len(cleaned) > 40:
+        cleaned = cleaned[-40:]
+    return cleaned
+@app.post("/api/chat")
+async def chat(request: Request) -> StreamingResponse:
+    _validate_config()
+    try:
+        body = await request.json()
+    except Exception as exc:
+        raise HTTPException(400, f"invalid json: {exc}") from exc
+    messages = _sanitise_messages(body.get("messages"))
+    temperature = float(body.get("temperature", 0.4))
+    max_tokens = int(body.get("max_tokens", 1024))
+    max_tokens = max(32, min(max_tokens, 4096))
+    payload = {
+        "model": QUEST_MODEL,
+        "messages": messages,
+        "temperature": max(0.0, min(temperature, 1.5)),
+        "max_tokens": max_tokens,
+        "stream": True,
+    }
+    upstream_url = f"{QUEST_BASE_URL}/chat/completions"
+    headers = {
+        "Authorization": f"Bearer {HF_TOKEN}",
+        "Content-Type": "application/json",
+        "Accept": "text/event-stream",
+    }
+    async def relay() -> Any:
+        timeout = httpx.Timeout(REQUEST_TIMEOUT, connect=15.0)
+        async with httpx.AsyncClient(timeout=timeout) as client:
+            try:
+                async with client.stream(
+                    "POST", upstream_url, json=payload, headers=headers
+                ) as upstream:
+                    if upstream.status_code >= 400:
+                        text = (await upstream.aread()).decode("utf-8", errors="replace")
+                        err = json.dumps(
+                            {"error": {"status": upstream.status_code, "body": text[:800]}}
+                        )
+                        yield f"data: {err}\n\n".encode()
+                        yield b"data: [DONE]\n\n"
+                        return
+                    async for chunk in upstream.aiter_raw():
+                        if chunk:
+                            yield chunk
+            except httpx.HTTPError as exc:
+                err = json.dumps({"error": {"status": 502, "body": str(exc)}})
+                yield f"data: {err}\n\n".encode()
+                yield b"data: [DONE]\n\n"
+    return StreamingResponse(relay(), media_type="text/event-stream")
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(
+        "server:app",
+        host=os.environ.get("HOST", "127.0.0.1"),
+        port=int(os.environ.get("PORT", "8000")),
+        reload=False,
+    )