TomLii Claude Sonnet 4.6 commited on
Commit
98cabfd
·
1 Parent(s): f6d633d

Add blog_demo/: proxy + single-page chat UI for embedding Quest-4B on a blog

Browse files

- server.py: FastAPI proxy that holds HF_TOKEN server-side and streams
/chat/completions from QUEST_BASE_URL back to the browser as SSE. Also
serves the static index.html and a /health probe.
- index.html: vanilla-JS chat widget that POSTs /api/chat and renders
the streaming delta.
- README.md: why a proxy is required (token safety + HF endpoints not
sending CORS), local run instructions, and a checklist for hardening
before exposing on a public blog (origin allowlist, rate limit, input
caps, fine-grained token, logging).

Smoke-tested locally: /health reports config status, /api/chat returns
a clean 500 when HF_TOKEN/QUEST_BASE_URL are missing.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

blog_demo/README.md ADDED
@@ -0,0 +1,80 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Blog chat widget — Quest-4B
2
+
3
+ Tiny proxy + one-page chat UI you can lift into a blog to let readers
4
+ talk to the Quest-4B HF Inference Endpoint.
5
+
6
+ ## Why a proxy
7
+
8
+ - **Token safety.** `HF_TOKEN` can never ship in client-side JS. Anyone
9
+ loading the page would grab it and run jobs on your HF account.
10
+ - **CORS.** HF Inference Endpoints don't emit permissive CORS headers, so
11
+ a browser `fetch` straight to the endpoint is blocked even if the
12
+ token problem were solved.
13
+
14
+ The proxy (server.py) holds the token server-side, validates incoming
15
+ requests, and streams the model's reply back.
16
+
17
+ ## Run locally
18
+
19
+ ```bash
20
+ cd blog_demo
21
+ python3 -m venv .venv
22
+ source .venv/bin/activate
23
+ pip install -r requirements.txt
24
+
25
+ export HF_TOKEN=hf_xxx
26
+ export QUEST_BASE_URL=https://<your-endpoint>.endpoints.huggingface.cloud/v1/
27
+ # optional:
28
+ export QUEST_ENDPOINT_MODEL=tgi # "osunlp/Quest-4B" if the container is vLLM
29
+ export ALLOWED_ORIGINS=http://127.0.0.1:8000
30
+
31
+ python server.py
32
+ ```
33
+
34
+ Open <http://127.0.0.1:8000/>, ask a question, watch the reply stream in.
35
+
36
+ Health check: <http://127.0.0.1:8000/health> returns whether the token
37
+ and base URL are wired up without leaking them.
38
+
39
+ ## What gets sent upstream
40
+
41
+ ```
42
+ POST {QUEST_BASE_URL}/chat/completions
43
+ Headers: Authorization: Bearer $HF_TOKEN
44
+ Body: {
45
+ "model": "tgi",
46
+ "messages": [...],
47
+ "temperature": 0.4,
48
+ "max_tokens": 1024,
49
+ "stream": true
50
+ }
51
+ ```
52
+
53
+ Any OpenAI-compatible endpoint (vLLM, TGI, SGLang, …) responds to this
54
+ shape. The proxy pipes the upstream SSE frames straight to the browser;
55
+ the page parses `choices[].delta.content` to render the streaming answer.
56
+
57
+ ## Deploying on the blog
58
+
59
+ Pick whichever backend is closest to where the blog is hosted:
60
+
61
+ | Host | How |
62
+ |---|---|
63
+ | Next.js / Vercel | paste the `POST /api/chat` handler logic into `app/api/chat/route.ts` (use Node's `fetch` + `ReadableStream`), set `HF_TOKEN` and `QUEST_BASE_URL` in Vercel env vars |
64
+ | Cloudflare Workers | port the proxy to a Worker, put `HF_API_TOKEN` in Worker Secrets, bind your blog domain as `ALLOWED_ORIGINS` |
65
+ | FastAPI behind nginx | run `server.py` under `systemd` or `supervisor`, proxy `/api/chat` from the blog hostname |
66
+ | Hugging Face Space (Docker) | drop the whole folder in a Docker Space, set `HF_TOKEN` and `QUEST_BASE_URL` as Space Secrets |
67
+
68
+ ### Lock it down before going public
69
+
70
+ 1. **Origin allowlist** — set `ALLOWED_ORIGINS=https://your-blog.com`
71
+ so other sites can't call your proxy from a browser.
72
+ 2. **Rate limit** — add an IP-based limit (e.g. `slowapi` for FastAPI,
73
+ Cloudflare Rate Limiting for Workers). A single abusive visitor can
74
+ drain your endpoint budget fast.
75
+ 3. **Input caps** — the proxy already trims each message to 8000 chars
76
+ and caps history at 40 turns; tune these for your use case.
77
+ 4. **Fine-grained token** — create a new HF token with access only to
78
+ the Quest endpoint so a leak can't touch anything else.
79
+ 5. **Observability** — log request counts, latency, and 4xx/5xx rates
80
+ so you notice abuse early.
blog_demo/index.html ADDED
@@ -0,0 +1,238 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!doctype html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="utf-8" />
5
+ <meta name="viewport" content="width=device-width,initial-scale=1" />
6
+ <title>Quest-4B chat demo</title>
7
+ <style>
8
+ :root {
9
+ --bg: #f2f4f8;
10
+ --paper: #ffffff;
11
+ --text: #0d1117;
12
+ --muted: #64748b;
13
+ --accent: #be5b2b;
14
+ --line: rgba(10, 15, 40, 0.1);
15
+ }
16
+ * { box-sizing: border-box; }
17
+ body {
18
+ margin: 0;
19
+ background: var(--bg);
20
+ color: var(--text);
21
+ font: 15px/1.55 -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto,
22
+ sans-serif;
23
+ }
24
+ .shell {
25
+ max-width: 760px;
26
+ margin: 32px auto;
27
+ padding: 0 20px;
28
+ }
29
+ h1 {
30
+ font-size: 1.6rem;
31
+ margin: 0 0 4px;
32
+ }
33
+ .sub {
34
+ color: var(--muted);
35
+ margin: 0 0 20px;
36
+ font-size: 0.92rem;
37
+ }
38
+ .sub code { background: #fff; padding: 1px 6px; border-radius: 6px; border: 1px solid var(--line); }
39
+ .card {
40
+ background: var(--paper);
41
+ border: 1px solid var(--line);
42
+ border-radius: 14px;
43
+ box-shadow: 0 1px 2px rgba(10, 15, 40, 0.05),
44
+ 0 2px 10px rgba(10, 15, 40, 0.06);
45
+ padding: 20px;
46
+ }
47
+ #log {
48
+ min-height: 220px;
49
+ max-height: 60vh;
50
+ overflow-y: auto;
51
+ padding: 4px 6px;
52
+ margin-bottom: 14px;
53
+ }
54
+ .msg { margin: 0 0 14px; }
55
+ .msg .who {
56
+ font-size: 0.75rem;
57
+ font-weight: 700;
58
+ letter-spacing: 0.06em;
59
+ text-transform: uppercase;
60
+ color: var(--muted);
61
+ margin-bottom: 4px;
62
+ }
63
+ .msg.assistant .who { color: var(--accent); }
64
+ .msg .body { white-space: pre-wrap; word-wrap: break-word; }
65
+ form { display: flex; gap: 10px; align-items: stretch; }
66
+ textarea {
67
+ flex: 1;
68
+ min-height: 46px;
69
+ max-height: 160px;
70
+ resize: vertical;
71
+ padding: 12px 14px;
72
+ border: 1px solid var(--line);
73
+ border-radius: 12px;
74
+ font: inherit;
75
+ outline: none;
76
+ }
77
+ textarea:focus { border-color: var(--accent); box-shadow: 0 0 0 3px rgba(190,91,43,0.15); }
78
+ button {
79
+ background: var(--text);
80
+ color: #fff;
81
+ border: 0;
82
+ border-radius: 999px;
83
+ padding: 0 22px;
84
+ font-weight: 600;
85
+ cursor: pointer;
86
+ }
87
+ button:disabled { opacity: 0.55; cursor: default; }
88
+ .status { color: var(--muted); font-size: 0.85rem; margin-top: 10px; }
89
+ .status.err { color: #dc2626; }
90
+ </style>
91
+ </head>
92
+ <body>
93
+ <div class="shell">
94
+ <h1>Quest-4B chat demo</h1>
95
+ <p class="sub">
96
+ Front-end calls <code>/api/chat</code> on this host; the Python proxy
97
+ adds <code>Authorization: Bearer $HF_TOKEN</code> and forwards the
98
+ request to <code>$QUEST_BASE_URL</code>.
99
+ </p>
100
+ <div class="card">
101
+ <div id="log"></div>
102
+ <form id="f">
103
+ <textarea
104
+ id="q"
105
+ placeholder="Ask Quest-4B something... (Enter to send, Shift+Enter for newline)"
106
+ autofocus
107
+ ></textarea>
108
+ <button id="send" type="submit">Send</button>
109
+ </form>
110
+ <div class="status" id="status"></div>
111
+ </div>
112
+ </div>
113
+ <script>
114
+ const log = document.getElementById("log");
115
+ const form = document.getElementById("f");
116
+ const input = document.getElementById("q");
117
+ const send = document.getElementById("send");
118
+ const status = document.getElementById("status");
119
+ const history = [];
120
+
121
+ function addMessage(role, text) {
122
+ const el = document.createElement("div");
123
+ el.className = "msg " + role;
124
+ const who = document.createElement("div");
125
+ who.className = "who";
126
+ who.textContent = role === "user" ? "You" : "Quest-4B";
127
+ const body = document.createElement("div");
128
+ body.className = "body";
129
+ body.textContent = text;
130
+ el.appendChild(who);
131
+ el.appendChild(body);
132
+ log.appendChild(el);
133
+ log.scrollTop = log.scrollHeight;
134
+ return body;
135
+ }
136
+
137
+ function setStatus(text, isError = false) {
138
+ status.textContent = text || "";
139
+ status.classList.toggle("err", Boolean(isError));
140
+ }
141
+
142
+ async function send_message(content) {
143
+ history.push({ role: "user", content });
144
+ addMessage("user", content);
145
+ const assistantBody = addMessage("assistant", "…");
146
+ setStatus("Waiting for the endpoint…");
147
+ send.disabled = true;
148
+
149
+ try {
150
+ const res = await fetch("/api/chat", {
151
+ method: "POST",
152
+ headers: { "Content-Type": "application/json" },
153
+ body: JSON.stringify({ messages: history, temperature: 0.4 }),
154
+ });
155
+ if (!res.ok || !res.body) {
156
+ const text = await res.text();
157
+ assistantBody.textContent = "";
158
+ throw new Error(text || res.statusText);
159
+ }
160
+
161
+ const reader = res.body.getReader();
162
+ const decoder = new TextDecoder();
163
+ let buffer = "";
164
+ let acc = "";
165
+ assistantBody.textContent = "";
166
+
167
+ while (true) {
168
+ const { value, done } = await reader.read();
169
+ if (done) break;
170
+ buffer += decoder.decode(value, { stream: true });
171
+ const lines = buffer.split("\n");
172
+ buffer = lines.pop() || "";
173
+ for (const raw of lines) {
174
+ const line = raw.trim();
175
+ if (!line.startsWith("data:")) continue;
176
+ const payload = line.slice(5).trim();
177
+ if (!payload || payload === "[DONE]") continue;
178
+ try {
179
+ const obj = JSON.parse(payload);
180
+ if (obj.error) {
181
+ throw new Error(
182
+ "endpoint " +
183
+ (obj.error.status || "?") +
184
+ ": " +
185
+ (obj.error.body || "unknown")
186
+ );
187
+ }
188
+ const delta = obj.choices?.[0]?.delta?.content || "";
189
+ if (delta) {
190
+ acc += delta;
191
+ assistantBody.textContent = acc;
192
+ log.scrollTop = log.scrollHeight;
193
+ }
194
+ } catch (parseErr) {
195
+ if (parseErr.message?.startsWith("endpoint ")) throw parseErr;
196
+ }
197
+ }
198
+ }
199
+ history.push({ role: "assistant", content: acc });
200
+ setStatus("");
201
+ } catch (err) {
202
+ assistantBody.textContent =
203
+ assistantBody.textContent || "[no response]";
204
+ setStatus(String(err.message || err), true);
205
+ } finally {
206
+ send.disabled = false;
207
+ input.focus();
208
+ }
209
+ }
210
+
211
+ form.addEventListener("submit", (e) => {
212
+ e.preventDefault();
213
+ const text = input.value.trim();
214
+ if (!text) return;
215
+ input.value = "";
216
+ send_message(text);
217
+ });
218
+ input.addEventListener("keydown", (e) => {
219
+ if (e.key === "Enter" && !e.shiftKey) {
220
+ e.preventDefault();
221
+ form.requestSubmit();
222
+ }
223
+ });
224
+
225
+ fetch("/health")
226
+ .then((r) => r.json())
227
+ .then((j) => {
228
+ if (!j.has_token || !j.has_base_url) {
229
+ setStatus(
230
+ "Server is running but HF_TOKEN / QUEST_BASE_URL are not set — chat will 500 until you export them.",
231
+ true
232
+ );
233
+ }
234
+ })
235
+ .catch(() => setStatus("Cannot reach /health", true));
236
+ </script>
237
+ </body>
238
+ </html>
blog_demo/requirements.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ fastapi>=0.110
2
+ uvicorn[standard]>=0.27
3
+ httpx>=0.27
blog_demo/server.py ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Minimal proxy server so a static blog page can safely chat with the
3
+ Quest-4B HF Inference Endpoint.
4
+
5
+ Why a proxy at all?
6
+
7
+ 1. The browser cannot put HF_TOKEN into client-side JS -- the moment you
8
+ ship it to visitors, the token is stolen and anyone can rack up a bill
9
+ on your HF account.
10
+ 2. HF Inference Endpoints do not emit permissive CORS headers, so even
11
+ without the token concern, a browser `fetch` straight to the endpoint
12
+ would be blocked.
13
+
14
+ This tiny FastAPI app holds the token server-side, forwards chat turns
15
+ to QUEST_BASE_URL, and streams the response back to the browser.
16
+
17
+ Run locally:
18
+
19
+ cd blog_demo
20
+ pip install fastapi uvicorn httpx
21
+ export HF_TOKEN=hf_xxx
22
+ export QUEST_BASE_URL=https://<your-endpoint>.endpoints.huggingface.cloud/v1/
23
+ python server.py
24
+
25
+ Then open http://127.0.0.1:8000/ in a browser.
26
+ """
27
+ from __future__ import annotations
28
+
29
+ import json
30
+ import os
31
+ from typing import Any, Dict, List
32
+
33
+ import httpx
34
+ from fastapi import FastAPI, HTTPException, Request
35
+ from fastapi.middleware.cors import CORSMiddleware
36
+ from fastapi.responses import FileResponse, StreamingResponse
37
+
38
+ HF_TOKEN = os.environ.get("HF_TOKEN", "").strip()
39
+ QUEST_BASE_URL = os.environ.get("QUEST_BASE_URL", "").strip().rstrip("/")
40
+ QUEST_MODEL = os.environ.get("QUEST_ENDPOINT_MODEL", "tgi").strip() or "tgi"
41
+
42
+ ALLOWED_ORIGINS = [
43
+ o.strip()
44
+ for o in os.environ.get(
45
+ "ALLOWED_ORIGINS",
46
+ "http://127.0.0.1:8000,http://localhost:8000",
47
+ ).split(",")
48
+ if o.strip()
49
+ ]
50
+
51
+ REQUEST_TIMEOUT = float(os.environ.get("REQUEST_TIMEOUT", "600"))
52
+
53
+ app = FastAPI(title="Quest-4B blog proxy")
54
+
55
+ app.add_middleware(
56
+ CORSMiddleware,
57
+ allow_origins=ALLOWED_ORIGINS,
58
+ allow_methods=["GET", "POST", "OPTIONS"],
59
+ allow_headers=["Content-Type"],
60
+ allow_credentials=False,
61
+ )
62
+
63
+ STATIC_DIR = os.path.dirname(os.path.abspath(__file__))
64
+
65
+
66
+ @app.get("/")
67
+ def index() -> FileResponse:
68
+ return FileResponse(os.path.join(STATIC_DIR, "index.html"))
69
+
70
+
71
+ @app.get("/health")
72
+ def health() -> Dict[str, Any]:
73
+ return {
74
+ "ok": True,
75
+ "has_token": bool(HF_TOKEN),
76
+ "has_base_url": bool(QUEST_BASE_URL),
77
+ "model": QUEST_MODEL,
78
+ }
79
+
80
+
81
+ def _validate_config() -> None:
82
+ if not HF_TOKEN:
83
+ raise HTTPException(500, "HF_TOKEN is not set on the server")
84
+ if not QUEST_BASE_URL:
85
+ raise HTTPException(500, "QUEST_BASE_URL is not set on the server")
86
+
87
+
88
+ def _sanitise_messages(raw: Any) -> List[Dict[str, str]]:
89
+ if not isinstance(raw, list) or not raw:
90
+ raise HTTPException(400, "`messages` must be a non-empty array")
91
+ cleaned: List[Dict[str, str]] = []
92
+ for m in raw:
93
+ if not isinstance(m, dict):
94
+ raise HTTPException(400, "each message must be an object")
95
+ role = str(m.get("role", "")).strip()
96
+ content = m.get("content", "")
97
+ if role not in {"system", "user", "assistant"}:
98
+ raise HTTPException(400, f"invalid role: {role!r}")
99
+ if not isinstance(content, str):
100
+ raise HTTPException(400, "message.content must be a string")
101
+ cleaned.append({"role": role, "content": content[:8000]})
102
+ if len(cleaned) > 40:
103
+ cleaned = cleaned[-40:]
104
+ return cleaned
105
+
106
+
107
+ @app.post("/api/chat")
108
+ async def chat(request: Request) -> StreamingResponse:
109
+ _validate_config()
110
+ try:
111
+ body = await request.json()
112
+ except Exception as exc:
113
+ raise HTTPException(400, f"invalid json: {exc}") from exc
114
+
115
+ messages = _sanitise_messages(body.get("messages"))
116
+ temperature = float(body.get("temperature", 0.4))
117
+ max_tokens = int(body.get("max_tokens", 1024))
118
+ max_tokens = max(32, min(max_tokens, 4096))
119
+
120
+ payload = {
121
+ "model": QUEST_MODEL,
122
+ "messages": messages,
123
+ "temperature": max(0.0, min(temperature, 1.5)),
124
+ "max_tokens": max_tokens,
125
+ "stream": True,
126
+ }
127
+
128
+ upstream_url = f"{QUEST_BASE_URL}/chat/completions"
129
+ headers = {
130
+ "Authorization": f"Bearer {HF_TOKEN}",
131
+ "Content-Type": "application/json",
132
+ "Accept": "text/event-stream",
133
+ }
134
+
135
+ async def relay() -> Any:
136
+ timeout = httpx.Timeout(REQUEST_TIMEOUT, connect=15.0)
137
+ async with httpx.AsyncClient(timeout=timeout) as client:
138
+ try:
139
+ async with client.stream(
140
+ "POST", upstream_url, json=payload, headers=headers
141
+ ) as upstream:
142
+ if upstream.status_code >= 400:
143
+ text = (await upstream.aread()).decode("utf-8", errors="replace")
144
+ err = json.dumps(
145
+ {"error": {"status": upstream.status_code, "body": text[:800]}}
146
+ )
147
+ yield f"data: {err}\n\n".encode()
148
+ yield b"data: [DONE]\n\n"
149
+ return
150
+ async for chunk in upstream.aiter_raw():
151
+ if chunk:
152
+ yield chunk
153
+ except httpx.HTTPError as exc:
154
+ err = json.dumps({"error": {"status": 502, "body": str(exc)}})
155
+ yield f"data: {err}\n\n".encode()
156
+ yield b"data: [DONE]\n\n"
157
+
158
+ return StreamingResponse(relay(), media_type="text/event-stream")
159
+
160
+
161
+ if __name__ == "__main__":
162
+ import uvicorn
163
+
164
+ uvicorn.run(
165
+ "server:app",
166
+ host=os.environ.get("HOST", "127.0.0.1"),
167
+ port=int(os.environ.get("PORT", "8000")),
168
+ reload=False,
169
+ )