Spaces:

osunlp
/

QUEST

Running

Lzy01241010 commited on 6 days ago

Commit

48b03a2

1 Parent(s): 9d849dd

perf: enable Gradio queue concurrency (fix "long queue" serialization)

The Space had no demo.queue() call, so Gradio default_concurrency_limit=1
serialised every run_ui research session — the second concurrent user
onward just piled up behind a single worker, producing the "long queue of
requests pending" warning.

run_ui is I/O-bound (it waits on the OSC vLLM endpoint, Serper, Jina and
Azure over HTTP), so it parallelises well. Add demo.queue() with
default_concurrency_limit=12 (tunable via QUEST_CONCURRENCY) and max_size=80;
raise launch max_threads to match. Real ceiling is the 8 OSC vLLM instances
behind the nginx load balancer.

Files changed (1) hide show

app.py +16 -1

app.py CHANGED Viewed

@@ -2532,5 +2532,20 @@ with gr.Blocks(
     )
 if __name__ == "__main__":
-    demo.launch()

     )
+# The research agent is almost entirely I/O-bound (it waits on the OSC vLLM
+# endpoint, Serper, Jina and Azure over HTTP), so many runs can proceed in
+# parallel on even a small CPU box. Gradio's default_concurrency_limit is 1,
+# which serialises every run_ui call and is what produces the "long queue of
+# requests pending" warning. Lift it; the real ceiling is the 8 OSC vLLM
+# instances behind the nginx load balancer. Tunable via the QUEST_CONCURRENCY
+# Space variable without a code change.
+QUEST_CONCURRENCY = int(os.getenv("QUEST_CONCURRENCY", "12"))
+QUEST_QUEUE_MAX = int(os.getenv("QUEST_QUEUE_MAX", "80"))
+demo.queue(
+    default_concurrency_limit=QUEST_CONCURRENCY,
+    max_size=QUEST_QUEUE_MAX,
+)
 if __name__ == "__main__":
+    demo.launch(max_threads=max(40, QUEST_CONCURRENCY * 3))