Spaces:

Ghostgim
/

ghostlm

Sleeping

App Files Files Community

Ghostgim commited on 15 days ago

Commit

551cb99

verified ·

1 Parent(s): 088475f

feat(chat): stream tokens as they're generated

Browse files

Convert chat_fn to a generator so Gradio's ChatInterface shows tokens
appearing incrementally rather than waiting 15-25 s for the full reply
to materialize. New helper generate_until_end_stream is the same loop as
generate_until_end but yields the growing token list after every sampled
token; chat_fn decodes and yields the running text snapshot per Gradio's
API contract.

No extra forward-pass cost. The user sees motion within the first ~1-2 s
instead of staring at a static loading state for the full duration. As a
side effect, this also reduces peak memory pressure: Gradio holds only
the latest snapshot in flight rather than the entire response object,
and the streaming yield gives the worker more chances to release
intermediate tensors mid-generation.

Files changed (1) hide show

app.py +54 -7

app.py CHANGED Viewed

@@ -197,6 +197,44 @@ def generate_until_end(
     return new_ids
 # ---------------------------------------------------------------------------
 # Module-level state
 # ---------------------------------------------------------------------------
@@ -248,7 +286,13 @@ def chat_fn(message: str, history: list, temperature: float, top_k: int,
         else:
             break
-    new_ids = generate_until_end(
         MODEL, prompt_ids,
         end_id=END_ID,
         max_new_tokens=int(max_tokens),
@@ -256,22 +300,25 @@ def chat_fn(message: str, history: list, temperature: float, top_k: int,
         top_k=int(top_k),
         top_p=float(top_p),
         repetition_penalty=float(repetition_penalty),
-    )
-    result = TOKENIZER.decode(new_ids).strip() or "(no response)"
     # Free intermediate tensors before returning. Without this, on
     # HF Spaces (CPU runtime, ~16GB RAM) the activation memory from
     # consecutive generations accumulates and the worker errors out
-    # after 2-3 turns. The user-visible bug is "model errors after 2
-    # generations and needs page reload"; this block fixes it.
     if torch.backends.mps.is_available():
         torch.mps.empty_cache()
     elif torch.cuda.is_available():
         torch.cuda.empty_cache()
     gc.collect()
-    return result
 # ---------------------------------------------------------------------------
 # UI

     return new_ids
+def generate_until_end_stream(
+    model,
+    prompt_ids: List[int],
+    *,
+    end_id: int,
+    max_new_tokens: int,
+    temperature: float,
+    top_k: int,
+    top_p: float,
+    repetition_penalty: float,
+):
+    """Streaming variant: same as ``generate_until_end`` but yields the
+    growing list of new token ids after every sampled token.
+    Used by Gradio's chat interface so the user sees text appear
+    incrementally rather than waiting 15-25 s for the full response.
+    The yields happen with no extra forward-pass cost; the generator
+    just surfaces what each iteration of the loop produces."""
+    ids = torch.tensor(prompt_ids, dtype=torch.long).unsqueeze(0)
+    new_ids: List[int] = []
+    ctx = model.config.context_length
+    with torch.no_grad():
+        for _ in range(max_new_tokens):
+            cond = ids[:, -ctx:]
+            logits, _ = model(cond)
+            next_logits = logits[:, -1, :].squeeze(0).clone()
+            tok = sample_next(
+                next_logits,
+                temperature=temperature, top_k=top_k, top_p=top_p,
+                prev_ids=new_ids[-128:], repetition_penalty=repetition_penalty,
+            )
+            if tok == end_id:
+                break
+            new_ids.append(tok)
+            ids = torch.cat([ids, torch.tensor([[tok]])], dim=1)
+            yield new_ids
 # ---------------------------------------------------------------------------
 # Module-level state
 # ---------------------------------------------------------------------------
         else:
             break
+    # Streaming: yield the growing decoded text after each sampled token
+    # so Gradio shows incremental output. Same total wall-clock as the
+    # non-streaming path, but the user sees motion immediately and the
+    # demo feels alive instead of frozen for 15-25 s. Each yield is a
+    # full snapshot of the response so far (Gradio's ChatInterface API).
+    last_text = ""
+    for new_ids in generate_until_end_stream(
         MODEL, prompt_ids,
         end_id=END_ID,
         max_new_tokens=int(max_tokens),
         top_k=int(top_k),
         top_p=float(top_p),
         repetition_penalty=float(repetition_penalty),
+    ):
+        text = TOKENIZER.decode(new_ids).strip()
+        if text and text != last_text:
+            last_text = text
+            yield text
+    if not last_text:
+        yield "(no response)"
     # Free intermediate tensors before returning. Without this, on
     # HF Spaces (CPU runtime, ~16GB RAM) the activation memory from
     # consecutive generations accumulates and the worker errors out
+    # after 2-3 turns.
     if torch.backends.mps.is_available():
         torch.mps.empty_cache()
     elif torch.cuda.is_available():
         torch.cuda.empty_cache()
     gc.collect()
 # ---------------------------------------------------------------------------
 # UI