Spaces:

tencent
/

Penguin-VL

Running on Zero

App Files Files Community

ZeroGPU

by hysts HF Staff - opened 11 days ago

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+128

-259

hysts

11 days ago

•

edited 11 days ago

Hi @lkeab , this PR adds ZeroGPU support to the Space.
You can try the updated UI here: https://huggingface.co/spaces/hysts-duplicates/Penguin-VL

For more details on ZeroGPU, see: https://huggingface.co/docs/hub/spaces-zerogpu

Summary

Add ZeroGPU support: add @spaces.GPU(duration=120) to the inference function and import spaces before torch for monkey-patching
Rewrite the Gradio UI from a manual Chatbot/Textbox/Button layout (~230 lines) to gr.ChatInterface(multimodal=True), cutting the interface module roughly in half
Replace the runtime pip install flash-attn --no-build-isolation hack with a pre-built wheel URL in requirements.txt (cp312, CUDA 12, torch 2.5); remove ensure_flash_attn_installed() and delete pre-requirements.txt
Switch default attention implementation from sdpa to flash_attention_2
Bump Python 3.11 → 3.12, Gradio SDK 6.5.1 → 6.11.0; relax pins for numpy and opencv-python-headless
Remove unused README frontmatter (hf_oauth, preload_from_hub, startup_duration_timeout)

Details

ZeroGPU

Re-enable ZeroGPU by adding the @spaces.GPU decorator and importing spaces before torch. The model is loaded into CPU memory at startup and migrated to GPU per-request.

ChatInterface migration

The hand-rolled UI assembled individual gr.Chatbot, gr.Image, gr.Video, gr.Textbox, and gr.Button components with custom upload/submit callbacks and content-normalization helpers. gr.ChatInterface(multimodal=True) handles file uploads, chat history, and streaming out of the box, so most of that code is no longer needed. Generation settings are passed as additional_inputs. Examples are now (prompt, files) pairs fed directly to ChatInterface.examples.

FlashAttention installation

Building flash-attn from source at runtime required pre-requirements.txt (ninja, packaging, etc.) and a subprocess pip install --no-build-isolation call that was fragile and slow. The pre-built wheel from the flash-attention GitHub releases page eliminates the build step entirely.

README cleanup

Removed hf_oauth / hf_oauth_scopes (not used by the app), preload_from_hub, and startup_duration_timeout. Updated SDK and Python version pins to match the actual runtime.

Note on `preload_from_hub`

preload_from_hub is not recommended even on dedicated hardware. It bakes model weights into the Docker image, which makes the image significantly larger and slows down every Space restart. Since Spaces uses ephemeral storage, there is no persistent disk cache — but downloading at runtime is fast (especially for a model of this size), and in practice it is quicker than pulling an oversized Docker image on every restart.

If you want to avoid re-downloading on every restart, you can mount a strage bucket and set the HF_HOME environment variable to point to it. This way the download only happens once, without bloating the Docker image.
See:

ZeroGPU88695f49

hysts changed pull request status to open 11 days ago

lkeab

Tencent org 11 days ago

Thanks for the help!

lkeab changed pull request status to merged 11 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

ZeroGPU

Summary

Details

ZeroGPU

ChatInterface migration

FlashAttention installation

README cleanup

Note on preload_from_hub

Note on `preload_from_hub`