ZeroGPU

#2
by hysts HF Staff - opened

Hi @lkeab , this PR adds ZeroGPU support to the Space.
You can try the updated UI here: https://huggingface.co/spaces/hysts-duplicates/Penguin-VL

For more details on ZeroGPU, see: https://huggingface.co/docs/hub/spaces-zerogpu

Summary

  • Add ZeroGPU support: add @spaces.GPU(duration=120) to the inference function and import spaces before torch for monkey-patching
  • Rewrite the Gradio UI from a manual Chatbot/Textbox/Button layout (~230 lines) to gr.ChatInterface(multimodal=True), cutting the interface module roughly in half
  • Replace the runtime pip install flash-attn --no-build-isolation hack with a pre-built wheel URL in requirements.txt (cp312, CUDA 12, torch 2.5); remove ensure_flash_attn_installed() and delete pre-requirements.txt
  • Switch default attention implementation from sdpa to flash_attention_2
  • Bump Python 3.11 โ†’ 3.12, Gradio SDK 6.5.1 โ†’ 6.11.0; relax pins for numpy and opencv-python-headless
  • Remove unused README frontmatter (hf_oauth, preload_from_hub, startup_duration_timeout)

Details

ZeroGPU

Re-enable ZeroGPU by adding the @spaces.GPU decorator and importing spaces before torch. The model is loaded into CPU memory at startup and migrated to GPU per-request.

ChatInterface migration

The hand-rolled UI assembled individual gr.Chatbot, gr.Image, gr.Video, gr.Textbox, and gr.Button components with custom upload/submit callbacks and content-normalization helpers. gr.ChatInterface(multimodal=True) handles file uploads, chat history, and streaming out of the box, so most of that code is no longer needed. Generation settings are passed as additional_inputs. Examples are now (prompt, files) pairs fed directly to ChatInterface.examples.

FlashAttention installation

Building flash-attn from source at runtime required pre-requirements.txt (ninja, packaging, etc.) and a subprocess pip install --no-build-isolation call that was fragile and slow. The pre-built wheel from the flash-attention GitHub releases page eliminates the build step entirely.

README cleanup

Removed hf_oauth / hf_oauth_scopes (not used by the app), preload_from_hub, and startup_duration_timeout. Updated SDK and Python version pins to match the actual runtime.

Note on preload_from_hub

preload_from_hub is not recommended even on dedicated hardware. It bakes model weights into the Docker image, which makes the image significantly larger and slows down every Space restart. Since Spaces uses ephemeral storage, there is no persistent disk cache โ€” but downloading at runtime is fast (especially for a model of this size), and in practice it is quicker than pulling an oversized Docker image on every restart.

If you want to avoid re-downloading on every restart, you can mount a strage bucket and set the HF_HOME environment variable to point to it. This way the download only happens once, without bloating the Docker image.
See:

hysts changed pull request status to open
Tencent org

Thanks for the help!

lkeab changed pull request status to merged

Sign up or log in to comment