Spaces:
Running on Zero
ZeroGPU
Hi @lkeab , this PR adds ZeroGPU support to the Space.
You can try the updated UI here: https://huggingface.co/spaces/hysts-duplicates/Penguin-VL
For more details on ZeroGPU, see: https://huggingface.co/docs/hub/spaces-zerogpu
Summary
- Add ZeroGPU support: add
@spaces.GPU(duration=120)to the inference function andimport spacesbefore torch for monkey-patching - Rewrite the Gradio UI from a manual Chatbot/Textbox/Button layout (~230 lines) to
gr.ChatInterface(multimodal=True), cutting the interface module roughly in half - Replace the runtime
pip install flash-attn --no-build-isolationhack with a pre-built wheel URL inrequirements.txt(cp312, CUDA 12, torch 2.5); removeensure_flash_attn_installed()and deletepre-requirements.txt - Switch default attention implementation from
sdpatoflash_attention_2 - Bump Python 3.11 โ 3.12, Gradio SDK 6.5.1 โ 6.11.0; relax pins for numpy and opencv-python-headless
- Remove unused README frontmatter (
hf_oauth,preload_from_hub,startup_duration_timeout)
Details
ZeroGPU
Re-enable ZeroGPU by adding the @spaces.GPU decorator and importing spaces before torch. The model is loaded into CPU memory at startup and migrated to GPU per-request.
ChatInterface migration
The hand-rolled UI assembled individual gr.Chatbot, gr.Image, gr.Video, gr.Textbox, and gr.Button components with custom upload/submit callbacks and content-normalization helpers. gr.ChatInterface(multimodal=True) handles file uploads, chat history, and streaming out of the box, so most of that code is no longer needed. Generation settings are passed as additional_inputs. Examples are now (prompt, files) pairs fed directly to ChatInterface.examples.
FlashAttention installation
Building flash-attn from source at runtime required pre-requirements.txt (ninja, packaging, etc.) and a subprocess pip install --no-build-isolation call that was fragile and slow. The pre-built wheel from the flash-attention GitHub releases page eliminates the build step entirely.
README cleanup
Removed hf_oauth / hf_oauth_scopes (not used by the app), preload_from_hub, and startup_duration_timeout. Updated SDK and Python version pins to match the actual runtime.
Note on preload_from_hub
preload_from_hub is not recommended even on dedicated hardware. It bakes model weights into the Docker image, which makes the image significantly larger and slows down every Space restart. Since Spaces uses ephemeral storage, there is no persistent disk cache โ but downloading at runtime is fast (especially for a model of this size), and in practice it is quicker than pulling an oversized Docker image on every restart.
If you want to avoid re-downloading on every restart, you can mount a strage bucket and set the HF_HOME environment variable to point to it. This way the download only happens once, without bloating the Docker image.
See:
Thanks for the help!