physix-infer / Dockerfile
Pratyush-01's picture
Re-create physix-infer: sequential vLLM boot, gpu_mem 0.40 each, python3 fix
7959cdc verified
# PhysiX-Infer — dual-model OpenAI-compatible inference Space.
#
# Hosts BOTH:
# * Qwen/Qwen2.5-3B-Instruct (untrained baseline)
# * Pratyush-01/physix-3b-rl (GRPO-trained variant)
#
# Why this Space exists:
# The HF Inference Router does not serve Qwen/Qwen2.5-3B-Instruct (no
# provider has it loaded), and won't serve a private/fine-tune unless
# the owner pays for an Inference Endpoint. Both checkpoints we want
# to compare are 3B Qwen2 fp16 models, and on a single 24 GB L4 we can
# keep two vLLM processes resident at ~45% gpu_memory each and never
# pay router/endpoint fees.
#
# Architecture (one container, three processes):
# :8001 vllm serve Qwen/Qwen2.5-3B-Instruct --gpu-memory-util 0.40
# :8002 vllm serve Pratyush-01/physix-3b-rl --gpu-memory-util 0.40
# :7860 uvicorn proxy.py:app routes by JSON `model` field
#
# Boot order matters: vLLMs come up SEQUENTIALLY, not in parallel. Both
# read `nvidia-smi` free-memory at startup; if they race, the second
# crashes with "No available memory for the cache blocks." See
# entrypoint.sh for the full reasoning.
#
# Why the official vllm/vllm-openai image:
# vLLM ships pre-compiled CUDA kernels that target the cuda toolkit
# and pytorch versions it was built against. Building from a generic
# nvidia/cuda image means recompiling vLLM's C++ kernels (~20 min,
# often fragile across CUDA minor versions). Starting from
# vllm/vllm-openai:<tag> guarantees torch / cu / nccl / vllm are all
# ABI-compatible. We just layer fastapi + httpx for the proxy on top.
#
# Cold start on a fresh HF Spaces L4 (no persistent /data):
# * Image pull: ~30 s
# * vLLM startup: ~30 s after weights are local
# * Weight download: ~45 s for both models from Hub CDN
# ── total ~90-120 s before /health flips green ──
FROM vllm/vllm-openai:v0.7.3
# vllm/vllm-openai sets ENTRYPOINT to `python -m vllm.entrypoints.openai.api_server`.
# We need to override that to launch our own multi-process entrypoint, so reset.
ENTRYPOINT []
ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
PIP_DISABLE_PIP_VERSION_CHECK=1 \
HOME=/tmp/home \
HF_HOME=/tmp/hf_cache \
XDG_CACHE_HOME=/tmp/xdg-cache \
# vLLM's torch.compile cache must land somewhere writable. The image's
# default ($HOME/.cache/vllm) breaks on HF Spaces because the runtime
# user has no writable home.
VLLM_CACHE_ROOT=/tmp/vllm_cache \
TORCH_HOME=/tmp/torch_cache \
TRITON_CACHE_DIR=/tmp/triton_cache \
PORT=7860
# fastapi/uvicorn/httpx for the routing proxy. The image already has them
# transitively (vllm depends on fastapi), but pin minimums to be safe.
# `pip install --no-deps` would be tighter but trades safety for ~5 MB.
RUN pip install \
"fastapi>=0.110" \
"uvicorn[standard]>=0.29" \
"httpx>=0.27"
WORKDIR /app
COPY proxy.py entrypoint.sh ./
RUN chmod +x /app/entrypoint.sh
# HF Spaces runs containers as a non-root UID with no /etc/passwd entry,
# so any cache path under $HOME must exist and be world-writable BEFORE
# the runtime user shows up. Pre-creating /tmp subdirs (which Spaces
# always lets us write to) is the standard workaround.
RUN mkdir -p \
"$HOME" "$HF_HOME" "$XDG_CACHE_HOME" \
"$VLLM_CACHE_ROOT" "$TORCH_HOME" "$TRITON_CACHE_DIR" \
/tmp/logs \
&& chmod -R 0777 /tmp
EXPOSE 7860
# /health is served by proxy.py and turns 200 only when BOTH vLLMs are up.
# Generous start-period covers the ~120 s cold boot.
HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=3 \
CMD curl -fsS "http://127.0.0.1:${PORT}/health" || exit 1
CMD ["/app/entrypoint.sh"]