# PhysiX-Infer — dual-model OpenAI-compatible inference Space. # # Hosts BOTH: # * Qwen/Qwen2.5-3B-Instruct (untrained baseline) # * Pratyush-01/physix-3b-rl (GRPO-trained variant) # # Why this Space exists: # The HF Inference Router does not serve Qwen/Qwen2.5-3B-Instruct (no # provider has it loaded), and won't serve a private/fine-tune unless # the owner pays for an Inference Endpoint. Both checkpoints we want # to compare are 3B Qwen2 fp16 models, and on a single 24 GB L4 we can # keep two vLLM processes resident at ~45% gpu_memory each and never # pay router/endpoint fees. # # Architecture (one container, three processes): # :8001 vllm serve Qwen/Qwen2.5-3B-Instruct --gpu-memory-util 0.40 # :8002 vllm serve Pratyush-01/physix-3b-rl --gpu-memory-util 0.40 # :7860 uvicorn proxy.py:app routes by JSON `model` field # # Boot order matters: vLLMs come up SEQUENTIALLY, not in parallel. Both # read `nvidia-smi` free-memory at startup; if they race, the second # crashes with "No available memory for the cache blocks." See # entrypoint.sh for the full reasoning. # # Why the official vllm/vllm-openai image: # vLLM ships pre-compiled CUDA kernels that target the cuda toolkit # and pytorch versions it was built against. Building from a generic # nvidia/cuda image means recompiling vLLM's C++ kernels (~20 min, # often fragile across CUDA minor versions). Starting from # vllm/vllm-openai: guarantees torch / cu / nccl / vllm are all # ABI-compatible. We just layer fastapi + httpx for the proxy on top. # # Cold start on a fresh HF Spaces L4 (no persistent /data): # * Image pull: ~30 s # * vLLM startup: ~30 s after weights are local # * Weight download: ~45 s for both models from Hub CDN # ── total ~90-120 s before /health flips green ── FROM vllm/vllm-openai:v0.7.3 # vllm/vllm-openai sets ENTRYPOINT to `python -m vllm.entrypoints.openai.api_server`. # We need to override that to launch our own multi-process entrypoint, so reset. ENTRYPOINT [] ENV PYTHONUNBUFFERED=1 \ PIP_NO_CACHE_DIR=1 \ PIP_DISABLE_PIP_VERSION_CHECK=1 \ HOME=/tmp/home \ HF_HOME=/tmp/hf_cache \ XDG_CACHE_HOME=/tmp/xdg-cache \ # vLLM's torch.compile cache must land somewhere writable. The image's # default ($HOME/.cache/vllm) breaks on HF Spaces because the runtime # user has no writable home. VLLM_CACHE_ROOT=/tmp/vllm_cache \ TORCH_HOME=/tmp/torch_cache \ TRITON_CACHE_DIR=/tmp/triton_cache \ PORT=7860 # fastapi/uvicorn/httpx for the routing proxy. The image already has them # transitively (vllm depends on fastapi), but pin minimums to be safe. # `pip install --no-deps` would be tighter but trades safety for ~5 MB. RUN pip install \ "fastapi>=0.110" \ "uvicorn[standard]>=0.29" \ "httpx>=0.27" WORKDIR /app COPY proxy.py entrypoint.sh ./ RUN chmod +x /app/entrypoint.sh # HF Spaces runs containers as a non-root UID with no /etc/passwd entry, # so any cache path under $HOME must exist and be world-writable BEFORE # the runtime user shows up. Pre-creating /tmp subdirs (which Spaces # always lets us write to) is the standard workaround. RUN mkdir -p \ "$HOME" "$HF_HOME" "$XDG_CACHE_HOME" \ "$VLLM_CACHE_ROOT" "$TORCH_HOME" "$TRITON_CACHE_DIR" \ /tmp/logs \ && chmod -R 0777 /tmp EXPOSE 7860 # /health is served by proxy.py and turns 200 only when BOTH vLLMs are up. # Generous start-period covers the ~120 s cold boot. HEALTHCHECK --interval=30s --timeout=10s --start-period=180s --retries=3 \ CMD curl -fsS "http://127.0.0.1:${PORT}/health" || exit 1 CMD ["/app/entrypoint.sh"]