Spaces:
Running
A newer version of the Gradio SDK is available: 6.12.0
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
What This Is
A Gradio-based chat interface for ServiceNow-AI's Apriel reasoning models, deployed as a HuggingFace Space. Users chat with vLLM-hosted models via an OpenAI-compatible API, with streaming responses and multimodal (text + image) support.
Running Locally
# Install dependencies
pip install -r requirements.txt
# Run with hot reload (needs env vars β see below)
python gradio_runner.py app.py
# Or run directly
python app.py
The Makefile target make runAppReloading bundles env vars and launches with hot reload, but contains hardcoded tokens β use it only as a reference for which env vars are needed.
Required Environment Variables
AUTH_TOKENβ vLLM API auth tokenHF_TOKENβ HuggingFace token (for chat logging dataset)VLLM_API_URL_APRIEL_1_6_15Bβ single vLLM endpointVLLM_API_URL_LIST_APRIEL_1_6_15Bβ comma-separated endpoints for load balancingMODEL_NAME_APRIEL_1_6_15Bβ model name on vLLM serverDEBUG_MODEβ "True"/"False" for verbose loggingAPRIEL_PROMPT_DATASETβ HF dataset repo for chat logging
Architecture
app.py β Main Gradio app (UI layout, streaming inference, session state). run_chat_inference() is the core generator that streams chat completions, handles reasoning tag splitting ([BEGIN FINAL RESPONSE]), and supports multimodal input (up to 5 images converted to base64).
utils.py β Model configuration registry (models_config dict) and logging helpers. Each model entry defines: HF URL, API name, vLLM endpoints, auth token, reasoning/multimodal flags, temperature, and output tags. Add new models here.
log_chat.py β Async queue-based chat logger. Writes to local train.csv and syncs to a HuggingFace Hub dataset. Uses a daemon thread to avoid blocking the UI. Has a test_log_chat() function for manual testing.
theme.py β Custom Gradio theme (Apriel) extending Soft theme with custom colors and fonts.
styles.css β Responsive CSS with dark mode support. Chat height uses CSS calc with breakpoints at 1280px, 1024px, 400px.
timer.py β Simple step-based timing utility for performance profiling.
HuggingFace Space Deployment
The Space is configured via YAML frontmatter in README.md (sdk, sdk_version, app_file). The sdk_version must match the gradio version in requirements.txt β mismatches cause build failures.
Key Patterns
- Endpoint rotation:
setup_model()round-robins across vLLM endpoints from the comma-separated env var list - Session state: A global
session_statedict tracks streaming status, stop flags, chat/session IDs, and opt-out preference - Reasoning models: Responses are split on
[BEGIN FINAL RESPONSE]tag β content before is "thought", content after is the visible response - Concurrency: Gradio queue with
default_concurrency_limit=4