Apriel-Chat / CLAUDE.md
bradnow's picture
add claude
ca258aa

A newer version of the Gradio SDK is available: 6.12.0

Upgrade

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

What This Is

A Gradio-based chat interface for ServiceNow-AI's Apriel reasoning models, deployed as a HuggingFace Space. Users chat with vLLM-hosted models via an OpenAI-compatible API, with streaming responses and multimodal (text + image) support.

Running Locally

# Install dependencies
pip install -r requirements.txt

# Run with hot reload (needs env vars β€” see below)
python gradio_runner.py app.py

# Or run directly
python app.py

The Makefile target make runAppReloading bundles env vars and launches with hot reload, but contains hardcoded tokens β€” use it only as a reference for which env vars are needed.

Required Environment Variables

  • AUTH_TOKEN β€” vLLM API auth token
  • HF_TOKEN β€” HuggingFace token (for chat logging dataset)
  • VLLM_API_URL_APRIEL_1_6_15B β€” single vLLM endpoint
  • VLLM_API_URL_LIST_APRIEL_1_6_15B β€” comma-separated endpoints for load balancing
  • MODEL_NAME_APRIEL_1_6_15B β€” model name on vLLM server
  • DEBUG_MODE β€” "True"/"False" for verbose logging
  • APRIEL_PROMPT_DATASET β€” HF dataset repo for chat logging

Architecture

app.py β€” Main Gradio app (UI layout, streaming inference, session state). run_chat_inference() is the core generator that streams chat completions, handles reasoning tag splitting ([BEGIN FINAL RESPONSE]), and supports multimodal input (up to 5 images converted to base64).

utils.py β€” Model configuration registry (models_config dict) and logging helpers. Each model entry defines: HF URL, API name, vLLM endpoints, auth token, reasoning/multimodal flags, temperature, and output tags. Add new models here.

log_chat.py β€” Async queue-based chat logger. Writes to local train.csv and syncs to a HuggingFace Hub dataset. Uses a daemon thread to avoid blocking the UI. Has a test_log_chat() function for manual testing.

theme.py β€” Custom Gradio theme (Apriel) extending Soft theme with custom colors and fonts.

styles.css β€” Responsive CSS with dark mode support. Chat height uses CSS calc with breakpoints at 1280px, 1024px, 400px.

timer.py β€” Simple step-based timing utility for performance profiling.

HuggingFace Space Deployment

The Space is configured via YAML frontmatter in README.md (sdk, sdk_version, app_file). The sdk_version must match the gradio version in requirements.txt β€” mismatches cause build failures.

Key Patterns

  • Endpoint rotation: setup_model() round-robins across vLLM endpoints from the comma-separated env var list
  • Session state: A global session_state dict tracks streaming status, stop flags, chat/session IDs, and opt-out preference
  • Reasoning models: Responses are split on [BEGIN FINAL RESPONSE] tag β€” content before is "thought", content after is the visible response
  • Concurrency: Gradio queue with default_concurrency_limit=4