# VLAC Service Contract (v1) This document specifies a _minimal_ HTTP + JSON API for exposing the Vision-Language-Action-Critic (VLAC) model to the **SimpleVLA-RL** training stack. > Design mantra: **Keep it dead-simple** > > 1. No gRPC, no fancy serializers – plain JSON over HTTP. > 2. Images are RGB JPEG/PNG bytes encoded as base-64 strings. > 3. One process, one GPU; batching is handled transparently inside the service. --- ## 0 Fixed runtime assumptions * **Checkpoint path** – VLAC weights live at `/home/zechen/SimpleVLA-RL/CKPT/VLAC`. The service starts with `--ckpt-path` defaulting to that location; override via CLI or `VLAC_CKPT` env-var if needed. * **GPU selection** – On startup the service scans a *pre-configured list* of GPU IDs (e.g. `0,1,2,3`) and picks the card reporting the lowest memory utilisation (`nvidia-smi --query-gpu=memory.used`). If *multiple* service instances run on the same node they will naturally spread. **Within one process** concurrent requests are simply batched on the single chosen GPU (see §4). * **Dtype / quantisation** – We keep the model’s own default (`bfloat16` for H100, `fp16` otherwise). No INT8 or QLoRA path for now. --- ## 1 Transport basics | Item | Value | |-----------------|-----------------------------------------| | Protocol | HTTP 1.1 or HTTPS inside a VPC | | Host/Port | configurable, default `0.0.0.0:8111` | | Content-Type | `application/json; charset=utf-8` | | Authentication | none (add Bearer token header if needed) | | Timeout | 60 s per request (server closes later) | ### Error shape ```json {"code": 422, "message": "prev_frame is missing"} ``` Common codes: `400` (bad input), `422` (validation), `500/503` (model). --- ## 2 Endpoints ### 2.1 POST /healthcheck Plain liveness probe. Response `200 OK` → `{"status":"ok","model_tag":"1.0.0"}` ### 2.2 POST /pairwise-critic Compare two images and return a scalar critic score. Request body ```jsonc { "task": "Pick up the bowl and put it in the box.", "image_a": "", "image_b": "", "rich": false // optional, default false } ``` Response body ```jsonc {"critic": 0.27, "raw": "0.27"} ``` ### 2.3 POST /done Trajectory termination test. Request body ```jsonc { "task": "Pick up …", "first_frame": "", "prev_frame": "", "curr_frame": "", "reference": ["", ""] // 2–11 items, optional, min 2 if provided } ``` Response ```jsonc {"done": true, "prob": 0.94} ``` > Implementation notes > • **CRITICAL**: If `reference` images are provided, at least 2 are required for in-context done detection. Use `reference: null` for simple done detection on current frame only. ### 2.4 POST /trajectory-critic Compute full critic + value curves at episode end. Request body (most fields mirror GAC_model) ```jsonc { "task": "Pick up …", "frames": ["", "…"], "reference":[], "skip": 5, "ref_num": 6, "batch_size": 10, "think": false, "return_video": false } ``` Response ```jsonc { "value_list": [0.0, 4.5, 9.8, …], "critic_list": [0.2, -0.3, …], "done_list": [0,0,0,1], "video": null // base64-MP4 if requested } ``` > Implementation notes > • The service **internally chunks** requests into micro-batches of ≤ 8 frames so larger `batch_size` values are accepted and processed automatically; the caller always receives a single consolidated response. > • **CRITICAL**: If `reference` images are provided, `ref_num` must be ≥ 2. Use `ref_num: 0` and `reference: null` when no reference images needed. > • Reference images help the model adapt to new tasks/environments but are optional. --- ## 3 Schemas (Pydantic-style) ```python class ImageB64(str): """Base-64 encoded RGB image ≤ 448×448.""" class PairwiseCriticRequest(BaseModel): task: str image_a: ImageB64 image_b: ImageB64 rich: bool | None = False class PairwiseCriticResponse(BaseModel): critic: float raw: str # … (DoneRequest, TrajectoryCriticRequest, etc.) ``` --- ## 4 Typical call-flow inside training 1. Actor pushes env frames to a list. 2. After every step call `/done`. If `done==true` → reward 1.0 & terminate. 3. If step limit reached call `/trajectory-critic`; use `value_list[-1]` as terminal reward. 4. Evaluation keeps the simulator `done` flag. --- ## 5 Image & video handling * **Auto-resize** – Incoming images of any resolution are resized to the VLAC target (currently 448 × 448) using Lanczos. Larger images are *not* rejected. * **Debug save** – Set env-var `VLAC_SAVE_INPUTS=1` to dump decoded PNGs to `/tmp/vlac_debug/` for repro; default is off. * **Video codec** – When `/trajectory-critic?return_video=true` is used the helper writes H.264 MP4 via `ffmpeg` defaults. Most workflows prefer the raw image lists returned in JSON. --- ## 6 Minimal reference implementation (FastAPI, single-file) ```python from fastapi import FastAPI, HTTPException from pydantic import BaseModel import base64 from PIL import Image from io import BytesIO from evo_vlac import GAC_model app = FastAPI() model = GAC_model(tag="critic") model.init_model(model_path=CKPT, model_type="internvl2", device_map="cuda:0") model.set_system_prompt(); model.set_config() def _b64_to_pil(data: str) -> Image.Image: return Image.open(BytesIO(base64.b64decode(data))).convert("RGB") class PairwiseCriticReq(BaseModel): task: str; image_a: str; image_b: str; rich: bool | None = False class PairwiseCriticResp(BaseModel): critic: float; raw: str @app.post("/pairwise-critic", response_model=PairwiseCriticResp) def pairwise(req: PairwiseCriticReq): crit, _ = model.get_trajectory_critic( task=req.task, image_list=[_b64_to_pil(req.image_a), _b64_to_pil(req.image_b)], ref_image_list=None, batch_num=1, ref_num=0, rich=req.rich) val = float(crit[-1]) return PairwiseCriticResp(critic=val, raw=str(val)) ``` > The **service binary** is thus just `python vlac_service.py` — no Docker, no Ray Serve. Clients can `requests.post()`. --- ## 6 Versioning headers Every non-health response SHOULD include: ``` X-VLAC-Model-Tag: 1.0.0 X-VLAC-Checkpoint-SHA: ``` so runs are fully reproducible. --- _End of contract_