# VLAC Service Contract (v1)

This document specifies a _minimal_ HTTP + JSON API for exposing the Vision-Language-Action-Critic (VLAC) model to the **SimpleVLA-RL** training stack.

> Design mantra: **Keep it dead-simple**
>
> 1. No gRPC, no fancy serializers – plain JSON over HTTP.
> 2. Images are RGB JPEG/PNG bytes encoded as base-64 strings.
> 3. One process, one GPU; batching is handled transparently inside the service.

---
## 0  Fixed runtime assumptions

* **Checkpoint path** – VLAC weights live at `/home/zechen/SimpleVLA-RL/CKPT/VLAC`.  The service starts with `--ckpt-path` defaulting to that location; override via CLI or `VLAC_CKPT` env-var if needed.
* **GPU selection** – On startup the service scans a *pre-configured list* of GPU IDs (e.g. `0,1,2,3`) and picks the card reporting the lowest memory utilisation (`nvidia-smi --query-gpu=memory.used`).  If *multiple* service instances run on the same node they will naturally spread.  **Within one process** concurrent requests are simply batched on the single chosen GPU (see §4).
* **Dtype / quantisation** – We keep the model’s own default (`bfloat16` for H100, `fp16` otherwise).  No INT8 or QLoRA path for now.

---
## 1  Transport basics

| Item            | Value                                   |
|-----------------|-----------------------------------------|
| Protocol        | HTTP 1.1 or HTTPS inside a VPC          |
| Host/Port       | configurable, default `0.0.0.0:8111`    |
| Content-Type    | `application/json; charset=utf-8`       |
| Authentication  | none (add Bearer token header if needed) |
| Timeout         | 60 s per request (server closes later)  |

### Error shape
```json
{"code": 422, "message": "prev_frame is missing"}
```
Common codes: `400` (bad input), `422` (validation), `500/503` (model).

---
## 2  Endpoints

### 2.1  POST /healthcheck
Plain liveness probe.
Response `200 OK` → `{"status":"ok","model_tag":"1.0.0"}`

### 2.2  POST /pairwise-critic
Compare two images and return a scalar critic score.

Request body
```jsonc
{
  "task": "Pick up the bowl and put it in the box.",
  "image_a": "<base64>",
  "image_b": "<base64>",
  "rich": false          // optional, default false
}
```
Response body
```jsonc
{"critic": 0.27, "raw": "0.27"}
```

### 2.3  POST /done
Trajectory termination test.

Request body
```jsonc
{
  "task": "Pick up …",
  "first_frame": "<b64>",
  "prev_frame": "<b64>",
  "curr_frame": "<b64>",
  "reference": ["<b64>", "<b64>"]   // 2–11 items, optional, min 2 if provided
}
```
Response
```jsonc
{"done": true, "prob": 0.94}
```

> Implementation notes
> • **CRITICAL**: If `reference` images are provided, at least 2 are required for in-context done detection. Use `reference: null` for simple done detection on current frame only.

### 2.4  POST /trajectory-critic
Compute full critic + value curves at episode end.

Request body (most fields mirror GAC_model)
```jsonc
{
  "task": "Pick up …",
  "frames":   ["<b64>", "…"],
  "reference":[],
  "skip": 5,
  "ref_num": 6,
  "batch_size": 10,
  "think": false,
  "return_video": false
}
```
Response
```jsonc
{
  "value_list":  [0.0, 4.5, 9.8, …],
  "critic_list": [0.2, -0.3, …],
  "done_list":   [0,0,0,1],
  "video": null              // base64-MP4 if requested
}
```

> Implementation notes
> • The service **internally chunks** requests into micro-batches of ≤ 8 frames so larger `batch_size` values are accepted and processed automatically; the caller always receives a single consolidated response.
> • **CRITICAL**: If `reference` images are provided, `ref_num` must be ≥ 2. Use `ref_num: 0` and `reference: null` when no reference images needed.
> • Reference images help the model adapt to new tasks/environments but are optional.

---
## 3  Schemas (Pydantic-style)
```python
class ImageB64(str):
    """Base-64 encoded RGB image ≤ 448×448."""

class PairwiseCriticRequest(BaseModel):
    task: str
    image_a: ImageB64
    image_b: ImageB64
    rich: bool | None = False

class PairwiseCriticResponse(BaseModel):
    critic: float
    raw: str
# … (DoneRequest, TrajectoryCriticRequest, etc.)
```

---
## 4  Typical call-flow inside training
1. Actor pushes env frames to a list.
2. After every step call `/done`. If `done==true` → reward 1.0 & terminate.
3. If step limit reached call `/trajectory-critic`; use `value_list[-1]` as terminal reward.
4. Evaluation keeps the simulator `done` flag.

---
## 5  Image & video handling
* **Auto-resize** – Incoming images of any resolution are resized to the VLAC target (currently 448 × 448) using Lanczos.  Larger images are *not* rejected.
* **Debug save** – Set env-var `VLAC_SAVE_INPUTS=1` to dump decoded PNGs to `/tmp/vlac_debug/` for repro; default is off.
* **Video codec** – When `/trajectory-critic?return_video=true` is used the helper writes H.264 MP4 via `ffmpeg` defaults.  Most workflows prefer the raw image lists returned in JSON.

---
## 6  Minimal reference implementation (FastAPI, single-file)
```python
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import base64
from PIL import Image
from io import BytesIO
from evo_vlac import GAC_model

app = FastAPI()
model = GAC_model(tag="critic")
model.init_model(model_path=CKPT, model_type="internvl2", device_map="cuda:0")
model.set_system_prompt(); model.set_config()

def _b64_to_pil(data: str) -> Image.Image:
    return Image.open(BytesIO(base64.b64decode(data))).convert("RGB")

class PairwiseCriticReq(BaseModel):
    task: str; image_a: str; image_b: str; rich: bool | None = False
class PairwiseCriticResp(BaseModel):
    critic: float; raw: str

@app.post("/pairwise-critic", response_model=PairwiseCriticResp)
def pairwise(req: PairwiseCriticReq):
    crit, _ = model.get_trajectory_critic(
        task=req.task,
        image_list=[_b64_to_pil(req.image_a), _b64_to_pil(req.image_b)],
        ref_image_list=None, batch_num=1, ref_num=0, rich=req.rich)
    val = float(crit[-1])
    return PairwiseCriticResp(critic=val, raw=str(val))
```

> The **service binary** is thus just `python vlac_service.py` — no Docker, no Ray Serve. Clients can `requests.post()`.

---
## 6  Versioning headers
Every non-health response SHOULD include:
```
X-VLAC-Model-Tag: 1.0.0
X-VLAC-Checkpoint-SHA: <sha256>
```
so runs are fully reproducible.

---
_End of contract_