Use Docker images
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "ce-amtic/ProcVLM-2B-FP32" \
--host 0.0.0.0 \
--port 30000# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "ce-amtic/ProcVLM-2B-FP32",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'ProcVLM-2B
ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.
Homepage | arXiv | Model | Code
Model Details
- Model name:
ce-amtic/ProcVLM-2B - Model type: Vision-language model for robot progress reward inference
- Architecture: Qwen3-VL-style multimodal causal language model
- Input: One or more RGB images sampled from a robot trajectory, plus a natural-language task description
- Output: Textual reasoning and a completion estimate formatted as
<progress>XX%</progress> - Primary use case: Frame-wise progress reward prediction for robotic manipulation videos
Intended Use
ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:
- estimating task completion progress from robot videos;
- producing dense progress rewards from sparse demonstrations;
- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
This model is not intended to be used as a safety-critical controller without downstream validation.
Quick Start
Clone the ProcVLM repository and install the environment:
git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM
uv sync --python 3.10
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation
Run progress reward inference on a video:
python evqa/inference.py \
--model_path ce-amtic/ProcVLM-2B \
--video_path path/to/your/video.mp4 \
--output_path path/to/progress_predictions.jsonl \
--task "fold the red T-shirt" \
--window_size 8
Each JSONL row contains a sampled frame_index and its corresponding progress prediction.
You can also visualize predictions as a video:
python evqa/eval/visualize_progress_video.py \
--model_path ce-amtic/ProcVLM-2B \
--video_path path/to/your/video.mp4 \
--output_path path/to/progress_visualization.mp4 \
--task "fold the red T-shirt" \
--window_size 8
Python API
The same inference workflow is available through infer_progress_from_video():
from evqa.inference import infer_progress_from_video
records = infer_progress_from_video(
model_path="ce-amtic/ProcVLM-2B",
video_path="path/to/your/video.mp4",
task="fold the red T-shirt",
window_size=8,
)
for item in records:
print(item["frame_index"], item["progress"])
The returned records include:
frame_index: source video frame index;timestamp_sec: source video timestamp;window_frame_indices: frame indices used as the model input window;progress: parsed progress value in[0, 100];reasoning: model reasoning with the progress tag removed;model_output: raw model output.
Prompt Format
ProcVLM uses a procedural progress prompt. The default template is:
Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
The model should answer with reasoning and a final progress tag, for example:
To complete the task: Tower the blocks, the following steps are required:
1. Grasp the green block.
2. Place the green block onto the red block.
Therefore, the estimated progress percentage is <progress>84.13%</progress>.
Or if the task is finished:
The task requires: Tower the blocks. Images show no block outside the tower, no further steps required.
Therefore, the estimated progress percentage is <progress>100.00%</progress>.
vLLM Batch Inference
For high-throughput multi-image inference, the ProcVLM repository provides evqa.model.batch_chat_with_vllm():
from evqa.model import batch_chat_with_vllm
outputs = batch_chat_with_vllm(
batch_items=[
{
"image": [
"frames/frame_000000.jpg",
"frames/frame_000010.jpg",
"frames/frame_000020.jpg",
],
"conversations": [
{
"from": "human",
"value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
}
],
}
],
model_path="ce-amtic/ProcVLM-2B",
max_new_tokens=1024,
temperature=0.0,
tp=1,
)
One-Shot LoRA Adaptation
ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the one-shot adaptation guide for:
- annotating coarse sub-task stages with the visual UI;
- generating a LoRA fine-tuning dataset;
- running
evqa/one-shot/lora_oneshot.sh; - using the saved LoRA checkpoint with
evqa/inference.py --use_lora.
Limitations
- The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
- The progress output is a learned estimate, not a calibrated physical measurement.
- For long-horizon videos, inference quality depends on the sampled frame window and the task description.
- The model should be validated in the target robot environment before being used as a reward signal for training or deployment.
Citation
If you use ProcVLM, please cite the paper:
@misc{feng2026procvlmlearningproceduregroundedprogress,
title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
year={2026},
eprint={2605.08774},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.08774},
}
License
Please refer to the license information on this model repository and the upstream base model license before using the weights.
- Downloads last month
- -
Install from pip and serve model
# Install SGLang from pip: pip install sglang# Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ce-amtic/ProcVLM-2B-FP32" \ --host 0.0.0.0 \ --port 30000# Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ce-amtic/ProcVLM-2B-FP32", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'