--- language: - en pipeline_tag: image-text-to-text library_name: transformers tags: - robotics - vision-language-model - progress-reward - robot-manipulation - qwen3-vl - procvlm license: apache-2.0 datasets: - ce-amtic/ProcVQA-20M-annotations base_model: - Qwen/Qwen3-VL-2B-Instruct --- # ProcVLM-2B ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.

Homepage | arXiv | Model | Code

## Model Details - **Model name:** [`ce-amtic/ProcVLM-2B`](https://huggingface.co/ce-amtic/ProcVLM-2B) - **Model type:** Vision-language model for robot progress reward inference - **Architecture:** Qwen3-VL-style multimodal causal language model - **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description - **Output:** Textual reasoning and a completion estimate formatted as `XX%` - **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos ## Intended Use ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include: - estimating task completion progress from robot videos; - producing dense progress rewards from sparse demonstrations; - adapting progress prediction to a new environment with one-shot LoRA fine-tuning. This model is not intended to be used as a safety-critical controller without downstream validation. ## Quick Start Clone the ProcVLM repository and install the environment: ```bash git clone https://github.com/ProcVLM/ProcVLM.git cd ProcVLM uv sync --python 3.10 source .venv/bin/activate uv pip install flash-attn --no-build-isolation ``` Run progress reward inference on a video: ```bash python evqa/inference.py \ --model_path ce-amtic/ProcVLM-2B \ --video_path path/to/your/video.mp4 \ --output_path path/to/progress_predictions.jsonl \ --task "fold the red T-shirt" \ --window_size 8 ``` Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction. You can also visualize predictions as a video: ```bash python evqa/eval/visualize_progress_video.py \ --model_path ce-amtic/ProcVLM-2B \ --video_path path/to/your/video.mp4 \ --output_path path/to/progress_visualization.mp4 \ --task "fold the red T-shirt" \ --window_size 8 ``` ## Python API The same inference workflow is available through `infer_progress_from_video()`: ```python from evqa.inference import infer_progress_from_video records = infer_progress_from_video( model_path="ce-amtic/ProcVLM-2B", video_path="path/to/your/video.mp4", task="fold the red T-shirt", window_size=8, ) for item in records: print(item["frame_index"], item["progress"]) ``` The returned records include: - `frame_index`: source video frame index; - `timestamp_sec`: source video timestamp; - `window_frame_indices`: frame indices used as the model input window; - `progress`: parsed progress value in `[0, 100]`; - `reasoning`: model reasoning with the progress tag removed; - `model_output`: raw model output. ## Prompt Format ProcVLM uses a procedural progress prompt. The default template is: ```text Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by tags. ``` The model should answer with reasoning and a final progress tag, for example: ```text To complete the task: Tower the blocks, the following steps are required: 1. Grasp the green block. 2. Place the green block onto the red block. Therefore, the estimated progress percentage is 84.13%. ``` Or if the task is finished: ```text The task requires: Tower the blocks. Images show no block outside the tower, no further steps required. Therefore, the estimated progress percentage is 100.00%. ``` ## vLLM Batch Inference For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`: ```python from evqa.model import batch_chat_with_vllm outputs = batch_chat_with_vllm( batch_items=[ { "image": [ "frames/frame_000000.jpg", "frames/frame_000010.jpg", "frames/frame_000020.jpg", ], "conversations": [ { "from": "human", "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by tags.', } ], } ], model_path="ce-amtic/ProcVLM-2B", max_new_tokens=1024, temperature=0.0, tp=1, ) ``` ## One-Shot LoRA Adaptation ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for: - annotating coarse sub-task stages with the visual UI; - generating a LoRA fine-tuning dataset; - running `evqa/one-shot/lora_oneshot.sh`; - using the saved LoRA checkpoint with `evqa/inference.py --use_lora`. ## Limitations - The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions. - The progress output is a learned estimate, not a calibrated physical measurement. - For long-horizon videos, inference quality depends on the sampled frame window and the task description. - The model should be validated in the target robot environment before being used as a reward signal for training or deployment. ## Citation If you use ProcVLM, please cite the paper: ```bibtex @misc{feng2026procvlmlearningproceduregroundedprogress, title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation}, author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang}, year={2026}, eprint={2605.08774}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2605.08774}, } ``` ## License Please refer to the license information on this model repository and the upstream base model license before using the weights.