File size: 6,970 Bytes
0a89af1 bb1445d 0a89af1 bb1445d 0a89af1 bb1445d ef9ad41 bb1445d ef9ad41 bb1445d ef9ad41 bb1445d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | ---
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- robotics
- vision-language-model
- progress-reward
- robot-manipulation
- qwen3-vl
- procvlm
license: apache-2.0
datasets:
- ce-amtic/ProcVQA-20M-annotations
base_model:
- Qwen/Qwen3-VL-2B-Instruct
---
# ProcVLM-2B
ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.
<p align="center">
<a href="https://procvlm.github.io/">Homepage</a> |
<a href="https://arxiv.org/abs/2605.08774">arXiv</a> |
<a href="https://github.com/ProcVLM/ProcVLM">Code</a>
</p>
## Model Details
- **Model name:** `ce-amtic/ProcVLM-2B`
- **Model type:** Vision-language model for robot progress reward inference
- **Architecture:** Qwen3-VL-style multimodal causal language model
- **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description
- **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>`
- **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos
## Intended Use
ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:
- estimating task completion progress from robot videos;
- producing dense progress rewards from sparse demonstrations;
- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
This model is not intended to be used as a safety-critical controller without downstream validation.
## Quick Start
Clone the ProcVLM repository and install the environment:
```bash
git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM
uv sync --python 3.10
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation
```
Run progress reward inference on a video:
```bash
python evqa/inference.py \
--model_path ce-amtic/ProcVLM-2B \
--video_path path/to/your/video.mp4 \
--output_path path/to/progress_predictions.jsonl \
--task "fold the red T-shirt" \
--window_size 8
```
Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction.
You can also visualize predictions as a video:
```bash
python evqa/eval/visualize_progress_video.py \
--model_path ce-amtic/ProcVLM-2B \
--video_path path/to/your/video.mp4 \
--output_path path/to/progress_visualization.mp4 \
--task "fold the red T-shirt" \
--window_size 8
```
## Python API
The same inference workflow is available through `infer_progress_from_video()`:
```python
from evqa.inference import infer_progress_from_video
records = infer_progress_from_video(
model_path="ce-amtic/ProcVLM-2B",
video_path="path/to/your/video.mp4",
task="fold the red T-shirt",
window_size=8,
)
for item in records:
print(item["frame_index"], item["progress"])
```
The returned records include:
- `frame_index`: source video frame index;
- `timestamp_sec`: source video timestamp;
- `window_frame_indices`: frame indices used as the model input window;
- `progress`: parsed progress value in `[0, 100]`;
- `reasoning`: model reasoning with the progress tag removed;
- `model_output`: raw model output.
## Prompt Format
ProcVLM uses a procedural progress prompt. The default template is:
```text
Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
```
The model should answer with reasoning and a final progress tag, for example:
```text
To complete the task: Tower the blocks, the following steps are required:
1. Grasp the green block.
2. Place the green block onto the red block.
Therefore, the estimated progress percentage is <progress>84.13%</progress>.
```
Or if the task is finished:
```text
The task requires: Tower the blocks. Images show no block outside the tower, no further steps required.
Therefore, the estimated progress percentage is <progress>100.00%</progress>.
```
## vLLM Batch Inference
For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`:
```python
from evqa.model import batch_chat_with_vllm
outputs = batch_chat_with_vllm(
batch_items=[
{
"image": [
"frames/frame_000000.jpg",
"frames/frame_000010.jpg",
"frames/frame_000020.jpg",
],
"conversations": [
{
"from": "human",
"value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
}
],
}
],
model_path="ce-amtic/ProcVLM-2B",
max_new_tokens=1024,
temperature=0.0,
tp=1,
)
```
## One-Shot LoRA Adaptation
ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for:
- annotating coarse sub-task stages with the visual UI;
- generating a LoRA fine-tuning dataset;
- running `evqa/one-shot/lora_oneshot.sh`;
- using the saved LoRA checkpoint with `evqa/inference.py --use_lora`.
## Limitations
- The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
- The progress output is a learned estimate, not a calibrated physical measurement.
- For long-horizon videos, inference quality depends on the sampled frame window and the task description.
- The model should be validated in the target robot environment before being used as a reward signal for training or deployment.
## Citation
If you use ProcVLM, please cite the paper:
```bibtex
@misc{feng2026procvlmlearningproceduregroundedprogress,
title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
year={2026},
eprint={2605.08774},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2605.08774},
}
```
## License
Please refer to the license information on this model repository and the upstream base model license before using the weights. |