File size: 7,084 Bytes
a1b9029
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
---
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- robotics
- vision-language-model
- progress-reward
- robot-manipulation
- qwen3-vl
- procvlm
license: apache-2.0
datasets:
- ce-amtic/ProcVQA-20M-annotations
base_model:
- Qwen/Qwen3-VL-2B-Instruct
---

# ProcVLM-2B

ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.

<p align="center">
  <a href="https://procvlm.github.io/">Homepage</a> |
  <a href="https://arxiv.org/abs/2605.08774">arXiv</a> |
  <a href="https://huggingface.co/ce-amtic/ProcVLM-2B">Model</a> |
  <a href="https://github.com/ProcVLM/ProcVLM">Code</a>
</p>

## Model Details

- **Model name:** [`ce-amtic/ProcVLM-2B`](https://huggingface.co/ce-amtic/ProcVLM-2B)
- **Model type:** Vision-language model for robot progress reward inference
- **Architecture:** Qwen3-VL-style multimodal causal language model
- **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description
- **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>`
- **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos

## Intended Use

ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:

- estimating task completion progress from robot videos;
- producing dense progress rewards from sparse demonstrations;
- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.

This model is not intended to be used as a safety-critical controller without downstream validation.

## Quick Start

Clone the ProcVLM repository and install the environment:

```bash
git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM

uv sync --python 3.10
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation
```

Run progress reward inference on a video:

```bash
python evqa/inference.py \
    --model_path ce-amtic/ProcVLM-2B \
    --video_path path/to/your/video.mp4 \
    --output_path path/to/progress_predictions.jsonl \
    --task "fold the red T-shirt" \
    --window_size 8
```

Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction.

You can also visualize predictions as a video:

```bash
python evqa/eval/visualize_progress_video.py \
    --model_path ce-amtic/ProcVLM-2B \
    --video_path path/to/your/video.mp4 \
    --output_path path/to/progress_visualization.mp4 \
    --task "fold the red T-shirt" \
    --window_size 8
```

## Python API

The same inference workflow is available through `infer_progress_from_video()`:

```python
from evqa.inference import infer_progress_from_video

records = infer_progress_from_video(
    model_path="ce-amtic/ProcVLM-2B",
    video_path="path/to/your/video.mp4",
    task="fold the red T-shirt",
    window_size=8,
)

for item in records:
    print(item["frame_index"], item["progress"])
```

The returned records include:

- `frame_index`: source video frame index;
- `timestamp_sec`: source video timestamp;
- `window_frame_indices`: frame indices used as the model input window;
- `progress`: parsed progress value in `[0, 100]`;
- `reasoning`: model reasoning with the progress tag removed;
- `model_output`: raw model output.

## Prompt Format

ProcVLM uses a procedural progress prompt. The default template is:

```text
Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
```

The model should answer with reasoning and a final progress tag, for example:

```text
To complete the task: Tower the blocks, the following steps are required:
1. Grasp the green block.
2. Place the green block onto the red block.
Therefore, the estimated progress percentage is <progress>84.13%</progress>.
```

Or if the task is finished:

```text
The task requires: Tower the blocks. Images show no block outside the tower, no further steps required. 
Therefore, the estimated progress percentage is <progress>100.00%</progress>.
```

## vLLM Batch Inference

For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`:

```python
from evqa.model import batch_chat_with_vllm

outputs = batch_chat_with_vllm(
    batch_items=[
        {
            "image": [
                "frames/frame_000000.jpg",
                "frames/frame_000010.jpg",
                "frames/frame_000020.jpg",
            ],
            "conversations": [
                {
                    "from": "human",
                    "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
                }
            ],
        }
    ],
    model_path="ce-amtic/ProcVLM-2B",
    max_new_tokens=1024,
    temperature=0.0,
    tp=1,
)
```

## One-Shot LoRA Adaptation

ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for:

- annotating coarse sub-task stages with the visual UI;
- generating a LoRA fine-tuning dataset;
- running `evqa/one-shot/lora_oneshot.sh`;
- using the saved LoRA checkpoint with `evqa/inference.py --use_lora`.

## Limitations

- The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
- The progress output is a learned estimate, not a calibrated physical measurement.
- For long-horizon videos, inference quality depends on the sampled frame window and the task description.
- The model should be validated in the target robot environment before being used as a reward signal for training or deployment.

## Citation

If you use ProcVLM, please cite the paper:

```bibtex
@misc{feng2026procvlmlearningproceduregroundedprogress,
      title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation}, 
      author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
      year={2026},
      eprint={2605.08774},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.08774}, 
}
```

## License

Please refer to the license information on this model repository and the upstream base model license before using the weights.