ce-amtic commited on
Commit
a1b9029
·
verified ·
1 Parent(s): 99ee438

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +202 -3
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ pipeline_tag: image-text-to-text
5
+ library_name: transformers
6
+ tags:
7
+ - robotics
8
+ - vision-language-model
9
+ - progress-reward
10
+ - robot-manipulation
11
+ - qwen3-vl
12
+ - procvlm
13
+ license: apache-2.0
14
+ datasets:
15
+ - ce-amtic/ProcVQA-20M-annotations
16
+ base_model:
17
+ - Qwen/Qwen3-VL-2B-Instruct
18
+ ---
19
+
20
+ # ProcVLM-2B
21
+
22
+ ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.
23
+
24
+ <p align="center">
25
+ <a href="https://procvlm.github.io/">Homepage</a> |
26
+ <a href="https://arxiv.org/abs/2605.08774">arXiv</a> |
27
+ <a href="https://huggingface.co/ce-amtic/ProcVLM-2B">Model</a> |
28
+ <a href="https://github.com/ProcVLM/ProcVLM">Code</a>
29
+ </p>
30
+
31
+ ## Model Details
32
+
33
+ - **Model name:** [`ce-amtic/ProcVLM-2B`](https://huggingface.co/ce-amtic/ProcVLM-2B)
34
+ - **Model type:** Vision-language model for robot progress reward inference
35
+ - **Architecture:** Qwen3-VL-style multimodal causal language model
36
+ - **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description
37
+ - **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>`
38
+ - **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos
39
+
40
+ ## Intended Use
41
+
42
+ ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:
43
+
44
+ - estimating task completion progress from robot videos;
45
+ - producing dense progress rewards from sparse demonstrations;
46
+ - adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
47
+
48
+ This model is not intended to be used as a safety-critical controller without downstream validation.
49
+
50
+ ## Quick Start
51
+
52
+ Clone the ProcVLM repository and install the environment:
53
+
54
+ ```bash
55
+ git clone https://github.com/ProcVLM/ProcVLM.git
56
+ cd ProcVLM
57
+
58
+ uv sync --python 3.10
59
+ source .venv/bin/activate
60
+ uv pip install flash-attn --no-build-isolation
61
+ ```
62
+
63
+ Run progress reward inference on a video:
64
+
65
+ ```bash
66
+ python evqa/inference.py \
67
+ --model_path ce-amtic/ProcVLM-2B \
68
+ --video_path path/to/your/video.mp4 \
69
+ --output_path path/to/progress_predictions.jsonl \
70
+ --task "fold the red T-shirt" \
71
+ --window_size 8
72
+ ```
73
+
74
+ Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction.
75
+
76
+ You can also visualize predictions as a video:
77
+
78
+ ```bash
79
+ python evqa/eval/visualize_progress_video.py \
80
+ --model_path ce-amtic/ProcVLM-2B \
81
+ --video_path path/to/your/video.mp4 \
82
+ --output_path path/to/progress_visualization.mp4 \
83
+ --task "fold the red T-shirt" \
84
+ --window_size 8
85
+ ```
86
+
87
+ ## Python API
88
+
89
+ The same inference workflow is available through `infer_progress_from_video()`:
90
+
91
+ ```python
92
+ from evqa.inference import infer_progress_from_video
93
+
94
+ records = infer_progress_from_video(
95
+ model_path="ce-amtic/ProcVLM-2B",
96
+ video_path="path/to/your/video.mp4",
97
+ task="fold the red T-shirt",
98
+ window_size=8,
99
+ )
100
+
101
+ for item in records:
102
+ print(item["frame_index"], item["progress"])
103
+ ```
104
+
105
+ The returned records include:
106
+
107
+ - `frame_index`: source video frame index;
108
+ - `timestamp_sec`: source video timestamp;
109
+ - `window_frame_indices`: frame indices used as the model input window;
110
+ - `progress`: parsed progress value in `[0, 100]`;
111
+ - `reasoning`: model reasoning with the progress tag removed;
112
+ - `model_output`: raw model output.
113
+
114
+ ## Prompt Format
115
+
116
+ ProcVLM uses a procedural progress prompt. The default template is:
117
+
118
+ ```text
119
+ Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
120
+ ```
121
+
122
+ The model should answer with reasoning and a final progress tag, for example:
123
+
124
+ ```text
125
+ To complete the task: Tower the blocks, the following steps are required:
126
+ 1. Grasp the green block.
127
+ 2. Place the green block onto the red block.
128
+ Therefore, the estimated progress percentage is <progress>84.13%</progress>.
129
+ ```
130
+
131
+ Or if the task is finished:
132
+
133
+ ```text
134
+ The task requires: Tower the blocks. Images show no block outside the tower, no further steps required.
135
+ Therefore, the estimated progress percentage is <progress>100.00%</progress>.
136
+ ```
137
+
138
+ ## vLLM Batch Inference
139
+
140
+ For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`:
141
+
142
+ ```python
143
+ from evqa.model import batch_chat_with_vllm
144
+
145
+ outputs = batch_chat_with_vllm(
146
+ batch_items=[
147
+ {
148
+ "image": [
149
+ "frames/frame_000000.jpg",
150
+ "frames/frame_000010.jpg",
151
+ "frames/frame_000020.jpg",
152
+ ],
153
+ "conversations": [
154
+ {
155
+ "from": "human",
156
+ "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
157
+ }
158
+ ],
159
+ }
160
+ ],
161
+ model_path="ce-amtic/ProcVLM-2B",
162
+ max_new_tokens=1024,
163
+ temperature=0.0,
164
+ tp=1,
165
+ )
166
+ ```
167
+
168
+ ## One-Shot LoRA Adaptation
169
+
170
+ ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for:
171
+
172
+ - annotating coarse sub-task stages with the visual UI;
173
+ - generating a LoRA fine-tuning dataset;
174
+ - running `evqa/one-shot/lora_oneshot.sh`;
175
+ - using the saved LoRA checkpoint with `evqa/inference.py --use_lora`.
176
+
177
+ ## Limitations
178
+
179
+ - The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
180
+ - The progress output is a learned estimate, not a calibrated physical measurement.
181
+ - For long-horizon videos, inference quality depends on the sampled frame window and the task description.
182
+ - The model should be validated in the target robot environment before being used as a reward signal for training or deployment.
183
+
184
+ ## Citation
185
+
186
+ If you use ProcVLM, please cite the paper:
187
+
188
+ ```bibtex
189
+ @misc{feng2026procvlmlearningproceduregroundedprogress,
190
+ title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
191
+ author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
192
+ year={2026},
193
+ eprint={2605.08774},
194
+ archivePrefix={arXiv},
195
+ primaryClass={cs.RO},
196
+ url={https://arxiv.org/abs/2605.08774},
197
+ }
198
+ ```
199
+
200
+ ## License
201
+
202
+ Please refer to the license information on this model repository and the upstream base model license before using the weights.