ce-amtic commited on
Commit
bb1445d
·
verified ·
1 Parent(s): 0a89af1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +189 -1
README.md CHANGED
@@ -1,5 +1,193 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
2
  license: apache-2.0
 
 
 
 
3
  ---
4
 
5
- ProcVLM-2B checkpoint based on Qwen3VL-2B-Instruct. Works under revision.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ pipeline_tag: image-text-to-text
5
+ library_name: transformers
6
+ tags:
7
+ - robotics
8
+ - vision-language-model
9
+ - progress-reward
10
+ - robot-manipulation
11
+ - qwen3-vl
12
+ - procvlm
13
  license: apache-2.0
14
+ datasets:
15
+ - ce-amtic/ProcVQA-20M-annotations
16
+ base_model:
17
+ - Qwen/Qwen3-VL-2B-Instruct
18
  ---
19
 
20
+ # ProcVLM-2B
21
+
22
+ ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage in a `<progress>...</progress>` tag.
23
+
24
+ <p align="center">
25
+ <a href="https://procvlm.github.io/">Homepage</a> |
26
+ <a href="https://arxiv.org/abs/2605.08774">arXiv</a> |
27
+ <a href="https://github.com/ProcVLM/ProcVLM">Code</a>
28
+ </p>
29
+
30
+ ## Model Details
31
+
32
+ - **Model name:** `ce-amtic/ProcVLM-2B`
33
+ - **Model type:** Vision-language model for robot progress reward inference
34
+ - **Architecture:** Qwen3-VL-style multimodal causal language model
35
+ - **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description
36
+ - **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>`
37
+ - **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos
38
+
39
+ ## Intended Use
40
+
41
+ ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:
42
+
43
+ - estimating task completion progress from robot videos;
44
+ - producing dense progress rewards from sparse demonstrations;
45
+ - visualizing progress over time for manipulation rollouts;
46
+ - adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
47
+
48
+ This model is not intended to be used as a safety-critical controller without downstream validation.
49
+
50
+ ## Quick Start
51
+
52
+ Clone the ProcVLM repository and install the environment:
53
+
54
+ ```bash
55
+ git clone https://github.com/ProcVLM/ProcVLM.git
56
+ cd ProcVLM
57
+
58
+ uv sync --python 3.10
59
+ source .venv/bin/activate
60
+ uv pip install flash-attn --no-build-isolation
61
+ ```
62
+
63
+ Run progress reward inference on a video:
64
+
65
+ ```bash
66
+ python evqa/inference.py \
67
+ --model_path ce-amtic/ProcVLM-2B \
68
+ --video_path path/to/your/video.mp4 \
69
+ --output_path path/to/progress_predictions.jsonl \
70
+ --task "fold the red T-shirt" \
71
+ --window_size 8
72
+ ```
73
+
74
+ Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction.
75
+
76
+ You can also visualize predictions as a video:
77
+
78
+ ```bash
79
+ python evqa/eval/visualize_progress_video.py \
80
+ --model_path ce-amtic/ProcVLM-2B \
81
+ --video_path path/to/your/video.mp4 \
82
+ --output_path path/to/progress_visualization.mp4 \
83
+ --task "fold the red T-shirt" \
84
+ --window_size 8
85
+ ```
86
+
87
+ ## Python API
88
+
89
+ The same inference workflow is available through `infer_progress_from_video()`:
90
+
91
+ ```python
92
+ from evqa.inference import infer_progress_from_video
93
+
94
+ records = infer_progress_from_video(
95
+ model_path="ce-amtic/ProcVLM-2B",
96
+ video_path="path/to/your/video.mp4",
97
+ task="fold the red T-shirt",
98
+ window_size=8,
99
+ )
100
+
101
+ for item in records:
102
+ print(item["frame_index"], item["progress"])
103
+ ```
104
+
105
+ The returned records include:
106
+
107
+ - `frame_index`: source video frame index;
108
+ - `timestamp_sec`: source video timestamp;
109
+ - `window_frame_indices`: frame indices used as the model input window;
110
+ - `progress`: parsed progress value in `[0, 100]`;
111
+ - `reasoning`: model reasoning with the progress tag removed;
112
+ - `model_output`: raw model output.
113
+
114
+ ## Prompt Format
115
+
116
+ ProcVLM uses a procedural progress prompt. The default template is:
117
+
118
+ ```text
119
+ Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
120
+ ```
121
+
122
+ The model should answer with reasoning and a final progress tag, for example:
123
+
124
+ ```text
125
+ The drawer is already open and the bread is close to the drawer. The remaining action is to place the bread inside the drawer.
126
+ Therefore, the estimated progress is <progress>62.5%</progress>.
127
+ ```
128
+
129
+ ## vLLM Batch Inference
130
+
131
+ For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`:
132
+
133
+ ```python
134
+ from evqa.model import batch_chat_with_vllm
135
+
136
+ outputs = batch_chat_with_vllm(
137
+ batch_items=[
138
+ {
139
+ "image": [
140
+ "frames/frame_000000.jpg",
141
+ "frames/frame_000010.jpg",
142
+ "frames/frame_000020.jpg",
143
+ ],
144
+ "conversations": [
145
+ {
146
+ "from": "human",
147
+ "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
148
+ }
149
+ ],
150
+ }
151
+ ],
152
+ model_path="ce-amtic/ProcVLM-2B",
153
+ max_new_tokens=1024,
154
+ temperature=0.0,
155
+ tp=1,
156
+ )
157
+ ```
158
+
159
+ ## One-Shot LoRA Adaptation
160
+
161
+ ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for:
162
+
163
+ - annotating coarse sub-task stages with the visual UI;
164
+ - generating a LoRA fine-tuning dataset;
165
+ - running `evqa/one-shot/lora_oneshot.sh`;
166
+ - using the saved LoRA checkpoint with `evqa/inference.py --use_lora`.
167
+
168
+ ## Limitations
169
+
170
+ - The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
171
+ - The progress output is a learned estimate, not a calibrated physical measurement.
172
+ - For long-horizon videos, inference quality depends on the sampled frame window and the task description.
173
+ - The model should be validated in the target robot environment before being used as a reward signal for training or deployment.
174
+
175
+ ## Citation
176
+
177
+ If you use ProcVLM, please cite the paper:
178
+
179
+ ```bibtex
180
+ @misc{procvlm2026,
181
+ title = {ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
182
+ author = {ProcVLM Authors},
183
+ year = {2026},
184
+ eprint = {2605.08774},
185
+ archivePrefix = {arXiv},
186
+ primaryClass = {cs.RO},
187
+ url = {https://arxiv.org/abs/2605.08774}
188
+ }
189
+ ```
190
+
191
+ ## License
192
+
193
+ Please refer to the license information on this model repository and the upstream base model license before using the weights.