---
language:
- en
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- robotics
- vision-language-model
- progress-reward
- robot-manipulation
- qwen3-vl
- procvlm
license: apache-2.0
datasets:
- ce-amtic/ProcVQA-20M-annotations
base_model:
- Qwen/Qwen3-VL-2B-Instruct
---

# ProcVLM-2B

ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.

<p align="center">
  <a href="https://procvlm.github.io/">Homepage</a> |
  <a href="https://arxiv.org/abs/2605.08774">arXiv</a> |
  <a href="https://huggingface.co/ce-amtic/ProcVLM-2B">Model</a> |
  <a href="https://github.com/ProcVLM/ProcVLM">Code</a>
</p>

## Model Details

- **Model name:** [`ce-amtic/ProcVLM-2B`](https://huggingface.co/ce-amtic/ProcVLM-2B)
- **Model type:** Vision-language model for robot progress reward inference
- **Architecture:** Qwen3-VL-style multimodal causal language model
- **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description
- **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>`
- **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos

## Intended Use

ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:

- estimating task completion progress from robot videos;
- producing dense progress rewards from sparse demonstrations;
- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.

This model is not intended to be used as a safety-critical controller without downstream validation.

## Quick Start

Clone the ProcVLM repository and install the environment:

```bash
git clone https://github.com/ProcVLM/ProcVLM.git
cd ProcVLM

uv sync --python 3.10
source .venv/bin/activate
uv pip install flash-attn --no-build-isolation
```

Run progress reward inference on a video:

```bash
python evqa/inference.py \
    --model_path ce-amtic/ProcVLM-2B \
    --video_path path/to/your/video.mp4 \
    --output_path path/to/progress_predictions.jsonl \
    --task "fold the red T-shirt" \
    --window_size 8
```

Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction.

You can also visualize predictions as a video:

```bash
python evqa/eval/visualize_progress_video.py \
    --model_path ce-amtic/ProcVLM-2B \
    --video_path path/to/your/video.mp4 \
    --output_path path/to/progress_visualization.mp4 \
    --task "fold the red T-shirt" \
    --window_size 8
```

## Python API

The same inference workflow is available through `infer_progress_from_video()`:

```python
from evqa.inference import infer_progress_from_video

records = infer_progress_from_video(
    model_path="ce-amtic/ProcVLM-2B",
    video_path="path/to/your/video.mp4",
    task="fold the red T-shirt",
    window_size=8,
)

for item in records:
    print(item["frame_index"], item["progress"])
```

The returned records include:

- `frame_index`: source video frame index;
- `timestamp_sec`: source video timestamp;
- `window_frame_indices`: frame indices used as the model input window;
- `progress`: parsed progress value in `[0, 100]`;
- `reasoning`: model reasoning with the progress tag removed;
- `model_output`: raw model output.

## Prompt Format

ProcVLM uses a procedural progress prompt. The default template is:

```text
Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
```

The model should answer with reasoning and a final progress tag, for example:

```text
To complete the task: Tower the blocks, the following steps are required:
1. Grasp the green block.
2. Place the green block onto the red block.
Therefore, the estimated progress percentage is <progress>84.13%</progress>.
```

Or if the task is finished:

```text
The task requires: Tower the blocks. Images show no block outside the tower, no further steps required. 
Therefore, the estimated progress percentage is <progress>100.00%</progress>.
```

## vLLM Batch Inference

For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`:

```python
from evqa.model import batch_chat_with_vllm

outputs = batch_chat_with_vllm(
    batch_items=[
        {
            "image": [
                "frames/frame_000000.jpg",
                "frames/frame_000010.jpg",
                "frames/frame_000020.jpg",
            ],
            "conversations": [
                {
                    "from": "human",
                    "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
                }
            ],
        }
    ],
    model_path="ce-amtic/ProcVLM-2B",
    max_new_tokens=1024,
    temperature=0.0,
    tp=1,
)
```

## One-Shot LoRA Adaptation

ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for:

- annotating coarse sub-task stages with the visual UI;
- generating a LoRA fine-tuning dataset;
- running `evqa/one-shot/lora_oneshot.sh`;
- using the saved LoRA checkpoint with `evqa/inference.py --use_lora`.

## Limitations

- The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
- The progress output is a learned estimate, not a calibrated physical measurement.
- For long-horizon videos, inference quality depends on the sampled frame window and the task description.
- The model should be validated in the target robot environment before being used as a reward signal for training or deployment.

## Citation

If you use ProcVLM, please cite the paper:

```bibtex
@misc{feng2026procvlmlearningproceduregroundedprogress,
      title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation}, 
      author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
      year={2026},
      eprint={2605.08774},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2605.08774}, 
}
```

## License

Please refer to the license information on this model repository and the upstream base model license before using the weights.