--- language: - en pipeline_tag: image-text-to-text library_name: transformers tags: - robotics - vision-language-model - progress-reward - robot-manipulation - qwen3-vl - procvlm license: apache-2.0 datasets: - ce-amtic/ProcVQA-20M-annotations base_model: - Qwen/Qwen3-VL-2B-Instruct --- # ProcVLM-2B ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.
## Model Details - **Model name:** [`ce-amtic/ProcVLM-2B`](https://huggingface.co/ce-amtic/ProcVLM-2B) - **Model type:** Vision-language model for robot progress reward inference - **Architecture:** Qwen3-VL-style multimodal causal language model - **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description - **Output:** Textual reasoning and a completion estimate formatted as `` - **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos ## Intended Use ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include: - estimating task completion progress from robot videos; - producing dense progress rewards from sparse demonstrations; - adapting progress prediction to a new environment with one-shot LoRA fine-tuning. This model is not intended to be used as a safety-critical controller without downstream validation. ## Quick Start Clone the ProcVLM repository and install the environment: ```bash git clone https://github.com/ProcVLM/ProcVLM.git cd ProcVLM uv sync --python 3.10 source .venv/bin/activate uv pip install flash-attn --no-build-isolation ``` Run progress reward inference on a video: ```bash python evqa/inference.py \ --model_path ce-amtic/ProcVLM-2B \ --video_path path/to/your/video.mp4 \ --output_path path/to/progress_predictions.jsonl \ --task "fold the red T-shirt" \ --window_size 8 ``` Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction. You can also visualize predictions as a video: ```bash python evqa/eval/visualize_progress_video.py \ --model_path ce-amtic/ProcVLM-2B \ --video_path path/to/your/video.mp4 \ --output_path path/to/progress_visualization.mp4 \ --task "fold the red T-shirt" \ --window_size 8 ``` ## Python API The same inference workflow is available through `infer_progress_from_video()`: ```python from evqa.inference import infer_progress_from_video records = infer_progress_from_video( model_path="ce-amtic/ProcVLM-2B", video_path="path/to/your/video.mp4", task="fold the red T-shirt", window_size=8, ) for item in records: print(item["frame_index"], item["progress"]) ``` The returned records include: - `frame_index`: source video frame index; - `timestamp_sec`: source video timestamp; - `window_frame_indices`: frame indices used as the model input window; - `progress`: parsed progress value in `[0, 100]`; - `reasoning`: model reasoning with the progress tag removed; - `model_output`: raw model output. ## Prompt Format ProcVLM uses a procedural progress prompt. The default template is: ```text Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by