Instructions to use ce-amtic/ProcVLM-2B-FP32 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ce-amtic/ProcVLM-2B-FP32 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ce-amtic/ProcVLM-2B-FP32")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ce-amtic/ProcVLM-2B-FP32")
model = AutoModelForImageTextToText.from_pretrained("ce-amtic/ProcVLM-2B-FP32")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ce-amtic/ProcVLM-2B-FP32 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ce-amtic/ProcVLM-2B-FP32"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ce-amtic/ProcVLM-2B-FP32",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ce-amtic/ProcVLM-2B-FP32

SGLang

How to use ce-amtic/ProcVLM-2B-FP32 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ce-amtic/ProcVLM-2B-FP32" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ce-amtic/ProcVLM-2B-FP32",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ce-amtic/ProcVLM-2B-FP32" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ce-amtic/ProcVLM-2B-FP32",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ce-amtic/ProcVLM-2B-FP32 with Docker Model Runner:
```
docker model run hf.co/ce-amtic/ProcVLM-2B-FP32
```

ce-amtic commited on about 21 hours ago

Commit

a1b9029

verified ·

1 Parent(s): 99ee438

Update README.md

Browse files

Files changed (1) hide show

README.md +202 -3

README.md CHANGED Viewed

@@ -1,3 +1,202 @@
----
-license: apache-2.0
----

+---
+language:
+- en
+pipeline_tag: image-text-to-text
+library_name: transformers
+tags:
+- robotics
+- vision-language-model
+- progress-reward
+- robot-manipulation
+- qwen3-vl
+- procvlm
+license: apache-2.0
+datasets:
+- ce-amtic/ProcVQA-20M-annotations
+base_model:
+- Qwen/Qwen3-VL-2B-Instruct
+---
+# ProcVLM-2B
+ProcVLM-2B is a procedure-grounded vision-language model for estimating progress rewards from robot manipulation observations. Given a task description and a recent window of video frames, the model reasons about the remaining atomic actions and predicts the current task completion percentage.
+<p align="center">
+  <a href="https://procvlm.github.io/">Homepage</a> |
+  <a href="https://arxiv.org/abs/2605.08774">arXiv</a> |
+  <a href="https://huggingface.co/ce-amtic/ProcVLM-2B">Model</a> |
+  <a href="https://github.com/ProcVLM/ProcVLM">Code</a>
+</p>
+## Model Details
+- **Model name:** [`ce-amtic/ProcVLM-2B`](https://huggingface.co/ce-amtic/ProcVLM-2B)
+- **Model type:** Vision-language model for robot progress reward inference
+- **Architecture:** Qwen3-VL-style multimodal causal language model
+- **Input:** One or more RGB images sampled from a robot trajectory, plus a natural-language task description
+- **Output:** Textual reasoning and a completion estimate formatted as `<progress>XX%</progress>`
+- **Primary use case:** Frame-wise progress reward prediction for robotic manipulation videos
+## Intended Use
+ProcVLM-2B is designed for research on robot learning, progress reward modeling, embodied evaluation, and procedure-aware video understanding. Typical use cases include:
+- estimating task completion progress from robot videos;
+- producing dense progress rewards from sparse demonstrations;
+- adapting progress prediction to a new environment with one-shot LoRA fine-tuning.
+This model is not intended to be used as a safety-critical controller without downstream validation.
+## Quick Start
+Clone the ProcVLM repository and install the environment:
+```bash
+git clone https://github.com/ProcVLM/ProcVLM.git
+cd ProcVLM
+uv sync --python 3.10
+source .venv/bin/activate
+uv pip install flash-attn --no-build-isolation
+```
+Run progress reward inference on a video:
+```bash
+python evqa/inference.py \
+    --model_path ce-amtic/ProcVLM-2B \
+    --video_path path/to/your/video.mp4 \
+    --output_path path/to/progress_predictions.jsonl \
+    --task "fold the red T-shirt" \
+    --window_size 8
+```
+Each JSONL row contains a sampled `frame_index` and its corresponding `progress` prediction.
+You can also visualize predictions as a video:
+```bash
+python evqa/eval/visualize_progress_video.py \
+    --model_path ce-amtic/ProcVLM-2B \
+    --video_path path/to/your/video.mp4 \
+    --output_path path/to/progress_visualization.mp4 \
+    --task "fold the red T-shirt" \
+    --window_size 8
+```
+## Python API
+The same inference workflow is available through `infer_progress_from_video()`:
+```python
+from evqa.inference import infer_progress_from_video
+records = infer_progress_from_video(
+    model_path="ce-amtic/ProcVLM-2B",
+    video_path="path/to/your/video.mp4",
+    task="fold the red T-shirt",
+    window_size=8,
+)
+for item in records:
+    print(item["frame_index"], item["progress"])
+```
+The returned records include:
+- `frame_index`: source video frame index;
+- `timestamp_sec`: source video timestamp;
+- `window_frame_indices`: frame indices used as the model input window;
+- `progress`: parsed progress value in `[0, 100]`;
+- `reasoning`: model reasoning with the progress tag removed;
+- `model_output`: raw model output.
+## Prompt Format
+ProcVLM uses a procedural progress prompt. The default template is:
+```text
+Given the recent observation and the task "{task}", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.
+```
+The model should answer with reasoning and a final progress tag, for example:
+```text
+To complete the task: Tower the blocks, the following steps are required:
+1. Grasp the green block.
+2. Place the green block onto the red block.
+Therefore, the estimated progress percentage is <progress>84.13%</progress>.
+```
+Or if the task is finished:
+```text
+The task requires: Tower the blocks. Images show no block outside the tower, no further steps required.
+Therefore, the estimated progress percentage is <progress>100.00%</progress>.
+```
+## vLLM Batch Inference
+For high-throughput multi-image inference, the ProcVLM repository provides `evqa.model.batch_chat_with_vllm()`:
+```python
+from evqa.model import batch_chat_with_vllm
+outputs = batch_chat_with_vllm(
+    batch_items=[
+        {
+            "image": [
+                "frames/frame_000000.jpg",
+                "frames/frame_000010.jpg",
+                "frames/frame_000020.jpg",
+            ],
+            "conversations": [
+                {
+                    "from": "human",
+                    "value": 'Given the recent observation and the task "fold the red T-shirt", first infer the remaining atomic actions required to complete the task. Then estimate the current completion percentage and output it as a float wrapped by <progress> tags.',
+                }
+            ],
+        }
+    ],
+    model_path="ce-amtic/ProcVLM-2B",
+    max_new_tokens=1024,
+    temperature=0.0,
+    tp=1,
+)
+```
+## One-Shot LoRA Adaptation
+ProcVLM can be adapted to a new environment with one successful task demonstration, plus optional additional successful or unsuccessful demonstrations. See the [one-shot adaptation guide](https://github.com/ProcVLM/ProcVLM/blob/main/evqa/docs/oneshot_adaptation.md) for:
+- annotating coarse sub-task stages with the visual UI;
+- generating a LoRA fine-tuning dataset;
+- running `evqa/one-shot/lora_oneshot.sh`;
+- using the saved LoRA checkpoint with `evqa/inference.py --use_lora`.
+## Limitations
+- The model estimates progress from visual observations and task text; it may be unreliable under strong domain shift, severe occlusion, unusual camera viewpoints, or ambiguous task descriptions.
+- The progress output is a learned estimate, not a calibrated physical measurement.
+- For long-horizon videos, inference quality depends on the sampled frame window and the task description.
+- The model should be validated in the target robot environment before being used as a reward signal for training or deployment.
+## Citation
+If you use ProcVLM, please cite the paper:
+```bibtex
+@misc{feng2026procvlmlearningproceduregroundedprogress,
+      title={ProcVLM: Learning Procedure-Grounded Progress Rewards for Robotic Manipulation},
+      author={Youhe Feng and Hansen Shi and Haoyang Li and Xinlei Guo and Yang Wang and Chengyang Zhang and Jinkai Zhang and Xiaohan Zhang and Jie Tang and Jing Zhang},
+      year={2026},
+      eprint={2605.08774},
+      archivePrefix={arXiv},
+      primaryClass={cs.RO},
+      url={https://arxiv.org/abs/2605.08774},
+}
+```
+## License
+Please refer to the license information on this model repository and the upstream base model license before using the weights.