nvidia/llama-nemotron-rerank-1b-v2

Add vLLM usage

#12

by nvidia-oliver-holworthy - opened Mar 3

base: refs/heads/main

←

from: refs/pr/12

Discussion Files changed

+73

-1

nvidia-oliver-holworthy

NVIDIA org Mar 3

•

edited Mar 3

What changed

Added version guidance: vllm>=0.14.0.
Added minimal serve command:
- vllm serve nvidia/llama-nemotron-rerank-1b-v2 --trust-remote-code
Added optional operational flags:
- --dtype, --data-parallel-size, --port
Added an online serving example using POST /rerank via requests.
Added an offline inference example using:
- LLM(..., runner="pooling", trust_remote_code=True) and llm.score(...)

Why

The README previously only showed Transformers usage. This update documents the current vLLM workflow for reranking and reduces setup ambiguity for users serving the model with vLLM.

Validation

Checked vLLM docs and route definitions for v0.14.0 and v0.16.0 to confirm /rerank support and aliases.
Ensured examples align with vLLM API behavior for the supported version range.
Docs-only change; no model/code behavior changes.

Add example of vLLM Usagec14ac443

nvidia-oliver-holworthy changed pull request status to open Mar 3

BoLiu changed pull request status to merged Mar 3

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment