What changed

  • Added version guidance: vllm>=0.14.0.
  • Added minimal serve command:
    • vllm serve nvidia/llama-nemotron-rerank-1b-v2 --trust-remote-code
  • Added optional operational flags:
    • --dtype, --data-parallel-size, --port
  • Added an online serving example using POST /rerank via requests.
  • Added an offline inference example using:
    • LLM(..., runner="pooling", trust_remote_code=True) and llm.score(...)

Why

The README previously only showed Transformers usage. This update documents the current vLLM workflow for reranking and reduces setup ambiguity for users serving the model with vLLM.

Validation

  • Checked vLLM docs and route definitions for v0.14.0 and v0.16.0 to confirm /rerank support and aliases.
  • Ensured examples align with vLLM API behavior for the supported version range.
  • Docs-only change; no model/code behavior changes.
nvidia-oliver-holworthy changed pull request status to open
BoLiu changed pull request status to merged

Sign up or log in to comment