llama-nemotron-rerank-1b-v2 β€” ONNX Export

ONNX export of nvidia/llama-nemotron-rerank-1b-v2 for CPU/GPU inference via ONNX Runtime and fastembed-rs.

Available variants

File Size Description
model.onnx + model.onnx_data 4.6 GB FP32, full precision
model_int8.onnx 1.2 GB INT8 dynamic quantization (all ops)
model_int4_full.onnx 832 MB INT4 MatMul + INT8 embedding β€” smallest

All variants produce identical logits [batch, 1] output and pass the relevance sanity check (panda/relevant doc scores significantly higher than irrelevant).

Usage with fastembed-rs

[dependencies]
fastembed = "5"
use fastembed::{TextRerank, RerankInitOptions, RerankerModel};

let mut model = TextRerank::try_new(
    RerankInitOptions::new(RerankerModel::LlamaNemotronRerank1BV2Int4Full)
        .with_show_download_progress(true),
)?;

let results = model.rerank(
    "what is a panda?",
    vec![
        "A panda is a large black-and-white bear native to China.",
        "The sky is blue and the grass is green.",
    ],
    true,
    None,
)?;
// results[0].score β‰ˆ 1.50, results[1].score β‰ˆ -6.87

Available RerankerModel variants:

  • RerankerModel::LlamaNemotronRerank1BV2 β€” FP32
  • RerankerModel::LlamaNemotronRerank1BV2Int8 β€” INT8
  • RerankerModel::LlamaNemotronRerank1BV2Int4Full β€” INT4+INT8-emb (recommended)

Usage with ONNX Runtime (Python)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

session = ort.InferenceSession(
    "model_int4_full.onnx",
    providers=["CPUExecutionProvider"],
)
tokenizer = AutoTokenizer.from_pretrained(
    "nvidia/llama-nemotron-rerank-1b-v2", trust_remote_code=True
)

query = "what is a panda?"
documents = [
    "A panda is a large black-and-white bear native to China.",
    "The sky is blue and the grass is green.",
]

enc = tokenizer(
    [query] * len(documents),
    documents,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="np",
)
logits = session.run(
    ["logits"],
    {
        "input_ids": enc["input_ids"].astype(np.int64),
        "attention_mask": enc["attention_mask"].astype(np.int64),
    },
)[0]
scores = logits[:, 0].tolist()
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:7.3f}  {doc[:80]}")

For the FP32 and INT4 variants with external data, pass the .onnx file path β€” ONNX Runtime resolves the .onnx_data sidecar automatically from the same directory.

Export details

  • Base model: nvidia/llama-nemotron-rerank-1b-v2 (LLaMA-3.2-1B with bidirectional attention, sequence-classification head)
  • Exported with: optimum 1.23.3 + transformers 4.48.3, opset 17, TorchScript path (dynamo=False)
  • Custom OnnxConfig required due to LlamaBidirectionalForSequenceClassification architecture (trust_remote_code=True)
  • INT8: onnxruntime.quantization.quantize_dynamic, weight_type=QInt8, MatMulConstBOnly=True
  • INT4Full: MatMulNBitsQuantizer(bits=4, block_size=32, is_symmetric=True) applied first, then quantize_dynamic for Gather (embedding) INT8

Model Overview

The Llama Nemotron Reranking 1B model is optimized for providing a logit score representing how relevant a document is to a given query. Fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8192 tokens). Evaluated on 26 languages.

Architecture

Architecture Type: Transformer Network Architecture: Fine-tuned ranker from meta-llama/Llama-3.2-1B with bidirectional attention.

The model is a transformer cross-encoder fine-tuned with contrastive learning. Bidirectional attention is applied during fine-tuning for higher accuracy. Mean pooling over the last decoder output is used with a binary classification head for ranking.

Inputs / Outputs

  • Input: input_ids [batch, seq_len], attention_mask [batch, seq_len]
  • Output: logits [batch, 1] β€” raw relevance score per (query, document) pair. Apply sigmoid for probabilities.

Evaluation Results

Pipeline Avg Recall@5 (NQ, HotpotQA, FiQA, TechQA)
llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 73.64%
llama-nemotron-embed-1b-v2 68.60%
nv-embedqa-e5-v5 + nv-rerankQA-mistral-4b-v3 75.45%
Pipeline Avg Recall@5 (MIRACL, 15 languages)
llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 65.80%
nv-embedqa-mistral-7b-v2 50.42%
BM25 26.51%
Pipeline Avg Recall@5 (MLQA cross-lingual, 42 language pairs)
llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 86.83%
nv-embedqa-mistral-7b-v2 68.38%
BM25 13.01%

License

Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Llama 3.2 Community Model License Agreement.

Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/llama-nemotron-rerank-1b-v2-ONNX

Quantized
(1)
this model