llama-nemotron-rerank-1b-v2 — ONNX Export

ONNX export of nvidia/llama-nemotron-rerank-1b-v2 for CPU/GPU inference via ONNX Runtime and fastembed-rs.

Available variants

File	Size	Description
`model.onnx` + `model.onnx_data`	4.6 GB	FP32, full precision
`model_int8.onnx`	1.2 GB	INT8 dynamic quantization (all ops)
`model_int4_full.onnx`	832 MB	INT4 MatMul + INT8 embedding — smallest

All variants produce identical logits [batch, 1] output and pass the relevance sanity check (panda/relevant doc scores significantly higher than irrelevant).

Usage with fastembed-rs

[dependencies]
fastembed = "5"

use fastembed::{TextRerank, RerankInitOptions, RerankerModel};

let mut model = TextRerank::try_new(
    RerankInitOptions::new(RerankerModel::LlamaNemotronRerank1BV2Int4Full)
        .with_show_download_progress(true),
)?;

let results = model.rerank(
    "what is a panda?",
    vec![
        "A panda is a large black-and-white bear native to China.",
        "The sky is blue and the grass is green.",
    ],
    true,
    None,
)?;
// results[0].score ≈ 1.50, results[1].score ≈ -6.87

Available RerankerModel variants:

RerankerModel::LlamaNemotronRerank1BV2 — FP32
RerankerModel::LlamaNemotronRerank1BV2Int8 — INT8
RerankerModel::LlamaNemotronRerank1BV2Int4Full — INT4+INT8-emb (recommended)

Usage with ONNX Runtime (Python)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

session = ort.InferenceSession(
    "model_int4_full.onnx",
    providers=["CPUExecutionProvider"],
)
tokenizer = AutoTokenizer.from_pretrained(
    "nvidia/llama-nemotron-rerank-1b-v2", trust_remote_code=True
)

query = "what is a panda?"
documents = [
    "A panda is a large black-and-white bear native to China.",
    "The sky is blue and the grass is green.",
]

enc = tokenizer(
    [query] * len(documents),
    documents,
    padding=True,
    truncation=True,
    max_length=512,
    return_tensors="np",
)
logits = session.run(
    ["logits"],
    {
        "input_ids": enc["input_ids"].astype(np.int64),
        "attention_mask": enc["attention_mask"].astype(np.int64),
    },
)[0]
scores = logits[:, 0].tolist()
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:7.3f}  {doc[:80]}")

For the FP32 and INT4 variants with external data, pass the .onnx file path — ONNX Runtime resolves the .onnx_data sidecar automatically from the same directory.

Export details

Base model: nvidia/llama-nemotron-rerank-1b-v2 (LLaMA-3.2-1B with bidirectional attention, sequence-classification head)
Exported with: optimum 1.23.3 + transformers 4.48.3, opset 17, TorchScript path (dynamo=False)
Custom OnnxConfig required due to LlamaBidirectionalForSequenceClassification architecture (trust_remote_code=True)
INT8: onnxruntime.quantization.quantize_dynamic, weight_type=QInt8, MatMulConstBOnly=True
INT4Full: MatMulNBitsQuantizer(bits=4, block_size=32, is_symmetric=True) applied first, then quantize_dynamic for Gather (embedding) INT8

Model Overview

The Llama Nemotron Reranking 1B model is optimized for providing a logit score representing how relevant a document is to a given query. Fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8192 tokens). Evaluated on 26 languages.

Architecture

Architecture Type: Transformer Network Architecture: Fine-tuned ranker from meta-llama/Llama-3.2-1B with bidirectional attention.

The model is a transformer cross-encoder fine-tuned with contrastive learning. Bidirectional attention is applied during fine-tuning for higher accuracy. Mean pooling over the last decoder output is used with a binary classification head for ranking.

Inputs / Outputs

Input: input_ids [batch, seq_len], attention_mask [batch, seq_len]
Output: logits [batch, 1] — raw relevance score per (query, document) pair. Apply sigmoid for probabilities.

Evaluation Results

Pipeline	Avg Recall@5 (NQ, HotpotQA, FiQA, TechQA)
llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2	73.64%
llama-nemotron-embed-1b-v2	68.60%
nv-embedqa-e5-v5 + nv-rerankQA-mistral-4b-v3	75.45%

Pipeline	Avg Recall@5 (MIRACL, 15 languages)
llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2	65.80%
nv-embedqa-mistral-7b-v2	50.42%
BM25	26.51%

Pipeline	Avg Recall@5 (MLQA cross-lingual, 42 language pairs)
llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2	86.83%
nv-embedqa-mistral-7b-v2	68.38%
BM25	13.01%

License

Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Llama 3.2 Community Model License Agreement.

Downloads last month: 33

Inference Providers NEW

Text Ranking

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstr/llama-nemotron-rerank-1b-v2-ONNX

Base model

nvidia/llama-nemotron-rerank-1b-v2

Quantized

(1)

this model