llama-nemotron-rerank-1b-v2 β ONNX Export
ONNX export of nvidia/llama-nemotron-rerank-1b-v2 for CPU/GPU inference via ONNX Runtime and fastembed-rs.
Available variants
| File | Size | Description |
|---|---|---|
model.onnx + model.onnx_data |
4.6 GB | FP32, full precision |
model_int8.onnx |
1.2 GB | INT8 dynamic quantization (all ops) |
model_int4_full.onnx |
832 MB | INT4 MatMul + INT8 embedding β smallest |
All variants produce identical logits [batch, 1] output and pass the relevance sanity check (panda/relevant doc scores significantly higher than irrelevant).
Usage with fastembed-rs
[dependencies]
fastembed = "5"
use fastembed::{TextRerank, RerankInitOptions, RerankerModel};
let mut model = TextRerank::try_new(
RerankInitOptions::new(RerankerModel::LlamaNemotronRerank1BV2Int4Full)
.with_show_download_progress(true),
)?;
let results = model.rerank(
"what is a panda?",
vec![
"A panda is a large black-and-white bear native to China.",
"The sky is blue and the grass is green.",
],
true,
None,
)?;
// results[0].score β 1.50, results[1].score β -6.87
Available RerankerModel variants:
RerankerModel::LlamaNemotronRerank1BV2β FP32RerankerModel::LlamaNemotronRerank1BV2Int8β INT8RerankerModel::LlamaNemotronRerank1BV2Int4Fullβ INT4+INT8-emb (recommended)
Usage with ONNX Runtime (Python)
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
session = ort.InferenceSession(
"model_int4_full.onnx",
providers=["CPUExecutionProvider"],
)
tokenizer = AutoTokenizer.from_pretrained(
"nvidia/llama-nemotron-rerank-1b-v2", trust_remote_code=True
)
query = "what is a panda?"
documents = [
"A panda is a large black-and-white bear native to China.",
"The sky is blue and the grass is green.",
]
enc = tokenizer(
[query] * len(documents),
documents,
padding=True,
truncation=True,
max_length=512,
return_tensors="np",
)
logits = session.run(
["logits"],
{
"input_ids": enc["input_ids"].astype(np.int64),
"attention_mask": enc["attention_mask"].astype(np.int64),
},
)[0]
scores = logits[:, 0].tolist()
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
for doc, score in ranked:
print(f"{score:7.3f} {doc[:80]}")
For the FP32 and INT4 variants with external data, pass the .onnx file path β ONNX Runtime resolves the .onnx_data sidecar automatically from the same directory.
Export details
- Base model:
nvidia/llama-nemotron-rerank-1b-v2(LLaMA-3.2-1B with bidirectional attention, sequence-classification head) - Exported with:
optimum 1.23.3+transformers 4.48.3, opset 17, TorchScript path (dynamo=False) - Custom
OnnxConfigrequired due toLlamaBidirectionalForSequenceClassificationarchitecture (trust_remote_code=True) - INT8:
onnxruntime.quantization.quantize_dynamic,weight_type=QInt8,MatMulConstBOnly=True - INT4Full:
MatMulNBitsQuantizer(bits=4, block_size=32, is_symmetric=True)applied first, thenquantize_dynamicfor Gather (embedding) INT8
Model Overview
The Llama Nemotron Reranking 1B model is optimized for providing a logit score representing how relevant a document is to a given query. Fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8192 tokens). Evaluated on 26 languages.
Architecture
Architecture Type: Transformer
Network Architecture: Fine-tuned ranker from meta-llama/Llama-3.2-1B with bidirectional attention.
The model is a transformer cross-encoder fine-tuned with contrastive learning. Bidirectional attention is applied during fine-tuning for higher accuracy. Mean pooling over the last decoder output is used with a binary classification head for ranking.
Inputs / Outputs
- Input:
input_ids [batch, seq_len],attention_mask [batch, seq_len] - Output:
logits [batch, 1]β raw relevance score per (query, document) pair. Apply sigmoid for probabilities.
Evaluation Results
| Pipeline | Avg Recall@5 (NQ, HotpotQA, FiQA, TechQA) |
|---|---|
| llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 | 73.64% |
| llama-nemotron-embed-1b-v2 | 68.60% |
| nv-embedqa-e5-v5 + nv-rerankQA-mistral-4b-v3 | 75.45% |
| Pipeline | Avg Recall@5 (MIRACL, 15 languages) |
|---|---|
| llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 | 65.80% |
| nv-embedqa-mistral-7b-v2 | 50.42% |
| BM25 | 26.51% |
| Pipeline | Avg Recall@5 (MLQA cross-lingual, 42 language pairs) |
|---|---|
| llama-nemotron-embed-1b-v2 + llama-nemotron-rerank-1b-v2 | 86.83% |
| nv-embedqa-mistral-7b-v2 | 68.38% |
| BM25 | 13.01% |
License
Use of this model is governed by the NVIDIA Open Model License Agreement. Additional Information: Llama 3.2 Community Model License Agreement.
- Downloads last month
- 33
Model tree for cstr/llama-nemotron-rerank-1b-v2-ONNX
Base model
nvidia/llama-nemotron-rerank-1b-v2