mxbai-rerank-base-v2 — ONNX FP16

ONNX FP16 export of mixedbread-ai/mxbai-rerank-base-v2 with the full CausalLM scoring head.

Why this export?

The original model is a Qwen2-0.5B CausalLM fine-tuned for reranking (NOT a standard cross-encoder). Community ONNX exports miss the LM head, outputting last_hidden_state instead of relevance scores.

This export wraps the full CausalLM to output scores [batch, 1] directly:

score = logits[last_token, "1"] - logits[last_token, "0"]

Model Details

Property	Value
Architecture	Qwen2ForCausalLM (wrapped)
Parameters	~494M
Format	ONNX FP16
Size	~947 MB
Inputs	`input_ids`, `attention_mask`
Output	`scores` [batch, 1]
Padding	LEFT (pad_token_id=151643 `<
yes_loc	16 (token "1")
no_loc	15 (token "0")

Usage

Input must be LEFT-PADDED and use the chat template prompt from reranker_config.json:

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
query: {query}
document: {document}
{task_prompt}<|im_end|>
<|im_start|>assistant

Quantization Notes

FP16: Nearly lossless (max divergence ~0.02 vs FP32). Recommended.
INT8: Unusable — divergence 7+ due to CausalLM activation ranges.
INT4: Incompatible with ONNX Runtime quantization tools.

Files

model_fp16.onnx — ONNX FP16 model (947 MB)
reranker_config.json — Prompt template + token IDs
tokenizer.json — Qwen2 tokenizer
tokenizer_config.json — Tokenizer configuration
special_tokens_map.json — Special tokens

Export Script

See convert-mxbai-rerank-to-onnx.py

Downloads last month: 79

Model tree for tss-deposium/mxbai-rerank-base-v2-onnx-fp16

Base model

mixedbread-ai/mxbai-rerank-base-v2

Quantized

(12)

this model