mxbai-rerank-base-v2 β€” ONNX FP16

ONNX FP16 export of mixedbread-ai/mxbai-rerank-base-v2 with the full CausalLM scoring head.

Why this export?

The original model is a Qwen2-0.5B CausalLM fine-tuned for reranking (NOT a standard cross-encoder). Community ONNX exports miss the LM head, outputting last_hidden_state instead of relevance scores.

This export wraps the full CausalLM to output scores [batch, 1] directly:

score = logits[last_token, "1"] - logits[last_token, "0"]

Model Details

Property Value
Architecture Qwen2ForCausalLM (wrapped)
Parameters ~494M
Format ONNX FP16
Size ~947 MB
Inputs input_ids, attention_mask
Output scores [batch, 1]
Padding LEFT (pad_token_id=151643 `<
yes_loc 16 (token "1")
no_loc 15 (token "0")

Usage

Input must be LEFT-PADDED and use the chat template prompt from reranker_config.json:

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
query: {query}
document: {document}
{task_prompt}<|im_end|>
<|im_start|>assistant

Quantization Notes

  • FP16: Nearly lossless (max divergence ~0.02 vs FP32). Recommended.
  • INT8: Unusable β€” divergence 7+ due to CausalLM activation ranges.
  • INT4: Incompatible with ONNX Runtime quantization tools.

Files

  • model_fp16.onnx β€” ONNX FP16 model (947 MB)
  • reranker_config.json β€” Prompt template + token IDs
  • tokenizer.json β€” Qwen2 tokenizer
  • tokenizer_config.json β€” Tokenizer configuration
  • special_tokens_map.json β€” Special tokens

Export Script

See convert-mxbai-rerank-to-onnx.py

Downloads last month
79
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tss-deposium/mxbai-rerank-base-v2-onnx-fp16

Quantized
(12)
this model