mxbai-rerank-base-v2 β ONNX FP16
ONNX FP16 export of mixedbread-ai/mxbai-rerank-base-v2 with the full CausalLM scoring head.
Why this export?
The original model is a Qwen2-0.5B CausalLM fine-tuned for reranking (NOT a standard cross-encoder). Community ONNX exports miss the LM head, outputting last_hidden_state instead of relevance scores.
This export wraps the full CausalLM to output scores [batch, 1] directly:
score = logits[last_token, "1"] - logits[last_token, "0"]
Model Details
| Property | Value |
|---|---|
| Architecture | Qwen2ForCausalLM (wrapped) |
| Parameters | ~494M |
| Format | ONNX FP16 |
| Size | ~947 MB |
| Inputs | input_ids, attention_mask |
| Output | scores [batch, 1] |
| Padding | LEFT (pad_token_id=151643 `< |
| yes_loc | 16 (token "1") |
| no_loc | 15 (token "0") |
Usage
Input must be LEFT-PADDED and use the chat template prompt from reranker_config.json:
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
query: {query}
document: {document}
{task_prompt}<|im_end|>
<|im_start|>assistant
Quantization Notes
- FP16: Nearly lossless (max divergence ~0.02 vs FP32). Recommended.
- INT8: Unusable β divergence 7+ due to CausalLM activation ranges.
- INT4: Incompatible with ONNX Runtime quantization tools.
Files
model_fp16.onnxβ ONNX FP16 model (947 MB)reranker_config.jsonβ Prompt template + token IDstokenizer.jsonβ Qwen2 tokenizertokenizer_config.jsonβ Tokenizer configurationspecial_tokens_map.jsonβ Special tokens
Export Script
- Downloads last month
- 79
Model tree for tss-deposium/mxbai-rerank-base-v2-onnx-fp16
Base model
mixedbread-ai/mxbai-rerank-base-v2