kread
/

Qwen3-Reranker-4B-int8-ov

openvino-export

Model card Files Files and versions

Qwen3-Reranker-4B-int8-ov / README.md

kread's picture

Add model card

6bd60a2 verified about 1 month ago

|

history blame contribute delete

3.43 kB

	---
	license: apache-2.0
	base_model: Qwen/Qwen3-Reranker-4B
	base_model_relation: quantized
	library_name: openvino
	tags:
	- openvino
	- openvino-export
	- text-ranking
	- reranker
	- qwen3
	language:
	- multilingual
	pipeline_tag: text-ranking
	---

	# Qwen3-Reranker-4B — OpenVINO IR (INT8 weight-only, asymmetric)

	> This is a redistribution. For the model's intended use, instruction
	> format, full evaluation (MTEB-R / CMTEB-R / MMTEB-R / MLDR / MTEB-Code /
	> FollowIR), and citation, please see the upstream card:
	> [Qwen/Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B).

	OpenVINO IR conversion of
	[Qwen/Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B),
	weight-only quantized to INT8 asymmetric via NNCF. Intended for the
	[OpenArc](https://github.com/SearchSavior/OpenArc) reranker engine and
	`optimum-intel` pipelines targeting Intel CPUs / iGPUs / dGPUs / NPUs.

	## Files

	- `openvino_model.{xml,bin}` — Qwen3 (4B) decoder, INT8 weights (~4.0 GB)
	- `openvino_tokenizer.{xml,bin}` / `openvino_detokenizer.{xml,bin}` — OpenVINO Tokenizers IR
	- `chat_template.jinja`, `generation_config.json`
	- Standard HF tokenizer files: `tokenizer.json`, `tokenizer_config.json`,
	`special_tokens_map.json`, `vocab.json`, `merges.txt`
	- `LICENSE`, `NOTICE` — Apache-2.0 with attribution to the upstream Qwen Team.

	## Architecture

	\| \| \|
	\|---\|---\|
	\| Base model \| Qwen3 ForCausalLM (Qwen3-4B-Base) \|
	\| Hidden size \| 2560 \|
	\| Layers \| 36 \|
	\| Attention heads / KV heads \| 32 / 8 \|
	\| Max position \| 40 960 \|
	\| Vocabulary \| 151 669 \|
	\| Source dtype \| bfloat16 \|
	\| Quantization \| NNCF INT8 weight-only, asymmetric \|

	## Usage with OpenArc

	```bash
	openarc add qwen3-4b-reranker \
	--model-path /path/to/Qwen3-Reranker-4B-int8-ov \
	--model-type rerank \
	--engine optimum \
	--device GPU

	openarc serve
	# POST /v1/rerank {"model": "qwen3-4b-reranker", "query": "...", "documents": [...]}
	```

	## Conversion notes

	The standard CLI route currently fails silently for this model on
	`optimum-intel @ HEAD`:

	```bash
	optimum-cli export openvino --weight-format int8 \
	--model Qwen/Qwen3-Reranker-4B ./out
	# exits 0; openvino_model.xml is 0 bytes, openvino_model.bin is a ~13 MB stub
	```

	The Python API path produces a usable model:

	```python
	from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

	quant = OVWeightQuantizationConfig(bits=8, sym=False)
	m = OVModelForCausalLM.from_pretrained(
	"Qwen/Qwen3-Reranker-4B",
	export=True,
	quantization_config=quant,
	trust_remote_code=True,
	)
	m.save_pretrained("./out")
	```

	Tokenizer / detokenizer IR generated separately via
	`openvino_tokenizers.convert_tokenizer(..., with_detokenizer=True)`.

	A more detailed walkthrough lives in
	[OpenArc/docs/openvino_qwen3.md](https://github.com/SearchSavior/OpenArc).

	## License

	Apache-2.0, inherited from
	[Qwen/Qwen3-Reranker-4B](https://huggingface.co/Qwen/Qwen3-Reranker-4B).
	See `LICENSE` and `NOTICE` in this repo.

	## Citation

	From the upstream model card:

	```bibtex
	@article{qwen3embedding,
	title={Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models},
	author={Zhang, Yanzhao and Li, Mingxin and Long, Dingkun and Zhang, Xin and Lin, Huan and Yang, Baosong and Xie, Pengjun and Yang, An and Liu, Dayiheng and Lin, Junyang and Huang, Fei and Zhou, Jingren},
	journal={arXiv preprint arXiv:2506.05176},
	year={2025}
	}
	```