GLiNER2 Large ONNX QInt8 for CPU

This repository contains a QInt8 ONNX Runtime derivative of lmo3/gliner2-large-v1-onnx.

The repository keeps the same lightweight config and tokenizer layout as the upstream ONNX release and replaces the ONNX weights with dynamically quantized CPU-oriented variants.

Base model

Upstream model: lmo3/gliner2-large-v1-onnx
Upstream revision: 6adb78ae8098685d239dda324cc124d948962c21
Upstream license: MIT

Quantization

Important compatibility note

This repository now uses onnx_files.qint8 as the canonical config key for the quantized CPU weights.

To avoid breaking older deployments immediately, onnx_files.fp32 is still present as a deprecated compatibility alias that points to the same QInt8 files. It does not indicate float32 weights in this repository.

The deprecated alias will remain for compatibility, but new consumers should select qint8 explicitly.

Runtime: ONNX Runtime 1.22.1
Method: dynamic quantization
Weight type: QInt8
Intended provider: CPUExecutionProvider

Files

onnx/encoder.onnx
onnx/classifier.onnx
canonical config key: onnx_files.qint8
deprecated compatibility alias: onnx_files.fp32
config.json
gliner2_config.json
tokenizer.json
tokenizer_config.json

Benchmark

Benchmarked locally on Intel Core i7-9750H (AVX2) with CPUExecutionProvider.

variant	size (MB)	avg latency (ms)	p95 latency (ms)	throughput (req/s)
upstream ONNX	1673.28	211.79	263.74	4.72
this QInt8 repo	622.16	134.98	152.22	7.41

Relative to the upstream ONNX release on this host:

throughput: +57.0%
average latency: -36.3%
artifact size: -62.8%

Validation notes

Label agreement with the upstream ONNX release on a 30-sample local benchmark: 30/30
Benchmarks are hardware-dependent and should be treated as directional rather than universal.
This repository is aimed at CPU inference. For GPU inference, prefer the non-quantized upstream release.

Compatibility

This repository is intended as an ONNX-format CPU-oriented derivative of the upstream model. For task framing, labels, and broader model documentation, refer to the upstream model card.

Downloads last month: 493

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for temsa/gliner2-large-v1-onnx-cpu-qint8

Base model

fastino/gliner2-large-v1

Quantized

lmo3/gliner2-large-v1-onnx

Quantized

(2)

this model