GLiNER2 Multi v1 - ONNX
ONNX export of fastino/gliner2-multi-v1 for zero-shot NER + Relation Extraction.
Multilingual (100+ languages via mDeBERTa-v3-base). Tested on English, German, French, Spanish.
Variants
| Variant | Encoder | Total | Quality | Recommended |
|---|---|---|---|---|
| fp16 | 530 MB | ~645 MB | 99.9999% cosine vs FP32 | Yes (default) |
| fp32 | 1059 MB | ~1.2 GB | Baseline | When every bit matters |
FP16 uses a hybrid approach: weights stored in Float16 on disk, Cast nodes auto-convert to Float32 at runtime. This gives full FP32 precision during inference with half the download/disk size.
Why not INT8?
INT8 quantization is not recommended for this model. Our testing revealed that INT8 destroys
precision at special token positions ([R], [E]) used for relation extraction:
| Token | FP32 vs INT8 cosine | FP32 vs FP16 cosine |
|---|---|---|
[R] head |
0.805 | 0.999999 |
[R] tail |
0.787 | 0.999999 |
| Regular text | 0.87-0.95 | 0.999999 |
This causes RE scores to flip (e.g., "Tim Cook works_at Apple" becomes "Tim Cook works_at Cupertino").
INT8 is acceptable for NER-only use cases but will produce wrong relation extraction results.
The encoder_int8.onnx file is included for NER-only scenarios but not recommended.
Architecture
5 ONNX models (mDeBERTa-v3-base encoder):
| Model | Inputs | Outputs |
|---|---|---|
encoder.onnx / encoder_fp16.onnx |
input_ids, attention_mask | hidden_state (batch, seq, 768) |
span_rep.onnx |
hidden_states, span_start_idx, span_end_idx | span_representations (batch, spans, 768) |
count_embed.onnx |
label_embeddings | transformed_embeddings |
count_pred.onnx |
schema_embedding | count_logits |
classifier.onnx |
hidden_state | logits |
Special tokens: [P]=250104, [E]=250106, [R]=250107, [SEP_TEXT]=250103
- NER schema:
( [P] entities ( [E] person [E] company ) ) [SEP_TEXT] <text> - RE schema:
( [P] works_at ( [R] head [R] tail ) ) [SEP_TEXT] <text>
Usage with engram (Rust, in-process, no sidecar)
let mut backend = Gliner2Backend::load(&model_dir, "fp16")?;
// NER
let entities = backend.extract_entities(text, &["person", "company", "city"], 0.3)?;
// Relation Extraction
let relations = backend.extract_relations(text, &["works_at", "headquartered_in"], 0.3)?;
// Combined
let (entities, relations) = backend.extract_all(text, &ner_labels, &rel_types, 0.3, 0.3)?;
Test Results (10/10 pass, FP16 hybrid)
| Task | Lang | Result | Score |
|---|---|---|---|
| NER: Bill Gates (person), Microsoft (company) | EN | PASS | 100% |
| NER: Tim Cook, Apple, Cupertino | DE | PASS | 100% |
| NER: Emmanuel Macron, France, Elysee | FR | PASS | 99% |
| NER: Elon Musk, Tesla, Austin | ES | PASS | 100% |
| RE: Bill Gates founded Microsoft | EN | PASS | h:98% t:97% |
| RE: Tim Cook works_at Apple | DE | PASS | h:100% t:100% |
| RE: Apple headquartered_in Cupertino | DE | PASS | h:100% t:100% |
| RE: NATO supports Ukraine | DE | PASS | h:100% t:99% |
| RE: Macron leads France | FR | PASS | h:100% t:95% |
| RE: Elon Musk works_at Tesla | ES | PASS | detected |
Re-export from PyTorch
Use the included export_gliner2_onnx.py to export any GLiNER2 model:
python export_gliner2_onnx.py fastino/gliner2-multi-v1 output_dir/ --quantize
python export_gliner2_onnx.py fastino/gliner2-large-v1 output_dir/ --quantize
Requirements: pip install gliner2 torch onnx onnxscript onnxruntime
Links
- Base model: fastino/gliner2-multi-v1
- Rust integration: engram
- ONNX runtime (Python): gliner2-onnx
License
Apache-2.0 (same as base model)
Model tree for dx111ge/gliner2-multi-v1-onnx
Base model
fastino/gliner2-multi-v1