GLiNER4j ONNX
ONNX export of fastino-ai/gliner2-base for Java inference via ONNX Runtime.
Part of the GLiNER4j project.
Supported Tasks
| Task | Description |
|---|---|
| Named Entity Recognition | Extract typed entity spans from text with confidence scores |
| Text Classification | Assign labels to text with multi-label support and confidence scores |
Both tasks support entity/label descriptions for improved accuracy and per-call overrides without model reloading.
Repository Structure
βββ gliner4j_config.json # Shared model configuration
βββ tokenizer.json # Shared HuggingFace tokenizer
βββ tokenizer_config.json
βββ onnx/ # Base FP32 (~830 MB)
β βββ encoder.onnx
β βββ span_rep.onnx
β βββ scoring_head.onnx
β βββ classifier_head.onnx
βββ onnx_fp16/ # FP16 (~416 MB, ~50% smaller)
β βββ encoder.onnx
β βββ span_rep.onnx
β βββ scoring_head.onnx
β βββ classifier_head.onnx
βββ onnx_quantized/ # INT8 dynamic quantization (~208 MB, ~75% smaller)
βββ encoder.onnx
βββ span_rep.onnx
βββ scoring_head.onnx
βββ classifier_head.onnx
Model Architecture
The model is split into 4 ONNX modules for modular inference:
| Module | Description |
|---|---|
encoder.onnx |
DeBERTaV2 transformer encoder (shared) |
span_rep.onnx |
Span representation layer (NER) |
scoring_head.onnx |
Count-aware scoring head (NER) |
classifier_head.onnx |
Classifier head MLP (Classification) |
Variants
| Variant | Folder | Precision | Size | Use case |
|---|---|---|---|---|
| Base | onnx/ |
FP32 | ~830 MB | Maximum accuracy |
| FP16 | onnx_fp16/ |
FP16 | ~416 MB | Good accuracy/size trade-off |
| Quantized | onnx_quantized/ |
INT8 | ~208 MB | Smallest footprint, fastest on CPU |
To download a specific variant only:
huggingface-cli download <repo> --include "onnx_fp16/*" "*.json"
Configuration
| Parameter | Value |
|---|---|
| Hidden size | 768 |
| Max span width | 8 |
| Max count | 20 |
| Span mode | SpanMarkerV0 |
| Token pooling | first |
| ONNX opset | 17 |
Usage
Use with GLiNER4j, a Java library for GLiNER2 inference via ONNX Runtime.
Named Entity Recognition
var entities = List.of(
new EntityDefinition("person", "Names of individuals"),
new EntityDefinition("organization", "Company or institution names")
);
var gliner = GLiNER4jNER.load(modelDir, entities);
Map<String, List<EntitySpan>> results = gliner.extract("John works at Google.");
Text Classification
var labels = List.of(
new ClassificationLabel("positive", "Expresses positive sentiment"),
new ClassificationLabel("negative", "Expresses negative sentiment")
);
var classifier = GLiNER4jClassifier.load(modelDir, labels);
List<ClassificationResult> results = classifier.classify("Great product!");
Model Variants
// FP16 variant
var gliner = GLiNER4jNER.load(modelDir, entities, "onnx_fp16");
var classifier = GLiNER4jClassifier.load(modelDir, labels, "onnx_fp16");
// Quantized variant
var gliner = GLiNER4jNER.load(modelDir, entities, "onnx_quantized");
var classifier = GLiNER4jClassifier.load(modelDir, labels, "onnx_quantized");
Features
- Entity/Label Descriptions: Provide natural language descriptions alongside entity types or classification labels to improve model accuracy
- Per-call Overrides: Change entities or labels at inference time without reloading the model
- Batch Processing: Batched encoder calls with virtual thread parallelism for scoring
- OpenTelemetry: Built-in instrumentation for duration, text count, and result count metrics (zero overhead when no OTel SDK is present)
- Runtime Configuration: Control thread pools, graph optimization level, and model caching
Benchmarks
Measured with JMH (average time, 15 iterations):
4 Entity Types
| Batch Size | Avg Latency (ms/op) | Per-Text (ms) | Throughput (texts/s) |
|---|---|---|---|
| 1 | 26.5 | 26.5 | ~37.7 |
| 4 | 143.5 | 35.9 | ~27.9 |
| 8 | 286.6 | 35.8 | ~27.9 |
8 Entity Types
| Batch Size | Avg Latency (ms/op) | Per-Text (ms) | Throughput (texts/s) |
|---|---|---|---|
| 1 | 34.1 | 34.1 | ~29.3 |
| 4 | 174.6 | 43.7 | ~22.9 |
| 8 | 339.2 | 42.4 | ~23.6 |
License
Apache License 2.0