File size: 9,348 Bytes
333931d 72cc0ae 333931d 72cc0ae | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 | ---
license: apache-2.0
library_name: pytorch
pipeline_tag: object-detection
tags:
- object-detection
- instance-segmentation
- real-time
- detection-transformer
- d-fine
- tensorrt
- openvino
datasets:
- visdrone
- taco
- coco
language:
- en
model-index:
- name: D-FINE-seg S (TACO, instance segmentation)
results:
- task:
type: instance-segmentation
name: Instance Segmentation
dataset:
name: TACO
type: taco
metrics:
- type: f1
value: 0.281
name: F1@IoU=0.5
- type: latency
value: 3.7
name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640)
- name: D-FINE S (VisDrone, object detection)
results:
- task:
type: object-detection
name: Object Detection
dataset:
name: VisDrone
type: visdrone
metrics:
- type: f1
value: 0.584
name: F1@IoU=0.5
- type: latency
value: 2.1
name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640)
---
# D-FINE-seg
**Real-Time Object Detection and Instance Segmentation.**
A DETR-style detector ([D-FINE](https://arxiv.org/abs/2410.13842)) extended with a lightweight
mask head, segmentation-aware training, and mask-aware Hungarian matching. Outperforms
Ultralytics YOLO26 in fine-tuning F1-score on TACO and VisDrone under a unified TensorRT FP16
end-to-end benchmarking protocol, while maintaining competitive latency.
- π **Paper:** [arXiv:2602.23043](https://arxiv.org/abs/2602.23043)
- π» **Code:** [github.com/ArgoHA/D-FINE-seg](https://github.com/ArgoHA/D-FINE-seg)
- π¬ **Video tutorial:** [YouTube](https://youtu.be/_uEyRRw4miY)
- π§ͺ **Colab:** [Open in Colab](https://colab.research.google.com/drive/1ZV12qnUQMpC0g3j-0G-tYhmmdM98a41X?usp=sharing)
- πͺͺ **License:** Apache 2.0
<p align="center">
<img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/det_benchmark.png" width="48%">
<img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/seg_benchmark.png" width="48%">
</p>
## Model description
D-FINE-seg adds an instance segmentation head to D-FINE without changing its detection core.
The mask head fuses HybridEncoder PAN features at strides 8/16/32 to 1/4 resolution; per-query
mask embeddings (3-layer MLP) are dot-producted with shared mask features to produce per-instance
masks. Training adds box-cropped BCE + Dice mask losses, mask-aware contrastive denoising,
and mask costs in the Hungarian matcher.
This is **not** a fork of D-FINE. The detection core is based on the
[original D-FINE paper](https://github.com/Peterande/D-FINE); everything else
(segmentation head, training pipeline, export, inference, augmentations) was reimplemented
from scratch. The mask head design follows the [Mask DINO](https://arxiv.org/abs/2206.02777) paradigm.
## Available checkpoints
All weights are PyTorch `.pt` files. Filename pattern: `dfine[_seg]_<size>_<dataset>.pt`.
### Object detection (COCO-pretrained)
| File | Size (M params) | Notes |
|---|---|---|
| `dfine_n_coco.pt` | 3.8 | Nano |
| `dfine_s_coco.pt` | 10.3 | Small |
| `dfine_m_coco.pt` | 19.6 | Medium |
| `dfine_l_coco.pt` | 31.2 | Large |
| `dfine_x_coco.pt` | 62.6 | Extra-Large |
### Object detection (Objects365 β COCO)
`dfine_{s,m,l,x}_obj2coco.pt` β same architectures, pretrained on Objects365, then fine-tuned
on COCO. Generally a stronger init for downstream fine-tuning.
### Instance segmentation (COCO-pretrained)
| File | Size (M params) | Notes |
|---|---|---|
| `dfine_seg_n_coco.pt` | 5.1 | Nano |
| `dfine_seg_s_coco.pt` | 11.9 | Small |
| `dfine_seg_m_coco.pt` | 21.2 | Medium |
| `dfine_seg_l_coco.pt` | 32.8 | Large |
| `dfine_seg_x_coco.pt` | 64.3 | Extra-Large |
## Usage
> **Note on `transformers` integration.** This model is not (yet) wrapped as a
> `transformers.AutoModel`. The recommended path is to use the official
> [training/inference repo](https://github.com/ArgoHA/D-FINE-seg) β weights auto-download
> from this Hub repo on first use. For an `AutoModel`-style API on a closely related
> architecture, see [`RTDetrV2ForObjectDetection`](https://huggingface.co/docs/transformers/model_doc/rt_detr_v2).
### Option 1 β Official repo (recommended)
```bash
git clone https://github.com/ArgoHA/D-FINE-seg.git
cd D-FINE-seg
pip install -r requirements.txt
```
Weights are auto-downloaded from this repo into `pretrained/` on first use. No manual setup
needed; just point at the size and dataset you want:
```python
from src.infer.torch_model import Torch_model
import cv2
model = Torch_model(
model_name="s", # n / s / m / l / x
model_path="pretrained/dfine_seg_s_coco.pt",
n_outputs=80, # COCO classes
input_width=640,
input_height=640,
conf_thresh=0.5,
enable_mask_head=True, # False for detection checkpoints
device="cuda", # cuda / mps / cpu
)
img = cv2.imread("path/to/image.jpg") # BGR
results = model(img) # [{"boxes", "scores", "labels", "masks"?}]
```
### Option 2 β Direct download with `huggingface_hub`
```python
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(
repo_id="ArgoSA/D-FINE-seg",
filename="dfine_seg_s_coco.pt",
)
# Then load with the official repo's Torch_model (see Option 1).
```
### Option 3 β Gradio demo
```bash
python -m demo.demo
```
## Training data
| Use case | Datasets used |
|---|---|
| COCO detection / segmentation pretraining | [COCO 2017](https://cocodataset.org/) |
| Objects365 β COCO checkpoints | [Objects365](https://www.objects365.org/) β COCO 2017 |
| Reported drone benchmarks | [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) (~6.5k train / ~550 val / ~1.6k test-dev) |
| Reported waste benchmarks | [TACO](http://tacodataset.org/) (1500 images, 59 effective classes, 86/14 batch-ID split) |
## Benchmarks
End-to-end latency (preprocessing + forward + postprocessing), RTX 5070 Ti, TensorRT FP16,
640Γ640, batch size 1. F1-score at IoU 0.5.
### VisDrone β object detection (test-dev)
| Model | F1 | IoU | Latency (ms) |
|---|---|---|---|
| **D-FINE N** | **0.531** | 0.288 | 1.6 |
| YOLO26 N | 0.455 | 0.226 | 2.8 |
| **D-FINE S** | **0.584** | 0.332 | 2.1 |
| YOLO26 S | 0.510 | 0.264 | 3.1 |
| **D-FINE M** | **0.605** | 0.351 | 2.7 |
| YOLO26 M | 0.562 | 0.301 | 3.6 |
| **D-FINE L** | **0.606** | 0.351 | 3.3 |
| YOLO26 L | 0.568 | 0.308 | 4.1 |
| **D-FINE X** | **0.611** | 0.354 | 4.5 |
| YOLO26 X | 0.584 | 0.319 | 5.3 |
### TACO β instance segmentation
| Model | F1 | IoU | Latency (ms) |
|---|---|---|---|
| **D-FINE-seg N** | **0.231** | 0.106 | 3.2 |
| YOLO26-seg N | 0.062 | 0.027 | 3.8 |
| **D-FINE-seg S** | **0.281** | 0.134 | 3.7 |
| YOLO26-seg S | 0.177 | 0.080 | 4.3 |
| **D-FINE-seg M** | **0.296** | 0.140 | 4.5 |
| YOLO26-seg M | 0.267 | 0.128 | 5.3 |
| **D-FINE-seg L** | **0.342** | 0.167 | 5.0 |
| YOLO26-seg L | 0.287 | 0.137 | 5.8 |
| **D-FINE-seg X** | **0.380** | 0.190 | 6.3 |
| YOLO26-seg X | 0.300 | 0.146 | 7.6 |
See the [GitHub README](https://github.com/ArgoHA/D-FINE-seg#benchmarks) for full TACO detection
results, COCO-style mask/box AP, and cross-format (Torch/TRT/OpenVINO/CoreML) comparisons on
desktop, edge (Intel N150), and Apple Silicon.
## Intended use and limitations
**Intended use.** General-purpose object detection and instance segmentation, particularly
when (a) low end-to-end latency matters and (b) the deployment target is GPU (TensorRT),
CPU/iGPU (OpenVINO), or Apple Silicon (CoreML).
**Out of scope.**
- Safety-critical perception (autonomous driving, medical) without independent validation.
- Strong domain shift away from the pretraining distribution. The COCO-pretrained checkpoints
are an init; expect to fine-tune on your own data for non-COCO classes.
- Real-time deployment without first re-exporting the TensorRT engine on the target GPU
(TRT engines are GPU-specific).
**Known limitations.**
- Mosaic augmentation is not recommended for the segmentation task; lower
`mosaic_augs.mosaic_prob` toward 0 if masks look wrong.
- INT8 quantization shows a noticeable F1 drop on segmentation; FP16 is the recommended
latency/accuracy trade-off for both GPU and CPU.
## Citation
```bibtex
@article{saakyan2026dfineseg,
title = {D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment},
author = {Saakyan, Argo and Solntsev, Dmitry},
journal = {arXiv preprint arXiv:2602.23043},
year = {2026},
eprint = {2602.23043}
}
@misc{peng2024dfine,
title = {D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
author = {Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
year = {2024},
eprint = {2410.13842},
archivePrefix = {arXiv},
primaryClass = {cs.CV}
}
```
## Acknowledgements
Detection core based on [D-FINE](https://github.com/Peterande/D-FINE) (Peng et al., 2024).
Mask head design follows [Mask DINO](https://arxiv.org/abs/2206.02777). Benchmarks use
[VisDrone](https://github.com/VisDrone/VisDrone-Dataset) and [TACO](http://tacodataset.org/).
|