File size: 9,348 Bytes
333931d
 
72cc0ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
333931d
72cc0ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
---
license: apache-2.0
library_name: pytorch
pipeline_tag: object-detection
tags:
  - object-detection
  - instance-segmentation
  - real-time
  - detection-transformer
  - d-fine
  - tensorrt
  - openvino
datasets:
  - visdrone
  - taco
  - coco
language:
  - en
model-index:
  - name: D-FINE-seg S (TACO, instance segmentation)
    results:
      - task:
          type: instance-segmentation
          name: Instance Segmentation
        dataset:
          name: TACO
          type: taco
        metrics:
          - type: f1
            value: 0.281
            name: F1@IoU=0.5
          - type: latency
            value: 3.7
            name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640)
  - name: D-FINE S (VisDrone, object detection)
    results:
      - task:
          type: object-detection
          name: Object Detection
        dataset:
          name: VisDrone
          type: visdrone
        metrics:
          - type: f1
            value: 0.584
            name: F1@IoU=0.5
          - type: latency
            value: 2.1
            name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640)
---

# D-FINE-seg

**Real-Time Object Detection and Instance Segmentation.**
A DETR-style detector ([D-FINE](https://arxiv.org/abs/2410.13842)) extended with a lightweight
mask head, segmentation-aware training, and mask-aware Hungarian matching. Outperforms
Ultralytics YOLO26 in fine-tuning F1-score on TACO and VisDrone under a unified TensorRT FP16
end-to-end benchmarking protocol, while maintaining competitive latency.

- πŸ“„ **Paper:** [arXiv:2602.23043](https://arxiv.org/abs/2602.23043)
- πŸ’» **Code:** [github.com/ArgoHA/D-FINE-seg](https://github.com/ArgoHA/D-FINE-seg)
- 🎬 **Video tutorial:** [YouTube](https://youtu.be/_uEyRRw4miY)
- πŸ§ͺ **Colab:** [Open in Colab](https://colab.research.google.com/drive/1ZV12qnUQMpC0g3j-0G-tYhmmdM98a41X?usp=sharing)
- πŸͺͺ **License:** Apache 2.0

<p align="center">
  <img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/det_benchmark.png" width="48%">
  <img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/seg_benchmark.png" width="48%">
</p>

## Model description

D-FINE-seg adds an instance segmentation head to D-FINE without changing its detection core.
The mask head fuses HybridEncoder PAN features at strides 8/16/32 to 1/4 resolution; per-query
mask embeddings (3-layer MLP) are dot-producted with shared mask features to produce per-instance
masks. Training adds box-cropped BCE + Dice mask losses, mask-aware contrastive denoising,
and mask costs in the Hungarian matcher.

This is **not** a fork of D-FINE. The detection core is based on the
[original D-FINE paper](https://github.com/Peterande/D-FINE); everything else
(segmentation head, training pipeline, export, inference, augmentations) was reimplemented
from scratch. The mask head design follows the [Mask DINO](https://arxiv.org/abs/2206.02777) paradigm.

## Available checkpoints

All weights are PyTorch `.pt` files. Filename pattern: `dfine[_seg]_<size>_<dataset>.pt`.

### Object detection (COCO-pretrained)

| File | Size (M params) | Notes |
|---|---|---|
| `dfine_n_coco.pt` | 3.8 | Nano |
| `dfine_s_coco.pt` | 10.3 | Small |
| `dfine_m_coco.pt` | 19.6 | Medium |
| `dfine_l_coco.pt` | 31.2 | Large |
| `dfine_x_coco.pt` | 62.6 | Extra-Large |

### Object detection (Objects365 β†’ COCO)

`dfine_{s,m,l,x}_obj2coco.pt` β€” same architectures, pretrained on Objects365, then fine-tuned
on COCO. Generally a stronger init for downstream fine-tuning.

### Instance segmentation (COCO-pretrained)

| File | Size (M params) | Notes |
|---|---|---|
| `dfine_seg_n_coco.pt` | 5.1 | Nano |
| `dfine_seg_s_coco.pt` | 11.9 | Small |
| `dfine_seg_m_coco.pt` | 21.2 | Medium |
| `dfine_seg_l_coco.pt` | 32.8 | Large |
| `dfine_seg_x_coco.pt` | 64.3 | Extra-Large |

## Usage

> **Note on `transformers` integration.** This model is not (yet) wrapped as a
> `transformers.AutoModel`. The recommended path is to use the official
> [training/inference repo](https://github.com/ArgoHA/D-FINE-seg) β€” weights auto-download
> from this Hub repo on first use. For an `AutoModel`-style API on a closely related
> architecture, see [`RTDetrV2ForObjectDetection`](https://huggingface.co/docs/transformers/model_doc/rt_detr_v2).

### Option 1 β€” Official repo (recommended)

```bash
git clone https://github.com/ArgoHA/D-FINE-seg.git
cd D-FINE-seg
pip install -r requirements.txt
```

Weights are auto-downloaded from this repo into `pretrained/` on first use. No manual setup
needed; just point at the size and dataset you want:

```python
from src.infer.torch_model import Torch_model
import cv2

model = Torch_model(
    model_name="s",                         # n / s / m / l / x
    model_path="pretrained/dfine_seg_s_coco.pt",
    n_outputs=80,                           # COCO classes
    input_width=640,
    input_height=640,
    conf_thresh=0.5,
    enable_mask_head=True,                  # False for detection checkpoints
    device="cuda",                          # cuda / mps / cpu
)

img = cv2.imread("path/to/image.jpg")       # BGR
results = model(img)                        # [{"boxes", "scores", "labels", "masks"?}]
```

### Option 2 β€” Direct download with `huggingface_hub`

```python
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="ArgoSA/D-FINE-seg",
    filename="dfine_seg_s_coco.pt",
)
# Then load with the official repo's Torch_model (see Option 1).
```

### Option 3 β€” Gradio demo

```bash
python -m demo.demo
```

## Training data

| Use case | Datasets used |
|---|---|
| COCO detection / segmentation pretraining | [COCO 2017](https://cocodataset.org/) |
| Objects365 β†’ COCO checkpoints | [Objects365](https://www.objects365.org/) β†’ COCO 2017 |
| Reported drone benchmarks | [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) (~6.5k train / ~550 val / ~1.6k test-dev) |
| Reported waste benchmarks | [TACO](http://tacodataset.org/) (1500 images, 59 effective classes, 86/14 batch-ID split) |

## Benchmarks

End-to-end latency (preprocessing + forward + postprocessing), RTX 5070 Ti, TensorRT FP16,
640Γ—640, batch size 1. F1-score at IoU 0.5.

### VisDrone β€” object detection (test-dev)

| Model | F1 | IoU | Latency (ms) |
|---|---|---|---|
| **D-FINE N** | **0.531** | 0.288 | 1.6 |
| YOLO26 N | 0.455 | 0.226 | 2.8 |
| **D-FINE S** | **0.584** | 0.332 | 2.1 |
| YOLO26 S | 0.510 | 0.264 | 3.1 |
| **D-FINE M** | **0.605** | 0.351 | 2.7 |
| YOLO26 M | 0.562 | 0.301 | 3.6 |
| **D-FINE L** | **0.606** | 0.351 | 3.3 |
| YOLO26 L | 0.568 | 0.308 | 4.1 |
| **D-FINE X** | **0.611** | 0.354 | 4.5 |
| YOLO26 X | 0.584 | 0.319 | 5.3 |

### TACO β€” instance segmentation

| Model | F1 | IoU | Latency (ms) |
|---|---|---|---|
| **D-FINE-seg N** | **0.231** | 0.106 | 3.2 |
| YOLO26-seg N | 0.062 | 0.027 | 3.8 |
| **D-FINE-seg S** | **0.281** | 0.134 | 3.7 |
| YOLO26-seg S | 0.177 | 0.080 | 4.3 |
| **D-FINE-seg M** | **0.296** | 0.140 | 4.5 |
| YOLO26-seg M | 0.267 | 0.128 | 5.3 |
| **D-FINE-seg L** | **0.342** | 0.167 | 5.0 |
| YOLO26-seg L | 0.287 | 0.137 | 5.8 |
| **D-FINE-seg X** | **0.380** | 0.190 | 6.3 |
| YOLO26-seg X | 0.300 | 0.146 | 7.6 |

See the [GitHub README](https://github.com/ArgoHA/D-FINE-seg#benchmarks) for full TACO detection
results, COCO-style mask/box AP, and cross-format (Torch/TRT/OpenVINO/CoreML) comparisons on
desktop, edge (Intel N150), and Apple Silicon.

## Intended use and limitations

**Intended use.** General-purpose object detection and instance segmentation, particularly
when (a) low end-to-end latency matters and (b) the deployment target is GPU (TensorRT),
CPU/iGPU (OpenVINO), or Apple Silicon (CoreML).

**Out of scope.**
- Safety-critical perception (autonomous driving, medical) without independent validation.
- Strong domain shift away from the pretraining distribution. The COCO-pretrained checkpoints
  are an init; expect to fine-tune on your own data for non-COCO classes.
- Real-time deployment without first re-exporting the TensorRT engine on the target GPU
  (TRT engines are GPU-specific).

**Known limitations.**
- Mosaic augmentation is not recommended for the segmentation task; lower
  `mosaic_augs.mosaic_prob` toward 0 if masks look wrong.
- INT8 quantization shows a noticeable F1 drop on segmentation; FP16 is the recommended
  latency/accuracy trade-off for both GPU and CPU.

## Citation

```bibtex
@article{saakyan2026dfineseg,
  title   = {D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment},
  author  = {Saakyan, Argo and Solntsev, Dmitry},
  journal = {arXiv preprint arXiv:2602.23043},
  year    = {2026},
  eprint  = {2602.23043}
}

@misc{peng2024dfine,
  title         = {D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
  author        = {Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
  year          = {2024},
  eprint        = {2410.13842},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV}
}
```

## Acknowledgements

Detection core based on [D-FINE](https://github.com/Peterande/D-FINE) (Peng et al., 2024).
Mask head design follows [Mask DINO](https://arxiv.org/abs/2206.02777). Benchmarks use
[VisDrone](https://github.com/VisDrone/VisDrone-Dataset) and [TACO](http://tacodataset.org/).