ArgoSA commited on
Commit
72cc0ae
Β·
verified Β·
1 Parent(s): 333931d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +260 -1
README.md CHANGED
@@ -1,4 +1,263 @@
1
  ---
2
  license: apache-2.0
3
- paper: arxiv.org/abs/2602.23043
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ library_name: pytorch
4
+ pipeline_tag: object-detection
5
+ tags:
6
+ - object-detection
7
+ - instance-segmentation
8
+ - real-time
9
+ - detection-transformer
10
+ - d-fine
11
+ - tensorrt
12
+ - onnx
13
+ - openvino
14
+ - coreml
15
+ datasets:
16
+ - visdrone
17
+ - taco
18
+ - coco
19
+ language:
20
+ - en
21
+ model-index:
22
+ - name: D-FINE-seg S (TACO, instance segmentation)
23
+ results:
24
+ - task:
25
+ type: instance-segmentation
26
+ name: Instance Segmentation
27
+ dataset:
28
+ name: TACO
29
+ type: taco
30
+ metrics:
31
+ - type: f1
32
+ value: 0.281
33
+ name: F1@IoU=0.5
34
+ - type: latency
35
+ value: 3.7
36
+ name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640)
37
+ - name: D-FINE S (VisDrone, object detection)
38
+ results:
39
+ - task:
40
+ type: object-detection
41
+ name: Object Detection
42
+ dataset:
43
+ name: VisDrone
44
+ type: visdrone
45
+ metrics:
46
+ - type: f1
47
+ value: 0.584
48
+ name: F1@IoU=0.5
49
+ - type: latency
50
+ value: 2.1
51
+ name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640)
52
  ---
53
+
54
+ # D-FINE-seg
55
+
56
+ **Real-Time Object Detection and Instance Segmentation.**
57
+ A DETR-style detector ([D-FINE](https://arxiv.org/abs/2410.13842)) extended with a lightweight
58
+ mask head, segmentation-aware training, and mask-aware Hungarian matching. Outperforms
59
+ Ultralytics YOLO26 in fine-tuning F1-score on TACO and VisDrone under a unified TensorRT FP16
60
+ end-to-end benchmarking protocol, while maintaining competitive latency.
61
+
62
+ - πŸ“„ **Paper:** [arXiv:2602.23043](https://arxiv.org/abs/2602.23043)
63
+ - πŸ’» **Code:** [github.com/ArgoHA/D-FINE-seg](https://github.com/ArgoHA/D-FINE-seg)
64
+ - 🎬 **Video tutorial:** [YouTube](https://youtu.be/_uEyRRw4miY)
65
+ - πŸ§ͺ **Colab:** [Open in Colab](https://colab.research.google.com/drive/1ZV12qnUQMpC0g3j-0G-tYhmmdM98a41X?usp=sharing)
66
+ - πŸͺͺ **License:** Apache 2.0
67
+
68
+ <p align="center">
69
+ <img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/det_benchmark.png" width="48%">
70
+ <img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/seg_benchmark.png" width="48%">
71
+ </p>
72
+
73
+ ## Model description
74
+
75
+ D-FINE-seg adds an instance segmentation head to D-FINE without changing its detection core.
76
+ The mask head fuses HybridEncoder PAN features at strides 8/16/32 to 1/4 resolution; per-query
77
+ mask embeddings (3-layer MLP) are dot-producted with shared mask features to produce per-instance
78
+ masks. Training adds box-cropped BCE + Dice mask losses, mask-aware contrastive denoising,
79
+ and mask costs in the Hungarian matcher.
80
+
81
+ This is **not** a fork of D-FINE. The detection core is based on the
82
+ [original D-FINE paper](https://github.com/Peterande/D-FINE); everything else
83
+ (segmentation head, training pipeline, export, inference, augmentations) was reimplemented
84
+ from scratch. The mask head design follows the [Mask DINO](https://arxiv.org/abs/2206.02777) paradigm.
85
+
86
+ ## Available checkpoints
87
+
88
+ All weights are PyTorch `.pt` files. Filename pattern: `dfine[_seg]_<size>_<dataset>.pt`.
89
+
90
+ ### Object detection (COCO-pretrained)
91
+
92
+ | File | Size (M params) | Notes |
93
+ |---|---|---|
94
+ | `dfine_n_coco.pt` | 3.8 | Nano |
95
+ | `dfine_s_coco.pt` | 10.3 | Small |
96
+ | `dfine_m_coco.pt` | 19.6 | Medium |
97
+ | `dfine_l_coco.pt` | 31.2 | Large |
98
+ | `dfine_x_coco.pt` | 62.6 | Extra-Large |
99
+
100
+ ### Object detection (Objects365 β†’ COCO)
101
+
102
+ `dfine_{s,m,l,x}_obj2coco.pt` β€” same architectures, pretrained on Objects365, then fine-tuned
103
+ on COCO. Generally a stronger init for downstream fine-tuning.
104
+
105
+ ### Instance segmentation (COCO-pretrained)
106
+
107
+ | File | Size (M params) | Notes |
108
+ |---|---|---|
109
+ | `dfine_seg_n_coco.pt` | 5.1 | Nano |
110
+ | `dfine_seg_s_coco.pt` | 11.9 | Small |
111
+ | `dfine_seg_m_coco.pt` | 21.2 | Medium |
112
+ | `dfine_seg_l_coco.pt` | 32.8 | Large |
113
+ | `dfine_seg_x_coco.pt` | 64.3 | Extra-Large |
114
+
115
+ ## Usage
116
+
117
+ > **Note on `transformers` integration.** This model is not (yet) wrapped as a
118
+ > `transformers.AutoModel`. The recommended path is to use the official
119
+ > [training/inference repo](https://github.com/ArgoHA/D-FINE-seg) β€” weights auto-download
120
+ > from this Hub repo on first use. For an `AutoModel`-style API on a closely related
121
+ > architecture, see [`RTDetrV2ForObjectDetection`](https://huggingface.co/docs/transformers/model_doc/rt_detr_v2).
122
+
123
+ ### Option 1 β€” Official repo (recommended)
124
+
125
+ ```bash
126
+ git clone https://github.com/ArgoHA/D-FINE-seg.git
127
+ cd D-FINE-seg
128
+ pip install -r requirements.txt
129
+ ```
130
+
131
+ Weights are auto-downloaded from this repo into `pretrained/` on first use. No manual setup
132
+ needed; just point at the size and dataset you want:
133
+
134
+ ```python
135
+ from src.infer.torch_model import Torch_model
136
+ import cv2
137
+
138
+ model = Torch_model(
139
+ model_name="s", # n / s / m / l / x
140
+ model_path="pretrained/dfine_seg_s_coco.pt",
141
+ n_outputs=80, # COCO classes
142
+ input_width=640,
143
+ input_height=640,
144
+ conf_thresh=0.5,
145
+ enable_mask_head=True, # False for detection checkpoints
146
+ device="cuda", # cuda / mps / cpu
147
+ )
148
+
149
+ img = cv2.imread("path/to/image.jpg") # BGR
150
+ results = model(img) # [{"boxes", "scores", "labels", "masks"?}]
151
+ ```
152
+
153
+ ### Option 2 β€” Direct download with `huggingface_hub`
154
+
155
+ ```python
156
+ from huggingface_hub import hf_hub_download
157
+
158
+ ckpt = hf_hub_download(
159
+ repo_id="ArgoSA/D-FINE-seg",
160
+ filename="dfine_seg_s_coco.pt",
161
+ )
162
+ # Then load with the official repo's Torch_model (see Option 1).
163
+ ```
164
+
165
+ ### Option 3 β€” Gradio demo
166
+
167
+ ```bash
168
+ python -m demo.demo
169
+ ```
170
+
171
+ ## Training data
172
+
173
+ | Use case | Datasets used |
174
+ |---|---|
175
+ | COCO detection / segmentation pretraining | [COCO 2017](https://cocodataset.org/) |
176
+ | Objects365 β†’ COCO checkpoints | [Objects365](https://www.objects365.org/) β†’ COCO 2017 |
177
+ | Reported drone benchmarks | [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) (~6.5k train / ~550 val / ~1.6k test-dev) |
178
+ | Reported waste benchmarks | [TACO](http://tacodataset.org/) (1500 images, 59 effective classes, 86/14 batch-ID split) |
179
+
180
+ ## Benchmarks
181
+
182
+ End-to-end latency (preprocessing + forward + postprocessing), RTX 5070 Ti, TensorRT FP16,
183
+ 640Γ—640, batch size 1. F1-score at IoU 0.5.
184
+
185
+ ### VisDrone β€” object detection (test-dev)
186
+
187
+ | Model | F1 | IoU | Latency (ms) |
188
+ |---|---|---|---|
189
+ | **D-FINE N** | **0.531** | 0.288 | 1.6 |
190
+ | YOLO26 N | 0.455 | 0.226 | 2.8 |
191
+ | **D-FINE S** | **0.584** | 0.332 | 2.1 |
192
+ | YOLO26 S | 0.510 | 0.264 | 3.1 |
193
+ | **D-FINE M** | **0.605** | 0.351 | 2.7 |
194
+ | YOLO26 M | 0.562 | 0.301 | 3.6 |
195
+ | **D-FINE L** | **0.606** | 0.351 | 3.3 |
196
+ | YOLO26 L | 0.568 | 0.308 | 4.1 |
197
+ | **D-FINE X** | **0.611** | 0.354 | 4.5 |
198
+ | YOLO26 X | 0.584 | 0.319 | 5.3 |
199
+
200
+ ### TACO β€” instance segmentation
201
+
202
+ | Model | F1 | IoU | Latency (ms) |
203
+ |---|---|---|---|
204
+ | **D-FINE-seg N** | **0.231** | 0.106 | 3.2 |
205
+ | YOLO26-seg N | 0.062 | 0.027 | 3.8 |
206
+ | **D-FINE-seg S** | **0.281** | 0.134 | 3.7 |
207
+ | YOLO26-seg S | 0.177 | 0.080 | 4.3 |
208
+ | **D-FINE-seg M** | **0.296** | 0.140 | 4.5 |
209
+ | YOLO26-seg M | 0.267 | 0.128 | 5.3 |
210
+ | **D-FINE-seg L** | **0.342** | 0.167 | 5.0 |
211
+ | YOLO26-seg L | 0.287 | 0.137 | 5.8 |
212
+ | **D-FINE-seg X** | **0.380** | 0.190 | 6.3 |
213
+ | YOLO26-seg X | 0.300 | 0.146 | 7.6 |
214
+
215
+ See the [GitHub README](https://github.com/ArgoHA/D-FINE-seg#benchmarks) for full TACO detection
216
+ results, COCO-style mask/box AP, and cross-format (Torch/TRT/OpenVINO/CoreML) comparisons on
217
+ desktop, edge (Intel N150), and Apple Silicon.
218
+
219
+ ## Intended use and limitations
220
+
221
+ **Intended use.** General-purpose object detection and instance segmentation, particularly
222
+ when (a) low end-to-end latency matters and (b) the deployment target is GPU (TensorRT),
223
+ CPU/iGPU (OpenVINO), or Apple Silicon (CoreML).
224
+
225
+ **Out of scope.**
226
+ - Safety-critical perception (autonomous driving, medical) without independent validation.
227
+ - Strong domain shift away from the pretraining distribution. The COCO-pretrained checkpoints
228
+ are an init; expect to fine-tune on your own data for non-COCO classes.
229
+ - Real-time deployment without first re-exporting the TensorRT engine on the target GPU
230
+ (TRT engines are GPU-specific).
231
+
232
+ **Known limitations.**
233
+ - Mosaic augmentation is not recommended for the segmentation task; lower
234
+ `mosaic_augs.mosaic_prob` toward 0 if masks look wrong.
235
+ - INT8 quantization shows a noticeable F1 drop on segmentation; FP16 is the recommended
236
+ latency/accuracy trade-off for both GPU and CPU.
237
+
238
+ ## Citation
239
+
240
+ ```bibtex
241
+ @article{saakyan2026dfineseg,
242
+ title = {D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment},
243
+ author = {Saakyan, Argo and Solntsev, Dmitry},
244
+ journal = {arXiv preprint arXiv:2602.23043},
245
+ year = {2026},
246
+ eprint = {2602.23043}
247
+ }
248
+
249
+ @misc{peng2024dfine,
250
+ title = {D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement},
251
+ author = {Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu},
252
+ year = {2024},
253
+ eprint = {2410.13842},
254
+ archivePrefix = {arXiv},
255
+ primaryClass = {cs.CV}
256
+ }
257
+ ```
258
+
259
+ ## Acknowledgements
260
+
261
+ Detection core based on [D-FINE](https://github.com/Peterande/D-FINE) (Peng et al., 2024).
262
+ Mask head design follows [Mask DINO](https://arxiv.org/abs/2206.02777). Benchmarks use
263
+ [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) and [TACO](http://tacodataset.org/).