| --- |
| license: apache-2.0 |
| library_name: pytorch |
| pipeline_tag: object-detection |
| tags: |
| - object-detection |
| - instance-segmentation |
| - real-time |
| - detection-transformer |
| - d-fine |
| - tensorrt |
| - openvino |
| datasets: |
| - visdrone |
| - taco |
| - coco |
| language: |
| - en |
| model-index: |
| - name: D-FINE-seg S (TACO, instance segmentation) |
| results: |
| - task: |
| type: instance-segmentation |
| name: Instance Segmentation |
| dataset: |
| name: TACO |
| type: taco |
| metrics: |
| - type: f1 |
| value: 0.281 |
| name: F1@IoU=0.5 |
| - type: latency |
| value: 3.7 |
| name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640) |
| - name: D-FINE S (VisDrone, object detection) |
| results: |
| - task: |
| type: object-detection |
| name: Object Detection |
| dataset: |
| name: VisDrone |
| type: visdrone |
| metrics: |
| - type: f1 |
| value: 0.584 |
| name: F1@IoU=0.5 |
| - type: latency |
| value: 2.1 |
| name: Latency (ms, RTX 5070 Ti, TRT FP16, 640x640) |
| --- |
| |
| # D-FINE-seg |
|
|
| **Real-Time Object Detection and Instance Segmentation.** |
| A DETR-style detector ([D-FINE](https://arxiv.org/abs/2410.13842)) extended with a lightweight |
| mask head, segmentation-aware training, and mask-aware Hungarian matching. Outperforms |
| Ultralytics YOLO26 in fine-tuning F1-score on TACO and VisDrone under a unified TensorRT FP16 |
| end-to-end benchmarking protocol, while maintaining competitive latency. |
|
|
| - 📄 **Paper:** [arXiv:2602.23043](https://arxiv.org/abs/2602.23043) |
| - 💻 **Code:** [github.com/ArgoHA/D-FINE-seg](https://github.com/ArgoHA/D-FINE-seg) |
| - 🎬 **Video tutorial:** [YouTube](https://youtu.be/_uEyRRw4miY) |
| - 🧪 **Colab:** [Open in Colab](https://colab.research.google.com/drive/1ZV12qnUQMpC0g3j-0G-tYhmmdM98a41X?usp=sharing) |
| - 🪪 **License:** Apache 2.0 |
|
|
| <p align="center"> |
| <img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/det_benchmark.png" width="48%"> |
| <img src="https://raw.githubusercontent.com/ArgoHA/D-FINE-seg/main/assets/seg_benchmark.png" width="48%"> |
| </p> |
|
|
| ## Model description |
|
|
| D-FINE-seg adds an instance segmentation head to D-FINE without changing its detection core. |
| The mask head fuses HybridEncoder PAN features at strides 8/16/32 to 1/4 resolution; per-query |
| mask embeddings (3-layer MLP) are dot-producted with shared mask features to produce per-instance |
| masks. Training adds box-cropped BCE + Dice mask losses, mask-aware contrastive denoising, |
| and mask costs in the Hungarian matcher. |
|
|
| This is **not** a fork of D-FINE. The detection core is based on the |
| [original D-FINE paper](https://github.com/Peterande/D-FINE); everything else |
| (segmentation head, training pipeline, export, inference, augmentations) was reimplemented |
| from scratch. The mask head design follows the [Mask DINO](https://arxiv.org/abs/2206.02777) paradigm. |
|
|
| ## Available checkpoints |
|
|
| All weights are PyTorch `.pt` files. Filename pattern: `dfine[_seg]_<size>_<dataset>.pt`. |
|
|
| ### Object detection (COCO-pretrained) |
|
|
| | File | Size (M params) | Notes | |
| |---|---|---| |
| | `dfine_n_coco.pt` | 3.8 | Nano | |
| | `dfine_s_coco.pt` | 10.3 | Small | |
| | `dfine_m_coco.pt` | 19.6 | Medium | |
| | `dfine_l_coco.pt` | 31.2 | Large | |
| | `dfine_x_coco.pt` | 62.6 | Extra-Large | |
|
|
| ### Object detection (Objects365 → COCO) |
|
|
| `dfine_{s,m,l,x}_obj2coco.pt` — same architectures, pretrained on Objects365, then fine-tuned |
| on COCO. Generally a stronger init for downstream fine-tuning. |
|
|
| ### Instance segmentation (COCO-pretrained) |
|
|
| | File | Size (M params) | Notes | |
| |---|---|---| |
| | `dfine_seg_n_coco.pt` | 5.1 | Nano | |
| | `dfine_seg_s_coco.pt` | 11.9 | Small | |
| | `dfine_seg_m_coco.pt` | 21.2 | Medium | |
| | `dfine_seg_l_coco.pt` | 32.8 | Large | |
| | `dfine_seg_x_coco.pt` | 64.3 | Extra-Large | |
|
|
| ## Usage |
|
|
| > **Note on `transformers` integration.** This model is not (yet) wrapped as a |
| > `transformers.AutoModel`. The recommended path is to use the official |
| > [training/inference repo](https://github.com/ArgoHA/D-FINE-seg) — weights auto-download |
| > from this Hub repo on first use. For an `AutoModel`-style API on a closely related |
| > architecture, see [`RTDetrV2ForObjectDetection`](https://huggingface.co/docs/transformers/model_doc/rt_detr_v2). |
|
|
| ### Option 1 — Official repo (recommended) |
|
|
| ```bash |
| git clone https://github.com/ArgoHA/D-FINE-seg.git |
| cd D-FINE-seg |
| pip install -r requirements.txt |
| ``` |
|
|
| Weights are auto-downloaded from this repo into `pretrained/` on first use. No manual setup |
| needed; just point at the size and dataset you want: |
|
|
| ```python |
| from src.infer.torch_model import Torch_model |
| import cv2 |
| |
| model = Torch_model( |
| model_name="s", # n / s / m / l / x |
| model_path="pretrained/dfine_seg_s_coco.pt", |
| n_outputs=80, # COCO classes |
| input_width=640, |
| input_height=640, |
| conf_thresh=0.5, |
| enable_mask_head=True, # False for detection checkpoints |
| device="cuda", # cuda / mps / cpu |
| ) |
| |
| img = cv2.imread("path/to/image.jpg") # BGR |
| results = model(img) # [{"boxes", "scores", "labels", "masks"?}] |
| ``` |
|
|
| ### Option 2 — Direct download with `huggingface_hub` |
| |
| ```python |
| from huggingface_hub import hf_hub_download |
|
|
| ckpt = hf_hub_download( |
| repo_id="ArgoSA/D-FINE-seg", |
| filename="dfine_seg_s_coco.pt", |
| ) |
| # Then load with the official repo's Torch_model (see Option 1). |
| ``` |
| |
| ### Option 3 — Gradio demo |
|
|
| ```bash |
| python -m demo.demo |
| ``` |
|
|
| ## Training data |
|
|
| | Use case | Datasets used | |
| |---|---| |
| | COCO detection / segmentation pretraining | [COCO 2017](https://cocodataset.org/) | |
| | Objects365 → COCO checkpoints | [Objects365](https://www.objects365.org/) → COCO 2017 | |
| | Reported drone benchmarks | [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) (~6.5k train / ~550 val / ~1.6k test-dev) | |
| | Reported waste benchmarks | [TACO](http://tacodataset.org/) (1500 images, 59 effective classes, 86/14 batch-ID split) | |
|
|
| ## Benchmarks |
|
|
| End-to-end latency (preprocessing + forward + postprocessing), RTX 5070 Ti, TensorRT FP16, |
| 640×640, batch size 1. F1-score at IoU 0.5. |
|
|
| ### VisDrone — object detection (test-dev) |
|
|
| | Model | F1 | IoU | Latency (ms) | |
| |---|---|---|---| |
| | **D-FINE N** | **0.531** | 0.288 | 1.6 | |
| | YOLO26 N | 0.455 | 0.226 | 2.8 | |
| | **D-FINE S** | **0.584** | 0.332 | 2.1 | |
| | YOLO26 S | 0.510 | 0.264 | 3.1 | |
| | **D-FINE M** | **0.605** | 0.351 | 2.7 | |
| | YOLO26 M | 0.562 | 0.301 | 3.6 | |
| | **D-FINE L** | **0.606** | 0.351 | 3.3 | |
| | YOLO26 L | 0.568 | 0.308 | 4.1 | |
| | **D-FINE X** | **0.611** | 0.354 | 4.5 | |
| | YOLO26 X | 0.584 | 0.319 | 5.3 | |
|
|
| ### TACO — instance segmentation |
|
|
| | Model | F1 | IoU | Latency (ms) | |
| |---|---|---|---| |
| | **D-FINE-seg N** | **0.231** | 0.106 | 3.2 | |
| | YOLO26-seg N | 0.062 | 0.027 | 3.8 | |
| | **D-FINE-seg S** | **0.281** | 0.134 | 3.7 | |
| | YOLO26-seg S | 0.177 | 0.080 | 4.3 | |
| | **D-FINE-seg M** | **0.296** | 0.140 | 4.5 | |
| | YOLO26-seg M | 0.267 | 0.128 | 5.3 | |
| | **D-FINE-seg L** | **0.342** | 0.167 | 5.0 | |
| | YOLO26-seg L | 0.287 | 0.137 | 5.8 | |
| | **D-FINE-seg X** | **0.380** | 0.190 | 6.3 | |
| | YOLO26-seg X | 0.300 | 0.146 | 7.6 | |
|
|
| See the [GitHub README](https://github.com/ArgoHA/D-FINE-seg#benchmarks) for full TACO detection |
| results, COCO-style mask/box AP, and cross-format (Torch/TRT/OpenVINO/CoreML) comparisons on |
| desktop, edge (Intel N150), and Apple Silicon. |
|
|
| ## Intended use and limitations |
|
|
| **Intended use.** General-purpose object detection and instance segmentation, particularly |
| when (a) low end-to-end latency matters and (b) the deployment target is GPU (TensorRT), |
| CPU/iGPU (OpenVINO), or Apple Silicon (CoreML). |
|
|
| **Out of scope.** |
| - Safety-critical perception (autonomous driving, medical) without independent validation. |
| - Strong domain shift away from the pretraining distribution. The COCO-pretrained checkpoints |
| are an init; expect to fine-tune on your own data for non-COCO classes. |
| - Real-time deployment without first re-exporting the TensorRT engine on the target GPU |
| (TRT engines are GPU-specific). |
|
|
| **Known limitations.** |
| - Mosaic augmentation is not recommended for the segmentation task; lower |
| `mosaic_augs.mosaic_prob` toward 0 if masks look wrong. |
| - INT8 quantization shows a noticeable F1 drop on segmentation; FP16 is the recommended |
| latency/accuracy trade-off for both GPU and CPU. |
|
|
| ## Citation |
|
|
| ```bibtex |
| @article{saakyan2026dfineseg, |
| title = {D-FINE-seg: Object Detection and Instance Segmentation Framework with Multi-Backend Deployment}, |
| author = {Saakyan, Argo and Solntsev, Dmitry}, |
| journal = {arXiv preprint arXiv:2602.23043}, |
| year = {2026}, |
| eprint = {2602.23043} |
| } |
| |
| @misc{peng2024dfine, |
| title = {D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement}, |
| author = {Yansong Peng and Hebei Li and Peixi Wu and Yueyi Zhang and Xiaoyan Sun and Feng Wu}, |
| year = {2024}, |
| eprint = {2410.13842}, |
| archivePrefix = {arXiv}, |
| primaryClass = {cs.CV} |
| } |
| ``` |
|
|
| ## Acknowledgements |
|
|
| Detection core based on [D-FINE](https://github.com/Peterande/D-FINE) (Peng et al., 2024). |
| Mask head design follows [Mask DINO](https://arxiv.org/abs/2206.02777). Benchmarks use |
| [VisDrone](https://github.com/VisDrone/VisDrone-Dataset) and [TACO](http://tacodataset.org/). |
|
|