Safetensors
File size: 2,168 Bytes
dd33978
 
3ab2a4a
 
dd33978
 
3ab2a4a
dd33978
3ab2a4a
dd33978
3ab2a4a
dd33978
3ab2a4a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
---
license: apache-2.0
library_name: transformers
pipeline_tag: robotics
---

# CosFly-Track

This repository contains model checkpoints associated with the paper [CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization](https://huggingface.co/papers/2605.17776).

## Overview

CosFly-Track is a large-scale multi-modal dataset and scalable generation pipeline designed for UAV (Unmanned Aerial Vehicle) visual tracking in urban environments. While most aerial vision-language navigation (VLN) datasets focus on static goals, CosFly-Track addresses the problem of continuously following a moving target while maintaining visibility and avoiding collisions.

### Dataset Details
- **Scale**: Approximately 12,000 expert and perturbed UAV trajectories.
- **Volume**: 2.4 million timesteps (approximately 334 hours).
- **Aligned Channels**: Seven aligned data channels including RGB, metric depth, semantic segmentation, six-degree-of-freedom (6-DoF) drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata.

## Methodology

The project introduces **MuCO** (Multi-Constraint Optimizer), a planner that plans directly in continuous 3D space. It jointly enforces target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the artifacts of grid-based planners.

## Model Description

The checkpoints in this repository include various vision-language models (VLMs) fine-tuned on the CosFly-Track dataset to act as dynamic target-following agents. Evaluated architectures include:
- Qwen (Qwen3-VL, Qwen3.5)
- InternVL
- GLM-4V
- Gemma 4

Fine-tuning on CosFly-Track significantly improves tracking performance (SR@1 meter) compared to zero-shot baselines, supporting the use of this dataset for training robust autonomous agents.

## Citation

```bibtex
@article{cosflytrack2026,
  title={CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization},
  author={Anonymous Authors},
  journal={arXiv preprint arXiv:2605.17776},
  year={2026}
}
```