🚀 DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval

---
license: apache-2.0
tags:
- pytorch
---

<a id="top"></a>
<div align="center">
  <h1>🚀 DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval</h1>

  <p>
    <b>Kun Wang</b><sup>1</sup>&nbsp;
    <b>Yupeng Hu</b><sup>1✉</sup>&nbsp;
    <b>Hao Liu</b><sup>1</sup>&nbsp;
    <b>Jiang Shao</b><sup>1</sup>&nbsp;
    <b>Liqiang Nie</b><sup>2</sup>
  </p>

  <p>
    <sup>1</sup>School of Software, Shandong University, Jinan, China<br>
    <sup>2</sup>School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China<br>
    <sup>✉</sup>Corresponding author
  </p>
</div>

These are the official implementation, pre-trained model weights, and configuration files for **DRONE**, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.

🔗 **Paper:** [Accepted by ACM TOIS 2026](https://dl.acm.org/doi/10.1145/3786606)
🔗 **GitHub Repository:** [iLearn-Lab/DRONE](https://github.com/iLearn-Lab/DRONE)

---

## 📌 Model Information

### 1. Model Name
**DRONE** (Cross-modal Representation Shift Refinement)

### 2. Task Type & Applicable Tasks
- **Task Type:** Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
- **Applicable Tasks:** Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.

### 3. Project Introduction
Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. **DRONE** addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.

> 💡 **Method Highlight:** DRONE introduces **Pseudo-Frame Temporal Alignment (PTA)** and **Curriculum-Guided Semantic Refinement (CSR)**. Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.

### 4. Training Data Source
The model supports and is evaluated on three standard VMR datasets:
- **ActivityNet Captions**
- **Charades-STA**
- **TACoS**
*(Follows splits and feature preparation from [ViGA](https://github.com/r-cui/ViGA))*

---

## 🚀 Usage & Basic Inference

### Step 1: Prepare the Environment
Clone the GitHub repository and set up the virtual environment:
```bash
git clone https://github.com/iLearn-Lab/DRONE.git
cd DRONE
```
```bash
python -m venv .venv
source .venv/bin/activate   # Linux / Mac
# .venv\Scripts\activate    # Windows
```
```bash
pip install numpy scipy pyyaml tqdm
```

### Step 2: Download Model Weights & Data
1. **Pre-trained Checkpoints:** Download the model checkpoints (includes `Act_ckpt/`, `Cha_ckpt/`, and `TACoS_ckpt/`).
2. **Datasets & Features:** Follow [ViGA](https://github.com/r-cui/ViGA)'s dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
3. **Configuration:** Before running, ensure you replace the local dataset root and feature paths in `src/config.yaml` and `src/utils/utils.py` with your actual local paths.

### Step 3: Run Training & Evaluation

**Training from Scratch:**
Depending on the dataset you want to train on, run the following commands:

#### For ActivityNet Captions
python -m src.experiment.train --task activitynetcaptions

#### For Charades-STA
python -m src.experiment.train --task charadessta

#### For TACoS
python -m src.experiment.train --task tacos


**Evaluation (Eval):**
To evaluate a trained experiment folder (which should contain `config.yaml` and `model_best.pt`), run:

python -m src.experiment.eval --exp path/to/your/experiment_folder

---

## ⚠️ Limitations & Notes

**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
- The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation. 
- While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).

---

## 🤝 Acknowledgements & Contact

- **Acknowledgement:** This implementation and data organization are inspired by the [ViGA](https://github.com/r-cui/ViGA) open-source community. Thanks to all collaborators and contributors of this project.
- **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.

---

## 📝⭐️ Citation

If you find our work or this repository useful in your research, please consider citing our paper:


@article{wang2026cross,
  title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval},
  author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang},
  journal={ACM Transactions on Information Systems},
  volume={44},
  number={3},
  pages={1--30},
  year={2026},
  publisher={ACM New York, NY}
}