🚀 DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval
Kun Wang1
Yupeng Hu1✉
Hao Liu1
Jiang Shao1
Liqiang Nie2
1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
✉Corresponding author
These are the official implementation, pre-trained model weights, and configuration files for **DRONE**, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift.
🔗 **Paper:** [Accepted by ACM TOIS 2026](https://dl.acm.org/doi/10.1145/3786606)
🔗 **GitHub Repository:** [iLearn-Lab/DRONE](https://github.com/iLearn-Lab/DRONE)
---
## 📌 Model Information
### 1. Model Name
**DRONE** (Cross-modal Representation Shift Refinement)
### 2. Task Type & Applicable Tasks
- **Task Type:** Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning
- **Applicable Tasks:** Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts.
### 3. Project Introduction
Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. **DRONE** addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations.
> 💡 **Method Highlight:** DRONE introduces **Pseudo-Frame Temporal Alignment (PTA)** and **Curriculum-Guided Semantic Refinement (CSR)**. Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively.
### 4. Training Data Source
The model supports and is evaluated on three standard VMR datasets:
- **ActivityNet Captions**
- **Charades-STA**
- **TACoS**
*(Follows splits and feature preparation from [ViGA](https://github.com/r-cui/ViGA))*
---
## 🚀 Usage & Basic Inference
### Step 1: Prepare the Environment
Clone the GitHub repository and set up the virtual environment:
```bash
git clone https://github.com/iLearn-Lab/DRONE.git
cd DRONE
```
```bash
python -m venv .venv
source .venv/bin/activate # Linux / Mac
# .venv\Scripts\activate # Windows
```
```bash
pip install numpy scipy pyyaml tqdm
```
### Step 2: Download Model Weights & Data
1. **Pre-trained Checkpoints:** Download the model checkpoints (includes `Act_ckpt/`, `Cha_ckpt/`, and `TACoS_ckpt/`).
2. **Datasets & Features:** Follow [ViGA](https://github.com/r-cui/ViGA)'s dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS.
3. **Configuration:** Before running, ensure you replace the local dataset root and feature paths in `src/config.yaml` and `src/utils/utils.py` with your actual local paths.
### Step 3: Run Training & Evaluation
**Training from Scratch:**
Depending on the dataset you want to train on, run the following commands:
#### For ActivityNet Captions
python -m src.experiment.train --task activitynetcaptions
#### For Charades-STA
python -m src.experiment.train --task charadessta
#### For TACoS
python -m src.experiment.train --task tacos
**Evaluation (Eval):**
To evaluate a trained experiment folder (which should contain `config.yaml` and `model_best.pt`), run:
python -m src.experiment.eval --exp path/to/your/experiment_folder
---
## ⚠️ Limitations & Notes
**Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**.
- The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation.
- While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG).
---
## 🤝 Acknowledgements & Contact
- **Acknowledgement:** This implementation and data organization are inspired by the [ViGA](https://github.com/r-cui/ViGA) open-source community. Thanks to all collaborators and contributors of this project.
- **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`.
---
## 📝⭐️ Citation
If you find our work or this repository useful in your research, please consider citing our paper:
@article{wang2026cross,
title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval},
author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang},
journal={ACM Transactions on Information Systems},
volume={44},
number={3},
pages={1--30},
year={2026},
publisher={ACM New York, NY}
}