--- license: apache-2.0 tags: - pytorch ---

🚀 DRONE: Cross-modal Representation Shift Refinement for Point-supervised Video Moment Retrieval

Kun Wang1  Yupeng Hu1✉  Hao Liu1  Jiang Shao1  Liqiang Nie2

1School of Software, Shandong University, Jinan, China
2School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, China
Corresponding author

These are the official implementation, pre-trained model weights, and configuration files for **DRONE**, a point-supervised Video Moment Retrieval (VMR) framework designed to mitigate cross-modal representation shift. 🔗 **Paper:** [Accepted by ACM TOIS 2026](https://dl.acm.org/doi/10.1145/3786606) 🔗 **GitHub Repository:** [iLearn-Lab/DRONE](https://github.com/iLearn-Lab/DRONE) --- ## 📌 Model Information ### 1. Model Name **DRONE** (Cross-modal Representation Shift Refinement) ### 2. Task Type & Applicable Tasks - **Task Type:** Point-supervised Video Moment Retrieval (VMR) / Vision-Language / Multimodal Learning - **Applicable Tasks:** Localizing temporal segments in untrimmed videos that match natural language queries, utilizing only point-level supervision to reduce annotation costs while actively addressing cross-modal representation shifts. ### 3. Project Introduction Point-supervised Video Moment Retrieval (VMR) aims to localize the temporal segment in a video that matches a natural language query using only single-frame annotations. **DRONE** addresses the cross-modal representation shift issue inherent in this setting, which progressively improves temporal alignment and semantic consistency between video and text representations. > 💡 **Method Highlight:** DRONE introduces **Pseudo-Frame Temporal Alignment (PTA)** and **Curriculum-Guided Semantic Refinement (CSR)**. Together, these modules systematically mitigate representation shifts, allowing the model to bridge the semantic gap between visual frames and textual queries effectively. ### 4. Training Data Source The model supports and is evaluated on three standard VMR datasets: - **ActivityNet Captions** - **Charades-STA** - **TACoS** *(Follows splits and feature preparation from [ViGA](https://github.com/r-cui/ViGA))* --- ## 🚀 Usage & Basic Inference ### Step 1: Prepare the Environment Clone the GitHub repository and set up the virtual environment: ```bash git clone https://github.com/iLearn-Lab/DRONE.git cd DRONE ``` ```bash python -m venv .venv source .venv/bin/activate # Linux / Mac # .venv\Scripts\activate # Windows ``` ```bash pip install numpy scipy pyyaml tqdm ``` ### Step 2: Download Model Weights & Data 1. **Pre-trained Checkpoints:** Download the model checkpoints (includes `Act_ckpt/`, `Cha_ckpt/`, and `TACoS_ckpt/`). 2. **Datasets & Features:** Follow [ViGA](https://github.com/r-cui/ViGA)'s dataset preparation guidelines for ActivityNet Captions, Charades-STA, and TACoS. 3. **Configuration:** Before running, ensure you replace the local dataset root and feature paths in `src/config.yaml` and `src/utils/utils.py` with your actual local paths. ### Step 3: Run Training & Evaluation **Training from Scratch:** Depending on the dataset you want to train on, run the following commands: #### For ActivityNet Captions python -m src.experiment.train --task activitynetcaptions #### For Charades-STA python -m src.experiment.train --task charadessta #### For TACoS python -m src.experiment.train --task tacos **Evaluation (Eval):** To evaluate a trained experiment folder (which should contain `config.yaml` and `model_best.pt`), run: python -m src.experiment.eval --exp path/to/your/experiment_folder --- ## ⚠️ Limitations & Notes **Disclaimer:** This framework and its pre-trained weights are intended for **academic research purposes only**. - The model requires access to the original source datasets (ActivityNet Captions, Charades-STA, TACoS) for full evaluation. - While designed to mitigate cross-modal representation shifts, performance relies on the quality of the point-level annotations and the inherent capacities of the selected visual backbones (C3D, I3D, VGG). --- ## 🤝 Acknowledgements & Contact - **Acknowledgement:** This implementation and data organization are inspired by the [ViGA](https://github.com/r-cui/ViGA) open-source community. Thanks to all collaborators and contributors of this project. - **Contact:** If you have any questions, feel free to contact me at `khylon.kun.wang@gmail.com`. --- ## 📝⭐️ Citation If you find our work or this repository useful in your research, please consider citing our paper: @article{wang2026cross, title={Cross-Modal Representation Shift Refinement for Point-supervised Video Moment Retrieval}, author={Wang, Kun and Hu, Yupeng and Liu, Hao and Shao, Jiang and Nie, Liqiang}, journal={ACM Transactions on Information Systems}, volume={44}, number={3}, pages={1--30}, year={2026}, publisher={ACM New York, NY} }