arxiv:2510.05652

SD-MVSum: Script-Driven Multimodal Video Summarization Method and Datasets

Published on May 7

Authors:

Abstract

SD-MVSum method enhances script-driven video summarization by incorporating cross-modal attention between script and audio transcripts alongside visual content, using extended datasets for training and evaluation.

AI-generated summary

In this work, we present a method and two large-scale datasets for Script-Driven Multimodal Video Summarization. The proposed method, SD-MVSum, builds on our earlier SD-VSum method for script-driven video summarization, which considered just the visual content of the video. SD-MVSum takes into account, in addition to the visual modality, the relevance of the user-provided script with the spoken content (i.e., audio transcript) of the video. The dependence between each considered pair of data modalities, i.e., script-video and script-transcript, is modeled using a new weighted cross-modal attention mechanism. This mechanism explicitly exploits the semantic similarity between the paired modalities in order to promote the parts of the full-length video with the highest relevance to the user-provided script. Furthermore, we extend two large-scale datasets for script-driven (S-VideoXum) and generic (MrHiSum) video summarization, to make them suitable for training and evaluation of script-driven multimodal video summarization methods. Experimental comparisons document the competitiveness of the proposed SD-MVSum method against other SotA approaches for script-driven and generic video summarization. Our new method and extended datasets are available at: https://github.com/IDT-ITI/SD-MVSum.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2510.05652

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.05652 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.05652 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.05652 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.