VideoXum: Cross-modal Visual and Textural Summarization of Videos

1University of Rochester 2OPPO US Research Center
* indicates equal contributions


Video summarization aims to distill the most important information from a source video into either an abridged video clip or a textual narrative. Existing methods often treat the generation of video and text summaries as independent tasks, thus neglecting the semantic correlation between visual and textual summarization. In other words, these methods only study a single modality as output without considering coherent video and text as outputs. In this work, we first introduce a novel task: cross-modal video summarization. This task seeks to transfer a long video into a condensed video clip and a semantically aligned textual summary, collectively referred to as a cross-modal summary. We then establish VideoXum (X refers to different modalities), a new large-scale human-annotated video benchmark for cross-modal video summarization. VideoXum is reannotated based on ActivityNet Captions with diverse open-domain videos. In the current version, VideoXum provides 14K long videos, with a total of 140K pairs of aligned video and text summaries. Compared to existing datasets, VideoXum offers superior scalability while preserving a comparable level of annotation quality. To validate the dataset's quality, we provide a comprehensive analysis of VideoXum, comparing it with existing datasets. Further, we perform an extensive empirical evaluation of several state-of-the-art methods on this dataset. Our findings highlight the impressive generalization capability of the vision-language encoder-decoder framework yields on VideoXum. Particularly, we propose VTSUM-BLIP, an end-to-end framework, serving as a strong baseline for this novel benchmark. Moreover, we adapt CLIPScore for VideoXum to measure the semantic consistency of cross-modal summaries effectively.


In this study, we first propose VideoXum, an enriched large-scale dataset for cross-modal video summarization. The dataset is built on ActivityNet Captions, a large-scale public video captioning benchmark. We hire workers to annotate ten shortened video summaries for each long source video according to the corresponding captions. VideoXum contains 14K long videos with 140K pairs of aligned video and text summaries.

Interpolate start reference image.

Illustration of our V2X-SUM task. A long source video (bottom) can be summarized into a shortened videoand a text narrative (top). The video and text summaries should be semantically aligned.

Our goal is to extend the traditional single-modal video summarization task to a cross-modal video summarization task, referred to as V2X-SUM to meet the demands of broader application scenarios (e.g., movie trailer generation and narrative generation). According to the target modality of the generated summaries, we categorize our proposed V2X-SUM task into three subtasks:

  • Video-to-Video Summarization (V2V-SUM). This task requires models to identify the most important segments from the source video and generate an abridged version of the source video.
  • Video-to-Text Summarization (V2T-SUM). In this task, models need to summarize the main content of the source video and generate a short text description.
  • Video-to-Video&Text Summarization (V2VT-SUM). This task requires models to summarize a short video and the corresponding narrative from a source video simultaneously. Moreover, the semantics of these two modalities of summaries should be well aligned.


We propose VTSUM-BLIP, a novel end-to-end cross-modal video summarization model. To leverage the strong capability of vision understanding and language modeling of pretrained language models, we employ BLIP as our backbone. Then, we design an efficient hierarchical video encoding strategy with a frozen encoder and a temporal modeling module to encode long videos. Furthermore, we design different task-specific decoders for video and text summarization. The modularized design enables us to perform more complex downstream tasks without changing the structure of the pretrained model.

Interpolate start reference image.

An overview of our VTSUM-BLIP model (left). The model consists of a hierarchical video encoder (middle), video-sum decoder, and text-sum decoder (right). For V2V-SUM, the video-sum decoder employs a temporal Transformer and local self-attention module to aggregate local context. For V2T-SUM, the text-sum decoder is a pretrained BLIP text decoder.


We evaluate our model by jointly generating visual and textual summaries for a long video. We refer to this task as cross-modal visual and textual summarization of videos (i.e., V2VT-SUM). We evaluate our model on VideoXum dataset. We show examples qualitative results from the variations of our proposed model (VTSUM-BLIP) in the following figure.

Interpolate start reference image.

Example results of the generated video and text summaries across different baseline models. Red (both line and box) indicates the results of the ground truth. Green indicates the results of the VTSUM-BLIP (Base). Blue indicates the results of VTSUM-BLIP (+TT+CA).


  author    = {Lin, Jingyang and Hua, Hang and Chen, Ming and Li, Yikang and Hsiao, Jenhao and Ho, Chiuman and Luo, Jiebo},
  title     = {VideoXum: Cross-modal Visual and Textural Summarization of Videos},
  journal   = {IEEE Transactions on Multimedia},
  year      = {2023},