Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion

Electronics Pub Date : 2024-05-23 DOI:10.3390/electronics13112039

Jing Zhang, Guangli Wu, Xinlong Bi, Yulong Cui

{"title":"Video Summarization Generation Network Based on Dynamic Graph Contrastive Learning and Feature Fusion","authors":"Jing Zhang, Guangli Wu, Xinlong Bi, Yulong Cui","doi":"10.3390/electronics13112039","DOIUrl":null,"url":null,"abstract":"Video summarization aims to analyze the structure and content of videos and extract key segments to construct summarization that can accurately summarize the main content, allowing users to quickly access the core information without browsing the full video. However, existing methods have difficulties in capturing long-term dependencies when dealing with long videos. On the other hand, there is a large amount of noise in graph structures, which may lead to the influence of redundant information and is not conducive to the effective learning of video features. To solve the above problems, we propose a video summarization generation network based on dynamic graph contrastive learning and feature fusion, which mainly consists of three modules: feature extraction, video encoder, and feature fusion. Firstly, we compute the shot features and construct a dynamic graph by using the shot features as nodes of the graph and the similarity between the shot features as the weights of the edges. In the video encoder, we extract the temporal and structural features in the video using stacked L-G Blocks, where the L-G Block consists of a bidirectional long short-term memory network and a graph convolutional network. Then, the shallow-level features are obtained after processing by L-G Blocks. In order to remove the redundant information in the graph, graph contrastive learning is used to obtain the optimized deep-level features. Finally, to fully exploit the feature information of the video, a feature fusion gate using the gating mechanism is designed to fully fuse the shallow-level features with the deep-level features. Extensive experiments are conducted on two benchmark datasets, TVSum and SumMe, and the experimental results show that our proposed method outperforms most of the current state-of-the-art video summarization methods.","PeriodicalId":504598,"journal":{"name":"Electronics","volume":"32 31","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Electronics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/electronics13112039","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Video summarization aims to analyze the structure and content of videos and extract key segments to construct summarization that can accurately summarize the main content, allowing users to quickly access the core information without browsing the full video. However, existing methods have difficulties in capturing long-term dependencies when dealing with long videos. On the other hand, there is a large amount of noise in graph structures, which may lead to the influence of redundant information and is not conducive to the effective learning of video features. To solve the above problems, we propose a video summarization generation network based on dynamic graph contrastive learning and feature fusion, which mainly consists of three modules: feature extraction, video encoder, and feature fusion. Firstly, we compute the shot features and construct a dynamic graph by using the shot features as nodes of the graph and the similarity between the shot features as the weights of the edges. In the video encoder, we extract the temporal and structural features in the video using stacked L-G Blocks, where the L-G Block consists of a bidirectional long short-term memory network and a graph convolutional network. Then, the shallow-level features are obtained after processing by L-G Blocks. In order to remove the redundant information in the graph, graph contrastive learning is used to obtain the optimized deep-level features. Finally, to fully exploit the feature information of the video, a feature fusion gate using the gating mechanism is designed to fully fuse the shallow-level features with the deep-level features. Extensive experiments are conducted on two benchmark datasets, TVSum and SumMe, and the experimental results show that our proposed method outperforms most of the current state-of-the-art video summarization methods.

查看原文本刊更多论文

基于动态图对比学习和特征融合的视频摘要生成网络

视频摘要旨在分析视频的结构和内容，提取关键片段，构建能够准确概括主要内容的摘要，让用户无需浏览完整视频即可快速获取核心信息。然而，现有方法在处理长视频时难以捕捉长期依赖关系。另一方面，图结构中存在大量噪声，可能导致冗余信息的影响，不利于视频特征的有效学习。为解决上述问题，我们提出了一种基于动态图对比学习和特征融合的视频摘要生成网络，主要由特征提取、视频编码器和特征融合三个模块组成。首先，我们计算镜头特征，并以镜头特征作为图的节点，以镜头特征之间的相似度作为边的权重，构建动态图。在视频编码器中，我们使用堆叠的 L-G Block 提取视频中的时间和结构特征，其中 L-G Block 由双向长短期记忆网络和图卷积网络组成。L-G Block 由双向长短时记忆网络和图卷积网络组成，经过 L-G Block 处理后得到浅层特征。为了去除图中的冗余信息，使用图对比学习来获得优化的深层次特征。最后，为了充分利用视频的特征信息，设计了一个使用门控机制的特征融合门，将浅层特征与深层特征充分融合。我们在两个基准数据集 TVSum 和 SumMe 上进行了广泛的实验，实验结果表明我们提出的方法优于目前大多数最先进的视频摘要方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Electronics

自引率

0.00%

发文量