Multimodal video summarization based on graph contrastive learning with fine-grained graph interaction

IF 3.6 2区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Guangli Wu , Miaomiao Wang , Ning Ma
{"title":"Multimodal video summarization based on graph contrastive learning with fine-grained graph interaction","authors":"Guangli Wu ,&nbsp;Miaomiao Wang ,&nbsp;Ning Ma","doi":"10.1016/j.sigpro.2025.110250","DOIUrl":null,"url":null,"abstract":"<div><div>Video summarization aims to extract key information from videos and select concise and representative clips to form a summary. However, unimodal video summarization methods have difficulty capturing the rich semantic in videos, while existing multimodal methods often face interference from redundant frames and noise during modal fusion, resulting in insufficient cross-modal interaction. Therefore, we introduce a multimodal video summarization model based on graph contrastive learning and fine-grained graph interaction. The model first constructs the video and text as graph structures, and uses a spatial–temporal graph network to collaboratively model spatial–temporal dependencies. Second, the node features of the video and text graph are optimized using graph contrastive learning to eliminate redundant frames and noise. In the cross-modal graph matching, the similarity between the video and text graph is modeled in parallel by introducing multiple semantic perspectives to achieve fine-grained cross-modal interaction. In addition, this paper introduces a graph alignment loss to further constrain the consistency of cross-modal semantic alignment. Finally, extensive experiments on two benchmark datasets, TVSum and SumMe, verify the effectiveness of the DCGM model, which outperforms the current state-of-the-art methods in terms of performance.</div></div>","PeriodicalId":49523,"journal":{"name":"Signal Processing","volume":"239 ","pages":"Article 110250"},"PeriodicalIF":3.6000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165168425003640","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

Abstract

Video summarization aims to extract key information from videos and select concise and representative clips to form a summary. However, unimodal video summarization methods have difficulty capturing the rich semantic in videos, while existing multimodal methods often face interference from redundant frames and noise during modal fusion, resulting in insufficient cross-modal interaction. Therefore, we introduce a multimodal video summarization model based on graph contrastive learning and fine-grained graph interaction. The model first constructs the video and text as graph structures, and uses a spatial–temporal graph network to collaboratively model spatial–temporal dependencies. Second, the node features of the video and text graph are optimized using graph contrastive learning to eliminate redundant frames and noise. In the cross-modal graph matching, the similarity between the video and text graph is modeled in parallel by introducing multiple semantic perspectives to achieve fine-grained cross-modal interaction. In addition, this paper introduces a graph alignment loss to further constrain the consistency of cross-modal semantic alignment. Finally, extensive experiments on two benchmark datasets, TVSum and SumMe, verify the effectiveness of the DCGM model, which outperforms the current state-of-the-art methods in terms of performance.

Abstract Image

基于图对比学习和细粒度图交互的多模态视频摘要
视频摘要的目的是从视频中提取关键信息,选取简洁、有代表性的片段形成摘要。然而,单模态视频摘要方法难以捕获视频中丰富的语义,而现有的多模态方法在模态融合过程中经常受到冗余帧和噪声的干扰,导致跨模态交互不足。为此,我们提出了一种基于图对比学习和细粒度图交互的多模态视频摘要模型。该模型首先将视频和文本构建为图结构,并利用时空图网络对时空依赖关系进行协同建模。其次,利用图对比学习优化视频图和文本图的节点特征,消除冗余帧和噪声;在跨模态图匹配中,通过引入多个语义透视图,对视频图和文本图的相似度进行并行建模,实现细粒度的跨模态交互。此外,本文还引入了图对齐损失来进一步约束跨模态语义对齐的一致性。最后,在TVSum和SumMe两个基准数据集上进行了大量实验,验证了DCGM模型的有效性,该模型在性能方面优于当前最先进的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Signal Processing
Signal Processing 工程技术-工程:电子与电气
CiteScore
9.20
自引率
9.10%
发文量
309
审稿时长
41 days
期刊介绍: Signal Processing incorporates all aspects of the theory and practice of signal processing. It features original research work, tutorial and review articles, and accounts of practical developments. It is intended for a rapid dissemination of knowledge and experience to engineers and scientists working in the research, development or practical application of signal processing. Subject areas covered by the journal include: Signal Theory; Stochastic Processes; Detection and Estimation; Spectral Analysis; Filtering; Signal Processing Systems; Software Developments; Image Processing; Pattern Recognition; Optical Signal Processing; Digital Signal Processing; Multi-dimensional Signal Processing; Communication Signal Processing; Biomedical Signal Processing; Geophysical and Astrophysical Signal Processing; Earth Resources Signal Processing; Acoustic and Vibration Signal Processing; Data Processing; Remote Sensing; Signal Processing Technology; Radar Signal Processing; Sonar Signal Processing; Industrial Applications; New Applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信