{"title":"Multimodal video summarization based on graph contrastive learning with fine-grained graph interaction","authors":"Guangli Wu , Miaomiao Wang , Ning Ma","doi":"10.1016/j.sigpro.2025.110250","DOIUrl":null,"url":null,"abstract":"<div><div>Video summarization aims to extract key information from videos and select concise and representative clips to form a summary. However, unimodal video summarization methods have difficulty capturing the rich semantic in videos, while existing multimodal methods often face interference from redundant frames and noise during modal fusion, resulting in insufficient cross-modal interaction. Therefore, we introduce a multimodal video summarization model based on graph contrastive learning and fine-grained graph interaction. The model first constructs the video and text as graph structures, and uses a spatial–temporal graph network to collaboratively model spatial–temporal dependencies. Second, the node features of the video and text graph are optimized using graph contrastive learning to eliminate redundant frames and noise. In the cross-modal graph matching, the similarity between the video and text graph is modeled in parallel by introducing multiple semantic perspectives to achieve fine-grained cross-modal interaction. In addition, this paper introduces a graph alignment loss to further constrain the consistency of cross-modal semantic alignment. Finally, extensive experiments on two benchmark datasets, TVSum and SumMe, verify the effectiveness of the DCGM model, which outperforms the current state-of-the-art methods in terms of performance.</div></div>","PeriodicalId":49523,"journal":{"name":"Signal Processing","volume":"239 ","pages":"Article 110250"},"PeriodicalIF":3.6000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0165168425003640","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Video summarization aims to extract key information from videos and select concise and representative clips to form a summary. However, unimodal video summarization methods have difficulty capturing the rich semantic in videos, while existing multimodal methods often face interference from redundant frames and noise during modal fusion, resulting in insufficient cross-modal interaction. Therefore, we introduce a multimodal video summarization model based on graph contrastive learning and fine-grained graph interaction. The model first constructs the video and text as graph structures, and uses a spatial–temporal graph network to collaboratively model spatial–temporal dependencies. Second, the node features of the video and text graph are optimized using graph contrastive learning to eliminate redundant frames and noise. In the cross-modal graph matching, the similarity between the video and text graph is modeled in parallel by introducing multiple semantic perspectives to achieve fine-grained cross-modal interaction. In addition, this paper introduces a graph alignment loss to further constrain the consistency of cross-modal semantic alignment. Finally, extensive experiments on two benchmark datasets, TVSum and SumMe, verify the effectiveness of the DCGM model, which outperforms the current state-of-the-art methods in terms of performance.
期刊介绍:
Signal Processing incorporates all aspects of the theory and practice of signal processing. It features original research work, tutorial and review articles, and accounts of practical developments. It is intended for a rapid dissemination of knowledge and experience to engineers and scientists working in the research, development or practical application of signal processing.
Subject areas covered by the journal include: Signal Theory; Stochastic Processes; Detection and Estimation; Spectral Analysis; Filtering; Signal Processing Systems; Software Developments; Image Processing; Pattern Recognition; Optical Signal Processing; Digital Signal Processing; Multi-dimensional Signal Processing; Communication Signal Processing; Biomedical Signal Processing; Geophysical and Astrophysical Signal Processing; Earth Resources Signal Processing; Acoustic and Vibration Signal Processing; Data Processing; Remote Sensing; Signal Processing Technology; Radar Signal Processing; Sonar Signal Processing; Industrial Applications; New Applications.