Lixin Dai , Tingting Han , Zhou Yu , Jun Yu , Min Tan , Yang Liu
{"title":"Modality-aware contrast and fusion for multi-modal summarization","authors":"Lixin Dai , Tingting Han , Zhou Yu , Jun Yu , Min Tan , Yang Liu","doi":"10.1016/j.neucom.2025.130094","DOIUrl":null,"url":null,"abstract":"<div><div>Multimodal Summarization with Multi-modal Output (MSMO) is an emerging field focused on generating reliable and high-quality summaries by integrating various media types, such as text and video. Current methods primarily focus on integrating features from different modalities, but often overlook further enhancement and optimization of the fused features. This limitation can reduce the representational capacity of the fusion, ultimately diminishing overall performance. To address these challenges, a novel Modality-aware Contrast and Fusion (MCF) network has been proposed. This network leverages contrastive learning to preserve the integrity of modality-specific semantics while promoting the complementary integration of different media types. The Multi-Modal Attention (MMA) module captures temporal dependencies and learns discriminative semantics for individual media types through uni-modal semantic attention, while aligning and integrating semantics from multiple sources via cross-modal semantic attention. The Uni-Cross Contrastive Learning (UCC) module minimizes modality-aware contrastive losses to enhance the distinctiveness of semantic representations. The Modality-Aware Fusion (MAF) module dynamically adjusts the contributions of uni-modal and cross-modal outputs during the summarization process, optimizing the integration based on the strengths of each modality. Extensive validation on the Bliss, Daily Mail, and CNN datasets demonstrates the state-of-the-art performance of the MCF network and confirms the effectiveness of its components.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"639 ","pages":"Article 130094"},"PeriodicalIF":5.5000,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225007660","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Multimodal Summarization with Multi-modal Output (MSMO) is an emerging field focused on generating reliable and high-quality summaries by integrating various media types, such as text and video. Current methods primarily focus on integrating features from different modalities, but often overlook further enhancement and optimization of the fused features. This limitation can reduce the representational capacity of the fusion, ultimately diminishing overall performance. To address these challenges, a novel Modality-aware Contrast and Fusion (MCF) network has been proposed. This network leverages contrastive learning to preserve the integrity of modality-specific semantics while promoting the complementary integration of different media types. The Multi-Modal Attention (MMA) module captures temporal dependencies and learns discriminative semantics for individual media types through uni-modal semantic attention, while aligning and integrating semantics from multiple sources via cross-modal semantic attention. The Uni-Cross Contrastive Learning (UCC) module minimizes modality-aware contrastive losses to enhance the distinctiveness of semantic representations. The Modality-Aware Fusion (MAF) module dynamically adjusts the contributions of uni-modal and cross-modal outputs during the summarization process, optimizing the integration based on the strengths of each modality. Extensive validation on the Bliss, Daily Mail, and CNN datasets demonstrates the state-of-the-art performance of the MCF network and confirms the effectiveness of its components.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.