{"title":"Multimodal representation fusion method for dense video captioning","authors":"Haojie Fang , Yonggang Li , Yingjian Li","doi":"10.1016/j.knosys.2025.113856","DOIUrl":null,"url":null,"abstract":"<div><div>Dense video captioning aims to locate multiple events from untrimmed videos and generate corresponding captions for each meaningful event. The application of multimodal information(e.g., video, audio) for dense video captioning has recently achieved great success. However, learning the information interactions between different modalities while achieving cross-modal feature alignment is highly challenging for an encoder. Recent studies of several multimodal tasks have shown that multimodal models benefit from shared and individual representations. Thus, in this paper, we propose a novel feature fusion module, which uses shared and individual modality representations to capture commonalities and complementary relationships between modalities. Moreover, the proposed model bridges the gap between shared modality representations, which helps to obtain deeper cross-modal associations for better feature interaction and alignment. Furthermore, to compensate for the limitation that different level proposal heads do not interact sufficiently during event detection, we propose a multilevel information interaction mechanism to dynamically adjust and fuse the information among different level proposal heads in the event detection module. Based on the ActivityNet Captions, subdatasets of ActivityNet Captions and YouCook2, we conducted comprehensive experiments to evaluate the performance of our proposed model. The experimental results show that our model achieves impressive performance compared with state-of-the-art methods.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"324 ","pages":"Article 113856"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125009025","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Dense video captioning aims to locate multiple events from untrimmed videos and generate corresponding captions for each meaningful event. The application of multimodal information(e.g., video, audio) for dense video captioning has recently achieved great success. However, learning the information interactions between different modalities while achieving cross-modal feature alignment is highly challenging for an encoder. Recent studies of several multimodal tasks have shown that multimodal models benefit from shared and individual representations. Thus, in this paper, we propose a novel feature fusion module, which uses shared and individual modality representations to capture commonalities and complementary relationships between modalities. Moreover, the proposed model bridges the gap between shared modality representations, which helps to obtain deeper cross-modal associations for better feature interaction and alignment. Furthermore, to compensate for the limitation that different level proposal heads do not interact sufficiently during event detection, we propose a multilevel information interaction mechanism to dynamically adjust and fuse the information among different level proposal heads in the event detection module. Based on the ActivityNet Captions, subdatasets of ActivityNet Captions and YouCook2, we conducted comprehensive experiments to evaluate the performance of our proposed model. The experimental results show that our model achieves impressive performance compared with state-of-the-art methods.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.