Multimodal representation fusion method for dense video captioning

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Knowledge-Based Systems Pub Date : 2025-06-13 DOI:10.1016/j.knosys.2025.113856

Haojie Fang , Yonggang Li , Yingjian Li

{"title":"Multimodal representation fusion method for dense video captioning","authors":"Haojie Fang , Yonggang Li , Yingjian Li","doi":"10.1016/j.knosys.2025.113856","DOIUrl":null,"url":null,"abstract":"<div><div>Dense video captioning aims to locate multiple events from untrimmed videos and generate corresponding captions for each meaningful event. The application of multimodal information(e.g., video, audio) for dense video captioning has recently achieved great success. However, learning the information interactions between different modalities while achieving cross-modal feature alignment is highly challenging for an encoder. Recent studies of several multimodal tasks have shown that multimodal models benefit from shared and individual representations. Thus, in this paper, we propose a novel feature fusion module, which uses shared and individual modality representations to capture commonalities and complementary relationships between modalities. Moreover, the proposed model bridges the gap between shared modality representations, which helps to obtain deeper cross-modal associations for better feature interaction and alignment. Furthermore, to compensate for the limitation that different level proposal heads do not interact sufficiently during event detection, we propose a multilevel information interaction mechanism to dynamically adjust and fuse the information among different level proposal heads in the event detection module. Based on the ActivityNet Captions, subdatasets of ActivityNet Captions and YouCook2, we conducted comprehensive experiments to evaluate the performance of our proposed model. The experimental results show that our model achieves impressive performance compared with state-of-the-art methods.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"324 ","pages":"Article 113856"},"PeriodicalIF":7.2000,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705125009025","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Dense video captioning aims to locate multiple events from untrimmed videos and generate corresponding captions for each meaningful event. The application of multimodal information(e.g., video, audio) for dense video captioning has recently achieved great success. However, learning the information interactions between different modalities while achieving cross-modal feature alignment is highly challenging for an encoder. Recent studies of several multimodal tasks have shown that multimodal models benefit from shared and individual representations. Thus, in this paper, we propose a novel feature fusion module, which uses shared and individual modality representations to capture commonalities and complementary relationships between modalities. Moreover, the proposed model bridges the gap between shared modality representations, which helps to obtain deeper cross-modal associations for better feature interaction and alignment. Furthermore, to compensate for the limitation that different level proposal heads do not interact sufficiently during event detection, we propose a multilevel information interaction mechanism to dynamically adjust and fuse the information among different level proposal heads in the event detection module. Based on the ActivityNet Captions, subdatasets of ActivityNet Captions and YouCook2, we conducted comprehensive experiments to evaluate the performance of our proposed model. The experimental results show that our model achieves impressive performance compared with state-of-the-art methods.

查看原文本刊更多论文

密集视频字幕的多模态表示融合方法

密集视频字幕旨在从未修剪的视频中定位多个事件，并为每个有意义的事件生成相应的字幕。多模态信息的应用(例如：（视频，音频）用于密集视频字幕最近取得了巨大的成功。然而，在实现跨模态特征对齐的同时，学习不同模态之间的信息交互对编码器来说是极具挑战性的。最近对几个多模态任务的研究表明，多模态模型受益于共享和个体表征。因此，在本文中，我们提出了一种新的特征融合模块，该模块使用共享和单个模态表示来捕获模态之间的共性和互补关系。此外，该模型弥补了共享模态表示之间的差距，有助于获得更深层次的跨模态关联，从而实现更好的特征交互和对齐。此外，为了弥补事件检测过程中不同级别的提案头交互不够充分的局限性，在事件检测模块中提出了多级信息交互机制，对不同级别提案头之间的信息进行动态调整和融合。基于ActivityNet Captions、ActivityNet Captions的子数据集和YouCook2，我们进行了全面的实验来评估我们提出的模型的性能。实验结果表明，与现有的方法相比，我们的模型取得了令人印象深刻的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.