Hongyu Deng;Yushan Xie;Qi Wang;Jianjun Wang;Weijian Ruan;Wu Liu;Yong-Jin Liu
{"title":"CDKM: Common and Distinct Knowledge Mining Network With Content Interaction for Dense Captioning","authors":"Hongyu Deng;Yushan Xie;Qi Wang;Jianjun Wang;Weijian Ruan;Wu Liu;Yong-Jin Liu","doi":"10.1109/TMM.2024.3407695","DOIUrl":null,"url":null,"abstract":"The dense captioning task aims at detecting multiple salient regions of an image and describing them separately in natural language. Although significant advancements in the field of dense captioning have been made, there are still some limitations to existing methods in recent years. On the one hand, most dense captioning methods lack strong target detection capabilities and struggle to cover all relevant content when dealing with target-intensive images. On the other hand, current transformer-based methods are powerful but neglect the acquisition and utilization of contextual information, hindering the visual understanding of local areas. To address these issues, we propose a common and distinct knowledge-mining network with content interaction for the task of dense captioning. Our network has a knowledge mining mechanism that improves the detection of salient targets by capturing common and distinct knowledge from multi-scale features. We further propose a content interaction module that combines region features into a unique context based on their correlation. Our experiments on various benchmarks have shown that the proposed method outperforms the current state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10462-10473"},"PeriodicalIF":8.4000,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10547439/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The dense captioning task aims at detecting multiple salient regions of an image and describing them separately in natural language. Although significant advancements in the field of dense captioning have been made, there are still some limitations to existing methods in recent years. On the one hand, most dense captioning methods lack strong target detection capabilities and struggle to cover all relevant content when dealing with target-intensive images. On the other hand, current transformer-based methods are powerful but neglect the acquisition and utilization of contextual information, hindering the visual understanding of local areas. To address these issues, we propose a common and distinct knowledge-mining network with content interaction for the task of dense captioning. Our network has a knowledge mining mechanism that improves the detection of salient targets by capturing common and distinct knowledge from multi-scale features. We further propose a content interaction module that combines region features into a unique context based on their correlation. Our experiments on various benchmarks have shown that the proposed method outperforms the current state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.