{"title":"Multi-Modal Graph Aggregation Transformer for image captioning","authors":"Lizhi Chen , Kesen Li","doi":"10.1016/j.neunet.2024.106813","DOIUrl":null,"url":null,"abstract":"<div><div>The current image captioning directly encodes the detected target area and recognizes the objects in the image to correctly describe the image. However, it is unreliable to make full use of regional features because they cannot convey contextual information, such as the relationship between objects and the lack of object predicate level semantics. An effective model should contain multiple modes and explore their interactions to help understand the image. Therefore, we introduce the Multi-Modal Graph Aggregation Transformer (MMGAT), which uses the information of various image modes to fill this gap. It first represents an image as a graph consisting of three sub-graphs, depicting context grid, region, and semantic text modalities respectively. Then, we introduce three aggregators that guide message passing from one graph to another to exploit context in different modalities, so as to refine the features of nodes. The updated nodes have better features for image captioning. We show significant performance scores of 144.6% CIDEr on MS-COCO and 80.3% CIDEr on Flickr30k compared to state of the arts, and conduct a rigorous analysis to demonstrate the importance of each part of our design.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"181 ","pages":"Article 106813"},"PeriodicalIF":6.0000,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608024007378","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The current image captioning directly encodes the detected target area and recognizes the objects in the image to correctly describe the image. However, it is unreliable to make full use of regional features because they cannot convey contextual information, such as the relationship between objects and the lack of object predicate level semantics. An effective model should contain multiple modes and explore their interactions to help understand the image. Therefore, we introduce the Multi-Modal Graph Aggregation Transformer (MMGAT), which uses the information of various image modes to fill this gap. It first represents an image as a graph consisting of three sub-graphs, depicting context grid, region, and semantic text modalities respectively. Then, we introduce three aggregators that guide message passing from one graph to another to exploit context in different modalities, so as to refine the features of nodes. The updated nodes have better features for image captioning. We show significant performance scores of 144.6% CIDEr on MS-COCO and 80.3% CIDEr on Flickr30k compared to state of the arts, and conduct a rigorous analysis to demonstrate the importance of each part of our design.
期刊介绍:
Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.