Alaa Thobhani , Beiji Zou , Xiaoyan Kui , Asma A. Al-Shargabi , Zaid Derea , Amr Abdussalam , Mohammed A. Asham
{"title":"A novel image captioning model with visual-semantic similarities and visual representations re-weighting","authors":"Alaa Thobhani , Beiji Zou , Xiaoyan Kui , Asma A. Al-Shargabi , Zaid Derea , Amr Abdussalam , Mohammed A. Asham","doi":"10.1016/j.jksuci.2024.102127","DOIUrl":null,"url":null,"abstract":"<div><p>Image captioning, the task of generating descriptive sentences for images, has seen significant advancements by incorporating semantic information. However, previous studies employed semantic attribute detectors to extract predetermined attributes consistently applied at every time step, resulting in the use of irrelevant attributes to the linguistic context during words’ generation. Furthermore, the integration between semantic attributes and visual representations in previous works is considered superficial and ineffective, leading to the neglection of the rich visual-semantic connections affecting the captioning model’s performance. To address the limitations of previous models, we introduced a novel framework that adapts attribute usage based on contextual relevance and effectively utilizes the similarities between visual features and semantic attributes. Our framework includes an Attribute Detection Component (ADC) that predicts relevant attributes using visual features and attribute embeddings. The Attribute Prediction and Visual Weighting module (APVW) then dynamically adjusts these attributes and generates weights to refine the visual context vector, enhancing semantic alignment. Our approach demonstrated an average improvement of 3.30% in BLEU@1 and 5.24% in CIDEr on MS-COCO, and 6.55% in BLEU@1 and 25.72% in CIDEr on Flickr30K, during CIDEr optimization phase.</p></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":"36 7","pages":"Article 102127"},"PeriodicalIF":5.2000,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1319157824002167/pdfft?md5=a64ddf3f2ec61fdc99923155773d0fc6&pid=1-s2.0-S1319157824002167-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1319157824002167","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Image captioning, the task of generating descriptive sentences for images, has seen significant advancements by incorporating semantic information. However, previous studies employed semantic attribute detectors to extract predetermined attributes consistently applied at every time step, resulting in the use of irrelevant attributes to the linguistic context during words’ generation. Furthermore, the integration between semantic attributes and visual representations in previous works is considered superficial and ineffective, leading to the neglection of the rich visual-semantic connections affecting the captioning model’s performance. To address the limitations of previous models, we introduced a novel framework that adapts attribute usage based on contextual relevance and effectively utilizes the similarities between visual features and semantic attributes. Our framework includes an Attribute Detection Component (ADC) that predicts relevant attributes using visual features and attribute embeddings. The Attribute Prediction and Visual Weighting module (APVW) then dynamically adjusts these attributes and generates weights to refine the visual context vector, enhancing semantic alignment. Our approach demonstrated an average improvement of 3.30% in BLEU@1 and 5.24% in CIDEr on MS-COCO, and 6.55% in BLEU@1 and 25.72% in CIDEr on Flickr30K, during CIDEr optimization phase.
期刊介绍:
In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.