{"title":"Fine-tuning with Multi-modal Entity Prompts for News Image Captioning","authors":"Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang","doi":"10.1145/3503161.3547883","DOIUrl":null,"url":null,"abstract":"News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3547883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.