Fine-tuning with Multi-modal Entity Prompts for News Image Captioning

Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang
{"title":"Fine-tuning with Multi-modal Entity Prompts for News Image Captioning","authors":"Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang","doi":"10.1145/3503161.3547883","DOIUrl":null,"url":null,"abstract":"News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3547883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.
新闻图片字幕的多模态实体提示微调
News Image Captioning旨在为嵌入在新闻文章中的图像生成描述,包括大量现实世界的概念,特别是关于命名实体的概念。但是,现有的方法在实体级模板中受到限制。制作模板不仅需要大量的劳动,而且由于局部实体感知,它很容易出错,这仅仅限制了每个语言模型解码步骤的预测输出,并且破坏了实体关系。为了克服这个问题,我们研究了一个简洁灵活的范例,通过引入一种带有微调预训练模型的提示机制来实现全局实体感知,该机制被称为新闻图像字幕的多模态实体提示微调(NewsMEP)。首先,我们结合了两个预训练模型:(i) CLIP,利用开放域知识对图像进行翻译;(ii) BART,扩展到同时对文章和图像进行编码。此外,利用BART体系结构,我们可以很容易地采用端到端方式。其次,我们在目标标题前添加两个提示,以利用预训练语言模型中的实体级词汇衔接和内在连贯。具体来说,通过映射CLIP嵌入来获得视觉提示,上下文向量自动构造面向实体的提示。第三,我们提供了一个实体链来控制关注感兴趣实体的标题生成。在两个大型公开数据集上的实验结果,包括详细的烧蚀研究,表明我们的NewsMEP不仅在一般标题指标上优于最先进的方法,而且在各种命名实体的精度和召回率方面也取得了显着的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信