{"title":"新闻图片字幕的多模态实体提示微调","authors":"Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang","doi":"10.1145/3503161.3547883","DOIUrl":null,"url":null,"abstract":"News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Fine-tuning with Multi-modal Entity Prompts for News Image Captioning\",\"authors\":\"Jingjing Zhang, Shancheng Fang, Zhendong Mao, Zhiwei Zhang, Yongdong Zhang\",\"doi\":\"10.1145/3503161.3547883\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.\",\"PeriodicalId\":412792,\"journal\":{\"name\":\"Proceedings of the 30th ACM International Conference on Multimedia\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 30th ACM International Conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3503161.3547883\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 30th ACM International Conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3503161.3547883","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fine-tuning with Multi-modal Entity Prompts for News Image Captioning
News Image Captioning aims to generate descriptions for images embedded in news articles, including plentiful real-world concepts, especially about named entities. However, existing methods are limited in the entity-level template. Not only is it labor-intensive to craft the template, but it is error-prone due to local entity-aware, which solely constrains the prediction output at each language model decoding step with corrupted entity relationship. To overcome the problem, we investigate a concise and flexible paradigm to achieve global entity-aware by introducing a prompting mechanism with fine-tuning pre-trained models, named Fine-tuning with Multi-modal Entity Prompts for News Image Captioning (NewsMEP). Firstly, we incorporate two pre-trained models: (i) CLIP, translating the image with open-domain knowledge; (ii) BART, extended to encode article and image simultaneously. Moreover, leveraging the BART architecture, we can easily take the end-to-end fashion. Secondly, we prepend the target caption with two prompts to utilize entity-level lexical cohesion and inherent coherence in the pre-trained language model. Concretely, the visual prompts are obtained by mapping CLIP embeddings, and contextual vectors automatically construct the entity-oriented prompts. Thirdly, we provide an entity chain to control caption generation that focuses on entities of interest. Experiments results on two large-scale publicly available datasets, including detailed ablation studies, show that our NewsMEP not only outperforms state-of-the-art methods in general caption metrics but also achieves significant performance in precision and recall of various named entities.