基于上下文相关双边LSTM的关键词驱动图像字幕

2017 IEEE International Conference on Multimedia and Expo (ICME) Pub Date : 2017-07-01 DOI:10.1109/ICME.2017.8019525

Xiaodan Zhang, Shengfeng He, Xinhang Song, Pengxu Wei, Shuqiang Jiang, Qixiang Ye, Jianbin Jiao, Rynson W. H. Lau

{"title":"基于上下文相关双边LSTM的关键词驱动图像字幕","authors":"Xiaodan Zhang, Shengfeng He, Xinhang Song, Pengxu Wei, Shuqiang Jiang, Qixiang Ye, Jianbin Jiao, Rynson W. H. Lau","doi":"10.1109/ICME.2017.8019525","DOIUrl":null,"url":null,"abstract":"Image captioning has recently received much attention. Existing approaches, however, are limited to describing images with simple contextual information, which typically generate one sentence to describe each image with only a single contextual emphasis. In this paper, we address this limitation from a user perspective with a novel approach. Given some keywords as additional inputs, the proposed method would generate various descriptions according to the provided guidance. Hence, descriptions with different focuses can be generated for the same image. Our method is based on a new Context-dependent Bilateral Long Short-Term Memory (CDB-LSTM) model to predict a keyword-driven sentence by considering the word dependence. The word dependence is explored externally with a bilateral pipeline, and internally with a unified and joint training process. Experiments on the MS COCO dataset demonstrate that the proposed approach not only significantly outperforms the baseline method but also shows good adaptation and consistency with various keywords.","PeriodicalId":330977,"journal":{"name":"2017 IEEE International Conference on Multimedia and Expo (ICME)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Keyword-driven image captioning via Context-dependent Bilateral LSTM\",\"authors\":\"Xiaodan Zhang, Shengfeng He, Xinhang Song, Pengxu Wei, Shuqiang Jiang, Qixiang Ye, Jianbin Jiao, Rynson W. H. Lau\",\"doi\":\"10.1109/ICME.2017.8019525\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Image captioning has recently received much attention. Existing approaches, however, are limited to describing images with simple contextual information, which typically generate one sentence to describe each image with only a single contextual emphasis. In this paper, we address this limitation from a user perspective with a novel approach. Given some keywords as additional inputs, the proposed method would generate various descriptions according to the provided guidance. Hence, descriptions with different focuses can be generated for the same image. Our method is based on a new Context-dependent Bilateral Long Short-Term Memory (CDB-LSTM) model to predict a keyword-driven sentence by considering the word dependence. The word dependence is explored externally with a bilateral pipeline, and internally with a unified and joint training process. Experiments on the MS COCO dataset demonstrate that the proposed approach not only significantly outperforms the baseline method but also shows good adaptation and consistency with various keywords.\",\"PeriodicalId\":330977,\"journal\":{\"name\":\"2017 IEEE International Conference on Multimedia and Expo (ICME)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE International Conference on Multimedia and Expo (ICME)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICME.2017.8019525\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Conference on Multimedia and Expo (ICME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICME.2017.8019525","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

图片字幕最近受到了很多关注。然而，现有的方法仅限于描述具有简单上下文信息的图像，通常只生成一个句子来描述具有单一上下文重点的每个图像。在本文中，我们用一种新颖的方法从用户的角度解决了这一限制。给出一些关键词作为额外输入，所提出的方法将根据所提供的指南生成各种描述。因此，同一幅图像可以产生不同焦点的描述。我们的方法是基于一种新的上下文相关的双边长短期记忆(CDB-LSTM)模型，通过考虑词依赖来预测关键词驱动句子。外部通过双边管道探索单词依赖性，内部通过统一的联合培训过程探索单词依赖性。在MS COCO数据集上的实验表明，该方法不仅显著优于基线方法，而且对各种关键词具有良好的适应性和一致性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Keyword-driven image captioning via Context-dependent Bilateral LSTM

Image captioning has recently received much attention. Existing approaches, however, are limited to describing images with simple contextual information, which typically generate one sentence to describe each image with only a single contextual emphasis. In this paper, we address this limitation from a user perspective with a novel approach. Given some keywords as additional inputs, the proposed method would generate various descriptions according to the provided guidance. Hence, descriptions with different focuses can be generated for the same image. Our method is based on a new Context-dependent Bilateral Long Short-Term Memory (CDB-LSTM) model to predict a keyword-driven sentence by considering the word dependence. The word dependence is explored externally with a bilateral pipeline, and internally with a unified and joint training process. Experiments on the MS COCO dataset demonstrate that the proposed approach not only significantly outperforms the baseline method but also shows good adaptation and consistency with various keywords.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE International Conference on Multimedia and Expo (ICME)

自引率

0.00%

发文量