Oracle Performance for Visual Captioning

arXiv: Computer Vision and Pattern Recognition Pub Date : 2015-11-14 DOI:10.5244/C.30.141

L. Yao, Nicolas Ballas, Kyunghyun Cho, John R. Smith, Yoshua Bengio

{"title":"Oracle Performance for Visual Captioning","authors":"L. Yao, Nicolas Ballas, Kyunghyun Cho, John R. Smith, Yoshua Bengio","doi":"10.5244/C.30.141","DOIUrl":null,"url":null,"abstract":"The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and only requiring training a conditional language model for the second step. We demonstrate the construction of such bounds on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite of the imperfect process we used for visual concept extraction in the first step and the simplicity of the language model for the second step, we show that current state-of-the-art models fall short when being compared with the learned upper bounds. Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount of visual elements captured and their accuracy, and the intrinsic difficulty and blessing of different datasets.","PeriodicalId":185904,"journal":{"name":"arXiv: Computer Vision and Pattern Recognition","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv: Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5244/C.30.141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

The task of associating images and videos with a natural language description has attracted a great amount of attention recently. Rapid progress has been made in terms of both developing novel algorithms and releasing new datasets. Indeed, the state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates the possibility of empirically establishing performance upper bounds on various visual captioning datasets without extra data labelling effort or human evaluation. In particular, it is assumed that visual captioning is decomposed into two steps: from visual inputs to visual concepts, and from visual concepts to natural language descriptions. One would be able to obtain an upper bound when assuming the first step is perfect and only requiring training a conditional language model for the second step. We demonstrate the construction of such bounds on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite of the imperfect process we used for visual concept extraction in the first step and the simplicity of the language model for the second step, we show that current state-of-the-art models fall short when being compared with the learned upper bounds. Furthermore, with such a bound, we quantify several important factors concerning image and video captioning: the number of visual concepts captured by different models, the trade-off between the amount of visual elements captured and their accuracy, and the intrinsic difficulty and blessing of different datasets.

查看原文本刊更多论文

Oracle性能可视化字幕

将图像和视频与自然语言描述相关联的任务近年来引起了人们的广泛关注。在开发新算法和发布新数据集方面取得了快速进展。事实上，一些标准数据集上的最新结果已经被推入了一个越来越难以做出重大改进的体系。这项工作没有提出新的模型，而是研究了在没有额外的数据标记工作或人工评估的情况下，在各种视觉字幕数据集上建立性能上限的可能性。特别地，假设视觉字幕分解为两个步骤:从视觉输入到视觉概念，从视觉概念到自然语言描述。假设第一步是完美的，第二步只需要训练一个条件语言模型，就可以得到上界。我们在MS-COCO、YouTube2Text和LSMDC (M-VAD和MPII-MD的组合)上演示了这种边界的构建。令人惊讶的是，尽管我们在第一步中用于视觉概念提取的过程并不完美，第二步的语言模型也很简单，但我们表明，与学习到的上界相比，当前最先进的模型存在不足。此外，有了这样一个界限，我们量化了几个关于图像和视频字幕的重要因素:不同模型捕获的视觉概念的数量，捕获的视觉元素数量与其准确性之间的权衡，以及不同数据集的内在困难和幸福。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv: Computer Vision and Pattern Recognition

自引率

0.00%

发文量