A Robust Approach to Open Vocabulary Image Retrieval with Deep Convolutional Neural Networks and Transfer Learning

2018 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC) Pub Date : 2018-10-01 DOI:10.23919/PNC.2018.8579473

Vishakh Padmakumar, Rishab Ranga, Srivalya Elluru, S. Kamath S

{"title":"A Robust Approach to Open Vocabulary Image Retrieval with Deep Convolutional Neural Networks and Transfer Learning","authors":"Vishakh Padmakumar, Rishab Ranga, Srivalya Elluru, S. Kamath S","doi":"10.23919/PNC.2018.8579473","DOIUrl":null,"url":null,"abstract":"Enabling computer systems to respond to conversational human language is a challenging problem with wideranging applications in the field of robotics and human computer interaction. Specifically, in image searches, humans tend to describe objects in fine-grained detail like color or company, for which conventional retrieval algorithms have shown poor performance. In this paper, a novel approach for open vocabulary image retrieval, capable of selecting the correct candidate image from among a set of distractions given a query in natural language form, is presented. Our methodology focuses on generating a robust set of image-text projections capable of accurately representing any image, with an objective of achieving high recall. To this end, an ensemble of classifiers is trained on ImageNet for representing high-resolution objects, Cifar 100 for smaller resolution images of objects and Caltech 256 for challenging views of everyday objects, for generating category-based projections. In addition to category based projections, we also make use of an image captioning model trained on MS COCO and Google Image Search (GISS) to capture additional semantic/latent information about the candidate images. To facilitate image retrieval, the natural language query and projection results are converted to a common vector representation using word embeddings, with which query-image similarity is computed. The proposed model when benchmarked on the RefCoco dataset, achieved an accuracy of 68.8%, while retrieving semantically meaningful candidate images.","PeriodicalId":409931,"journal":{"name":"2018 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/PNC.2018.8579473","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Enabling computer systems to respond to conversational human language is a challenging problem with wideranging applications in the field of robotics and human computer interaction. Specifically, in image searches, humans tend to describe objects in fine-grained detail like color or company, for which conventional retrieval algorithms have shown poor performance. In this paper, a novel approach for open vocabulary image retrieval, capable of selecting the correct candidate image from among a set of distractions given a query in natural language form, is presented. Our methodology focuses on generating a robust set of image-text projections capable of accurately representing any image, with an objective of achieving high recall. To this end, an ensemble of classifiers is trained on ImageNet for representing high-resolution objects, Cifar 100 for smaller resolution images of objects and Caltech 256 for challenging views of everyday objects, for generating category-based projections. In addition to category based projections, we also make use of an image captioning model trained on MS COCO and Google Image Search (GISS) to capture additional semantic/latent information about the candidate images. To facilitate image retrieval, the natural language query and projection results are converted to a common vector representation using word embeddings, with which query-image similarity is computed. The proposed model when benchmarked on the RefCoco dataset, achieved an accuracy of 68.8%, while retrieving semantically meaningful candidate images.

查看原文本刊更多论文

基于深度卷积神经网络和迁移学习的开放词汇图像检索方法

在机器人技术和人机交互领域，使计算机系统能够响应人类会话语言是一个具有挑战性的问题。具体来说，在图像搜索中，人类倾向于用细粒度的细节来描述物体，比如颜色或公司，而传统的检索算法在这方面表现不佳。本文提出了一种开放词汇图像检索的新方法，该方法能够在给定自然语言形式的查询中从一组干扰中选择正确的候选图像。我们的方法侧重于生成一组鲁棒的图像-文本投影，能够准确地表示任何图像，目标是实现高召回。为此，在ImageNet上训练了一个分类器集合，用于表示高分辨率对象，Cifar 100用于较小分辨率的对象图像，Caltech 256用于挑战日常对象的视图，用于生成基于类别的投影。除了基于类别的投影，我们还利用MS COCO和谷歌图像搜索(GISS)训练的图像字幕模型来捕获关于候选图像的额外语义/潜在信息。为了方便图像检索，使用词嵌入将自然语言查询和投影结果转换为公共向量表示，并计算查询图像相似度。在RefCoco数据集上进行基准测试时，该模型在检索语义上有意义的候选图像时达到了68.8%的准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2018 Pacific Neighborhood Consortium Annual Conference and Joint Meetings (PNC)

自引率

0.00%

发文量