Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation

2014 IEEE Conference on Computer Vision and Pattern Recognition Pub Date : 2013-11-11 DOI:10.1109/CVPR.2014.81

Ross B. Girshick, Jeff Donahue, Trevor Darrell, J. Malik

{"title":"Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation","authors":"Ross B. Girshick, Jeff Donahue, Trevor Darrell, J. Malik","doi":"10.1109/CVPR.2014.81","DOIUrl":null,"url":null,"abstract":"Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.","PeriodicalId":319578,"journal":{"name":"2014 IEEE Conference on Computer Vision and Pattern Recognition","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22687","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2014.81","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22687

Abstract

Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex ensemble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that improves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012 -- achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural networks (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. We also present experiments that provide insight into what the network learns, revealing a rich hierarchy of image features. Source code for the complete system is available at http://www.cs.berkeley.edu/~rbg/rcnn.

查看原文本刊更多论文

丰富的特征层次结构用于准确的目标检测和语义分割

在经典PASCAL VOC数据集上测量的对象检测性能在过去几年中趋于稳定。表现最好的方法是复杂的集成系统，它通常将多个低级图像特征与高级上下文相结合。在本文中，我们提出了一种简单且可扩展的检测算法，相对于之前在VOC 2012上的最佳结果，该算法将平均精度(mAP)提高了30%以上，实现了53.3%的mAP。我们的方法结合了两个关键见解:(1)可以将高容量卷积神经网络(cnn)应用于自下而上的区域建议，以定位和分割对象;(2)当标记训练数据稀缺时，对辅助任务进行监督预训练，然后进行特定领域的微调，可以显著提高性能。由于我们将区域建议与CNN结合在一起，我们称我们的方法为R-CNN:具有CNN特征的区域。我们还展示了一些实验，这些实验提供了对网络学习内容的洞察，揭示了图像特征的丰富层次。完整系统的源代码可从http://www.cs.berkeley.edu/~rbg/rcnn获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE Conference on Computer Vision and Pattern Recognition

自引率

0.00%

发文量