Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, J. Kosecka, A. Berg
{"title":"快速单镜头检测和姿态估计","authors":"Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, J. Kosecka, A. Berg","doi":"10.1109/3DV.2016.78","DOIUrl":null,"url":null,"abstract":"For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for sliding-window detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stages of detecting parts or initial bounding boxes. While not the first system to treat pose estimation as a categorization problem, this is the first attempt to combine detection and pose estimation at the same level using a deep learning approach. The key to the architecture is a deep convolutional network where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image. The resulting system is as accurate as recent work on pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly faster (46 frames per second (FPS) on a TITAN X GPU). This approach to detection and rough pose estimation is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM.","PeriodicalId":425304,"journal":{"name":"2016 Fourth International Conference on 3D Vision (3DV)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"107","resultStr":"{\"title\":\"Fast Single Shot Detection and Pose Estimation\",\"authors\":\"Patrick Poirson, Phil Ammirato, Cheng-Yang Fu, Wei Liu, J. Kosecka, A. Berg\",\"doi\":\"10.1109/3DV.2016.78\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for sliding-window detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stages of detecting parts or initial bounding boxes. While not the first system to treat pose estimation as a categorization problem, this is the first attempt to combine detection and pose estimation at the same level using a deep learning approach. The key to the architecture is a deep convolutional network where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image. The resulting system is as accurate as recent work on pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly faster (46 frames per second (FPS) on a TITAN X GPU). This approach to detection and rough pose estimation is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM.\",\"PeriodicalId\":425304,\"journal\":{\"name\":\"2016 Fourth International Conference on 3D Vision (3DV)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"107\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Fourth International Conference on 3D Vision (3DV)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/3DV.2016.78\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Fourth International Conference on 3D Vision (3DV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/3DV.2016.78","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 107
摘要
在导航和机器人应用中,估计物体的三维姿态与检测同样重要。许多姿态估计方法依赖于检测或跟踪部件或关键点[11,21]。在本文中,我们建立了一个最新的最先进的卷积网络,用于滑动窗口检测[10],在单个镜头中提供检测和粗略的姿态估计,而不需要检测部件或初始边界框的中间阶段。虽然这不是第一个将姿态估计视为分类问题的系统,但这是第一次尝试使用深度学习方法将检测和姿态估计结合在同一层次上。该架构的关键是一个深度卷积网络,其中对象类别存在的分数,其位置的偏移量和近似姿态都是在图像中位置的规则网格上估计的。由此产生的系统与最近在姿态估计上的工作一样准确(42.4% 8在Pascal 3D+[21]上查看mAVP),并且明显更快(在TITAN X GPU上每秒46帧(FPS))。这种检测和粗略姿态估计方法足够快速和准确,可以广泛应用于高精度姿态估计、目标跟踪和定位以及vSLAM等任务的预处理步骤。
For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for sliding-window detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stages of detecting parts or initial bounding boxes. While not the first system to treat pose estimation as a categorization problem, this is the first attempt to combine detection and pose estimation at the same level using a deep learning approach. The key to the architecture is a deep convolutional network where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image. The resulting system is as accurate as recent work on pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly faster (46 frames per second (FPS) on a TITAN X GPU). This approach to detection and rough pose estimation is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM.