3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image

2013 IEEE Conference on Computer Vision and Pattern Recognition Pub Date : 2013-06-23 DOI:10.1109/CVPR.2013.437

Ishani Chakraborty, Hui Cheng, O. Javed

{"title":"3D Visual Proxemics: Recognizing Human Interactions in 3D from a Single Image","authors":"Ishani Chakraborty, Hui Cheng, O. Javed","doi":"10.1109/CVPR.2013.437","DOIUrl":null,"url":null,"abstract":"We present a unified framework for detecting and classifying people interactions in unconstrained user generated images. Unlike previous approaches that directly map people/face locations in 2D image space into features for classification, we first estimate camera viewpoint and people positions in 3D space and then extract spatial configuration features from explicit 3D people positions. This approach has several advantages. First, it can accurately estimate relative distances and orientations between people in 3D. Second, it encodes spatial arrangements of people into a richer set of shape descriptors than afforded in 2D. Our 3D shape descriptors are invariant to camera pose variations often seen in web images and videos. The proposed approach also estimates camera pose and uses it to capture the intent of the photo. To achieve accurate 3D people layout estimation, we develop an algorithm that robustly fuses semantic constraints about human interpositions into a linear camera model. This enables our model to handle large variations in people size, heights (e.g. age) and poses. An accurate 3D layout also allows us to construct features informed by Proxemics that improves our semantic classification. To characterize the human interaction space, we introduce visual proxemes, a set of prototypical patterns that represent commonly occurring social interactions in events. We train a discriminative classifier that classifies 3D arrangements of people into visual proxemes and quantitatively evaluate the performance on a large, challenging dataset.","PeriodicalId":6343,"journal":{"name":"2013 IEEE Conference on Computer Vision and Pattern Recognition","volume":"30 1","pages":"3406-3413"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2013.437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

We present a unified framework for detecting and classifying people interactions in unconstrained user generated images. Unlike previous approaches that directly map people/face locations in 2D image space into features for classification, we first estimate camera viewpoint and people positions in 3D space and then extract spatial configuration features from explicit 3D people positions. This approach has several advantages. First, it can accurately estimate relative distances and orientations between people in 3D. Second, it encodes spatial arrangements of people into a richer set of shape descriptors than afforded in 2D. Our 3D shape descriptors are invariant to camera pose variations often seen in web images and videos. The proposed approach also estimates camera pose and uses it to capture the intent of the photo. To achieve accurate 3D people layout estimation, we develop an algorithm that robustly fuses semantic constraints about human interpositions into a linear camera model. This enables our model to handle large variations in people size, heights (e.g. age) and poses. An accurate 3D layout also allows us to construct features informed by Proxemics that improves our semantic classification. To characterize the human interaction space, we introduce visual proxemes, a set of prototypical patterns that represent commonly occurring social interactions in events. We train a discriminative classifier that classifies 3D arrangements of people into visual proxemes and quantitatively evaluate the performance on a large, challenging dataset.

查看原文本刊更多论文

3D视觉接近学:从单个图像中识别3D中的人类互动

我们提出了一个统一的框架，用于检测和分类无约束用户生成图像中的人员交互。与以往直接将2D图像空间中的人/脸位置映射为特征进行分类的方法不同，我们首先估计3D空间中的相机视点和人的位置，然后从明确的3D人物位置中提取空间配置特征。这种方法有几个优点。首先，它可以准确地估计3D中人与人之间的相对距离和方向。其次，它将人的空间排列编码成一组比2D更丰富的形状描述符。我们的3D形状描述符对于在网络图像和视频中经常看到的相机姿势变化是不变的。该方法还可以估计相机的姿势，并用它来捕捉照片的意图。为了实现准确的三维人物布局估计，我们开发了一种算法，该算法将关于人物插入的语义约束稳健地融合到线性相机模型中。这使我们的模型能够处理人的尺寸、身高(例如年龄)和姿势的巨大变化。精确的3D布局还允许我们构建由Proxemics通知的特征，从而改进我们的语义分类。为了描述人类互动空间的特征，我们引入了视觉特征，这是一组代表事件中常见的社会互动的原型模式。我们训练了一个判别分类器，该分类器将人的3D排列分类为视觉对象，并在一个大型的、具有挑战性的数据集上定量评估其性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE Conference on Computer Vision and Pattern Recognition

自引率

0.00%

发文量