会说话的头:探测人类并识别他们的互动

2014 IEEE Conference on Computer Vision and Pattern Recognition Pub Date : 2014-06-23 DOI:10.1109/CVPR.2014.117

Minh Hoai, Andrew Zisserman

{"title":"会说话的头:探测人类并识别他们的互动","authors":"Minh Hoai, Andrew Zisserman","doi":"10.1109/CVPR.2014.117","DOIUrl":null,"url":null,"abstract":"The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard arrangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contributions: first, we introduce a new learnable context aware configuration model for detecting sets of people in TV material that predicts the scale and location of each upper body in the configuration, second, we show that inference of the model can be solved globally and efficiently using dynamic programming, and implement a maximum margin learning framework, and third, we show that the configuration model substantially outperforms a Deformable Part Model (DPM) for predicting upper body locations in video frames, even when the DPM is equipped with the context of other upper bodies. Experiments are performed over two datasets: the TV Human Interaction dataset, and 150 episodes from four different TV shows. We also demonstrate the benefits of the model in recognizing interactions in TV shows.","PeriodicalId":319578,"journal":{"name":"2014 IEEE Conference on Computer Vision and Pattern Recognition","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":"{\"title\":\"Talking Heads: Detecting Humans and Recognizing Their Interactions\",\"authors\":\"Minh Hoai, Andrew Zisserman\",\"doi\":\"10.1109/CVPR.2014.117\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard arrangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contributions: first, we introduce a new learnable context aware configuration model for detecting sets of people in TV material that predicts the scale and location of each upper body in the configuration, second, we show that inference of the model can be solved globally and efficiently using dynamic programming, and implement a maximum margin learning framework, and third, we show that the configuration model substantially outperforms a Deformable Part Model (DPM) for predicting upper body locations in video frames, even when the DPM is equipped with the context of other upper bodies. Experiments are performed over two datasets: the TV Human Interaction dataset, and 150 episodes from four different TV shows. We also demonstrate the benefits of the model in recognizing interactions in TV shows.\",\"PeriodicalId\":319578,\"journal\":{\"name\":\"2014 IEEE Conference on Computer Vision and Pattern Recognition\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-06-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"60\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 IEEE Conference on Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR.2014.117\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE Conference on Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2014.117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 60

摘要

这项工作的目的是准确有效地检测编辑电视材料中一个或多个人的配置。由于电影风格，这种配置经常出现在标准安排中，我们利用这一点来提供场景背景。我们的贡献如下:首先，我们引入了一个新的可学习的上下文感知配置模型，用于检测电视材料中的人集，该模型可以预测配置中每个上半身的规模和位置;其次，我们证明了该模型的推理可以使用动态规划全局有效地解决，并实现了最大边际学习框架;我们表明，配置模型在预测视频帧中的上半身位置方面大大优于可变形部分模型(DPM)，即使DPM配备了其他上半身的上下文。实验在两个数据集上进行:电视人类互动数据集和来自四个不同电视节目的150集。我们还展示了该模型在识别电视节目中的交互方面的好处。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Talking Heads: Detecting Humans and Recognizing Their Interactions

The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard arrangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contributions: first, we introduce a new learnable context aware configuration model for detecting sets of people in TV material that predicts the scale and location of each upper body in the configuration, second, we show that inference of the model can be solved globally and efficiently using dynamic programming, and implement a maximum margin learning framework, and third, we show that the configuration model substantially outperforms a Deformable Part Model (DPM) for predicting upper body locations in video frames, even when the DPM is equipped with the context of other upper bodies. Experiments are performed over two datasets: the TV Human Interaction dataset, and 150 episodes from four different TV shows. We also demonstrate the benefits of the model in recognizing interactions in TV shows.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 IEEE Conference on Computer Vision and Pattern Recognition

自引率

0.00%

发文量