基于多尺度时空推理的视频社会关系识别

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Pub Date : 2019-06-01 DOI:10.1109/CVPR.2019.00368

Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei

{"title":"基于多尺度时空推理的视频社会关系识别","authors":"Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei","doi":"10.1109/CVPR.2019.00368","DOIUrl":null,"url":null,"abstract":"Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.","PeriodicalId":6711,"journal":{"name":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","volume":"104 1","pages":"3561-3569"},"PeriodicalIF":0.0000,"publicationDate":"2019-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"60","resultStr":"{\"title\":\"Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning\",\"authors\":\"Xinchen Liu, Wu Liu, Meng Zhang, Jingwen Chen, Lianli Gao, C. Yan, Tao Mei\",\"doi\":\"10.1109/CVPR.2019.00368\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.\",\"PeriodicalId\":6711,\"journal\":{\"name\":\"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"volume\":\"104 1\",\"pages\":\"3561-3569\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"60\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CVPR.2019.00368\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CVPR.2019.00368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 60

摘要

从视觉内容中发现社会关系，如亲情、友谊等，可以让机器更好地解读人类的行为和情感。现有的研究主要集中在从静止图像中识别社会关系，而忽略了另一个重要的媒体——视频。一方面，视频中的动作和故事情节为社会关系识别提供了更重要的线索。另一方面，关键人物可能出现在任意的时空位置，甚至从头到尾都不在同一张图像中。为了克服这些挑战，我们提出了一个多尺度时空推理(MSTR)框架来识别视频中的社会关系。在空间表示方面，我们不仅采用时间段网络来学习全局动作和场景信息，还设计了一个三重图模型来捕捉人与物体之间的视觉关系。在时间域，我们提出了一个金字塔图卷积网络来进行多尺度感受域的时间推理，可以同时获得视频中的长期和短期故事情节。通过这种方式，MSTR可以在时空维度上全面探索视频中社会关系推理的多尺度动作和故事情节。在一个新的大规模视频社会关系数据集上的大量实验证明了该框架的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Social Relation Recognition From Videos via Multi-Scale Spatial-Temporal Reasoning

Discovering social relations, e.g., kinship, friendship, etc., from visual contents can make machines better interpret the behaviors and emotions of human beings. Existing studies mainly focus on recognizing social relations from still images while neglecting another important media--video. On one hand, the actions and storylines in videos provide more important cues for social relation recognition. On the other hand, the key persons may appear at arbitrary spatial-temporal locations, even not in one same image from beginning to the end. To overcome these challenges, we propose a Multi-scale Spatial-Temporal Reasoning (MSTR) framework to recognize social relations from videos. For the spatial representation, we not only adopt a temporal segment network to learn global action and scene information, but also design a Triple Graphs model to capture visual relations between persons and objects. For the temporal domain, we propose a Pyramid Graph Convolutional Network to perform temporal reasoning with multi-scale receptive fields, which can obtain both long-term and short-term storylines in videos. By this means, MSTR can comprehensively explore the multi-scale actions and storylines in spatial-temporal dimensions for social relation reasoning in videos. Extensive experiments on a new large-scale Video Social Relation dataset demonstrate the effectiveness of the proposed framework.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

自引率

0.00%

发文量