Social Context-aware Person Search in Videos via Multi-modal Cues

ACM Transactions on Information Systems (TOIS) Pub Date : 2021-11-22 DOI:10.1145/3480967

Dan Li, Tong Xu, Peilun Zhou, Weidong He, Y. Hao, Yi Zheng, Enhong Chen

{"title":"Social Context-aware Person Search in Videos via Multi-modal Cues","authors":"Dan Li, Tong Xu, Peilun Zhou, Weidong He, Y. Hao, Yi Zheng, Enhong Chen","doi":"10.1145/3480967","DOIUrl":null,"url":null,"abstract":"Person search has long been treated as a crucial and challenging task to support deeper insight in personalized summarization and personality discovery. Traditional methods, e.g., person re-identification and face recognition techniques, which profile video characters based on visual information, are often limited by relatively fixed poses or small variation of viewpoints and suffer from more realistic scenes with high motion complexity (e.g., movies). At the same time, long videos such as movies often have logical story lines and are composed of continuously developmental plots. In this situation, different persons usually meet on a specific occasion, in which informative social cues are performed. We notice that these social cues could semantically profile their personality and benefit person search task in two aspects. First, persons with certain relationships usually co-occur in short intervals; in case one of them is easier to be identified, the social relation cues extracted from their co-occurrences could further benefit the identification for the harder ones. Second, social relations could reveal the association between certain scenes and characters (e.g., classmate relationship may only exist among students), which could narrow down candidates into certain persons with a specific relationship. In this way, high-level social relation cues could improve the effectiveness of person search. Along this line, in this article, we propose a social context-aware framework, which fuses visual and social contexts to profile persons in more semantic perspectives and better deal with person search task in complex scenarios. Specifically, we first segment videos into several independent scene units and abstract out social contexts within these scene units. Then, we construct inner-personal links through a graph formulation operation for each scene unit, in which both visual cues and relation cues are considered. Finally, we perform a relation-aware label propagation to identify characters’ occurrences, combining low-level semantic cues (i.e., visual cues) and high-level semantic cues (i.e., relation cues) to further enhance the accuracy. Experiments on real-world datasets validate that our solution outperforms several competitive baselines.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"91 1","pages":"1 - 25"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3480967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Person search has long been treated as a crucial and challenging task to support deeper insight in personalized summarization and personality discovery. Traditional methods, e.g., person re-identification and face recognition techniques, which profile video characters based on visual information, are often limited by relatively fixed poses or small variation of viewpoints and suffer from more realistic scenes with high motion complexity (e.g., movies). At the same time, long videos such as movies often have logical story lines and are composed of continuously developmental plots. In this situation, different persons usually meet on a specific occasion, in which informative social cues are performed. We notice that these social cues could semantically profile their personality and benefit person search task in two aspects. First, persons with certain relationships usually co-occur in short intervals; in case one of them is easier to be identified, the social relation cues extracted from their co-occurrences could further benefit the identification for the harder ones. Second, social relations could reveal the association between certain scenes and characters (e.g., classmate relationship may only exist among students), which could narrow down candidates into certain persons with a specific relationship. In this way, high-level social relation cues could improve the effectiveness of person search. Along this line, in this article, we propose a social context-aware framework, which fuses visual and social contexts to profile persons in more semantic perspectives and better deal with person search task in complex scenarios. Specifically, we first segment videos into several independent scene units and abstract out social contexts within these scene units. Then, we construct inner-personal links through a graph formulation operation for each scene unit, in which both visual cues and relation cues are considered. Finally, we perform a relation-aware label propagation to identify characters’ occurrences, combining low-level semantic cues (i.e., visual cues) and high-level semantic cues (i.e., relation cues) to further enhance the accuracy. Experiments on real-world datasets validate that our solution outperforms several competitive baselines.

查看原文本刊更多论文

基于多模态线索的视频社交情境感知人物搜索

长期以来，人物搜索一直被视为一项至关重要且具有挑战性的任务，以支持更深入的个性化总结和个性发现。传统的方法，如人物再识别和人脸识别技术，基于视觉信息对视频人物进行分析，往往受到相对固定的姿势或视点变化的限制，并且受到高运动复杂性的更真实场景(如电影)的影响。同时，像电影这样的长视频往往有逻辑的故事情节，由不断发展的情节组成。在这种情况下，不同的人通常在一个特定的场合见面，在这个场合中，信息丰富的社会线索被执行。我们注意到，这些社会线索可以从两个方面对他们的个性进行语义刻画，并有利于找人任务。首先，具有特定关系的人通常在短时间内同时出现;如果其中一个更容易被识别，从它们的共现中提取的社会关系线索可以进一步有利于识别更难的。其次，社会关系可以揭示某些场景和人物之间的联系(例如，同学关系可能只存在于学生之间)，这可以将候选人缩小到具有特定关系的某些人。在这种情况下，高层次的社会关系线索可以提高人的搜索效率。在此基础上，本文提出了一种社会语境感知框架，该框架融合了视觉和社会语境，从更多的语义角度描述人物，更好地处理复杂场景中的人物搜索任务。具体来说，我们首先将视频分割成几个独立的场景单元，并在这些场景单元中抽象出社会背景。然后，我们通过对每个场景单元进行图形化运算来构建内部个人联系，其中同时考虑了视觉线索和关系线索。最后，我们结合低级语义线索(即视觉线索)和高级语义线索(即关系线索)进行关系感知标签传播来识别字符的出现，进一步提高准确性。在真实世界数据集上的实验验证了我们的解决方案优于几个竞争基线。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Information Systems (TOIS)

自引率

0.00%

发文量