{"title":"A Multi-Granularity Relation Graph Aggregation Framework With Multimodal Clues for Social Relation Reasoning","authors":"Cong Xu;Feiyu Chen;Qi Jia;Yihua Wang;Liang Jin;Yunji Li;Yaqian Zhao;Changming Zhao","doi":"10.1109/TMM.2025.3543054","DOIUrl":null,"url":null,"abstract":"The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a <italic>Multi-Granularity Relation Graph Aggregation Framework</i> (<bold>MGRG</b>) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4961-4970"},"PeriodicalIF":9.7000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10912770/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a Multi-Granularity Relation Graph Aggregation Framework (MGRG) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.