A Multi-Granularity Relation Graph Aggregation Framework With Multimodal Clues for Social Relation Reasoning

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-03-05 DOI:10.1109/TMM.2025.3543054

Cong Xu;Feiyu Chen;Qi Jia;Yihua Wang;Liang Jin;Yunji Li;Yaqian Zhao;Changming Zhao

{"title":"A Multi-Granularity Relation Graph Aggregation Framework With Multimodal Clues for Social Relation Reasoning","authors":"Cong Xu;Feiyu Chen;Qi Jia;Yihua Wang;Liang Jin;Yunji Li;Yaqian Zhao;Changming Zhao","doi":"10.1109/TMM.2025.3543054","DOIUrl":null,"url":null,"abstract":"The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a <italic>Multi-Granularity Relation Graph Aggregation Framework</i> (<bold>MGRG</b>) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"4961-4970"},"PeriodicalIF":9.7000,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10912770/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

The social relation is a fundamental attribute of human beings in daily life. The ability of humans to form large organizations and institutions stems directly from our complex social networks. Therefore, understanding social relationships in the context of multimedia is crucial for building domain-specific or general artificial intelligence systems. The key to reason social relations lies in understanding the human interactions between individuals through multimodal representations such as action and utterance. However, due to video editing techniques and various narrative sequences in videos, two individuals with social relationships may not appear together in the same frame or clip. Additionally, social relations may manifest in different levels of granularity in video expressions. Previous research has not effectively addressed these challenges. Therefore, this paper proposes a Multi-Granularity Relation Graph Aggregation Framework (MGRG) to enhance the inference ability for social relation reasoning in multimedia content, like video. Different from existing methods, our method considers the paradigm of jointly inferring the relations by constructing a social relation graph. We design a hierarchical multimodal relation graph illustrating the exchange of information between individuals' roles, capturing the complex interactions at multi-levels of granularity from fine to coarse. In MGRG, we propose two aggregation modules to cluster multimodal features in different granularity layer relation graph, considering temporal aspects and importance. Experimental results show that our method generates a logical and coherent social relation graph and improves the performance in accuracy.

查看原文本刊更多论文

基于多模态线索的多粒度关系图聚合框架

社会关系是人在日常生活中的基本属性。人类形成大型组织和机构的能力直接源于我们复杂的社会网络。因此，理解多媒体背景下的社会关系对于构建特定领域或通用人工智能系统至关重要。理解社会关系的关键在于通过行为和话语等多模态表征来理解个体之间的相互作用。然而，由于视频剪辑技术和视频中各种叙事顺序的原因，两个具有社会关系的个体可能不会同时出现在同一帧或片段中。此外，社会关系可能在视频表达中以不同的粒度层次表现出来。以前的研究并没有有效地解决这些挑战。为此，本文提出了一种多粒度关系图聚合框架（MGRG），以提高视频等多媒体内容中社会关系推理的推理能力。与现有方法不同的是，我们的方法考虑了通过构建社会关系图来共同推断关系的范式。我们设计了一个分层的多模态关系图，说明了个体角色之间的信息交换，从细到粗捕获了多层次粒度的复杂交互。在MGRG中，我们提出了两个聚合模块对不同粒度层关系图中的多模态特征进行聚类，同时考虑了时间和重要性。实验结果表明，该方法生成了一个逻辑连贯的社会关系图，提高了准确率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.