Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo
{"title":"HSSHG:用于视频质量检测的启发式语义约束时空异构图","authors":"Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo","doi":"10.1109/TMM.2024.3443661","DOIUrl":null,"url":null,"abstract":"Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11176-11190"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"HSSHG: Heuristic Semantics-Constrained Spatio-Temporal Heterogeneous Graph for VideoQA\",\"authors\":\"Ruomei Wang;Yuanmao Luo;Fuwei Zhang;Mingyang Liu;Xiaonan Luo\",\"doi\":\"10.1109/TMM.2024.3443661\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"26 \",\"pages\":\"11176-11190\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10636771/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10636771/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
摘要
视频问题解答是一项具有挑战性的任务,需要模型识别视频中的视觉信息并进行时空推理。目前的模型越来越注重通过图神经网络实现物体的时空推理。然而,现有的基于图网络的模型在构建对象间的时空关系时仍存在不足:(1)在定义邻接关系时没有考虑对象间的时空约束;(2)在生成边权重时没有充分考虑对象间的语义相关性。这些都使得模型缺乏对象间时空交互的表征,直接影响了对象关系推理的能力。为解决上述问题,本文设计了一种启发式语义约束时空异质图,采用语义一致性感知策略来构建对象间的时空交互关系。对象间的时空关系受对象共现关系和对象一致性的约束。情节摘要和对象位置被用作启发式语义先验,以限制空间和时间边缘的权重。时空异质性图更准确地还原了对象之间的时空关系,增强了模型的对象时空推理能力。基于时空异构图,本文提出了启发式语义约束时空异构图(Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA,HSSHG),它在基准MSVD-QA和FrameQA数据集上取得了最先进的性能,并在基准MSRVTT-QA和ActivityNet-QA数据集上展示了有竞争力的结果。广泛的消融实验验证了网络中每个组件的有效性和超参数设置的合理性,定性分析验证了 HSSHG 的对象级时空推理能力。
HSSHG: Heuristic Semantics-Constrained Spatio-Temporal Heterogeneous Graph for VideoQA
Video question answering is a challenging task that requires models to recognize visual information in videos and perform spatio-temporal reasoning. Current models increasingly focus on enabling objects spatio-temporal reasoning via graph neural networks. However, the existing graph network-based models still have deficiencies when constructing the spatio-temporal relationship between objects: (1) The lack of consideration of the spatio-temporal constraints between objects when defining the adjacency relationship; (2) The semantic correlation between objects is not fully considered when generating edge weights. These make the model lack representation of spatio-temporal interaction between objects, which directly affects the ability of object relation reasoning. To solve the above problems, this paper designs a heuristic semantics-constrained spatio-temporal heterogeneous graph, employing a semantic consistency-aware strategy to construct the spatio-temporal interaction between objects. The spatio-temporal relationship between objects is constrained by the object co-occurrence relationship and the object consistency. The plot summaries and object locations are used as heuristic semantic priors to constrain the weights of spatial and temporal edges. The spatio-temporal heterogeneity graph more accurately restores the spatio-temporal relationship between objects and strengthens the model's object spatio-temporal reasoning ability. Based on the spatio-temporal heterogeneous graph, this paper proposes Heuristic Semantics-constrained Spatio-temporal Heterogeneous Graph for VideoQA (HSSHG), which achieves state-of-the-art performance on benchmark MSVD-QA and FrameQA datasets, and demonstrates competitive results on benchmark MSRVTT-QA and ActivityNet-QA dataset. Extensive ablation experiments verify the effectiveness of each component in the network and the rationality of hyperparameter settings, and qualitative analysis verifies the object-level spatio-temporal reasoning ability of HSSHG.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.