Spatial-temporal graph-guided global attention network for video-based person re-identification

IF 2.4 4区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Machine Vision and Applications Pub Date : 2023-12-03 DOI:10.1007/s00138-023-01489-w

Xiaobao Li, Wen Wang, Qingyong Li, Jiang Zhang

{"title":"Spatial-temporal graph-guided global attention network for video-based person re-identification","authors":"Xiaobao Li, Wen Wang, Qingyong Li, Jiang Zhang","doi":"10.1007/s00138-023-01489-w","DOIUrl":null,"url":null,"abstract":"Global attention learning has been extensively applied in video-based person re-identification due to its superiority in capturing contextual correlations. However, existing global attention learning methods usually adopt the conventional neural network to model non-Euclidean contextual correlations, resulting in a limited representation ability. Inspired by the graph-structure property of the contextual correlations, we propose a spatial-temporal graph-guided global attention network (STG\\(^3\\)A) for video-based person re-identification. STG\\(^3\\)A comprises two graph-guided attention modules to capture the spatial contexts within a frame and temporal contexts across all frames in a sequence for global attention learning. Furthermore, the graphs from both modules are encoded as graph representations, which combine with weighted representations to grasp the spatial-temporal contextual information adequately for video feature learning. To reduce the effect of noisy graph nodes and learn robust graph representations, a graph node attention is developed to trade-off the importance of each graph node, leading to noise-tolerant graph models. Finally, we design a graph-guided fusion scheme to integrate the representations output by these two attentive modules for a more compact video feature. Extensive experiments on MARS and DukeMTMCVideoReID datasets demonstrate the superior performance of the STG\\(^3\\)A.","PeriodicalId":51116,"journal":{"name":"Machine Vision and Applications","volume":"55 1","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2023-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Machine Vision and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00138-023-01489-w","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Global attention learning has been extensively applied in video-based person re-identification due to its superiority in capturing contextual correlations. However, existing global attention learning methods usually adopt the conventional neural network to model non-Euclidean contextual correlations, resulting in a limited representation ability. Inspired by the graph-structure property of the contextual correlations, we propose a spatial-temporal graph-guided global attention network (STG\(^3\)A) for video-based person re-identification. STG\(^3\)A comprises two graph-guided attention modules to capture the spatial contexts within a frame and temporal contexts across all frames in a sequence for global attention learning. Furthermore, the graphs from both modules are encoded as graph representations, which combine with weighted representations to grasp the spatial-temporal contextual information adequately for video feature learning. To reduce the effect of noisy graph nodes and learn robust graph representations, a graph node attention is developed to trade-off the importance of each graph node, leading to noise-tolerant graph models. Finally, we design a graph-guided fusion scheme to integrate the representations output by these two attentive modules for a more compact video feature. Extensive experiments on MARS and DukeMTMCVideoReID datasets demonstrate the superior performance of the STG\(^3\)A.

Abstract Image

查看原文本刊更多论文

基于视频的人物再识别的时空图引导全局注意力网络

全局注意学习由于其在捕捉语境关联方面的优势，在基于视频的人物再识别中得到了广泛的应用。然而，现有的全局注意学习方法通常采用传统的神经网络对非欧几里得上下文关联进行建模，导致表征能力有限。受上下文相关性的图结构特性的启发，我们提出了一个时空图引导的全局注意网络(STG \(^3\) a)，用于基于视频的人物再识别。STG \(^3\) A包括两个图形引导的注意力模块，用于捕获框架内的空间上下文和所有框架中的时间上下文，以进行全局注意力学习。此外，将两个模块的图形编码为图形表示，并与加权表示相结合，充分掌握视频特征学习的时空上下文信息。为了减少噪声图节点的影响并学习鲁棒图表示，开发了图节点注意力来权衡每个图节点的重要性，从而产生耐噪声图模型。最后，我们设计了一个图形引导的融合方案来整合这两个关注模块输出的表示，以获得更紧凑的视频特征。在MARS和DukeMTMCVideoReID数据集上的大量实验证明了STG \(^3\) A的优越性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Machine Vision and Applications 工程技术-工程：电子与电气

CiteScore

6.30

自引率

3.00%

发文量

审稿时长

8.7 months

期刊介绍： Machine Vision and Applications publishes high-quality technical contributions in machine vision research and development. Specifically, the editors encourage submittals in all applications and engineering aspects of image-related computing. In particular, original contributions dealing with scientific, commercial, industrial, military, and biomedical applications of machine vision, are all within the scope of the journal. Particular emphasis is placed on engineering and technology aspects of image processing and computer vision. The following aspects of machine vision applications are of interest: algorithms, architectures, VLSI implementations, AI techniques and expert systems for machine vision, front-end sensing, multidimensional and multisensor machine vision, real-time techniques, image databases, virtual reality and visualization. Papers must include a significant experimental validation component.