Pose-Skeleton Guided Cross-Attention Representation Fusion for Occluded Pedestrian Re-Identification

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-31 DOI:10.1109/TCSVT.2025.3556250

Shuze Geng;Yifan Liu;Zijin Wang;Gang Yan;Yang Yu;Yingchun Guo

{"title":"Pose-Skeleton Guided Cross-Attention Representation Fusion for Occluded Pedestrian Re-Identification","authors":"Shuze Geng;Yifan Liu;Zijin Wang;Gang Yan;Yang Yu;Yingchun Guo","doi":"10.1109/TCSVT.2025.3556250","DOIUrl":null,"url":null,"abstract":"Most methods address occluded pedestrian Re-Identification (Re-ID) by employing external auxiliary models in the feature output stage of the backbone network to locate visible appearance areas. Nevertheless, these approaches suffer from issues such as occlusion information diffusion and imprecise masks generated by external models, indicating the need for further exploration in the decoupling of pedestrian features from occlusion information. In light of these challenges, we propose an innovative algorithm called Pose-Skeleton guided Cross-attention Representation fusion (PSCR) method. Firstly, we introduce the Visible Appearance Region Attention (VARA) model designed to leverage pose information for guiding the backbone network in effectively distinguishing between occlusion information and pedestrian features at the intermediate layer. By employing a suppression strategy, the model is able to effectively suppress occlusion interference and alleviate the diffusion of occlusion information. Next, to achieve precise localization of pedestrian-specific semantic regions, a groundbreaking Skeletal Area Modeling (SAM) is proposed. Leveraging the principles of mathematical modeling and capitalizing on the efficacy of human keypoint confidence, this module generates finely-grained masks for local skeleton regions and extracts an exhaustive set of local features. Lastly, under the constraints imposed by spatial attention masks, a cross-attention mechanism is employed to fuse the features acquired from the previous two steps with local features. This fusion process results in the generation of enhanced local features that seamlessly integrate aligning high-level semantic information. Extensive experimentation demonstrates that the proposed algorithm exhibits notable performance advancements when compared to existing methodologies.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8598-8613"},"PeriodicalIF":11.1000,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10945980/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Most methods address occluded pedestrian Re-Identification (Re-ID) by employing external auxiliary models in the feature output stage of the backbone network to locate visible appearance areas. Nevertheless, these approaches suffer from issues such as occlusion information diffusion and imprecise masks generated by external models, indicating the need for further exploration in the decoupling of pedestrian features from occlusion information. In light of these challenges, we propose an innovative algorithm called Pose-Skeleton guided Cross-attention Representation fusion (PSCR) method. Firstly, we introduce the Visible Appearance Region Attention (VARA) model designed to leverage pose information for guiding the backbone network in effectively distinguishing between occlusion information and pedestrian features at the intermediate layer. By employing a suppression strategy, the model is able to effectively suppress occlusion interference and alleviate the diffusion of occlusion information. Next, to achieve precise localization of pedestrian-specific semantic regions, a groundbreaking Skeletal Area Modeling (SAM) is proposed. Leveraging the principles of mathematical modeling and capitalizing on the efficacy of human keypoint confidence, this module generates finely-grained masks for local skeleton regions and extracts an exhaustive set of local features. Lastly, under the constraints imposed by spatial attention masks, a cross-attention mechanism is employed to fuse the features acquired from the previous two steps with local features. This fusion process results in the generation of enhanced local features that seamlessly integrate aligning high-level semantic information. Extensive experimentation demonstrates that the proposed algorithm exhibits notable performance advancements when compared to existing methodologies.

查看原文本刊更多论文

姿态-骨架引导下的交叉注意表征融合在遮挡行人再识别中的应用

大多数方法通过在骨干网络的特征输出阶段使用外部辅助模型来定位可见外观区域来解决遮挡行人重新识别（Re-ID）问题。然而，这些方法存在遮挡信息扩散和外部模型产生的不精确掩模等问题，表明行人特征与遮挡信息的解耦需要进一步探索。鉴于这些挑战，我们提出了一种创新的算法，称为姿势-骨架引导交叉注意表示融合（PSCR）方法。首先，我们引入了VARA （Visible Appearance Region Attention）模型，该模型旨在利用位姿信息指导骨干网络有效区分遮挡信息和中间层行人特征。通过采用抑制策略，该模型能够有效地抑制遮挡干扰，缓解遮挡信息的扩散。其次，为了实现行人特定语义区域的精确定位，提出了一种开创性的骨骼区域建模（SAM）方法。利用数学建模原理和利用人类关键点信心的有效性，该模块为局部骨架区域生成细粒度掩模，并提取一组详尽的局部特征。最后，在空间注意掩模的约束下，利用交叉注意机制将前两步获得的特征与局部特征融合。这种融合过程产生了增强的局部特征，这些特征无缝地集成了对齐的高级语义信息。大量的实验表明，与现有的方法相比，所提出的算法表现出显著的性能进步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.