Joint Convolutional and Self-Attention Network for Occluded Person Re-Identification

2022 18th International Conference on Mobility, Sensing and Networking (MSN) Pub Date : 2022-12-01 DOI:10.1109/MSN57253.2022.00123

Chuxia Yang, Wanshu Fan, D. Zhou, Qiang Zhang

{"title":"Joint Convolutional and Self-Attention Network for Occluded Person Re-Identification","authors":"Chuxia Yang, Wanshu Fan, D. Zhou, Qiang Zhang","doi":"10.1109/MSN57253.2022.00123","DOIUrl":null,"url":null,"abstract":"Occluded person Re-Identification (Re-ID) is built on cross views, which aims to retrieve a target person in occlusion scenes. Under the condition that occlusion leads to the interference of other objects and the loss of personal information, the efficient extraction of personal feature representation is crucial to the recognition accuracy of the system. Most of the existing methods solve this problem by designing various deep networks, which are called convolutional neural networks (CNN)-based methods. Although these methods have the powerful ability to mine local features, they may fail to capture features containing global information due to the limitation of the gaussian distribution property of convolution operation. Recently, methods based on Vision Transformer (ViT) have been successfully employed to person Re-ID task and achieved good performance. However, since ViT-based methods lack the capability of extracting local information from person images, the generated results may severely lose local details. To address these deficiencies, we design a convolution and self-attention aggregation network (CSNet) by combining the advantages of both CNN and ViT. The proposed CSNet consists of three parts. First, to better capture personal information, we adopt Dual-Branch Encoder (DBE) to encode person images. Then, we also embed a Local Information Aggregation Module (LIAM) in the feature map, which effectively leverages the useful information in the local feature map. Finally, a Multi-Head Global-to-Local Attention (MHGLA) module is designed to transmit global information to local features. Experimental results demonstrate the superiority of the proposed method compared with the state-of-the-art (SOTA) methods on both the occluded person Re-ID datasets and the holistic person Re-ID datasets.","PeriodicalId":114459,"journal":{"name":"2022 18th International Conference on Mobility, Sensing and Networking (MSN)","volume":"164 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 18th International Conference on Mobility, Sensing and Networking (MSN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MSN57253.2022.00123","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Occluded person Re-Identification (Re-ID) is built on cross views, which aims to retrieve a target person in occlusion scenes. Under the condition that occlusion leads to the interference of other objects and the loss of personal information, the efficient extraction of personal feature representation is crucial to the recognition accuracy of the system. Most of the existing methods solve this problem by designing various deep networks, which are called convolutional neural networks (CNN)-based methods. Although these methods have the powerful ability to mine local features, they may fail to capture features containing global information due to the limitation of the gaussian distribution property of convolution operation. Recently, methods based on Vision Transformer (ViT) have been successfully employed to person Re-ID task and achieved good performance. However, since ViT-based methods lack the capability of extracting local information from person images, the generated results may severely lose local details. To address these deficiencies, we design a convolution and self-attention aggregation network (CSNet) by combining the advantages of both CNN and ViT. The proposed CSNet consists of three parts. First, to better capture personal information, we adopt Dual-Branch Encoder (DBE) to encode person images. Then, we also embed a Local Information Aggregation Module (LIAM) in the feature map, which effectively leverages the useful information in the local feature map. Finally, a Multi-Head Global-to-Local Attention (MHGLA) module is designed to transmit global information to local features. Experimental results demonstrate the superiority of the proposed method compared with the state-of-the-art (SOTA) methods on both the occluded person Re-ID datasets and the holistic person Re-ID datasets.

查看原文本刊更多论文

联合卷积自注意网络在闭塞人再识别中的应用

遮挡人再识别(Re-ID)是建立在交叉视图上的，目的是在遮挡场景中检索目标人。在遮挡导致其他物体干扰和个人信息丢失的情况下，有效提取个人特征表示对系统的识别精度至关重要。现有的方法大多是通过设计各种深度网络来解决这个问题，这些深度网络被称为基于卷积神经网络(CNN)的方法。尽管这些方法具有强大的局部特征挖掘能力，但由于卷积运算的高斯分布特性的限制，它们可能无法捕获包含全局信息的特征。近年来，基于视觉变换(Vision Transformer, ViT)的方法已成功应用于人员重识别任务，并取得了良好的效果。然而，由于基于vita的方法缺乏从人物图像中提取局部信息的能力，生成的结果可能会严重丢失局部细节。为了解决这些不足，我们结合CNN和ViT的优点设计了一个卷积和自关注聚合网络(CSNet)。拟议的CSNet由三部分组成。首先，为了更好地捕获个人信息，我们采用双分支编码器(Dual-Branch Encoder, DBE)对人物图像进行编码。然后，在特征映射中嵌入局部信息聚合模块(LIAM)，有效地利用局部特征映射中的有用信息。最后，设计了一个多头全局到局部注意(MHGLA)模块，将全局信息传递到局部特征。实验结果表明，该方法在遮挡人Re-ID数据集和整体人Re-ID数据集上均优于最先进的SOTA方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 18th International Conference on Mobility, Sensing and Networking (MSN)

自引率

0.00%

发文量