{"title":"车辆再识别的局部引导全局协同学习转换器","authors":"Yanling Shi, Xiaofei Zhang, X. Tan","doi":"10.1109/ICTAI56018.2022.00122","DOIUrl":null,"url":null,"abstract":"Vehicle reidentification(ReID) has attracted much attention and is significant for traffic security surveillance. Due to the variety of views of the same vehicle captured by different camera and the great similarity in the visual appearance of different vehicles, it is necessary to explore how to effectively utilize local detail information to achieve collaborative perception to highlight discriminative appearance features. Different from existing local feature exploration methods that focus on using extra part or keypoint information, we propose a global collaborative learning Transformer guided by local abstract features, named LG-CoT, which aims to highlight the highest-attention regions of vehicle images. We adopt Vision Transformer(ViT) as our backbone to extract global features and obtain all local tokens. To reduce the distribution from the background and drive the network to focus more on details, all attention maps containing low-level texture information and high-level semantic information are multiplied to obtain the local regions with highest-attention. Finally, we design a local-attention-guided pose-optimization feature encoding module, which can help the global features focus on local regions adaptively. Extensive experiments on two popular datasets and a dataset we built in a T-junction traffic scene suggest that our method can achieve comparable performance.","PeriodicalId":354314,"journal":{"name":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Local-guided Global Collaborative Learning Transformer for Vehicle Reidentification\",\"authors\":\"Yanling Shi, Xiaofei Zhang, X. Tan\",\"doi\":\"10.1109/ICTAI56018.2022.00122\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vehicle reidentification(ReID) has attracted much attention and is significant for traffic security surveillance. Due to the variety of views of the same vehicle captured by different camera and the great similarity in the visual appearance of different vehicles, it is necessary to explore how to effectively utilize local detail information to achieve collaborative perception to highlight discriminative appearance features. Different from existing local feature exploration methods that focus on using extra part or keypoint information, we propose a global collaborative learning Transformer guided by local abstract features, named LG-CoT, which aims to highlight the highest-attention regions of vehicle images. We adopt Vision Transformer(ViT) as our backbone to extract global features and obtain all local tokens. To reduce the distribution from the background and drive the network to focus more on details, all attention maps containing low-level texture information and high-level semantic information are multiplied to obtain the local regions with highest-attention. Finally, we design a local-attention-guided pose-optimization feature encoding module, which can help the global features focus on local regions adaptively. Extensive experiments on two popular datasets and a dataset we built in a T-junction traffic scene suggest that our method can achieve comparable performance.\",\"PeriodicalId\":354314,\"journal\":{\"name\":\"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICTAI56018.2022.00122\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 34th International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI56018.2022.00122","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Local-guided Global Collaborative Learning Transformer for Vehicle Reidentification
Vehicle reidentification(ReID) has attracted much attention and is significant for traffic security surveillance. Due to the variety of views of the same vehicle captured by different camera and the great similarity in the visual appearance of different vehicles, it is necessary to explore how to effectively utilize local detail information to achieve collaborative perception to highlight discriminative appearance features. Different from existing local feature exploration methods that focus on using extra part or keypoint information, we propose a global collaborative learning Transformer guided by local abstract features, named LG-CoT, which aims to highlight the highest-attention regions of vehicle images. We adopt Vision Transformer(ViT) as our backbone to extract global features and obtain all local tokens. To reduce the distribution from the background and drive the network to focus more on details, all attention maps containing low-level texture information and high-level semantic information are multiplied to obtain the local regions with highest-attention. Finally, we design a local-attention-guided pose-optimization feature encoding module, which can help the global features focus on local regions adaptively. Extensive experiments on two popular datasets and a dataset we built in a T-junction traffic scene suggest that our method can achieve comparable performance.