Dynamic Attention Vision-Language Transformer Network for Person Re-identification

IF 11.6 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2024-10-26 DOI:10.1007/s11263-024-02277-3

Guifang Zhang, Shijun Tan, Zhe Ji, Yuming Fang

{"title":"Dynamic Attention Vision-Language Transformer Network for Person Re-identification","authors":"Guifang Zhang, Shijun Tan, Zhe Ji, Yuming Fang","doi":"10.1007/s11263-024-02277-3","DOIUrl":null,"url":null,"abstract":"<p>Multimodal based person re-identification (ReID) has garnered increasing attention in recent years. However, the integration of visual and textual information encounters significant challenges. Biases in feature integration are frequently observed in existing methods, resulting in suboptimal performance and restricted generalization across a spectrum of ReID tasks. At the same time, since there is a domain gap between the datasets used by the pretraining model and the ReID datasets, it has a certain impact on the performance. In response to these challenges, we proposed a dynamic attention vision-language transformer network for the ReID task. In this network, a novel image-text dynamic attention module (ITDA) is designed to promote unbiased feature integration by dynamically assigning the importance of image and text representations. Additionally, an adapter module is adopted to address the domain gap between pretraining datasets and ReID datasets. Our network can capture complex connections between visual and textual information and achieve satisfactory performance. We conducted numerous experiments on ReID benchmarks to demonstrate the efficacy of our proposed method. The experimental results show that our method achieves state-of-the-art performance, surpassing existing integration strategies. These findings underscore the critical role of unbiased feature dynamic integration in enhancing the capabilities of multimodal based ReID models.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"96 1","pages":""},"PeriodicalIF":11.6000,"publicationDate":"2024-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-024-02277-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Multimodal based person re-identification (ReID) has garnered increasing attention in recent years. However, the integration of visual and textual information encounters significant challenges. Biases in feature integration are frequently observed in existing methods, resulting in suboptimal performance and restricted generalization across a spectrum of ReID tasks. At the same time, since there is a domain gap between the datasets used by the pretraining model and the ReID datasets, it has a certain impact on the performance. In response to these challenges, we proposed a dynamic attention vision-language transformer network for the ReID task. In this network, a novel image-text dynamic attention module (ITDA) is designed to promote unbiased feature integration by dynamically assigning the importance of image and text representations. Additionally, an adapter module is adopted to address the domain gap between pretraining datasets and ReID datasets. Our network can capture complex connections between visual and textual information and achieve satisfactory performance. We conducted numerous experiments on ReID benchmarks to demonstrate the efficacy of our proposed method. The experimental results show that our method achieves state-of-the-art performance, surpassing existing integration strategies. These findings underscore the critical role of unbiased feature dynamic integration in enhancing the capabilities of multimodal based ReID models.

Abstract Image

查看原文本刊更多论文

用于人员再识别的动态注意力视觉语言转换器网络

近年来，基于多模态的人员再识别（ReID）技术受到越来越多的关注。然而，视觉和文本信息的整合遇到了重大挑战。现有方法在特征整合方面经常出现偏差，导致在一系列 ReID 任务中表现不佳，通用性受限。同时，由于预训练模型所使用的数据集与 ReID 数据集之间存在领域差距，这对性能有一定的影响。为了应对这些挑战，我们为 ReID 任务提出了一种动态注意力视觉语言转换器网络。在这个网络中，我们设计了一个新颖的图像-文本动态注意力模块（ITDA），通过动态分配图像和文本表征的重要性来促进无偏见的特征整合。此外，还采用了一个适配器模块来解决预训练数据集和 ReID 数据集之间的领域差距。我们的网络能够捕捉视觉和文本信息之间的复杂联系，并取得了令人满意的性能。我们在 ReID 基准上进行了大量实验，以证明我们提出的方法的有效性。实验结果表明，我们的方法达到了最先进的性能，超越了现有的整合策略。这些发现强调了无偏差特征动态整合在增强基于多模态的 ReID 模型能力方面的关键作用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.