Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-03-18 DOI:10.1007/s11263-025-02396-5

Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger

{"title":"Fusion for Visual-Infrared Person ReID in Real-World Surveillance Using Corrupted Multimodal Data","authors":"Arthur Josi, Mahdi Alehdaghi, Rafael M. O. Cruz, Eric Granger","doi":"10.1007/s11263-025-02396-5","DOIUrl":null,"url":null,"abstract":"<p>Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images face corruptions such as blur, noise, and weather. Despite their practical relevance, deep learning models for multimodal V-I ReID remain far less investigated than for single and cross-modal V to I settings. Moreover, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID – named Multimodal Middle Stream Fusion (MMSF) – that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing for dynamic balancing of the importance of each modality. The literature typically reports ReID performance using clean datasets, but more recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios, using data with realistic corruptions. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, the proposed ML-MDA is shown as essential for a V-I person ReID system to sustain high accuracy and robustness in face of corrupted multimodal images. Our multimodal ReID models attains the best accuracy and complexity trade-off under both CL and NCL settings and compared to state-of-art unimodal ReID systems, except for the ThermalWORLD dataset due to its low-quality I. Our MMSF model outperforms every method under CL and NCL camera scenarios. GitHub code: https://github.com/art2611/MREiD-UCD-CCD.git.</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"183 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02396-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Visible-infrared person re-identification (V-I ReID) seeks to match images of individuals captured over a distributed network of RGB and IR cameras. The task is challenging due to the significant differences between V and I modalities, especially under real-world conditions, where images face corruptions such as blur, noise, and weather. Despite their practical relevance, deep learning models for multimodal V-I ReID remain far less investigated than for single and cross-modal V to I settings. Moreover, state-of-art V-I ReID models cannot leverage corrupted modality information to sustain a high level of accuracy. In this paper, we propose an efficient model for multimodal V-I ReID – named Multimodal Middle Stream Fusion (MMSF) – that preserves modality-specific knowledge for improved robustness to corrupted multimodal images. In addition, three state-of-art attention-based multimodal fusion models are adapted to address corrupted multimodal data in V-I ReID, allowing for dynamic balancing of the importance of each modality. The literature typically reports ReID performance using clean datasets, but more recently, evaluation protocols have been proposed to assess the robustness of ReID models under challenging real-world scenarios, using data with realistic corruptions. However, these protocols are limited to unimodal V settings. For realistic evaluation of multimodal (and cross-modal) V-I person ReID models, we propose new challenging corrupted datasets for scenarios where V and I cameras are co-located (CL) and not co-located (NCL). Finally, the benefits of our Masking and Local Multimodal Data Augmentation (ML-MDA) strategy are explored to improve the robustness of ReID models to multimodal corruption. Our experiments on clean and corrupted versions of the SYSU-MM01, RegDB, and ThermalWORLD datasets indicate the multimodal V-I ReID models that are more likely to perform well in real-world operational conditions. In particular, the proposed ML-MDA is shown as essential for a V-I person ReID system to sustain high accuracy and robustness in face of corrupted multimodal images. Our multimodal ReID models attains the best accuracy and complexity trade-off under both CL and NCL settings and compared to state-of-art unimodal ReID systems, except for the ThermalWORLD dataset due to its low-quality I. Our MMSF model outperforms every method under CL and NCL camera scenarios. GitHub code: https://github.com/art2611/MREiD-UCD-CCD.git.

查看原文本刊更多论文

基于多模态数据的视红外人脸识别融合

可见红外人体再识别（V-I ReID）旨在匹配通过RGB和IR相机分布式网络捕获的个体图像。由于V和I模式之间存在显著差异，特别是在现实世界条件下，图像面临诸如模糊、噪声和天气等损坏，因此这项任务具有挑战性。尽管具有实际意义，但对多模态V-I ReID的深度学习模型的研究远远少于单模态和跨模态V-I设置。此外，最先进的V-I ReID模型不能利用损坏的模态信息来维持高水平的准确性。在本文中，我们提出了一种有效的多模态V-I ReID模型-称为多模态中流融合(MMSF) -它保留了特定于模态的知识，以提高对损坏的多模态图像的鲁棒性。此外，三种最先进的基于注意力的多模态融合模型适用于解决V-I ReID中损坏的多模态数据，允许动态平衡每个模态的重要性。文献通常使用干净的数据集来报告ReID的性能，但最近，已经提出了评估协议，使用具有实际损坏的数据来评估ReID模型在具有挑战性的现实场景下的鲁棒性。然而，这些协议仅限于单峰V设置。为了对多模态（和跨模态）V-I人ReID模型进行现实评估，我们提出了新的具有挑战性的损坏数据集，用于V和I相机共定位（CL）和非共定位（NCL）的场景。最后，探讨了掩蔽和局部多模态数据增强（ML-MDA）策略的好处，以提高ReID模型对多模态损坏的鲁棒性。我们对SYSU-MM01、RegDB和ThermalWORLD数据集的干净和损坏版本进行的实验表明，多模态V-I ReID模型更有可能在实际操作条件下表现良好。特别是，所提出的ML-MDA对于V-I人ReID系统在面对损坏的多模态图像时保持高精度和鲁棒性至关重要。我们的多模态ReID模型在CL和NCL设置下都达到了最好的精度和复杂性权衡，与最先进的单模态ReID系统相比，除了ThermalWORLD数据集由于其低质量I.我们的MMSF模型在CL和NCL相机场景下优于所有方法。GitHub代码：https://github.com/art2611/MREiD-UCD-CCD.git。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.