Yan Wang , Sivapriyaa Kannappan , Fangliang Bai , Stuart Gibson , Christopher Solomon
{"title":"Extended excitation backprop with gradient weighting: A general visualization solution for understanding heterogeneous face recognition","authors":"Yan Wang , Sivapriyaa Kannappan , Fangliang Bai , Stuart Gibson , Christopher Solomon","doi":"10.1016/j.patrec.2025.03.032","DOIUrl":"10.1016/j.patrec.2025.03.032","url":null,"abstract":"<div><div>Visualization methods have been used to reveal areas of images which influence the decision making of machine learning models, thereby helping to understand and diagnose the learned models and suggest ways to improve their performance. This concept is termed Explainable Artificial Intelligence. In this work, we focus on visualization methods for metric-learning based neural networks. We propose a gradient-weighted extended Excitation Back-Propagation (gweEBP) method that integrates the gradient information during its backpropagation for the accurate investigation of embedding networks. We perform an extensive evaluation of our gweEBP, and seven other visualization methods, on two neural networks, trained for heterogeneous face recognition. The evaluation is performed over two publicly available cross-modality datasets using two evaluation methods termed the “hiding game” and the “inpainting game”. Our experiments showed that the proposed method outperforms the competing methods in both games in most cases. Additionally, our comprehensive study also provides a benchmark for comparing visualization techniques, which may help other researchers develop new techniques and perform comparative studies on them.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 136-143"},"PeriodicalIF":3.9,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143815571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jongsu Youn , Dae Ung Jo , Seungmo Seo , Sukhyun Kim , Jongwon Choi
{"title":"Generating visual-adaptive audio representation for audio recognition","authors":"Jongsu Youn , Dae Ung Jo , Seungmo Seo , Sukhyun Kim , Jongwon Choi","doi":"10.1016/j.patrec.2025.03.020","DOIUrl":"10.1016/j.patrec.2025.03.020","url":null,"abstract":"<div><div>We propose “<em>Visual-adaptive Audio Spectrogram Generation</em>” (VASG), which is an innovative audio feature generation method preserving the Mel-spectrogram’s structure while enhancing its own discriminability. VASG maintains the spatio-temporal information of the Mel-spectrogram without degrading the performance of existing audio recognition and improves intra-class discriminability by incorporating the relational knowledge of images. VASG incorporates images only during the training phase, and once trained, VASG can be utilized as a converter that takes an input Mel-spectrogram and outputs an enhanced Mel-spectrogram, improving the discriminability of audio spectrograms without requiring further training during application. To effectively increase the discriminability of the encoded audio feature, we introduce a novel audio-visual correlation learning loss, named “Batch-wise Correlation Transfer” loss, that aligns inter-correlation between audio and visual modality. When applying pre-trained VASG to convert environmental sound classification benchmarks, we observed performance improvements in various audio classification models. Using the enhanced Mel-spectrograms produced by VASG, as opposed to the original Mel-spectrogram input, led to performance gains in recent state-of-the-art models, with accuracy increases of up to 4.27%.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 65-71"},"PeriodicalIF":3.9,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143760781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luca Cogo, Marco Buzzelli, Simone Bianco, Raimondo Schettini
{"title":"Robust camera-independent color chart localization using YOLO","authors":"Luca Cogo, Marco Buzzelli, Simone Bianco, Raimondo Schettini","doi":"10.1016/j.patrec.2025.03.022","DOIUrl":"10.1016/j.patrec.2025.03.022","url":null,"abstract":"<div><div>Accurate color information plays a critical role in numerous computer vision tasks, with the Macbeth ColorChecker being a widely used reference target due to its colorimetrically characterized color patches. However, automating the precise extraction of color information in complex scenes remains a challenge. In this paper, we propose a novel method for the automatic detection and accurate extraction of color information from Macbeth ColorCheckers in challenging environments. Our approach involves two distinct phases: (i) a chart localization step using a deep learning model to identify the presence of the ColorChecker, and (ii) a consensus-based pose estimation and color extraction phase that ensures precise localization and description of individual color patches. We rigorously evaluate our method using the widely adopted NUS and ColorChecker datasets. Comparative results against state-of-the-art methods show that our method outperforms the best solution in the state of the art achieving about 5% improvement on the ColorChecker dataset and about 17% on the NUS dataset. Furthermore, the design of our approach enables it to handle the presence of multiple ColorCheckers in complex scenes. Code will be made available after pubblication at: <span><span>https://github.com/LucaCogo/ColorChartLocalization</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 51-58"},"PeriodicalIF":3.9,"publicationDate":"2025-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143747886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Geometrical preservation and correlation learning for multi-source unsupervised domain adaptation","authors":"Huiling Fu, Yuwu Lu","doi":"10.1016/j.patrec.2025.03.018","DOIUrl":"10.1016/j.patrec.2025.03.018","url":null,"abstract":"<div><div>Multi-source unsupervised domain adaptation (MUDA) aims to improve the performance of the model on the target domain by utilizing useful information from several source domains with distinct distributions. However, due to the diverse information in each domain, how to extract and transfer useful information from source domains is essential for MUDA. Most existing MUDA methods simply minimized the distribution incongruity among multiple domains, without fully considering the unique information within each domain and the relationships between different domains. In response to these challenges, we propose a novel MUDA approach named geometrical preservation correlation learning (GPCL). Specifically, GPCL integrates graph regularization and correlation learning within the nonnegative matrix factorization (NMF) structure, leveraging the inherent geometry of the data distribution to acquire discriminative features while maintaining both the local and global geometrical structures of the original data. Meanwhile, GPCL extracts the maximum correlation information from each source domain and target domain to further narrow their domain discrepancy and ensure positive knowledge transfer. Integrated experimental results across multiple benchmarks verify that GPCL performs better than several existing MUDA approaches, showcasing the efficiency of our method in MUDA. For example, on the Office-Home dataset, GPCL outperforms the SOTA by an average of 1.58%. On the ImageCLEF-DA dataset, GPCL achieves the best results across multiple sub-tasks and the average performance, outperforming the single-source SOTA by 2.3%, 2%, and 1.26%, respectively.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 72-78"},"PeriodicalIF":3.9,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143760747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FoodMem: Near real-time and precise food video segmentation","authors":"Ahmad AlMughrabi , Adrián Galán , Ricardo Marques , Petia Radeva","doi":"10.1016/j.patrec.2025.03.014","DOIUrl":"10.1016/j.patrec.2025.03.014","url":null,"abstract":"<div><div>Food segmentation, including in videos, is vital for addressing real-world health, agriculture, and food biotechnology issues. Current limitations lead to inaccurate nutritional analysis, inefficient crop management, and suboptimal food processing, impacting food security and public health. Improving segmentation techniques can enhance dietary assessments, agricultural productivity, and the food production process. This study introduces the development of a robust framework for high-quality, near-real-time segmentation and tracking of food items in videos, using minimal hardware resources. We present FoodMem, a novel framework designed to segment food items from video sequences of 360-degree unbounded scenes. FoodMem can consistently generate masks of food portions in a video sequence, overcoming the limitations of existing semantic segmentation models, such as flickering and prohibitive inference speeds in video processing contexts. To address these issues, FoodMem leverages a two-phase solution: a transformer segmentation phase to create initial segmentation masks and a memory-based tracking phase to monitor food masks in complex scenes. Our framework outperforms current state-of-the-art food segmentation models, yielding superior performance across various conditions, such as camera angles, lighting, reflections, scene complexity, and food diversity.<span><span><sup>2</sup></span></span>This results in reduced segmentation noise, elimination of artifacts, and completion of missing segments. We also introduce a new annotated food dataset encompassing challenging scenarios absent in previous benchmarks. Extensive experiments conducted on MetaFood3D, Nutrition5k, and Vegetables & Fruits datasets demonstrate that FoodMem enhances the state-of-the-art by 2.5% mean average precision in food video segmentation and is <span><math><mrow><mn>58</mn><mo>×</mo></mrow></math></span> faster on average. The source code is available at: <span><span><sup>3</sup></span></span>.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 59-64"},"PeriodicalIF":3.9,"publicationDate":"2025-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143747841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Di Zhou , Ying Gao , Hui Li , Xiaoya Liu , Qinghua Lin
{"title":"Group commonality graph: Multimodal pedestrian trajectory prediction via deep group features","authors":"Di Zhou , Ying Gao , Hui Li , Xiaoya Liu , Qinghua Lin","doi":"10.1016/j.patrec.2025.03.019","DOIUrl":"10.1016/j.patrec.2025.03.019","url":null,"abstract":"<div><div>Pedestrian trajectory prediction is a challenging task in domains such as autonomous driving and robot motion planning. Existing methods often focus on aggregating nearby individuals into a single group, while neglecting individual differences and the risks of unreliable interactions. Therefore we propose a novel framework termed group commonality graph, which comprises a group feature capture network and a spatial–temporal graph sparse connected network. The previous network can group and pool pedestrians based on their characteristics, capturing and integrating deep features of the group to generate the final prediction. The subsequent network learns pedestrian motion patterns and simulates their interactive relationships. The framework not only addresses the limitations of overly simplistic aggregation methods but also ensures reliable interactions with sparse directionality. Additionally, to evaluate the effectiveness of our model, we introduce a new evaluation metric termed collision prediction error, which incorporates map environment information to assess the comprehensiveness of multimodal prediction results. Experimental results on public pedestrian trajectory prediction benchmark demonstrate that our method outperforms the state-of-the-art methods.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 36-42"},"PeriodicalIF":3.9,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143704940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahui Qin , Yang Xu , Yifan Fu , Zebin Wu , Zhihui Wei
{"title":"FSMT: Few-shot object detection via Multi-Task Decoupled","authors":"Jiahui Qin , Yang Xu , Yifan Fu , Zebin Wu , Zhihui Wei","doi":"10.1016/j.patrec.2025.03.016","DOIUrl":"10.1016/j.patrec.2025.03.016","url":null,"abstract":"<div><div>With the advancement of object detection technology, few-shot object detection (FSOD) has become a research hotspot. Existing methods face two major challenges: base models have limited generalization to unseen categories, especially with limited few-shot data, where the shared feature representation fails to meet the distinct needs of classification and regression tasks; FSOD is susceptible to overfitting during training. To address these issues, this paper proposes a Multi-Task Decoupled Method (MTDM), which enhances the model’s generalization to new categories by separating the feature extraction processes for different tasks. Additionally, a dynamic adjustment strategy is adopted, which adaptively modifies the IOU threshold and loss function parameters based on variations in the training data, reducing the risk of overfitting and maximizing the utilization of limited data resources. Experimental results show that the proposed hybrid model performs well on multiple few-shot datasets, effectively overcoming the challenges posed by limited annotated data.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 8-14"},"PeriodicalIF":3.9,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143685765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kening Zhang , Yung Po Tsang , Carman K.M. Lee , C.H. Wu
{"title":"Integrating large language models with explainable fuzzy inference systems for trusty steel defect detection","authors":"Kening Zhang , Yung Po Tsang , Carman K.M. Lee , C.H. Wu","doi":"10.1016/j.patrec.2025.03.017","DOIUrl":"10.1016/j.patrec.2025.03.017","url":null,"abstract":"<div><div>In industrial applications, the complexity of machine learning models often makes their decision-making processes difficult to interpret and lack transparency, particularly in the steel manufacturing sector. Understanding these processes is crucial for ensuring quality control, regulatory compliance, and gaining the trust of stakeholders. To address this issue, this paper proposes LE-FIS, a large language models (<strong>L</strong>LMs)-based <strong>E</strong>xplainable <strong>F</strong>uzzy <strong>I</strong>nference <strong>S</strong>ystem to interpret black-box models for steel defect detection. The method introduces a locally trained, globally predicted deep detection approach (LTGP), which segments the image into small parts for local training and then tests on the entire image for steel defect detection. Then, LE-FIS is designed to explain the LTGP by automatically generating rules and membership functions, with a genetic algorithm (GA) used to optimize parameters. Furthermore, state-of-the-art LLMs are employed to interpret the results of LE-FIS, and evaluation metrics are established for comparison and analysis. Experimental results demonstrate that LTGP performs well in defect detection tasks, and LE-FIS supported by LLMs provides a trustworthy and interpretable model for steel defect detection, which enhances transparency and reliability in industrial environments.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 29-35"},"PeriodicalIF":3.9,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143697954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advancing video self-supervised learning via image foundation models","authors":"Jingwei Wu , Zhewei Huang , Chang Liu","doi":"10.1016/j.patrec.2025.03.015","DOIUrl":"10.1016/j.patrec.2025.03.015","url":null,"abstract":"<div><div>In the past decade, image foundation models (IFMs) have achieved unprecedented progress. However, the potential of directly using IFMs for video self-supervised representation learning has largely been overlooked. In this study, we propose an advancing video self-supervised learning (AdViSe) approach, aimed at significantly reducing the training overhead of video representation models using pre-trained IFMs. Specifically, we first introduce temporal modeling modules (ResNet3D) to IFMs, constructing a video representation model. We then employ a video self-supervised learning approach, playback rate perception, to train temporal modules while freezing the IFM components. Experiments on UCF101 demonstrate that AdViSe achieves performance comparable to state-of-the-art methods while reducing training time by <span><math><mrow><mn>3</mn><mo>.</mo><mn>4</mn><mo>×</mo></mrow></math></span> and GPU memory usage by <span><math><mrow><mn>8</mn><mo>.</mo><mn>2</mn><mo>×</mo></mrow></math></span>. This study offers fresh insights into low-cost video self-supervised learning based on pre-trained IFMs. Code is available at <span><span>https://github.com/JingwWu/advise-video-ssl</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 22-28"},"PeriodicalIF":3.9,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143696940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cross-modal contrastive learning with multi-hierarchical tracklet clustering for multi object tracking","authors":"Ru Hong, Jiming Yang, Zeyu Cai, Feipeng Da","doi":"10.1016/j.patrec.2025.02.032","DOIUrl":"10.1016/j.patrec.2025.02.032","url":null,"abstract":"<div><div>The tracklet-based offline multi-object tracking (MOT) paradigm addresses the challenge of long-term association in online mode by utilizing global optimization for tracklet clustering in videos. The key to accurate offline MOT lies in establishing robust similarity between tracklets by leveraging both their temporal motion and appearance cues. To this end, we propose a multi-hierarchical tracklet clustering method based on cross-modal contrastive learning, called MHCM2DMOT. This method incorporates three key techniques: (I) A tracklet generation strategy based on motion association uniqueness, which ensures efficient object association across consecutive frames while preserving identity uniqueness; (II) Encoding tracklet motion and appearance cues through both language and visual models, enhancing interaction between different modal features via cross-modal contrastive learning to produce more distinct multi-modal fusion similarities; (III) A multi-hierarchical tracklet clustering method using graph attention network, which balances tracking performance with inference speed. Our tracker achieves state-of-the-art results on popular MOT datasets, ensuring accurate tracking performance.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"192 ","pages":"Pages 1-7"},"PeriodicalIF":3.9,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143686423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}