{"title":"Eye-movement-prompted large image captioning model","authors":"Zheng Yang , Bing Han , Xinbo Gao , Zhi-Hui Zhan","doi":"10.1016/j.patcog.2024.111097","DOIUrl":"10.1016/j.patcog.2024.111097","url":null,"abstract":"<div><div>Pretrained large vision-language models have shown outstanding performance on the task of image captioning. However, owing to the insufficient decoding of image features, existing large models sometimes lose important information, such as objects, scenes, and their relationships. In addition, the complex “black-box” nature of these models makes their mechanisms difficult to explain. Research shows that humans learn richer representations than machines do, which inspires us to improve the accuracy and interpretability of large image captioning models by combining human observation patterns. We built a new dataset, called saliency in image captioning (SIC), to explore relationships between human vision and language representation. One thousand images with rich context information were selected as image data of SIC. Each image was annotated with five caption labels and five eye-movement labels. Through analysis of the eye-movement data, we found that humans efficiently captured comprehensive information for image captioning during their observations. Therefore, we propose an eye-movement-prompted large image captioning model, which is embedded with two carefully designed modules: the eye-movement simulation module (EMS) and the eye-movement analyzing module (EMA). EMS combines the human observation pattern to simulate eye-movement features, including the positions and scan paths of eye fixations. EMA is a graph neural network (GNN) based module, which decodes graphical eye-movement data and abstracts image features as a directed graph. More accurate descriptions can be predicted by decoding the generated graph. Extensive experiments were conducted on the MS-COCO and NoCaps datasets to validate our model. The experimental results showed that our network was interpretable, and could achieve superior results compared with state-of-the-art methods, <em>i.e.</em>, 84.2% BLEU-4 and 145.1% CIDEr-D on MS-COCO Karpathy test split, indicating its strong potential for use in image captioning.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111097"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GBMOD: A granular-ball mean-shift outlier detector","authors":"Shitong Cheng , Xinyu Su , Baiyang Chen , Hongmei Chen , Dezhong Peng , Zhong Yuan","doi":"10.1016/j.patcog.2024.111115","DOIUrl":"10.1016/j.patcog.2024.111115","url":null,"abstract":"<div><div>Outlier detection is a crucial data mining task involving identifying abnormal objects, errors, or emerging trends. Mean-shift-based outlier detection techniques evaluate the abnormality of an object by calculating the mean distance between the object and its <span><math><mi>k</mi></math></span>-nearest neighbors. However, in datasets with significant noise, the presence of noise in the <span><math><mi>k</mi></math></span>-nearest neighbors of some objects makes the model ineffective in detecting outliers. Additionally, the mean-shift outlier detection technique depends on finding the <span><math><mi>k</mi></math></span>-nearest neighbors of an object, which can be time-consuming. To address these issues, we propose a granular-ball computing-based mean-shift outlier detection method (GBMOD). Specifically, we first generate high-quality granular-balls to cover the data. By using the centers of granular-balls as anchors, the subsequent mean-shift process can effectively avoid the influence of noise points in the neighborhood. Then, outliers are detected based on the distance from the object to the displaced center of the granular-ball to which it belongs. Finally, the distance between the object and the shifted center of the granular-ball to which the object belongs is calculated, resulting in the outlier scores of objects. Subsequent experiments demonstrate the effectiveness, efficiency, and robustness of the method proposed in this paper.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111115"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lijian Yang , Jianxun Mi , Weisheng Li , Guofen Wang , Bin Xiao
{"title":"Improving the sparse coding model via hybrid Gaussian priors","authors":"Lijian Yang , Jianxun Mi , Weisheng Li , Guofen Wang , Bin Xiao","doi":"10.1016/j.patcog.2024.111102","DOIUrl":"10.1016/j.patcog.2024.111102","url":null,"abstract":"<div><div>Sparse Coding (SC) imposes a sparse prior on the representation coefficients under a dictionary or a sensing matrix. However, the sparse regularization, approximately expressed as the <span><math><msub><mrow><mi>L</mi></mrow><mrow><mn>1</mn></mrow></msub></math></span>-norm, is not strongly convex. The uniqueness of the optimal solution requires the dictionary to be of low mutual coherence. As a specialized form of SC, Convolutional Sparse Coding (CSC) encounters the same issue. Inspired by the Elastic Net, this paper proposes to learn an additional anisotropic Gaussian prior for the sparse codes, thus improving the convexity of the SC problem and enabling the modeling of feature correlation. As a result, the SC problem is modified by the proposed elastic projection. We thereby analyze the effectiveness of the proposed method under the framework of LISTA and demonstrate that this simple technique has the potential to correct bad codes and reduce the error bound, especially in noisy scenarios. Furthermore, we extend this technique to the CSC model for the vision practice of image denoising. Extensive experimental results show that the learned Gaussian prior significantly improves the performance of both the SC and CSC models. Source codes are available at <span><span>https://github.com/eeejyang/EPCSCNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111102"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiahui Wang , Dongsheng Ruan , Yang Li , Zefeng Wang , Yongquan Wu , Tao Tan , Guang Yang , Mingfeng Jiang
{"title":"Data augmentation strategies for semi-supervised medical image segmentation","authors":"Jiahui Wang , Dongsheng Ruan , Yang Li , Zefeng Wang , Yongquan Wu , Tao Tan , Guang Yang , Mingfeng Jiang","doi":"10.1016/j.patcog.2024.111116","DOIUrl":"10.1016/j.patcog.2024.111116","url":null,"abstract":"<div><div>Exploiting unlabeled and labeled data augmentations has become considerably important for semi-supervised medical image segmentation tasks. However, existing data augmentation methods, such as Cut-mix and generative models, typically dependent on consistency regularization or ignore data correlation between slices. To address cognitive biases problems, we propose two novel data augmentation strategies and a Dual Attention-guided Consistency network (DACNet) to improve semi-supervised medical image segmentation performance significantly. For labeled data augmentation, we randomly crop and stitch annotated data rather than unlabeled data to create mixed annotated data, which breaks the anatomical structures and introduces voxel-level uncertainty in limited annotated data. For unlabeled data augmentation, we combine the diffusion model with the Laplacian pyramid fusion strategy to generate unlabeled data with higher slice correlation. To enhance the decoders to learn different semantic but discriminative features, we propose the DACNet to achieve structural differentiation by introducing spatial and channel attention into the decoders. Extensive experiments are conducted to show the effectiveness and generalization of our approach. Specifically, our proposed labeled and unlabeled data augmentation strategies improved accuracy by 0.3% to 16.49% and 0.22% to 1.72%, respectively, when compared with various state-of-the-art semi-supervised methods. Furthermore, our DACNet outperforms existing methods on three medical datasets (91.72% dice score with 20% labeled data on the LA dataset). Source code will be publicly available at <span><span>https://github.com/Oubit1/DACNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111116"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenjie Mao , Bin Yu , Chen Zhang , A.K. Qin , Yu Xie
{"title":"FedKT: Federated learning with knowledge transfer for non-IID data","authors":"Wenjie Mao , Bin Yu , Chen Zhang , A.K. Qin , Yu Xie","doi":"10.1016/j.patcog.2024.111143","DOIUrl":"10.1016/j.patcog.2024.111143","url":null,"abstract":"<div><div>Federated Learning enables clients to train a joint model collaboratively without disclosing raw data. However, learning over non-IID data may raise performance degeneration, which has become a fundamental bottleneck. Despite numerous efforts to address this issue, challenges such as excessive local computational burdens and reliance on shared data persist, rendering them impractical in real-world scenarios. In this paper, we propose a novel federated knowledge transfer framework to overcome data heterogeneity issues. Specifically, a model segmentation distillation method and a learnable aggregation network are developed for server-side knowledge ensemble and transfer, while a client-side consistency-constrained loss is devised to rectify local updates, thereby enhancing both global and client models. The framework considers both diversity and consistency among clients and can serve as a general solution for extracting knowledge from distributed nodes. Extensive experiments on four datasets demonstrate our framework’s effectiveness, achieving superior performance compared to advanced competitors in high-heterogeneity settings.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111143"},"PeriodicalIF":7.5,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao He , Jianfeng Ren , Ruibin Bai , Xudong Jiang
{"title":"Radar gait recognition using Dual-branch Swin Transformer with Asymmetric Attention Fusion","authors":"Wentao He , Jianfeng Ren , Ruibin Bai , Xudong Jiang","doi":"10.1016/j.patcog.2024.111101","DOIUrl":"10.1016/j.patcog.2024.111101","url":null,"abstract":"<div><div>Video-based gait recognition suffers from potential privacy issues and performance degradation due to dim environments, partial occlusions, or camera view changes. Radar has recently become increasingly popular and overcome various challenges presented by vision sensors. To capture tiny differences in radar gait signatures of different people, a dual-branch Swin Transformer is proposed, where one branch captures the time variations of the radar micro-Doppler signature and the other captures the repetitive frequency patterns in the spectrogram. Unlike natural images where objects can be translated, rotated, or scaled, the spatial coordinates of spectrograms and CVDs have unique physical meanings, and there is no affine transformation for radar targets in these synthetic images. The patch splitting mechanism in Vision Transformer makes it ideal to extract discriminant information from patches, and learn the attentive information across patches, as each patch carries some unique physical properties of radar targets. Swin Transformer consists of a set of cascaded Swin blocks to extract semantic features from shallow to deep representations, further improving the classification performance. Lastly, to highlight the branch with larger discriminant power, an Asymmetric Attention Fusion is proposed to optimally fuse the discriminant features from the two branches. To enrich the research on radar gait recognition, a large-scale NTU-RGR dataset is constructed, containing 45,768 radar frames of 98 subjects. The proposed method is evaluated on the NTU-RGR dataset and the MMRGait-1.0 database. It consistently and significantly outperforms all the compared methods on both datasets. <em>The codes are available at:</em> <span><span>https://github.com/wentaoheunnc/NTU-RGR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111101"},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MWVOS: Mask-Free Weakly Supervised Video Object Segmentation via promptable foundation model","authors":"Zhenghao Zhang , Shengfan Zhang , Zuozhuo Dai , Zilong Dong , Siyu Zhu","doi":"10.1016/j.patcog.2024.111100","DOIUrl":"10.1016/j.patcog.2024.111100","url":null,"abstract":"<div><div>The current state-of-the-art techniques for video object segmentation necessitate extensive training on video datasets with mask annotations, thereby constraining their ability to transfer zero-shot learning to new image distributions and tasks. However, recent advancements in foundation models, particularly in the domain of image segmentation, have showcased robust generalization capabilities, introducing a novel prompt-driven paradigm for a variety of downstream segmentation challenges on new data distributions. This study delves into the potential of vision foundation models using diverse prompt strategies and proposes a mask-free approach for unsupervised video object segmentation. To further improve the efficacy of prompt learning in diverse and complex video scenes, we introduce a spatial–temporal decoupled deformable attention mechanism to establish an effective correlation between intra- and inter-frame features. Extensive experiments conducted on the DAVIS2017-unsupervised and YoutubeVIS19&21 and OIVS datasets demonstrate the superior performance of the proposed approach without mask supervision when compared to existing mask-supervised methods, as well as its capacity to generalize to weakly-annotated video datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111100"},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Jointly stochastic fully symmetric interpolatory rules and local approximation for scalable Gaussian process regression","authors":"Hongli Zhang, Jinglei Liu","doi":"10.1016/j.patcog.2024.111125","DOIUrl":"10.1016/j.patcog.2024.111125","url":null,"abstract":"<div><div>When exploring the broad application prospects of large-scale Gaussian process regression (GPR), three core challenges significantly constrain its full effectiveness: firstly, the <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> time complexity of computing the inverse covariance matrix of <span><math><mi>n</mi></math></span> training points becomes an insurmountable performance bottleneck when processing large-scale datasets; Secondly, although traditional local approximation methods are widely used, they are often limited by the inconsistency of prediction results; The third issue is that many aggregation strategies lack discrimination when evaluating the importance of experts (i.e. local models), resulting in a loss of overall prediction accuracy. In response to the above challenges, this article innovatively proposes a comprehensive method that integrates third-degree stochastic fully symmetric interpolatory rules (TDSFSI), local approximation, and Tsallis mutual information (TDSFSIRLA), aiming to fundamentally break through existing limitations. Specifically, TDSFSIRLA first introduces an efficient third-degree stochastic fully symmetric interpolatory rules, which achieves accurate approximation of Gaussian kernel functions by generating adaptive dimensional feature maps. This innovation not only significantly reduces the number of required orthogonal nodes and effectively lowers computational costs, but also maintains extremely high approximation accuracy, providing a solid theoretical foundation for processing large-scale datasets. Furthermore, in order to overcome the inconsistency of local approximation methods, this paper adopts the Generalized Robust Bayesian Committee Machine (GRBCM) as the aggregation framework for local experts. GRBCM ensures the harmonious unity of the prediction results of each local model through its inherent consistency and robustness, significantly improving the stability and reliability of the overall prediction. More importantly, in response to the issue of uneven distribution of expert weights, this article creatively introduces Tsallis mutual information as a metric for weight allocation. Tsallis mutual information, with its sensitive ability to capture information complexity, assigns weights to different local experts that match their contribution, effectively solving the problem of prediction bias caused by uneven weight distribution and further improving prediction accuracy. In the experimental verification phase, this article conducted comprehensive testing on multiple synthetic datasets and seven representative real datasets. The results show that the TDSFSIRLA method not only achieves significant reduction in time complexity, but also demonstrates excellent performance in prediction accuracy, fully verifying its significant advantages and broad application prospects in the field of large-scale Gaussi","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111125"},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hong-Hanh Nguyen-Le , Lam Tran , Dinh Song An Nguyen , Nhien-An Le-Khac , Thuc Nguyen
{"title":"Privacy-preserving speaker verification system using Ranking-of-Element hashing","authors":"Hong-Hanh Nguyen-Le , Lam Tran , Dinh Song An Nguyen , Nhien-An Le-Khac , Thuc Nguyen","doi":"10.1016/j.patcog.2024.111107","DOIUrl":"10.1016/j.patcog.2024.111107","url":null,"abstract":"<div><div>The advancements in automatic speaker recognition have led to the exploration of voice data for verification systems. This raises concerns about the security of storing voice templates in plaintext. In this paper, we propose a novel cancellable biometrics that does not require users to manage random matrices or tokens. First, we pre-process the raw voice data and feed it into a deep feature extraction module to obtain embeddings. Next, we propose a hashing scheme, Ranking-of-Elements, which generates compact hashed codes by recording the number of elements whose values are lower than that of a random element. This approach captures more information from smaller-valued elements and prevents the adversary from guessing the ranking value through Attacks via Record Multiplicity. Lastly, we introduce a fuzzy matching method, to mitigate the variations in templates resulting from environmental noise. We evaluate the performance and security of our method on two datasets: TIMIT and VoxCeleb1.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111107"},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HDR reconstruction from a single exposure LDR using texture and structure dual-stream generation","authors":"Yu-Hsiang Chen, Shanq-Jang Ruan","doi":"10.1016/j.patcog.2024.111127","DOIUrl":"10.1016/j.patcog.2024.111127","url":null,"abstract":"<div><div>Reconstructing high dynamic range (HDR) imagery from a single low dynamic range (LDR) photograph presents substantial challenges. The challenges are primarily due to the loss of details and information in regions of underexposure or overexposure due to quantization and saturation inherent to camera sensors. Traditional learning-based approaches often struggle with distinguishing overexposed regions within an object from the background, leading to compromised detail retention in these critical areas. Our methodology focuses on meticulously reconstructing structural and textural details to preserve the integrity of the structural information. We propose a new two-stage model architecture for HDR image reconstruction, including a dual-stream network and a feature fusion stage. The dual-stream network is designed to reconstruct structural and textural details, while the feature fusion stage aims to minimize artifacts by utilizing the reconstructed information. We have demonstrated that our proposed method performs better than other state-of-the-art single-image HDR reconstruction algorithms in various quality metrics.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111127"},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142652240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}