{"title":"KSOF: Leveraging kinematics and spatio-temporal optimal fusion for human motion prediction","authors":"Rui Ding , KeHua Qu , Jin Tang","doi":"10.1016/j.patcog.2024.111206","DOIUrl":"10.1016/j.patcog.2024.111206","url":null,"abstract":"<div><div>Ignoring the meaningful kinematics law, which generates improbable or impractical predictions, is one of the obstacles to human motion prediction. Current methods attempt to tackle this problem by taking simple kinematics information as auxiliary features to improve predictions. However, it remains challenging to utilize human prior knowledge deeply, such as the trajectory formed by the same joint should be smooth and continuous in this task. In this paper, we advocate explicitly describing kinematics information via velocity and acceleration by proposing a novel loss called joint point smoothness (JPS) loss, which calculates the acceleration of joints to smooth the sudden change in joint velocity. In addition, capturing spatio-temporal dependencies to make feature representations more informative is also one of the obstacles in this task. Therefore, we propose a dual-path network (KSOF) that models the temporal and spatial dependencies from kinematic temporal convolutional network (K-TCN) and spatial graph convolutional networks (S-GCN), respectively. Moreover, we propose a novel multi-scale fusion module named spatio-temporal optimal fusion (SOF) to enhance extraction of the essential correlation and important features at different scales from spatio-temporal coupling features. We evaluate our approach on three standard benchmark datasets, including Human3.6M, CMU-Mocap, and 3DPW datasets. For both short-term and long-term predictions, our method achieves outstanding performance on all these datasets. The code is available at <span><span>https://github.com/qukehua/KSOF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111206"},"PeriodicalIF":7.5,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Camera-aware graph multi-domain adaptive learning for unsupervised person re-identification","authors":"Zhidan Ran, Xiaobo Lu, Xuan Wei, Wei Liu","doi":"10.1016/j.patcog.2024.111217","DOIUrl":"10.1016/j.patcog.2024.111217","url":null,"abstract":"<div><div>Recently, unsupervised person re-identification (Re-ID) has gained much attention due to its important practical significance in real-world application scenarios without pairwise labeled data. A key challenge for unsupervised person Re-ID is learning discriminative and robust feature representations under cross-camera scene variation. Contrastive learning approaches treat unsupervised representation learning as a dictionary look-up task. However, existing methods ignore both intra- and inter-camera semantic associations during training. In this paper, we propose a novel unsupervised person Re-ID framework, Camera-Aware Graph Multi-Domain Adaptive Learning (CGMAL), which can conduct multi-domain feature transfer with semantic propagation for learning discriminative domain-invariant representations. Specifically, we treat each camera as a distinct domain and extract image samples from every camera domain to form a mini-batch. A heterogeneous graph is constructed for representing the relationships between all instances in a mini-batch. Then a Graph Convolutional Network (GCN) is employed to fuse the image samples into a unified space and implement promising semantic transfer for providing ideal feature representations. Subsequently, we construct the memory-based non-parametric contrastive loss to train the model. In particular, we design an adversarial training scheme for transferring the knowledge learned by GCN to the feature extractor. Experimental experiments on three benchmarks validate that our proposed approach is superior to the state-of-the-art unsupervised methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111217"},"PeriodicalIF":7.5,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guozhen Peng , Yunhong Wang , Shaoxiong Zhang , Rui Li , Yuwei Zhao , Annan Li
{"title":"RSANet: Relative-sequence quality assessment network for gait recognition in the wild","authors":"Guozhen Peng , Yunhong Wang , Shaoxiong Zhang , Rui Li , Yuwei Zhao , Annan Li","doi":"10.1016/j.patcog.2024.111219","DOIUrl":"10.1016/j.patcog.2024.111219","url":null,"abstract":"<div><div>Gait recognition in the wild has received increasing attention since the gait pattern is hard to disguise and can be captured in a long distance. However, due to occlusions and segmentation errors, low-quality silhouettes are common and inevitable. To mitigate this low-quality problem, some prior arts propose absolute-single quality assessment models. Although these methods obtain a good performance, they only focus on the silhouette quality of a single frame, lacking consideration of the variation state of the entire sequence. In this paper, we propose a Relative-Sequence Quality Assessment Network, named RSANet. It uses the Average Feature Similarity Module (AFSM) to evaluate silhouette quality by calculating the similarity between one silhouette and all other silhouettes in the same silhouette sequence. The silhouette quality is based on the sequence, reflecting a relative quality. Furthermore, RSANet uses Multi-Temporal-Receptive-Field Residual Blocks (MTB) to extend temporal receptive fields without parameter increases. It achieves a Rank-1 accuracy of 75.2% on Gait3D, 81.8% on GREW, and 77.6% on BUAA-Duke-Gait datasets respectively. The code is available at <span><span>https://github.com/PGZ-Sleepy/RSANet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111219"},"PeriodicalIF":7.5,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuhang Zhang , Jiani Hu , Dongchao Wen , Weihong Deng
{"title":"Unsupervised evaluation for out-of-distribution detection","authors":"Yuhang Zhang , Jiani Hu , Dongchao Wen , Weihong Deng","doi":"10.1016/j.patcog.2024.111212","DOIUrl":"10.1016/j.patcog.2024.111212","url":null,"abstract":"<div><div>We need to acquire labels for test sets to evaluate the performance of existing out-of-distribution (OOD) detection methods. In real-world deployment, it is laborious to label each new test set as there are various OOD data with different difficulties. However, we need to use different OOD data to evaluate OOD detection methods as their performance varies widely. Thus, we propose evaluating OOD detection methods on unlabeled test sets, which can free us from labeling each new OOD test set. It is a non-trivial task as we do not know which sample is correctly detected without OOD labels, and the evaluation metric like AUROC cannot be calculated. In this paper, we address this important yet untouched task for the first time. Inspired by the bimodal distribution of OOD detection test sets, we propose an unsupervised indicator named Gscore that has a certain relationship with the OOD detection performance; thus, we could use neural networks to learn that relationship to predict OOD detection performance without OOD labels. Through extensive experiments, we validate that there does exist a strong quantitative correlation, which is almost linear, between Gscore and the OOD detection performance. Additionally, we introduce Gbench, a new benchmark consisting of 200 different real-world OOD datasets, to test the performance of Gscore. Our results show that Gscore achieves state-of-the-art performance compared with other unsupervised evaluation methods and generalizes well with different in-distribution (ID)/OOD datasets, OOD detection methods, backbones, and ID:OOD ratios. Furthermore, we conduct analyses on Gbench to study the effects of backbones and ID/OOD datasets on OOD detection performance. The dataset and code will be available.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111212"},"PeriodicalIF":7.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743134","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lunke Fei , Zhihao He , Wai Keung Wong , Qi Zhu , Shuping Zhao , Jie Wen
{"title":"Semantic decomposition and enhancement hashing for deep cross-modal retrieval","authors":"Lunke Fei , Zhihao He , Wai Keung Wong , Qi Zhu , Shuping Zhao , Jie Wen","doi":"10.1016/j.patcog.2024.111225","DOIUrl":"10.1016/j.patcog.2024.111225","url":null,"abstract":"<div><div>Deep hashing has garnered considerable interest and has shown impressive performance in the domain of retrieval. However, the majority of the current hashing techniques rely solely on binary similarity evaluation criteria to assess the semantic relationships between multi-label instances, which presents a challenge in overcoming the feature gap across various modalities. In this paper, we propose semantic decomposition and enhancement hashing (SDEH) by extensively exploring the multi-label semantic information shared by different modalities for cross-modal retrieval. Specifically, we first introduce two independent attention-based feature learning subnetworks to capture the modality-specific features with both global and local details. Subsequently, we exploit the semantic features from multi-label vectors by decomposing the shared semantic information among multi-modal features such that the associations of different modalities can be established. Finally, we jointly learn the common hash code representations of multimodal information under the guidelines of quadruple losses, making the hash codes informative while simultaneously preserving multilevel semantic relationships and feature distribution consistency. Comprehensive experiments on four commonly used multimodal datasets offer strong support for the exceptional effectiveness of our proposed SDEH.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111225"},"PeriodicalIF":7.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jia Fu , Guotai Wang , Tao Lu , Qiang Yue , Tom Vercauteren , Sébastien Ourselin , Shaoting Zhang
{"title":"UM-CAM: Uncertainty-weighted multi-resolution class activation maps for weakly-supervised segmentation","authors":"Jia Fu , Guotai Wang , Tao Lu , Qiang Yue , Tom Vercauteren , Sébastien Ourselin , Shaoting Zhang","doi":"10.1016/j.patcog.2024.111204","DOIUrl":"10.1016/j.patcog.2024.111204","url":null,"abstract":"<div><div>Weakly-supervised medical image segmentation methods utilizing image-level labels have gained attention for reducing the annotation cost. They typically use Class Activation Maps (CAM) from a classification network but struggle with incomplete activation regions due to low-resolution localization without detailed boundaries. Differently from most of them that only focus on improving the quality of CAMs, we propose a more unified weakly-supervised segmentation framework with image-level supervision. Firstly, an Uncertainty-weighted Multi-resolution Class Activation Map (UM-CAM) is proposed to generate high-quality pixel-level pseudo-labels. Subsequently, a Geodesic distance-based Seed Expansion (GSE) strategy is introduced to rectify ambiguous boundaries in the UM-CAM by leveraging contextual information. To train a final segmentation model from noisy pseudo-labels, we introduce a Random-View Consensus (RVC) training strategy to suppress unreliable pixel/voxels and encourage consistency between random-view predictions. Extensive experiments on 2D fetal brain segmentation and 3D brain tumor segmentation tasks showed that our method significantly outperforms existing weakly-supervised methods. Code is available at: <span><span>https://github.com/HiLab-git/UM-CAM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111204"},"PeriodicalIF":7.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743156","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yulong Jia , Jiaming Li , Ganlong Zhao , Shuangyin Liu , Weijun Sun , Liang Lin , Guanbin Li
{"title":"Enhancing out-of-distribution detection via diversified multi-prototype contrastive learning","authors":"Yulong Jia , Jiaming Li , Ganlong Zhao , Shuangyin Liu , Weijun Sun , Liang Lin , Guanbin Li","doi":"10.1016/j.patcog.2024.111214","DOIUrl":"10.1016/j.patcog.2024.111214","url":null,"abstract":"<div><div>Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep neural networks in the open world. Recent distance-based contrastive learning methods demonstrated their effectiveness by learning improved feature representations in the embedding space. However, those methods might lead to an incomplete and ambiguous representation of a class, thereby resulting in the loss of intra-class semantic information. In this work, we propose a novel diversified multi-prototype contrastive learning framework, which preserves the semantic knowledge within each class’s embedding space by introducing multiple fine-grained prototypes for each class. This preserves intrinsic features within the in-distribution data, promoting discrimination against OOD samples. We also devise an activation constraints technique to mitigate the impact of extreme activation values on other dimensions and facilitate the computation of distance-based scores. Extensive experiments on several benchmarks show that our proposed method is effective and beneficial for OOD detection, outperforming previous state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111214"},"PeriodicalIF":7.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kuiran Wang , Xuehui Yu , Wenwen Yu , Guorong Li , Xiangyuan Lan , Qixiang Ye , Jianbin Jiao , Zhenjun Han
{"title":"ClickTrack: Towards real-time interactive single object tracking","authors":"Kuiran Wang , Xuehui Yu , Wenwen Yu , Guorong Li , Xiangyuan Lan , Qixiang Ye , Jianbin Jiao , Zhenjun Han","doi":"10.1016/j.patcog.2024.111211","DOIUrl":"10.1016/j.patcog.2024.111211","url":null,"abstract":"<div><div>Single object tracking (SOT) relies on precise object bounding box initialization. In this paper, we reconsidered the deficiencies in the current approaches to initializing single object trackers and propose a new paradigm for single object tracking algorithms, ClickTrack, a new paradigm using clicking interaction for real-time scenarios. Moreover, click as an input type inherently lack hierarchical information. To address ambiguity in certain special scenarios, we designed the Guided Click Refiner (GCR), which accepts point and optional textual information as inputs, transforming the point into the bounding box expected by the operator. The bounding box will be used as input of single object trackers. Experiments on LaSOT and GOT-10k benchmarks show that tracker combined with GCR achieves stable performance in real-time interactive scenarios. Furthermore, we explored the integration of GCR into the Segment Anything model (SAM), significantly reducing ambiguity issues when SAM receives point inputs.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111211"},"PeriodicalIF":7.5,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142759286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SEMACOL: Semantic-enhanced multi-scale approach for text-guided grayscale image colorization","authors":"Chaochao Niu, Ming Tao, Bing-Kun Bao","doi":"10.1016/j.patcog.2024.111203","DOIUrl":"10.1016/j.patcog.2024.111203","url":null,"abstract":"<div><div>High-quality colorization of grayscale images using text descriptions presents a significant challenge, especially in accurately coloring small objects. The existing methods have two major flaws. First, text descriptions typically omit size information of objects, resulting in text features that often lack semantic information reflecting object sizes. Second, these methods identify coloring areas by relying solely on low-resolution visual features from the Unet encoder and fail to leverage the fine-grained information provided by high-resolution visual features effectively. To address these issues, we introduce the Semantic-Enhanced Multi-scale Approach for Text-Guided Grayscale Image Colorization (SEMACOL). We first introduce a Cross-Modal Text Augmentation module that incorporates grayscale images into text features, which enables accurate perception of object sizes in text descriptions. Subsequently, we propose a Multi-scale Content Location module, which utilizes multi-scale features to precisely identify coloring areas within grayscale images. Meanwhile, we incorporate a Text-Influenced Colorization Adjustment module to effectively adjust colorization based on text descriptions. Finally, we implement a Dynamic Feature Fusion Strategy, which dynamically refines outputs from both the Multi-scale Content Location and Text-Influenced Colorization Adjustment modules, ensuring a coherent colorization process. SEMACOL demonstrates remarkable performance improvements over existing state-of-the-art methods on public datasets. Specifically, SEMACOL achieves a PSNR of 25.695, SSIM of 0.92240, LPIPS of 0.156, and FID of 17.54, surpassing the previous best results (PSNR: 25.511, SSIM: 0.92104, LPIPS: 0.157, FID: 26.93). The code will be available at <span><span>https://github.com/ChchNiu/SEMACOL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111203"},"PeriodicalIF":7.5,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuanbing Zou , Qingjie Zhao , Prodip Kumar Sarker , Shanshan Li , Lei Wang , Wangwang Liu
{"title":"Diffusion-based framework for weakly-supervised temporal action localization","authors":"Yuanbing Zou , Qingjie Zhao , Prodip Kumar Sarker , Shanshan Li , Lei Wang , Wangwang Liu","doi":"10.1016/j.patcog.2024.111207","DOIUrl":"10.1016/j.patcog.2024.111207","url":null,"abstract":"<div><div>Weakly supervised temporal action localization aims to localize action instances with only video-level supervision. Due to the absence of frame-level annotation supervision, how effectively separate action snippets and backgrounds from semantically ambiguous features becomes an arduous challenge for this task. To address this issue from a generative modeling perspective, we propose a novel diffusion-based network with two stages. Firstly, we design a local masking mechanism module to learn the local semantic information and generate binary masks at the early stage, which (1) are used to perform action-background separation and (2) serve as pseudo-ground truth required by the diffusion module. Then, we propose a diffusion module to generate high-quality action predictions under the pseudo-ground truth supervision in the second stage. In addition, we further optimize the new-refining operation in the local masking module to improve the operation efficiency. The experimental results demonstrate that the proposed method achieves a promising performance on the publicly available mainstream datasets THUMOS14 and ActivityNet. The code is available at <span><span>https://github.com/Rlab123/action_diff</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"160 ","pages":"Article 111207"},"PeriodicalIF":7.5,"publicationDate":"2024-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142743131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}