{"title":"UniParser: Multi-Human Parsing With Unified Correlation Representation Learning","authors":"Jiaming Chu;Lei Jin;Yinglei Teng;Jianshu Li;Yunchao Wei;Zheng Wang;Junliang Xing;Shuicheng Yan;Jian Zhao","doi":"10.1109/TIP.2024.3456004","DOIUrl":"10.1109/TIP.2024.3456004","url":null,"abstract":"Multi-human parsing is an image segmentation task necessitating both instance-level and fine-grained category-level information. However, prior research has typically processed these two types of information through distinct branch types and output formats, leading to inefficient and redundant frameworks. This paper introduces UniParser, which integrates instance-level and category-level representations in three key aspects: 1) we propose a unified correlation representation learning approach, allowing our network to learn instance and category features within the cosine space; 2) we unify the form of outputs of each modules as pixel-level results while supervising instance and category features using a homogeneous label accompanied by an auxiliary loss; and 3) we design a joint optimization procedure to fuse instance and category representations. By unifying instance-level and category-level output, UniParser circumvents manually designed post-processing techniques and surpasses state-of-the-art methods, achieving 49.3% AP on MHPv2.0 and 60.4% AP on CIHP. We have released our source code, pretrained models, and demos to facilitate future studies on \u0000<uri>https://github.com/cjm-sfw/Uniparser</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5159-5171"},"PeriodicalIF":0.0,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142174677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Target Before Shooting: Accurate Anomaly Detection and Localization Under One Millisecond via Cascade Patch Retrieval","authors":"Hanxi Li;Jianfei Hu;Bo Li;Hao Chen;Yongbin Zheng;Chunhua Shen","doi":"10.1109/TIP.2024.3448263","DOIUrl":"10.1109/TIP.2024.3448263","url":null,"abstract":"In this work, by re-examining the “matching” nature of Anomaly Detection (AD), we propose a novel AD framework that simultaneously enjoys new records of AD accuracy and dramatically high running speed. In this framework, the anomaly detection problem is solved via a cascade patch retrieval procedure that retrieves the nearest neighbors for each test image patch in a coarse-to-fine fashion. Given a test sample, the top-K most similar training images are first selected based on a robust histogram matching process. Secondly, the nearest neighbor of each test patch is retrieved over the similar geometrical locations on those “most similar images”, by using a carefully trained local metric. Finally, the anomaly score of each test image patch is calculated based on the distance to its “nearest neighbor” and the “non-background” probability. The proposed method is termed “Cascade Patch Retrieval” (CPR) in this work. Different from the previous patch-matching-based AD algorithms, CPR selects proper “targets” (reference images and patches) before “shooting” (patch-matching). On the well-acknowledged MVTec AD, BTAD and MVTec-3D AD datasets, the proposed algorithm consistently outperforms all the comparing SOTA methods by remarkable margins, measured by various AD metrics. Furthermore, CPR is extremely efficient. It runs at the speed of 113 FPS with the standard setting while its simplified version only requires less than 1 ms to process an image at the cost of a trivial accuracy drop. The code of CPR is available at \u0000<uri>https://github.com/flyinghu123/CPR</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5606-5621"},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142170998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ze Song;Xudong Kang;Xiaohui Wei;Shutao Li;Haibo Liu
{"title":"Unified and Real-Time Image Geo-Localization via Fine-Grained Overlap Estimation","authors":"Ze Song;Xudong Kang;Xiaohui Wei;Shutao Li;Haibo Liu","doi":"10.1109/TIP.2024.3453008","DOIUrl":"10.1109/TIP.2024.3453008","url":null,"abstract":"Image geo-localization aims to locate a query image from source platform (e.g., drones, street vehicle) by matching it with Geo-tagged reference images from the target platforms (e.g., different satellites). Achieving cross-modal or cross-view real-time (>30fps) image localization with the guaranteed accuracy in a unified framework remains a challenge due to the huge differences in modalities and views between the two platforms. In order to solve this problem, a novel fine-grained overlap estimation based image geo-localization method is proposed in this paper, the core of which is to estimate the salient and subtle overlapping regions in image pairs to ensure correct matching. Specifically, the high-level semantic features of input images are extracted by a deep convolutional neural network. Then, a novel overlap scanning module (OSM) is presented to mine the long-range spatial and channel dependencies of semantic features in various subspaces, thereby identifying fine-grained overlapping regions. Finally, we adopt the triplet ranking loss to guide the proposed network optimization so that the matching regions are as close as possible and the most mismatched regions are as far away as possible. To demonstrate the effectiveness of our FOENet, comprehensive experiments are conducted on three cross-view benchmarks and one cross-modal benchmark. Our FOENet yields better performance in various metrics and the recall accuracy at top 1 (R@1) is significantly improved, with a maximum improvement of 70.6%. In addition, the proposed model runs fast on a single RTX 6000, reaching real-time inference speed on all datasets, with the fastest being 82.3 FPS.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5060-5072"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TMP: Temporal Motion Propagation for Online Video Super-Resolution","authors":"Zhengqiang Zhang;Ruihuang Li;Shi Guo;Yang Cao;Lei Zhang","doi":"10.1109/TIP.2024.3453048","DOIUrl":"10.1109/TIP.2024.3453048","url":null,"abstract":"Online video super-resolution (online-VSR) highly relies on an effective alignment module to aggregate temporal information, while the strict latency requirement makes accurate and efficient alignment very challenging. Though much progress has been achieved, most of the existing online-VSR methods estimate the motion fields of each frame separately to perform alignment, which is computationally redundant and ignores the fact that the motion fields of adjacent frames are correlated. In this work, we propose an efficient Temporal Motion Propagation (TMP) method, which leverages the continuity of motion field to achieve fast pixel-level alignment among consecutive frames. Specifically, we first propagate the offsets from previous frames to the current frame, and then refine them in the neighborhood, significantly reducing the matching space and speeding up the offset estimation process. Furthermore, to enhance the robustness of alignment, we perform spatial-wise weighting on the warped features, where the positions with more precise offsets are assigned higher importance. Experiments on benchmark datasets demonstrate that the proposed TMP method achieves leading online-VSR accuracy as well as inference speed. The source code of TMP can be found at \u0000<uri>https://github.com/xtudbxk/TMP</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5014-5028"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shilei Wang;Zhenhua Wang;Qianqian Sun;Gong Cheng;Jifeng Ning
{"title":"Modeling of Multiple Spatial-Temporal Relations for Robust Visual Object Tracking","authors":"Shilei Wang;Zhenhua Wang;Qianqian Sun;Gong Cheng;Jifeng Ning","doi":"10.1109/TIP.2024.3453028","DOIUrl":"10.1109/TIP.2024.3453028","url":null,"abstract":"Recently, one-stream trackers have achieved parallel feature extraction and relation modeling through the exploitation of Transformer-based architectures. This design greatly improves the performance of trackers. However, as one-stream trackers often overlook crucial tracking cues beyond the template, they prone to give unsatisfactory results against complex tracking scenarios. To tackle these challenges, we propose a multi-cue single-stream tracker, dubbed MCTrack here, which seamlessly integrates template information, historical trajectory, historical frame, and the search region for synchronized feature extraction and relation modeling. To achieve this, we employ two types of encoders to convert the template, historical frames, search region, and historical trajectory into tokens, which are then collectively fed into a Transformer architecture. To distill temporal and spatial cues, we introduce a novel adaptive update mechanism, which incorporates a thresholding component and a local multi-peak component to filter out less accurate and overly disturbed tracking cues. Empirically, MCTrack achieves leading performance on mainstream benchmark datasets, surpassing the most advanced SeqTrack by 2.0% in terms of the AO metric on GOT-10k. The code is available at \u0000<uri>https://github.com/wsumel/MCTrack</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5073-5085"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yidong Luo;Junchao Zhang;Jianbo Shao;Jiandong Tian;Jiayi Ma
{"title":"Learning a Non-Locally Regularized Convolutional Sparse Representation for Joint Chromatic and Polarimetric Demosaicking","authors":"Yidong Luo;Junchao Zhang;Jianbo Shao;Jiandong Tian;Jiayi Ma","doi":"10.1109/TIP.2024.3451693","DOIUrl":"10.1109/TIP.2024.3451693","url":null,"abstract":"Division of focal plane color polarization camera becomes the mainstream in polarimetric imaging for it directly captures color polarization mosaic image by one snapshot, so image demosaicking is an essential task. Current color polarization demosaicking (CPDM) methods are prone to unsatisfied results since it’s difficult to recover missed 15 or 14 pixels out of 16 pixels in color polarization mosaic images. To address this problem, a non-locally regularized convolutional sparse regularization model, which is advantaged in denoising and edge maintaining, is proposed to recall more information for CPDM task, and the CPDM task is transformed into an energy function to be solved by ADMM optimization. Finally, the optimal model generates informative and clear results. The experimental results, including reconstructed synthetic and real-world scenes, demonstrate that our proposed method outperforms the current state-of-the-art methods in terms of quantitative measurements and visual quality. The source code is available at \u0000<uri>https://github.com/roydon-luo/NLCSR-CPDM</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5029-5044"},"PeriodicalIF":0.0,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142160433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UVaT: Uncertainty Incorporated View-Aware Transformer for Robust Multi-View Classification","authors":"Yapeng Li;Yong Luo;Bo Du","doi":"10.1109/TIP.2024.3451931","DOIUrl":"10.1109/TIP.2024.3451931","url":null,"abstract":"Existing multi-view classification algorithms usually assume that all examples have observations on all views, and the data in different views are clean. However, in real-world applications, we are often provided with data that have missing representations or contain noise on some views (i.e., missing or noise views). This may lead to significant performance degeneration, and thus many algorithms are proposed to address the incomplete view or noisy view issues. However, most of existing algorithms deal with the two issues separately, and hence may fail when both missing and noisy views exist. They are also usually not flexible in that the view or feature significance cannot be adaptively identified. Besides, the view missing patterns may vary in the training and test phases, and such difference is often ignored. To remedy these drawbacks, we propose a novel multi-view classification framework that is simultaneously robust to both incomplete and noisy views. This is achieved by integrating early fusion and late fusion in a single framework. Specifically, in our early fusion module, we propose a view-aware transformer to mask the missing views and adaptively explore the relationships between views and target tasks to deal with missing views. Considering that view missing patterns may change from the training to the test phase, we also design single-view classification and category-consistency constraints to reduce the dependence of our model on view-missing patterns. In our late fusion module, we quantify the view uncertainty in an ensemble way to estimate the noise level of that view. Then the uncertainty and prediction logits of different views are integrated to make our model robust to noisy views. The framework is trained in an end-to-end manner. Experimental results on diverse datasets demonstrate the robustness and effectiveness of our model for both incomplete and noisy views. Codes are available at \u0000<uri>https://github.com/li-yapeng/UVaT</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5129-5143"},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Liao;Kangmin Xu;Haoning Wu;Chaofeng Chen;Wenxiu Sun;Qiong Yan;C.-C. Jay Kuo;Weisi Lin
{"title":"Blind Video Quality Prediction by Uncovering Human Video Perceptual Representation","authors":"Liang Liao;Kangmin Xu;Haoning Wu;Chaofeng Chen;Wenxiu Sun;Qiong Yan;C.-C. Jay Kuo;Weisi Lin","doi":"10.1109/TIP.2024.3445738","DOIUrl":"10.1109/TIP.2024.3445738","url":null,"abstract":"Blind video quality assessment (VQA) has become an increasingly demanding problem in automatically assessing the quality of ever-growing in-the-wild videos. Although efforts have been made to measure temporal distortions, the core to distinguish between VQA and image quality assessment (IQA), the lack of modeling of how the human visual system (HVS) relates to the temporal quality of videos hinders the precise mapping of predicted temporal scores to the human perception. Inspired by the recent discovery of the temporal straightness law of natural videos in the HVS, this paper intends to model the complex temporal distortions of in-the-wild videos in a simple and uniform representation by describing the geometric properties of videos in the visual perceptual domain. A novel videolet, with perceptual representation embedding of a few consecutive frames, is designed as the basic quality measurement unit to quantify temporal distortions by measuring the angular and linear displacements from the straightness law. By combining the predicted score on each videolet, a perceptually temporal quality evaluator (PTQE) is formed to measure the temporal quality of the entire video. Experimental results demonstrate that the perceptual representation in the HVS is an efficient way of predicting subjective temporal quality. Moreover, when combined with spatial quality metrics, PTQE achieves top performance over popular in-the-wild video datasets. More importantly, PTQE requires no additional information beyond the video being assessed, making it applicable to any dataset without parameter tuning. Additionally, the generalizability of PTQE is evaluated on video frame interpolation tasks, demonstrating its potential to benefit temporal-related enhancement tasks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"4998-5013"},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"M2GCNet: Multi-Modal Graph Convolution Network for Precise Brain Tumor Segmentation Across Multiple MRI Sequences","authors":"Tongxue Zhou","doi":"10.1109/TIP.2024.3451936","DOIUrl":"10.1109/TIP.2024.3451936","url":null,"abstract":"Accurate segmentation of brain tumors across multiple MRI sequences is essential for diagnosis, treatment planning, and clinical decision-making. In this paper, I propose a cutting-edge framework, named multi-modal graph convolution network (M2GCNet), to explore the relationships across different MR modalities, and address the challenge of brain tumor segmentation. The core of M2GCNet is the multi-modal graph convolution module (M2GCM), a pivotal component that represents MR modalities as graphs, with nodes corresponding to image pixels and edges capturing latent relationships between pixels. This graph-based representation enables the effective utilization of both local and global contextual information. Notably, M2GCM comprises two important modules: the spatial-wise graph convolution module (SGCM), adept at capturing extensive spatial dependencies among distinct regions within an image, and the channel-wise graph convolution module (CGCM), dedicated to modelling intricate contextual dependencies among different channels within the image. Additionally, acknowledging the intrinsic correlation present among different MR modalities, a multi-modal correlation loss function is introduced. This novel loss function aims to capture specific nonlinear relationships between correlated modality pairs, enhancing the model’s ability to achieve accurate segmentation results. The experimental evaluation on two brain tumor datasets demonstrates the superiority of the proposed M2GCNet over other state-of-the-art segmentation methods. Furthermore, the proposed method paves the way for improved tumor diagnosis, multi-modal information fusion, and a deeper understanding of brain tumor pathology.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"4896-4910"},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Privacy-Preserving Autoencoder for Collaborative Object Detection","authors":"Bardia Azizian;Ivan V. Bajić","doi":"10.1109/TIP.2024.3451938","DOIUrl":"10.1109/TIP.2024.3451938","url":null,"abstract":"Privacy is a crucial concern in collaborative machine vision where a part of a Deep Neural Network (DNN) model runs on the edge, and the rest is executed on the cloud. In such applications, the machine vision model does not need the exact visual content to perform its task. Taking advantage of this potential, private information could be removed from the data insofar as it does not significantly impair the accuracy of the machine vision system. In this paper, we present an autoencoder-style network integrated within an object detection pipeline, which generates a latent representation of the input image that preserves task-relevant information while removing private information. Our approach employs an adversarial training strategy that not only removes private information from the bottleneck of the autoencoder but also promotes improved compression efficiency for feature channels coded by conventional codecs like VVC-Intra. We assess the proposed system using a realistic evaluation framework for privacy, directly measuring face and license plate recognition accuracy. Experimental results show that our proposed method is able to reduce the bitrate significantly at the same object detection accuracy compared to coding the input images directly, while keeping the face and license plate recognition accuracy on the images recovered from the bottleneck features low, implying strong privacy protection. Our code is available at \u0000<uri>https://github.com/bardia-az/ppa-code</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"4937-4951"},"PeriodicalIF":0.0,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142142169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}