{"title":"Learning Cross-Attention Point Transformer With Global Porous Sampling","authors":"Yueqi Duan;Haowen Sun;Juncheng Yan;Jiwen Lu;Jie Zhou","doi":"10.1109/TIP.2024.3486612","DOIUrl":"10.1109/TIP.2024.3486612","url":null,"abstract":"In this paper, we propose a point-based cross-attention transformer named CrossPoints with parametric Global Porous Sampling (GPS) strategy. The attention module is crucial to capture the correlations between different tokens for transformers. Most existing point-based transformers design multi-scale self-attention operations with down-sampled point clouds by the widely-used Farthest Point Sampling (FPS) strategy. However, FPS only generates sub-clouds with holistic structures, which fails to fully exploit the flexibility of points to generate diversified tokens for the attention module. To address this, we design a cross-attention module with parametric GPS and Complementary GPS (C-GPS) strategies to generate series of diversified tokens through controllable parameters. We show that FPS is a degenerated case of GPS, and the network learns more abundant relational information of the structure and geometry when we perform consecutive cross-attention over the tokens generated by GPS as well as C-GPS sampled points. More specifically, we set evenly-sampled points as queries and design our cross-attention layers with GPS and C-GPS sampled points as keys and values. In order to further improve the diversity of tokens, we design a deformable operation over points to adaptively adjust the points according to the input. Extensive experimental results on both shape classification and indoor scene segmentation tasks indicate promising boosts over the recent point cloud transformers. We also conduct ablation studies to show the effectiveness of our proposed cross-attention module with GPS strategy.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6283-6297"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142562960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Salient Object Detection From Arbitrary Modalities","authors":"Nianchang Huang;Yang Yang;Ruida Xi;Qiang Zhang;Jungong Han;Jin Huang","doi":"10.1109/TIP.2024.3486225","DOIUrl":"10.1109/TIP.2024.3486225","url":null,"abstract":"Toward desirable saliency prediction, the types and numbers of inputs for a salient object detection (SOD) algorithm may dynamically change in many real-life applications. However, existing SOD algorithms are mainly designed or trained for one particular type of inputs, failing to be generalized to other types of inputs. Consequentially, more types of SOD algorithms need to be prepared in advance for handling different types of inputs, raising huge hardware and research costs. Differently, in this paper, we propose a new type of SOD task, termed Arbitrary Modality SOD (AM SOD). The most prominent characteristics of AM SOD are that the modality types and modality numbers will be arbitrary or dynamically changed. The former means that the inputs to the AM SOD algorithm may be arbitrary modalities such as RGB, depths, or even any combination of them. While, the latter indicates that the inputs may have arbitrary modality numbers as the input type is changed, e.g. single-modality RGB image, dual-modality RGB-Depth (RGB-D) images or triple-modality RGB-Depth-Thermal (RGB-D-T) images. Accordingly, a preliminary solution to the above challenges, i.e. a modality switch network (MSN), is proposed in this paper. In particular, a modality switch feature extractor (MSFE) is first designed to extract discriminative features from each modality effectively by introducing some modality indicators, which will generate some weights for modality switching. Subsequently, a dynamic fusion module (DFM) is proposed to adaptively fuse features from a variable number of modalities based on a novel Transformer structure. Finally, a new dataset, named AM-XD, is constructed to facilitate research on AM SOD. Extensive experiments demonstrate that our AM SOD method can effectively cope with changes in the type and number of input modalities for robust salient object detection. Our code and AM-XD dataset will be released on \u0000<uri>https://github.com/nexiakele/AMSODFirst</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6268-6282"},"PeriodicalIF":0.0,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142562961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haiwen Diao;Ying Zhang;Shang Gao;Jiawen Zhu;Long Chen;Huchuan Lu
{"title":"GSSF: Generalized Structural Sparse Function for Deep Cross-Modal Metric Learning","authors":"Haiwen Diao;Ying Zhang;Shang Gao;Jiawen Zhu;Long Chen;Huchuan Lu","doi":"10.1109/TIP.2024.3485498","DOIUrl":"10.1109/TIP.2024.3485498","url":null,"abstract":"Cross-modal metric learning is a prominent research topic that bridges the semantic heterogeneity between vision and language. Existing methods frequently utilize simple cosine or complex distance metrics to transform the pairwise features into a similarity score, which suffers from an inadequate or inefficient capability for distance measurements. Consequently, we propose a Generalized Structural Sparse Function to dynamically capture thorough and powerful relationships across modalities for pair-wise similarity learning while remaining concise but efficient. Specifically, the distance metric delicately encapsulates two formats of diagonal and block-diagonal terms, automatically distinguishing and highlighting the cross-channel relevancy and dependency inside a structured and organized topology. Hence, it thereby empowers itself to adapt to the optimal matching patterns between the paired features and reaches a sweet spot between model complexity and capability. Extensive experiments on cross-modal and two extra uni-modal retrieval tasks (image-text retrieval, person re-identification, fine-grained image retrieval) have validated its superiority and flexibility over various popular retrieval frameworks. More importantly, we further discover that it can be seamlessly incorporated into multiple application scenarios, and demonstrates promising prospects from Attention Mechanism to Knowledge Distillation in a plug-and-play manner.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6241-6252"},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142559850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cheuk-Yiu Chan;Wan-Chi Siu;Yuk-Hee Chan;H. Anthony Chan
{"title":"AnlightenDiff: Anchoring Diffusion Probabilistic Model on Low Light Image Enhancement","authors":"Cheuk-Yiu Chan;Wan-Chi Siu;Yuk-Hee Chan;H. Anthony Chan","doi":"10.1109/TIP.2024.3486610","DOIUrl":"10.1109/TIP.2024.3486610","url":null,"abstract":"Low-light image enhancement aims to improve the visual quality of images captured under poor illumination. However, enhancing low-light images often introduces image artifacts, color bias, and low SNR. In this work, we propose AnlightenDiff, an anchoring diffusion model for low light image enhancement. Diffusion models can enhance the low light image to well-exposed image by iterative refinement, but require anchoring to ensure that enhanced results remain faithful to the input. We propose a Dynamical Regulated Diffusion Anchoring mechanism and Sampler to anchor the enhancement process. We also propose a Diffusion Feature Perceptual Loss tailored for diffusion based model to utilize different loss functions in image domain. AnlightenDiff demonstrates the effect of diffusion models for low-light enhancement and achieving high perceptual quality results. Our techniques show a promising future direction for applying diffusion models to image enhancement.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6324-6339"},"PeriodicalIF":0.0,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10740586","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142560350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection","authors":"Yifan Xu;Mengdan Zhang;Xiaoshan Yang;Changsheng Xu","doi":"10.1109/TIP.2024.3485518","DOIUrl":"10.1109/TIP.2024.3485518","url":null,"abstract":"We explore multi-modal contextual knowledge learned through multi-modal masked language modeling to provide explicit localization guidance for novel classes in open-vocabulary object detection (OVD). Intuitively, a well-modeled and correctly predicted masked concept word should effectively capture the textual contexts, visual contexts, and the cross-modal correspondence between texts and regions, thereby automatically activating high attention on corresponding regions. In light of this, we propose a multi-modal contextual knowledge distillation framework, MMC-Det, to explicitly supervise a student detector with the context-aware attention of the masked concept words in a teacher fusion transformer. The teacher fusion transformer is trained with our newly proposed diverse multi-modal masked language modeling (D-MLM) strategy, which significantly enhances the fine-grained region-level visual context modeling in the fusion transformer. The proposed distillation process provides additional contextual guidance to the concept-region matching of the detector, thereby further improving the OVD performance. Extensive experiments performed upon various detection datasets show the effectiveness of our multi-modal context learning strategy.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6253-6267"},"PeriodicalIF":0.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142541295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rethinking Noise Sampling in Class-Imbalanced Diffusion Models","authors":"Chenghao Xu;Jiexi Yan;Muli Yang;Cheng Deng","doi":"10.1109/TIP.2024.3485484","DOIUrl":"10.1109/TIP.2024.3485484","url":null,"abstract":"In the practical application of image generation, dealing with long-tailed data distributions is a common challenge for diffusion-based generative models. To tackle this issue, we investigate the head-class accumulation effect in diffusion models’ latent space, particularly focusing on its correlation to the noise sampling strategy. Our experimental analysis indicates that employing a consistent sampling distribution for the noise prior across all classes leads to a significant bias towards head classes in the noise sampling distribution, which results in poor quality and diversity of the generated images. Motivated by this observation, we propose a novel sampling strategy named Bias-aware Prior Adjusting (BPA) to debias diffusion models in the class-imbalanced scenario. With BPA, each class is automatically assigned an adaptive noise sampling distribution prior during training, effectively mitigating the influence of class imbalance on the generation process. Extensive experiments on several benchmarks demonstrate that images generated using our proposed BPA showcase elevated diversity and superior quality.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6298-6308"},"PeriodicalIF":0.0,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142541349","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"λ-Domain Rate Control via Wavelet-Based Residual Neural Network for VVC HDR Intra Coding","authors":"Feng Yuan;Jianjun Lei;Zhaoqing Pan;Bo Peng;Haoran Xie","doi":"10.1109/TIP.2024.3484173","DOIUrl":"10.1109/TIP.2024.3484173","url":null,"abstract":"High dynamic range (HDR) video offers a more realistic visual experience than standard dynamic range (SDR) video, while introducing new challenges to both compression and transmission. Rate control is an effective technology to overcome these challenges, and ensure optimal HDR video delivery. However, the rate control algorithm in the latest video coding standard, versatile video coding (VVC), is tailored to SDR videos, and does not produce well coding results when encoding HDR videos. To address this problem, a data-driven \u0000<inline-formula> <tex-math>$lambda $ </tex-math></inline-formula>\u0000-domain rate control algorithm is proposed for VVC HDR intra frames in this paper. First, the coding characteristics of HDR intra coding are analyzed, and a piecewise R-\u0000<inline-formula> <tex-math>$lambda $ </tex-math></inline-formula>\u0000 model is proposed to accurately determine the correlation between the rate (R) and the Lagrange parameter \u0000<inline-formula> <tex-math>$lambda $ </tex-math></inline-formula>\u0000 for HDR intra frames. Then, to optimize bit allocation at the coding tree unit (CTU)-level, a wavelet-based residual neural network (WRNN) is developed to accurately predict the parameters of the piecewise R-\u0000<inline-formula> <tex-math>$lambda $ </tex-math></inline-formula>\u0000 model for each CTU. Third, a large-scale HDR dataset is established for training WRNN, which facilitates the applications of deep learning in HDR intra coding. Extensive experimental results show that our proposed HDR intra frame rate control algorithm achieves superior coding results than the state-of-the-art algorithms. The source code of this work will be released at \u0000<uri>https://github.com/TJU-Videocoding/WRNN.git</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6189-6203"},"PeriodicalIF":0.0,"publicationDate":"2024-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142490613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Donggon Jang;Sunhyeok Lee;Gyuwon Choi;Yejin Lee;Sanghyeok Son;Dae-Shik Kim
{"title":"Energy-Based Domain Adaptation Without Intermediate Domain Dataset for Foggy Scene Segmentation","authors":"Donggon Jang;Sunhyeok Lee;Gyuwon Choi;Yejin Lee;Sanghyeok Son;Dae-Shik Kim","doi":"10.1109/TIP.2024.3483566","DOIUrl":"10.1109/TIP.2024.3483566","url":null,"abstract":"Robust segmentation performance under dense fog is crucial for autonomous driving, but collecting labeled real foggy scene datasets is burdensome in the real world. To this end, existing methods have adapted models trained on labeled clear weather images to the unlabeled real foggy domain. However, these approaches require intermediate domain datasets (e.g. synthetic fog) and involve multi-stage training, making them cumbersome and less practical for real-world applications. In addition, the issue of overconfident pseudo-labels by a confidence score remains less explored in self-training for foggy scene adaptation. To resolve these issues, we propose a new framework, named DAEN, which Directly Adapts without additional datasets or multi-stage training and leverages an ENergy score in self-training. Notably, we integrate a High-order Style Matching (HSM) module into the network to match high-order statistics between clear weather features and real foggy features. HSM enables the network to implicitly learn complex fog distributions without relying on intermediate domain datasets or multi-stage training. Furthermore, we introduce Energy Score-based Pseudo-Labeling (ESPL) to mitigate the overconfidence issue of the confidence score in self-training. ESPL generates more reliable pseudo-labels through a pixel-wise energy score, thereby alleviating bias and preventing the model from assigning pseudo-labels exclusively to head classes. Extensive experiments demonstrate that DAEN achieves state-of-the-art performance on three real foggy scene datasets and exhibits a generalization ability to other adverse weather conditions. Code is available at \u0000<uri>https://github.com/jdg900/daen</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6143-6157"},"PeriodicalIF":0.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142489425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MA-ST3D: Motion Associated Self-Training for Unsupervised Domain Adaptation on 3D Object Detection","authors":"Chi Zhang;Wenbo Chen;Wei Wang;Zhaoxiang Zhang","doi":"10.1109/TIP.2024.3482976","DOIUrl":"10.1109/TIP.2024.3482976","url":null,"abstract":"Recently, unsupervised domain adaptation (UDA) for 3D object detectors has increasingly garnered attention as a method to eliminate the prohibitive costs associated with generating extensive 3D annotations, which are crucial for effective model training. Self-training (ST) has emerged as a simple and effective technique for UDA. The major issue involved in ST-UDA for 3D object detection is refining the imprecise predictions caused by domain shift and generating accurate pseudo labels as supervisory signals. This study presents a novel ST-UDA framework to generate high-quality pseudo labels by associating predictions of 3D point cloud sequences during ego-motion according to spatial and temporal consistency, named motion-associated self-training for 3D object detection (MA-ST3D). MA-ST3D maintains a global-local pathway (GLP) architecture to generate high-quality pseudo-labels by leveraging both intra-frame and inter-frame consistencies along the spatial dimension of the LiDAR’s ego-motion. It also equips two memory modules for both global and local pathways, called global memory and local memory, to suppress the temporal fluctuation of pseudo-labels during self-training iterations. In addition, a motion-aware loss is introduced to impose discriminated regulations on pseudo labels with different motion statuses, which mitigates the harmful spread of false positive pseudo labels. Finally, our method is evaluated on three representative domain adaptation tasks on authoritative 3D benchmark datasets (i.e. Waymo, Kitti, and nuScenes). MA-ST3D achieved SOTA performance on all evaluated UDA settings and even surpassed the weakly supervised DA methods on the Kitti and NuScenes object detection benchmark.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6227-6240"},"PeriodicalIF":0.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142489490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deblurring Videos Using Spatial-Temporal Contextual Transformer With Feature Propagation","authors":"Liyan Zhang;Boming Xu;Zhongbao Yang;Jinshan Pan","doi":"10.1109/TIP.2024.3482176","DOIUrl":"10.1109/TIP.2024.3482176","url":null,"abstract":"We present a simple and effective approach to explore both local spatial-temporal contexts and non-local temporal information for video deblurring. First, we develop an effective spatial-temporal contextual transformer to explore local spatial-temporal contexts from videos. As the features extracted by the spatial-temporal contextual transformer does not model the non-local temporal information of video well, we then develop a feature propagation method to aggregate useful features from the long-range frames so that both local spatial-temporal contexts and non-local temporal information can be better utilized for video deblurring. Finally, we formulate the spatial-temporal contextual transformer with the feature propagation into a unified deep convolutional neural network (CNN) and train it in an end-to-end manner. We show that using the spatial-temporal contextual transformer with the feature propagation is able to generate useful features and makes the deep CNN model more compact and effective for video deblurring. Extensive experimental results show that the proposed method performs favorably against state-of-the-art ones on the benchmark datasets in terms of accuracy and model parameters.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6354-6366"},"PeriodicalIF":0.0,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142489748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}