{"title":"HomuGAN: A 3D-Aware GAN With the Method of Cylindrical Spatial-Constrained Sampling","authors":"Haochen Yu;Weixi Gong;Jiansheng Chen;Huimin Ma","doi":"10.1109/TIP.2024.3520423","DOIUrl":"10.1109/TIP.2024.3520423","url":null,"abstract":"Controllable 3D-aware scene synthesis seeks to disentangle the various latent codes in the implicit space enabling the generation network to create highly realistic images with 3D consistency. Recent approaches often integrate Neural Radiance Fields with the upsampling method of StyleGAN2, employing Convolutions with style modulation to transform spatial coordinates into frequency domain representations. Our analysis indicates that this approach can give rise to a bubble phenomenon in StyleNeRF. We argue that the style modulation introduces extraneous information into the implicit space, disrupting 3D implicit modeling and degrading image quality. We introduce HomuGAN, incorporating two key improvements. First, we disentangle the style modulation applied to implicit modeling from that utilized for super-resolution, thus alleviating the bubble phenomenon. Second, we introduce Cylindrical Spatial-Constrained Sampling and Parabolic Sampling. The latter sampling method, as an alternative method to the former, specifically contributes to the performance of foreground modeling of vehicles. We evaluate HomuGAN on publicly available datasets, comparing its performance to existing methods. Empirical results demonstrate that our model achieves the best performance, exhibiting relatively outstanding disentanglement capability. Moreover, HomuGAN addresses the training instability problem observed in StyleNeRF and reduces the bubble phenomenon.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"320-334"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenhao Li;Mengyuan Liu;Hong Liu;Bin Ren;Xia Li;Yingxuan You;Nicu Sebe
{"title":"HYRE: Hybrid Regressor for 3D Human Pose and Shape Estimation","authors":"Wenhao Li;Mengyuan Liu;Hong Liu;Bin Ren;Xia Li;Yingxuan You;Nicu Sebe","doi":"10.1109/TIP.2024.3515872","DOIUrl":"10.1109/TIP.2024.3515872","url":null,"abstract":"Regression-based 3D human pose and shape estimation often fall into one of two different paradigms. Parametric approaches, which regress the parameters of a human body model, tend to produce physically plausible but image-mesh misalignment results. In contrast, non-parametric approaches directly regress human mesh vertices, resulting in pixel-aligned but unreasonable predictions. In this paper, we consider these two paradigms together for a better overall estimation. To this end, we propose a novel HYbrid REgressor (HYRE) that greatly benefits from the joint learning of both paradigms. The core of our HYRE is a hybrid intermediary across paradigms that provides complementary clues to each paradigm at the shared feature level and fuses their results at the part-based decision level, thereby bridging the gap between the two. We demonstrate the effectiveness of the proposed method through both quantitative and qualitative experimental analyses, resulting in improvements for each approach and ultimately leading to better hybrid results. Our experiments show that HYRE outperforms previous methods on challenging 3D human pose and shape benchmarks.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"235-246"},"PeriodicalIF":0.0,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142888260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Xie;Linfeng Fei;Huanjie Tao;Yaocong Hu;Wei Zhou;Jiun Tian Hoe;Weipeng Hu;Yap-Peng Tan
{"title":"Residual Quotient Learning for Zero-Reference Low-Light Image Enhancement","authors":"Chao Xie;Linfeng Fei;Huanjie Tao;Yaocong Hu;Wei Zhou;Jiun Tian Hoe;Weipeng Hu;Yap-Peng Tan","doi":"10.1109/TIP.2024.3519997","DOIUrl":"10.1109/TIP.2024.3519997","url":null,"abstract":"Recently, neural networks have become the dominant approach to low-light image enhancement (LLIE), with at least one-third of them adopting a Retinex-related architecture. However, through in-depth analysis, we contend that this most widely accepted LLIE structure is suboptimal, particularly when addressing the non-uniform illumination commonly observed in natural images. In this paper, we present a novel variant learning framework, termed residual quotient learning, to substantially alleviate this issue. Instead of following the existing Retinex-related decomposition-enhancement-reconstruction process, our basic idea is to explicitly reformulate the light enhancement task as adaptively predicting the latent quotient with reference to the original low-light input using a residual learning fashion. By leveraging the proposed residual quotient learning, we develop a lightweight yet effective network called ResQ-Net. This network features enhanced non-uniform illumination modeling capabilities, making it more suitable for real-world LLIE tasks. Moreover, due to its well-designed structure and reference-free loss function, ResQ-Net is flexible in training as it allows for zero-reference optimization, which further enhances the generalization and adaptability of our entire framework. Extensive experiments on various benchmark datasets demonstrate the merits and effectiveness of the proposed residual quotient learning, and our trained ResQ-Net outperforms state-of-the-art methods both qualitatively and quantitatively. Furthermore, a practical application in dark face detection is explored, and the preliminary results confirm the potential and feasibility of our method in real-world scenarios.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"365-378"},"PeriodicalIF":0.0,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142884230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Rebalanced Vision-Language Retrieval Considering Structure-Aware Distillation","authors":"Yang Yang;Wenjuan Xi;Luping Zhou;Jinhui Tang","doi":"10.1109/TIP.2024.3518759","DOIUrl":"10.1109/TIP.2024.3518759","url":null,"abstract":"Vision-language retrieval aims to search for similar instances in one modality based on queries from another modality. The primary objective is to learn cross-modal matching representations in a latent common space. Actually, the assumption underlying cross-modal matching is modal balance, where each modality contains sufficient information to represent the others. However, noise interference and modality insufficiency often lead to modal imbalance, making it a common phenomenon in practice. The impact of imbalance on retrieval performance remains an open question. In this paper, we first demonstrate that ultimate cross-modal matching is generally sub-optimal for cross-modal retrieval when imbalanced modalities exist. The structure of instances in the common space is inherently influenced when facing imbalanced modalities, posing a challenge to cross-modal similarity measurement. To address this issue, we emphasize the importance of meaningful structure-preserved matching. Accordingly, we propose a simple yet effective method to rebalance cross-modal matching by learning structure-preserved matching representations. Specifically, we design a novel multi-granularity cross-modal matching that incorporates structure-aware distillation alongside the cross-modal matching loss. While the cross-modal matching loss constraints instance-level matching, the structure-aware distillation further regularizes the geometric consistency between learned matching representations and intra-modal representations through the developed relational matching. Extensive experiments on different datasets affirm the superior cross-modal retrieval performance of our approach, simultaneously enhancing single-modal retrieval capabilities compared to the baseline models.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6881-6892"},"PeriodicalIF":0.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiarui Zhang;Ruixu Geng;Xiaolong Du;Yan Chen;Houqiang Li;Yang Hu
{"title":"Passive Non-Line-of-Sight Imaging With Light Transport Modulation","authors":"Jiarui Zhang;Ruixu Geng;Xiaolong Du;Yan Chen;Houqiang Li;Yang Hu","doi":"10.1109/TIP.2024.3518097","DOIUrl":"10.1109/TIP.2024.3518097","url":null,"abstract":"Passive non-line-of-sight (NLOS) imaging has witnessed rapid development in recent years, due to its ability to image objects that are out of sight. The light transport condition plays an important role in this task since changing the conditions will lead to different imaging models. Existing learning-based NLOS methods usually train independent models for different light transport conditions, which is computationally inefficient and impairs the practicality of the models. In this work, we propose NLOS-LTM, a novel passive NLOS imaging method that effectively handles multiple light transport conditions with a single network. We achieve this by inferring a latent light transport representation from the projection image and using this representation to modulate the network that reconstructs the hidden image from the projection image. We train a light transport encoder together with a vector quantizer to obtain the light transport representation. To further regulate this representation, we jointly learn both the reconstruction network and the reprojection network during training. A set of light transport modulation blocks is used to modulate the two jointly trained networks in a multi-scale way. Extensive experiments on a large-scale passive NLOS dataset demonstrate the superiority of the proposed method. The code is available at <uri>https://github.com/JerryOctopus/NLOS-LTM</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"410-424"},"PeriodicalIF":0.0,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142879933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhao Wang;Bolin Chen;Shurun Wang;Shiqi Wang;Yan Ye;Siwei Ma
{"title":"Ultra-Low Bitrate Face Video Compression Based on Conversions From 3D Keypoints to 2D Motion Map","authors":"Zhao Wang;Bolin Chen;Shurun Wang;Shiqi Wang;Yan Ye;Siwei Ma","doi":"10.1109/TIP.2024.3518100","DOIUrl":"10.1109/TIP.2024.3518100","url":null,"abstract":"How to compress face video is a crucial problem for a series of online applications, such as video chat/conference, live broadcasting and remote education. Compared to other natural videos, these face-centric videos owning abundant structural information can be compactly represented and high-quality reconstructed via deep generative models, such that the promising compression performance can be achieved. However, the existing generative face video compression schemes are faced with the inconsistency between the 3D facial motion in the physical world and the face content evolution in the 2D view. To solve this drawback, we propose a 3D-Keypoint-and-2D-Motion based generative method for Face Video Compression, namely FVC-3K2M, which can well ensure perceptual compensation and visual consistency between motion description and face reconstruction. In particular, the temporal evolution of face video can be characterized into separate 3D keypoints from the global and local perspectives, entailing great coding flexibility and accurate motion representation. Moreover, a cascade motion conversion mechanism is further proposed to internally convert 3D keypoints to 2D dense motion, enforcing the face video reconstruction to be perceptually realistic. Finally, an adaptive reference frame selection scheme is developed to enhance the adaptation of various temporal movements. Experimental results show that the proposed scheme can realize reliable video communication in the extremely limited bandwidth, e.g., 2 kbps. Compared to the state-of-the-art video coding standards and the latest face video compression methods, extensive comparisons demonstrate that our proposed scheme achieves superior compression performance in terms of multiple quality evaluations.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6850-6864"},"PeriodicalIF":0.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Dual-Exposure Quad-Bayer Patterns for Joint Denoising and Deblurring","authors":"Yuzhi Zhao;Lai-Man Po;Xin Ye;Yongzhe Xu;Qiong Yan","doi":"10.1109/TIP.2024.3515873","DOIUrl":"10.1109/TIP.2024.3515873","url":null,"abstract":"Image degradation caused by noise and blur remains a persistent challenge in imaging systems, stemming from limitations in both hardware and methodology. Single-image solutions face an inherent tradeoff between noise reduction and motion blur. While short exposures can capture clear motion, they suffer from noise amplification. Long exposures reduce noise but introduce blur. Learning-based single-image enhancers tend to be over-smooth due to the limited information. Multi-image solutions using burst mode avoid this tradeoff by capturing more spatial-temporal information but often struggle with misalignment from camera/scene motion. To address these limitations, we propose a physical-model-based image restoration approach leveraging a novel dual-exposure Quad-Bayer pattern sensor. By capturing pairs of short and long exposures at the same starting point but with varying durations, this method integrates complementary noise-blur information within a single image. We further introduce a Quad-Bayer synthesis method (B2QB) to simulate sensor data from Bayer patterns to facilitate training. Based on this dual-exposure sensor model, we design a hierarchical convolutional neural network called QRNet to recover high-quality RGB images. The network incorporates input enhancement blocks and multi-level feature extraction to improve restoration quality. Experiments demonstrate superior performance over state-of-the-art deblurring and denoising methods on both synthetic and real-world datasets. The code, model, and datasets are publicly available at <uri>https://github.com/zhaoyuzhi/QRNet</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"350-364"},"PeriodicalIF":0.0,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142867124","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zeya Song;Liqi Xue;Jun Xu;Baoping Zhang;Chao Jin;Jian Yang;Changliang Zou
{"title":"Real-World Low-Dose CT Image Denoising by Patch Similarity Purification","authors":"Zeya Song;Liqi Xue;Jun Xu;Baoping Zhang;Chao Jin;Jian Yang;Changliang Zou","doi":"10.1109/TIP.2024.3515878","DOIUrl":"10.1109/TIP.2024.3515878","url":null,"abstract":"Reducing the radiation dose in CT scanning is important to alleviate the damage to the human health in clinical scenes. A promising way is to replace the normal-dose CT (NDCT) imaging by low-dose CT (LDCT) imaging with lower tube voltage and tube current. This often brings severe noise to the LDCT images, which adversely affects the diagnosis accuracy. Most of existing LDCT image denoising networks are trained either with synthetic LDCT images or real-world LDCT and NDCT image pairs with huge spatial misalignment. However, the synthetic noise is very different from the complex noise in real-world LDCT images, while the huge spatial misalignment brings inaccurate predictions of tissue structures in the denoised LDCT images. To well utilize real-world LDCT and NDCT image pairs for LDCT image denoising, in this paper, we introduce a new Patch Similarity Purification (PSP) strategy to construct high-quality training dataset for network training. Specifically, our PSP strategy first perform binarization for each pair of image patches cropped from the corresponding LDCT and NDCT image pairs. For each pair of binary masks, it then computes their similarity ratio by common mask calculation, and the patch pair can be selected as a training sample if their mask similarity ratio is higher than a threshold. By using our PSP strategy, each training set of our Rabbit and Patient datasets contain hundreds of thousands of real-world LDCT and NDCT image patch pairs with negligible misalignment. Extensive experiments demonstrate the usefulness of our PSP strategy on purifying the training data and the effectiveness of training LDCT image denoising networks on our datasets. The code and dataset are provided at <uri>https://github.com/TuTusong/PSP</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"196-208"},"PeriodicalIF":0.0,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142840915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Key-Axis-Based Localization of Symmetry Axes in 3D Objects Utilizing Geometry and Texture","authors":"Yulin Wang;Chen Luo","doi":"10.1109/TIP.2024.3515801","DOIUrl":"10.1109/TIP.2024.3515801","url":null,"abstract":"In pose estimation for objects with rotational symmetry, ambiguous poses may arise, and the symmetry axes of objects are crucial for eliminating such ambiguities. Currently, in pose estimation, reliance on manual settings of symmetry axes decreases the accuracy of pose estimation. To address this issue, this method proposes determining the orders of symmetry axes and angles between axes based on a given rotational symmetry type or polyhedron, reducing the need for manual settings of symmetry axes. Subsequently, two key axes with the highest orders are defined and localized, then three orthogonal axes are generated based on key axes, while each symmetry axis can be computed utilizing orthogonal axes. Compared to localizing symmetry axes one by one, the key-axis-based symmetry axis localization is more efficient. To support geometric and texture symmetry, the method utilizes the ADI metric for key axis localization in geometrically symmetric objects and proposes a novel metric, ADI-C, for objects with texture symmetry. Experimental results on the LM-O and HB datasets demonstrate a 9.80% reduction in symmetry axis localization error and a 1.64% improvement in pose estimation accuracy. Additionally, the method introduces a new dataset, DSRSTO, to illustrate its performance across seven types of geometrically and texturally symmetric objects. The GitHub link for the open-source tool based on this method is \u0000<uri>https://github.com/WangYuLin-SEU/KASAL</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6720-6733"},"PeriodicalIF":0.0,"publicationDate":"2024-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142840914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Unlabeled Videos for Video-Text Retrieval via Pseudo-Supervised Learning","authors":"Yu Lu;Ruijie Quan;Linchao Zhu;Yi Yang","doi":"10.1109/TIP.2024.3514352","DOIUrl":"10.1109/TIP.2024.3514352","url":null,"abstract":"Large-scale pre-trained vision-language models (e.g., CLIP) have shown incredible generalization performance in downstream tasks such as video-text retrieval (VTR). Traditional approaches have leveraged CLIP’s robust multi-modal alignment ability for VTR by directly fine-tuning vision and text encoders with clean video-text data. Yet, these techniques rely on carefully annotated video-text pairs, which are expensive and require significant manual effort. In this context, we introduce a new approach, Pseudo-Supervised Selective Contrastive Learning (PS-SCL). PS-SCL minimizes the dependency on manually-labeled text annotations by generating pseudo-supervisions from unlabeled video data for training. We first exploit CLIP’s visual recognition capabilities to generate pseudo-texts automatically. These pseudo-texts contain diverse visual concepts from the video and serve as weak textual guidance. Moreover, we introduce Selective Contrastive Learning (SeLeCT), which prioritizes and selects highly correlated video-text pairs from pseudo-supervised video-text pairs. By doing so, SeLeCT enables more effective multi-modal learning under weak pairing supervision. Experimental results demonstrate that our method outperforms CLIP zero-shot performance by a large margin on multiple video-text retrieval benchmarks, e.g., 8.2% R@1 for video-to-text on MSRVTT, 12.2% R@1 for video-to-text on DiDeMo, and 10.9% R@1 for video-to-text on ActivityNet, respectively.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6748-6760"},"PeriodicalIF":0.0,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142832331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}