{"title":"Point Clouds Matching Based on Discrete Optimal Transport","authors":"Litao Ma;Wei Bian;Xiaoping Xue","doi":"10.1109/TIP.2024.3459594","DOIUrl":"10.1109/TIP.2024.3459594","url":null,"abstract":"Matching is an important prerequisite for point clouds registration, which is to establish a reliable correspondence between two point clouds. This paper aims to improve recent theoretical and algorithmic results on discrete optimal transport (DOT), since it lacks robustness for the point clouds matching problems with large-scale affine or even nonlinear transformation. We first consider the importance of the used prior probability for accurate matching and give some theoretical analysis. Then, to solve the point clouds matching problems with complex deformation and noise, we propose an improved DOT model, which introduces an orthogonal matrix and a diagonal matrix into the classical DOT model. To enhance its capability of dealing with cases with outliers, we further bring forward a relaxed and regularized DOT model. Meantime, we propose two algorithms to solve the brought forward two models. Finally, extensive experiments on some real datasets are designed in the presence of reflection, large-scale rotation, stretch, noise, and outliers. Some state-of-the-art methods, including CPD, APM, RANSAC, TPS-ICP, TPS-RPM, RPMNet, and classical DOT methods, are to be discussed and compared. For different levels of degradation, the numerical results demonstrate that the proposed methods perform more favorably and robustly than the other methods.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5650-5662"},"PeriodicalIF":0.0,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuanhan Wang;Xiaojia Chen;Lianli Gao;Jingkuan Song;Heng Tao Shen
{"title":"CPI-Parser: Integrating Causal Properties Into Multiple Human Parsing","authors":"Xuanhan Wang;Xiaojia Chen;Lianli Gao;Jingkuan Song;Heng Tao Shen","doi":"10.1109/TIP.2024.3469579","DOIUrl":"10.1109/TIP.2024.3469579","url":null,"abstract":"Existing methods of multiple human parsing (MHP) apply deep models to learn instance-level representations for segmenting each person into non-overlapped body parts. However, learned representations often contain many spurious correlations that degrade model generalization, leading learned models to be vulnerable to visually contextual variations in images (e.g., unseen image styles/external interventions). To tackle this, we present a causal property integrated parsing model termed CPI-Parser, which is driven by fundamental causal principles involving two causal properties for human parsing (i.e., the causal diversity and the causal invariance). Specifically, we assume that an image is constructed by a mix of causal factors (the characteristics of body parts) and non-causal factors (external contexts), where only the former ones decide the essence of human parsing. Since causal/non-causal factors are unobservable, the proposed CPI-Parser is required to separate key factors that satisfy the causal properties from an image. In this way, the parser is able to rely on causal factors w.r.t relevant evidence rather than non-causal factors w.r.t spurious correlations, thus alleviating model degradation and yielding improved parsing ability. Notably, the CPI-Parser is designed in a flexible way and can be integrated into any existing MHP frameworks. Extensive experiments conducted on three widely used benchmarks demonstrate the effectiveness and generalizability of our method. Code and models are released (\u0000<uri>https://github.com/HAG-uestc/CPI-Parser</uri>\u0000) for research purpose.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5771-5782"},"PeriodicalIF":0.0,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142373919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Subjective and Objective Quality Assessment of Rendered Human Avatar Videos in Virtual Reality","authors":"Yu-Chih Chen;Avinab Saha;Alexandre Chapiro;Christian Häne;Jean-Charles Bazin;Bo Qiu;Stefano Zanetti;Ioannis Katsavounidis;Alan C. Bovik","doi":"10.1109/TIP.2024.3468881","DOIUrl":"10.1109/TIP.2024.3468881","url":null,"abstract":"We study the visual quality judgments of human subjects on digital human avatars (sometimes referred to as “holograms” in the parlance of virtual reality [VR] and augmented reality [AR] systems) that have been subjected to distortions. We also study the ability of video quality models to predict human judgments. As streaming human avatar videos in VR or AR become increasingly common, the need for more advanced human avatar video compression protocols will be required to address the tradeoffs between faithfully transmitting high-quality visual representations while adjusting to changeable bandwidth scenarios. During transmission over the internet, the perceived quality of compressed human avatar videos can be severely impaired by visual artifacts. To optimize trade-offs between perceptual quality and data volume in practical workflows, video quality assessment (VQA) models are essential tools. However, there are very few VQA algorithms developed specifically to analyze human body avatar videos, due, at least in part, to the dearth of appropriate and comprehensive datasets of adequate size. Towards filling this gap, we introduce the LIVE-Meta Rendered Human Avatar VQA Database, which contains 720 human avatar videos processed using 20 different combinations of encoding parameters, labeled by corresponding human perceptual quality judgments that were collected in six degrees of freedom VR headsets. To demonstrate the usefulness of this new and unique video resource, we use it to study and compare the performances of a variety of state-of-the-art Full Reference and No Reference video quality prediction models, including a new model called HoloQA. As a service to the research community, we publicly releases the metadata of the new database at \u0000<uri>https://live.ece.utexas.edu/research/LIVE-Meta-rendered-human-avatar/index.html</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5740-5754"},"PeriodicalIF":0.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142368033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenqi Ren;Linrui Wu;Yanyang Yan;Shengyao Xu;Feng Huang;Xiaochun Cao
{"title":"INformer: Inertial-Based Fusion Transformer for Camera Shake Deblurring","authors":"Wenqi Ren;Linrui Wu;Yanyang Yan;Shengyao Xu;Feng Huang;Xiaochun Cao","doi":"10.1109/TIP.2024.3461967","DOIUrl":"10.1109/TIP.2024.3461967","url":null,"abstract":"Inertial measurement units (IMU) in the capturing device can record the motion information of the device, with gyroscopes measuring angular velocity and accelerometers measuring acceleration. However, conventional deblurring methods seldom incorporate IMU data, and existing approaches that utilize IMU information often face challenges in fully leveraging this valuable data, resulting in noise issues from the sensors. To address these issues, in this paper, we propose a multi-stage deblurring network named INformer, which combines inertial information with the Transformer architecture. Specifically, we design an IMU-image Attention Fusion (IAF) block to merge motion information derived from inertial measurements with blurry image features at the attention level. Furthermore, we introduce an Inertial-Guided Deformable Attention (IGDA) block for utilizing the motion information features as guidance to adaptively adjust the receptive field, which can further refine the corresponding blur kernel for pixels. Extensive experiments on comprehensive benchmarks demonstrate that our proposed method performs favorably against state-of-the-art deblurring approaches.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6045-6056"},"PeriodicalIF":0.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142368032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Addressing Challenges of Incorporating Appearance Cues Into Heuristic Multi-Object Tracker via a Novel Feature Paradigm","authors":"Chongwei Liu;Haojie Li;Zhihui Wang;Rui Xu","doi":"10.1109/TIP.2024.3468901","DOIUrl":"10.1109/TIP.2024.3468901","url":null,"abstract":"In the field of Multi-Object Tracking (MOT), the incorporation of appearance cues into tracking-by-detection heuristic trackers using re-identification (ReID) features has posed limitations on its advancement. The existing ReID paradigm involves the extraction of coarse-grained object-level feature vectors from cropped objects at a fixed input size using a ReID model, and similarity computation through a simple normalized inner product. However, MOT requires fine-grained features from different object regions and more accurate similarity measurements to identify individuals, especially in the presence of occlusion. To address these limitations, we propose a novel feature paradigm. In this paradigm, we extract the feature map from the entire frame image to preserve object sizes and represent objects using a set of fine-grained features from different object regions. These features are sampled from adaptive patches within the object bounding box on the feature map to effectively capture local appearance cues. We introduce Mutual Ratio Similarity (MRS) to accurately measure the similarity of the most discriminative region between two objects based on the sampled patches, which proves effective in handling occlusion. Moreover, we propose absolute Intersection over Union (AIoU) to consider object sizes in feature cost computation. We integrate our paradigm with advanced motion techniques to develop a heuristic Motion-Feature joint multi-object tracker, MoFe. Within it, we reformulate the track state transition of tracklets to better model their life cycle, and firstly introduce a runtime recorder after MoFe to refine trajectories. Extensive experiments on five benchmarks, i.e., GMOT-40, BDD100k, DanceTrack, MOT17, and MOT20, demonstrate that MoFe achieves state-of-the-art performance in robustness and generalizability without any fine-tuning, and even surpasses the performance of fine-tuned ReID features.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5727-5739"},"PeriodicalIF":0.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142368031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Video Instance Shadow Detection Under the Sun and Sky","authors":"Zhenghao Xing;Tianyu Wang;Xiaowei Hu;Haoran Wu;Chi-Wing Fu;Pheng-Ann Heng","doi":"10.1109/TIP.2024.3468877","DOIUrl":"10.1109/TIP.2024.3468877","url":null,"abstract":"Instance shadow detection, crucial for applications such as photo editing and light direction estimation, has undergone significant advancements in predicting shadow instances, object instances, and their associations. The extension of this task to videos presents challenges in annotating diverse video data and addressing complexities arising from occlusion and temporary disappearances within associations. In response to these challenges, we introduce ViShadow, a semi-supervised video instance shadow detection framework that leverages both labeled image data and unlabeled video data for training. ViShadow features a two-stage training pipeline: the first stage, utilizing labeled image data, identifies shadow and object instances through contrastive learning for cross-frame pairing. The second stage employs unlabeled videos, incorporating an associated cycle consistency loss to enhance tracking ability. A retrieval mechanism is introduced to manage temporary disappearances, ensuring tracking continuity. The SOBA-VID dataset, comprising unlabeled training videos and labeled testing videos, along with the SOAP-VID metric, is introduced for the quantitative evaluation of VISD solutions. The effectiveness of ViShadow is further demonstrated through various video-level applications such as video inpainting, instance cloning, shadow editing, and text-instructed shadow-object manipulation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5715-5726"},"PeriodicalIF":0.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142367959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adapting Vision-Language Models via Learning to Inject Knowledge","authors":"Shiyu Xuan;Ming Yang;Shiliang Zhang","doi":"10.1109/TIP.2024.3468884","DOIUrl":"10.1109/TIP.2024.3468884","url":null,"abstract":"Pre-trained vision-language models (VLM) such as CLIP, have demonstrated impressive zero-shot performance on various vision tasks. Trained on millions or even billions of image-text pairs, the text encoder has memorized a substantial amount of appearance knowledge. Such knowledge in VLM is usually leveraged by learning specific task-oriented prompts, which may limit its performance in unseen tasks. This paper proposes a new knowledge injection framework to pursue a generalizable adaption of VLM to downstream vision tasks. Instead of learning task-specific prompts, we extract task-agnostic knowledge features, and insert them into features of input images or texts. The fused features hence gain better discriminative capability and robustness to intra-category variances. Those knowledge features are generated by inputting learnable prompt sentences into text encoder of VLM, and extracting its multi-layer features. A new knowledge injection module (KIM) is proposed to refine text features or visual features using knowledge features. This knowledge injection framework enables both modalities to benefit from the rich knowledge memorized in the text encoder. Experiments show that our method outperforms recently proposed methods under few-shot learning, base-to-new classes generalization, cross-dataset transfer, and domain generalization settings. For instance, it outperforms CoOp by 4.5% under the few-shot learning setting, and CoCoOp by 4.4% under the base-to-new classes generalization setting. Our code will be released.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5798-5809"},"PeriodicalIF":0.0,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142368030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Event-Assisted Blurriness Representation Learning for Blurry Image Unfolding","authors":"Pengyu Zhang;Hao Ju;Lei Yu;Weihua He;Yaoyuan Wang;Ziyang Zhang;Qi Xu;Shengming Li;Dong Wang;Huchuan Lu;Xu Jia","doi":"10.1109/TIP.2024.3468023","DOIUrl":"10.1109/TIP.2024.3468023","url":null,"abstract":"The goal of blurry image deblurring and unfolding task is to recover a single sharp frame or a sequence from a blurry one. Recently, its performance is greatly improved with introduction of a bio-inspired visual sensor, event camera. Most existing event-assisted deblurring methods focus on the design of powerful network architectures and effective training strategy, while ignoring the role of blur modeling in removing various blur in dynamic scenes. In this work, we propose to implicitly model blur in an image by computing blurriness representation with an event-assisted blurriness encoder. The learning of blurriness representation is formulated as a ranking problem based on specially synthesized pairs. Blurriness-aware image unfolding is achieved by integrating blur relevant information contained in the representation into a base unfolding network. The integration is mainly realized by the proposed blurriness-guided modulation and multi-scale aggregation modules. Experiments on GOPRO and HQF datasets show favorable performance of the proposed method against state-of-the-art approaches. More results on real-world data validate its effectiveness in recovering a sequence of latent sharp frames from a blurry image.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5824-5836"},"PeriodicalIF":0.0,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142362791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianbiao Mei;Yu Yang;Mengmeng Wang;Junyu Zhu;Jongwon Ra;Yukai Ma;Laijian Li;Yong Liu
{"title":"Camera-Based 3D Semantic Scene Completion With Sparse Guidance Network","authors":"Jianbiao Mei;Yu Yang;Mengmeng Wang;Junyu Zhu;Jongwon Ra;Yukai Ma;Laijian Li;Yong Liu","doi":"10.1109/TIP.2024.3461989","DOIUrl":"10.1109/TIP.2024.3461989","url":null,"abstract":"Semantic scene completion (SSC) aims to predict the semantic occupancy of each voxel in the entire 3D scene from limited observations, which is an emerging and critical task for autonomous driving. Recently, many studies have turned to camera-based SSC solutions due to the richer visual cues and cost-effectiveness of cameras. However, existing methods usually rely on sophisticated and heavy 3D models to process the lifted 3D features directly, which are not discriminative enough for clear segmentation boundaries. In this paper, we adopt the dense-sparse-dense design and propose a one-stage camera-based SSC framework, termed SGN, to propagate semantics from the semantic-aware seed voxels to the whole scene based on spatial geometry cues. Firstly, to exploit depth-aware context and dynamically select sparse seed voxels, we redesign the sparse voxel proposal network to process points generated by depth prediction directly with the coarse-to-fine paradigm. Furthermore, by designing hybrid guidance (sparse semantic and geometry guidance) and effective voxel aggregation for spatial geometry cues, we enhance the feature separation between different categories and expedite the convergence of semantic propagation. Finally, we devise the multi-scale semantic propagation module for flexible receptive fields while reducing the computation resources. Extensive experimental results on the SemanticKITTI and SSCBench-KITTI-360 datasets demonstrate the superiority of our SGN over existing state-of-the-art methods. And even our lightweight version SGN-L achieves notable scores of 14.80% mIoU and 45.45% IoU on SeamnticKITTI validation with only 12.5 M parameters and 7.16 G training memory. Code is available at \u0000<uri>https://github.com/Jieqianyu/SGN</uri>\u0000.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5468-5481"},"PeriodicalIF":0.0,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiaojiao Li;Zhiyuan Zhang;Yuzhe Liu;Rui Song;Yunsong Li;Qian Du
{"title":"SWFormer: Stochastic Windows Convolutional Transformer for Hybrid Modality Hyperspectral Classification","authors":"Jiaojiao Li;Zhiyuan Zhang;Yuzhe Liu;Rui Song;Yunsong Li;Qian Du","doi":"10.1109/TIP.2024.3465038","DOIUrl":"10.1109/TIP.2024.3465038","url":null,"abstract":"Joint classification of hyperspectral images with hybrid modality can significantly enhance interpretation potentials, particularly when elevation information from the LiDAR sensor is integrated for outstanding performance. Recently, the transformer architecture was introduced to the HSI and LiDAR classification task, which has been verified as highly efficient. However, the existing naive transformer architectures suffer from two main drawbacks: 1) Inadequacy extraction for local spatial information and multi-scale information from HSI simultaneously. 2) The matrix calculation in the transformer consumes vast amounts of computing power. In this paper, we propose a novel Stochastic Window Transformer (SWFormer) framework to resolve these issues. First, the effective spatial and spectral feature projection networks are built independently based on hybrid-modal heterogeneous data composition using parallel feature extraction, which is conducive to excavating the perceptual features more representative along different dimensions. Furthermore, to construct local-global nonlinear feature maps more flexibly, we implement multi-scale strip convolution coupled with a transformer strategy. Moreover, in an innovative random window transformer structure, features are randomly masked to achieve sparse window pruning, alleviating the problem of information density redundancy, and reducing the parameters required for intensive attention. Finally, we designed a plug-and-play feature aggregation module that adapts domain offset between modal features adaptively to minimize semantic gaps between them and enhance the representational ability of the fusion feature. Three fiducial datasets demonstrate the effectiveness of the SWFormer in determining classification results.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"5482-5495"},"PeriodicalIF":0.0,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142325161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}