Wanyu Wu;Wei Wang;Zheng Wang;Kui Jiang;Zhengguo Li
{"title":"For Overall Nighttime Visibility: Integrate Irregular Glow Removal With Glow-Aware Enhancement","authors":"Wanyu Wu;Wei Wang;Zheng Wang;Kui Jiang;Zhengguo Li","doi":"10.1109/TCSVT.2024.3465670","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3465670","url":null,"abstract":"Current low-light image enhancement (LLIE) techniques truly enhance luminance but have limited exploration on another harmful factor of nighttime visibility, the glow effects with multiple shapes in the real world. The presence of glow is inevitable due to widespread artificial light sources, and direct enhancement can cause further glow diffusion. In the pursuit of Overall Nighttime Visibility Enhancement (ONVE), we propose a physical model guided framework ONVE to derive a Nighttime Imaging Model with Near-Field Light Sources (NIM-NLS), whose APSF prior generator is validated efficiently in six categories of glow shapes. Guided by this physical-world model as domain knowledge, we subsequently develop an extensible Light-aware Blind Deconvolution Network (LBDN) to face the blind decomposition challenge on direct transmission map D and light source map G based on APSF. Then, an innovative Glow-guided Retinex-based progressive Enhancement module (GRE) is introduced as a further optimization on reflection R from D to harmonize the conflict of glow removal and brightness boost. Notably, ONVE is an unsupervised framework based on a zero-shot learning strategy and uses physical domain knowledge to form the overall pipeline and network. Empirical evaluations on multiple datasets validate the remarkable efficacy of the proposed ONVE in improving nighttime visibility and performance of high-level vision tasks.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"823-837"},"PeriodicalIF":8.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143369817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yili Jin;Xize Duan;Kaiyuan Hu;Fangxin Wang;Xue Liu
{"title":"3D Video Conferencing via On-Hand Devices","authors":"Yili Jin;Xize Duan;Kaiyuan Hu;Fangxin Wang;Xue Liu","doi":"10.1109/TCSVT.2024.3465848","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3465848","url":null,"abstract":"Video conferencing has become indispensable in human communication. Researchers are exploring immersive capabilities to enhance video conferencing experiences by delivering realistic interactions. However, existing methods have stringent and extra hardware beyond a typical video conference, including multiple depth cameras, large screens, and headsets, which pose obstacles to the widespread adoption due to high costs and complex setups. Thus, there is an urgent demand for light-weight systems using only on-hand devices including single RGB camera and standard screen, without additional hardware. We propose DVCO, a novel 3D video conferencing system via on-hand devices. With DVCO, users can experience lifelike virtual conferencing that includes natural contact and interactive features. To achieve this, DVCO has two main components. Virtual Camera Transformation (VCT) and New View Generator (NVG). VCT computes a downscaled sender image from tracking to determine viewpoint and gaze vector, enhancing virtual presence on standard screens. NVG takes an input frame and desired view angle to produce an output reflecting the new view from a single RGB camera. Together, these provide an affordable, easy-to-integrate enhancement for current video conferencing systems without expensive upgrades. Through a user study, it has been demonstrated that DVCO offers an exceptional level of immersion when compared to traditional systems. Experiments are conducted to showcase the superior performance of VCT and NVG in comparison to baseline methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"900-910"},"PeriodicalIF":8.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143369985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shanaka Ramesh Gunasekara;Wanqing Li;Jack Yang;Philip O. Ogunbona
{"title":"Asynchronous Joint-Based Temporal Pooling for Skeleton-Based Action Recognition","authors":"Shanaka Ramesh Gunasekara;Wanqing Li;Jack Yang;Philip O. Ogunbona","doi":"10.1109/TCSVT.2024.3465845","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3465845","url":null,"abstract":"Deep neural networks for skeleton-based human action recognition (HAR) often utilize traditional averaging or maximum temporal pooling to aggregate features by treating all joints and frames equally. However, this approach can excessively aggregate less discriminative or even indiscriminative features into the final feature vectors for recognition. To address this issue, a novel method called asynchronous joint adaptive temporal pooling (AJTP) is introduced in this paper. The method aims to enhance action recognition by identifying a set of informative joints across the temporal dimension and applying a joint-based and asynchronous motion-preservative pooling rather than conventional frame-based pooling. The effectiveness of the proposed AJTP has been empirically validated by integrating it with popular Graph Convolutional Network (GCN) models on three benchmark datasets: NTU RGB+D 120, PKUMMD, and Kinetic400. The results have shown that a GCN model with AJTP substantially improves performance compared to its counterpart GCN model with conventional temporal pooling techniques. The source code is available at <uri>https://github.com/ShanakaRG/AJTP</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"357-366"},"PeriodicalIF":8.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingye Ju;Chunming He;Can Ding;Wenqi Ren;Lin Zhang;Kai-Kuang Ma
{"title":"All-Inclusive Image Enhancement for Degraded Images Exhibiting Low-Frequency Corruption","authors":"Mingye Ju;Chunming He;Can Ding;Wenqi Ren;Lin Zhang;Kai-Kuang Ma","doi":"10.1109/TCSVT.2024.3465875","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3465875","url":null,"abstract":"In this paper, a novel image enhancement method, called the all-inclusive image enhancement (AIIE), is proposed that can effectively enhance the degraded images for improving the visibility of image content. These imageries were acquired under various types of weather conditions such as haze, low-light, underwater, and sandstorm, etc. One commonality shared by this class of noise is that the resulted degradations on visual quality or visibility are caused by low-frequency interference. Existing image enhancement methods lack the ability to deal with all types of degradations from this class, while our proposed AIIE offers a unified treatment for them. To achieve this goal, a statistical property is obtained from the study of the discrete cosine transform (DCT) of 1,000 high- and 1000 low-quality images on their DCT domains. It shows that the normalized DCT coefficients (between 0 and 1) of high-quality images has about 95% fall in the interval [0, 0.2]; for low-quality images, almost all the coefficients are in the same interval. This fundamental property, called the DCT prior (DCT-P), is instrumental to the development of our AIIE algorithm proposed in this paper. Since the proposed DCT-P delineates the attributes of high- and low-quality images clearly, it becomes a highly effective ‘tool’ to convert low-quality images to its enhanced version. Extensive experimental results have clearly validated the superior performance of the AIIE conducted on different types of deteriorated images in terms of visual quality and efficiency as well as significant advantages on computational complexity, which is essential for real-time applications.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"838-856"},"PeriodicalIF":8.3,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143369883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel SO(3) Rotational Equivariant Masked Autoencoder for 3D Mesh Object Analysis","authors":"Min Xie;Jieyu Zhao;Kedi Shen","doi":"10.1109/TCSVT.2024.3465041","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3465041","url":null,"abstract":"Equivariant networks have recently made significant strides in computer vision tasks related to robotic grasping, molecule generation, and 6D pose tracking. In this paper, we explore 3D mesh object analysis based on an equivariant masked autoencoder to reduce the model dependence on large datasets and predict the pose transformation. We employ 3D reconstruction tasks under rotation and masking operations, such as segmentation tasks after rotation, as pretraining to enhance downstream task performance. To mitigate the computational complexity of the algorithm, we first utilize multiple non-overlapping 3D mesh patches with a fixed face size. We then design a rotation-equivariant self-attention mechanism to obtain advanced features. To improve the throughput of the encoder, we design a sparse token merging strategy. Our method achieves comparable performance on equivariant analysis tasks of mesh objects, such as 3D mesh pose transformation estimation, object classification and part segmentation on the ShapeNetCore16, Manifold40, COSEG-aliens, COSEG-vases and Human Body datasets. In the object classification task, we achieve superior performance even when only 10% of the original sample is used. We perform extensive ablation experiments to demonstrate the efficacy of critical design choices in our approach.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"329-342"},"PeriodicalIF":8.3,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Purify Then Guide: A Bi-Directional Bridge Network for Open-Vocabulary Semantic Segmentation","authors":"Yuwen Pan;Rui Sun;Yuan Wang;Wenfei Yang;Tianzhu Zhang;Yongdong Zhang","doi":"10.1109/TCSVT.2024.3464631","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3464631","url":null,"abstract":"Open-vocabulary semantic segmentation (OVSS) aims to segment an image into regions of corresponding semantic vocabularies, without being limited to a predefined set of object categories. Existing works mainly utilize large-scale vision-language models (e.g., CLIP) to leverage their superior open-vocabulary classification abilities in a two-stage manner. However, their heavy reliance on the first-stage segmentation network leaves the full potential of CLIP untapped, creating an unresolved gap between the rich pre-training knowledge and the challenging per-pixel classification task. Although the recent one-stage paradigm has further leveraged pre-trained vision knowledge from CLIP, it fails to effectively utilize text information due to the inclusion of numerous unrelated semantics in the vocabulary list. How to avoid noise interference in text information and utilize language guidance remains a Gordian knot. In this paper, we propose a bi-directional bridge network (BBN) to bridge the gap between upstream pre-trained models and downstream segmentation tasks. It first purifies the noisy text embedding and then guides semantics-vision aggregation with the purified information in a purification-then-guidance manner, thereby facilitating effective semantic utilization. Specifically, we design an optimal purification modulator to purify noisy text information via the optimal transport algorithm, and a reliable guidance modulator to integrate proper textual information into vision embedding via the designed reliable attention in an adaptive manner. Extensive experimental results on five challenging benchmarks demonstrate that our BBN performs favorably against state-of-the-art open-vocabulary semantic segmentation methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"343-356"},"PeriodicalIF":8.3,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Boosting Deepfake Detection Generalizability via Expansive Learning and Confidence Judgement","authors":"Kuiyuan Zhang;Zeming Hou;Zhongyun Hua;Yifeng Zheng;Leo Yu Zhang","doi":"10.1109/TCSVT.2024.3462985","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3462985","url":null,"abstract":"As deepfake technology poses severe threats to information security, significant efforts have been devoted to deepfake detection. To enable model generalization for detecting new types of deepfakes, it is required that the existing models should learn knowledge about new types of deepfakes without losing prior knowledge, a challenge known as catastrophic forgetting (CF). Existing methods mainly utilize domain adaptation to learn about the new deepfakes for addressing this issue. However, these methods are constrained to utilizing a small portion of data samples from the new deepfakes, and they suffer from CF when the size of the data samples used for domain adaptation increases. This resulted in poor average performance in source and target domains. In this paper, we introduce a novel approach to boost the generalizability of deepfake detection. Our approach follows a two-stage training process: training in the source domain (prior deepfakes that have been used for training) and domain adaptation to the target domain (new types of deepfakes). In the first stage, we employ expansive learning to train our expanded model from a well-trained teacher model. In the second stage, we transfer the expanded model to the target domain while removing assistant components. For model architecture, we propose the frequency extraction module to extract frequency features as complementary to spatial features and introduce spatial-frequency contrastive loss to enhance feature learning ability. Moreover, we develop a confidence judgement module to eliminate conflicts between new and prior knowledge. Experimental results demonstrate that our method can achieve better average accuracy in source and target domains even when using large-scale data samples of the target domain, and it exhibits superior generalizability compared to state-of-the-art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"953-966"},"PeriodicalIF":8.3,"publicationDate":"2024-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143369818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"L2A: Learning Affinity From Attention for Weakly Supervised Continual Semantic Segmentation","authors":"Hao Liu;Yong Zhou;Bing Liu;Ming Yan;Joey Tianyi Zhou","doi":"10.1109/TCSVT.2024.3462946","DOIUrl":"10.1109/TCSVT.2024.3462946","url":null,"abstract":"Despite significant advances in continual semantic segmentation (CSS), they still rely on the pixel-level annotation to train models, which is time-consuming and labor-intensive. Continual learning from image-level labels is an emerging scheme in continual semantic segmentation to reduce the annotation cost. However, the incomplete and coarse pseudo-labels are insufficient to train a model to maintain a balance between stability and plasticity. To solve these issues, we propose a novel end-to-end framework based on Transformer, called L2A, for Weakly Supervised Continual Semantic Segmentation (WSCSS). In particular, to generate reliable annotations from the image-level supervision, we introduce a semantic affinity from multi-head self-attention (SA-MHSA) module to capture the semantic relationships among adjacent image coordinates. Subsequently, this acquired semantic affinity is employed to refine the initial pseudo labels of new classes trained with the image-level annotations. Furthermore, to minimize catastrophic forgetting, we propose a semantic drift compensation (SDC) strategy to optimize the pseudo-label generation process, which can effectively improve the alignment of object boundaries across both new and old categories. Comprehensive experiments conducted on the PASCAL VOC 2012 and COCO datasets demonstrate the superiority of our framework in existing WSCSS scenarios and a newly proposed challenge protocol, as well as remains competitive compared to the pixel-level supervised CSS methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"315-328"},"PeriodicalIF":8.3,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255169","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Patch-Aware Batch Normalization for Improving Cross-Domain Robustness","authors":"Lei Qi;Dongjia Zhao;Yinghuan Shi;Xin Geng","doi":"10.1109/TCSVT.2024.3462501","DOIUrl":"10.1109/TCSVT.2024.3462501","url":null,"abstract":"Despite the significant success of deep learning in computer vision tasks, cross-domain tasks still present a challenge in which the model’s performance will degrade when the training set and the test set follow different distributions. Most existing methods employ adversarial learning or instance normalization for achieving data augmentation to solve this task. In contrast, considering that the batch normalization (BN) layer may not be robust for unseen domains and there exist the differences between local patches of an image, we propose a novel method called patch-aware batch normalization (PBN). To be specific, we first split feature maps of a batch into non-overlapping patches along the spatial dimension, and then independently normalize each patch to jointly optimize the shared BN parameter at each iteration. By exploiting the differences between local patches of an image, our proposed PBN can effectively enhance the robustness of the model’s parameters. Besides, considering the statistics from each patch may be inaccurate due to their smaller size compared to the global feature maps, we incorporate the globally accumulated statistics with the statistics from each batch to obtain the final statistics for normalizing each patch. Since the proposed PBN can replace the typical BN, it can be integrated into most existing state-of-the-art methods. Extensive experiments and analysis demonstrate the effectiveness of our PBN in multiple computer vision tasks, including classification, object detection, instance retrieval, and semantic segmentation.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"800-810"},"PeriodicalIF":8.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-Supervised Learning for Rolling Shutter Temporal Super-Resolution","authors":"Bin Fan;Ying Guo;Yuchao Dai;Chao Xu;Boxin Shi","doi":"10.1109/TCSVT.2024.3462520","DOIUrl":"10.1109/TCSVT.2024.3462520","url":null,"abstract":"Most cameras on portable devices adopt a rolling shutter (RS) mechanism, encoding sufficient temporal dynamic information through sequential readouts. This advantage can be exploited to recover a temporal sequence of latent global shutter (GS) images. Existing methods rely on fully supervised learning, necessitating specialized optical devices to collect paired RS-GS images as ground-truth, which is too costly to scale. In this paper, we propose a self-supervised learning framework for the first time to produce a high frame rate GS video from two consecutive RS images, unleashing the potential of RS cameras. Specifically, we first develop the unified warping model of RS2GS and GS2RS, enabling the complement conversions of RS2GS and GS2RS to be incorporated into a uniform network model. Then, based on the cycle consistency constraint, given a triplet of consecutive RS frames, we minimize the discrepancy between the input middle RS frame and its cycle reconstruction, generated by interpolating back from the predicted two intermediate GS frames. Experiments on various benchmarks show that our approach achieves comparable or better performance than state-of-the-art supervised methods while enjoying stronger generalization capabilities. Moreover, our approach makes it possible to recover smooth and distortion-free videos from two adjacent RS frames in the real-world BS-RSC dataset, surpassing prior limitations.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"769-782"},"PeriodicalIF":8.3,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142255172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}