{"title":"Perceptual Depth Quality Assessment of Stereoscopic Omnidirectional Images","authors":"Wei Zhou;Zhou Wang","doi":"10.1109/TCSVT.2024.3449696","DOIUrl":"10.1109/TCSVT.2024.3449696","url":null,"abstract":"Depth perception plays an essential role in the viewer experience for immersive virtual reality (VR) visual environments. However, previous research investigations in the depth quality of 3D/stereoscopic images are rather limited, and in particular, are largely lacking for 3D viewing of 360-degree omnidirectional content. In this work, we make one of the first attempts to develop an objective quality assessment model named depth quality index (DQI) for efficient no-reference (NR) depth quality assessment of stereoscopic omnidirectional images. Motivated by the perceptual characteristics of the human visual system (HVS), the proposed DQI is built upon multi-color-channel, adaptive viewport selection, and interocular discrepancy features. Experimental results demonstrate that the proposed method outperforms state-of-the-art image quality assessment (IQA) and depth quality assessment (DQA) approaches in predicting the perceptual depth quality when tested using both single-viewport and omnidirectional stereoscopic image databases. Furthermore, we demonstrate that combining the proposed depth quality model with existing IQA methods significantly boosts the performance in predicting the overall quality of 3D omnidirectional images.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13452-13462"},"PeriodicalIF":8.3,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142177195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adjustable Visible and Infrared Image Fusion","authors":"Boxiong Wu;Jiangtao Nie;Wei Wei;Lei Zhang;Yanning Zhang","doi":"10.1109/TCSVT.2024.3449638","DOIUrl":"10.1109/TCSVT.2024.3449638","url":null,"abstract":"The visible and infrared image fusion (VIF) method aims to utilize the complementary information between these two modalities to synthesize a new image containing richer information. Although it has been extensively studied, the synthesized image that has the best visual results is difficult to reach consensus since users have different opinions. To address this problem, we propose an adjustable VIF framework termed AdjFusion, which introduces a global controlling coefficient into VIF to enforce it can interact with users. Within AdjFusion, a semantic-aware modulation module is proposed to transform the global controlling coefficient into a semantic-aware controlling coefficient, which provides pixel-wise guidance for AdjFusion considering both interactivity and semantic information within visible and infrared images. In addition, the introduced global controlling coefficient not only can be utilized as an external interface for interaction with users but also can be easily customized by the downstream tasks (e.g., VIF-based detection and segmentation), which can help to select the best fusion result for the downstream tasks. Taking advantage of this, we further propose a lightweight adaptation module for AdjFusion to learn the global controlling coefficient to be suitable for the downstream tasks better. Experimental results demonstrate the proposed AdjFusion can 1) provide ways to dynamically synthesize images to meet the diverse demands of users; and 2) outperform the previous state-of-the-art methods on both VIF-based detection and segmentation tasks, with the constructed lightweight adaptation method. Our code will be released after accepted at \u0000<uri>https://github.com/BearTo2/AdjFusion</uri>\u0000.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13463-13477"},"PeriodicalIF":8.3,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142177196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bochen Xie;Yongjian Deng;Zhanpeng Shao;Qingsong Xu;Youfu Li
{"title":"Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams","authors":"Bochen Xie;Yongjian Deng;Zhanpeng Shao;Qingsong Xu;Youfu Li","doi":"10.1109/TCSVT.2024.3448615","DOIUrl":"10.1109/TCSVT.2024.3448615","url":null,"abstract":"Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S2TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13427-13440"},"PeriodicalIF":8.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142223515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ming Li;Zhaoli Yang;Tao Wang;Yushu Zhang;Wenying Wen
{"title":"Dual Protection for Image Privacy and Copyright via Traceable Adversarial Examples","authors":"Ming Li;Zhaoli Yang;Tao Wang;Yushu Zhang;Wenying Wen","doi":"10.1109/TCSVT.2024.3448351","DOIUrl":"10.1109/TCSVT.2024.3448351","url":null,"abstract":"In recent years, the uploading of massive personal images has increased the security risks, mainly including privacy breaches and copyright infringement. Adversarial examples provide a novel solution for protecting image privacy, as they can evade the detection by deep neural network (DNN)-based recognizers. However, the perturbations in the adversarial examples typically meaningless and therefore cannot be extracted as traceable information to support copyright protection. In this paper, we designed a dual protection scheme for image privacy and copyright via traceable adversarial examples. Specifically, a traceable adversarial model is proposed, which can be used to embed the invisible copyright information into images for copyright protection while fooling DNN-based recognizers for privacy protection. Inspired by the training method of generative adversarial networks (GANs), a new dynamic adversarial training strategy is designed, which allows our model for achieving stable multi-objective learning. Experimental results show that our scheme is exceptionally robust in the face of a variety of noise conditions and image processing methods, while exhibiting good model migration and defense robustness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13401-13412"},"PeriodicalIF":8.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142177197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic, Robust, and Blind Video Watermarking Resisting Camera Recording","authors":"Lina Lin;Deyang Wu;Jiayan Wang;Yanli Chen;Xinpeng Zhang;Hanzhou Wu","doi":"10.1109/TCSVT.2024.3448502","DOIUrl":"10.1109/TCSVT.2024.3448502","url":null,"abstract":"As a secondary generation method, video recording will cause irreversible damage to the watermark within the video, which has always been challenging in video forensics. Although many video watermarking methods are reported in the literature, these methods, however, still cannot well resist camera recording. This has motivated the authors in this paper to introduce a new video watermarking method to resist camera recording. For the proposed method, two watermarks, i.e., copyright watermark and synchronization watermark, are embedded into the well-selected frequency domain coefficients. The synchronization watermark is used to ensure that the copyright watermark can be successfully extracted at the decoder side. To extract the copyright watermark without manual assistance, a neural network based segmentation model is applied to identify the watermarked video-playing region in the camera-recorded video. Meanwhile, automatic perspective correction is performed on the watermarked video-playing region so that the watermark information can be extracted accurately. The experiments show that the watermark data can be embedded into the raw video successfully and extracted from the camera-recorded video accurately by applying the proposed method. And, the proposed method significantly outperforms related works in terms of robustness in different scenarios, which has verified the superiority and applicability of the proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13413-13426"},"PeriodicalIF":8.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142177200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xuehui Wu;Huanliang Xu;Henry Leung;Xiaobo Lu;Yanbin Li
{"title":"F2CENet: Single-Image Object Counting Based on Block Co-Saliency Density Map Estimation","authors":"Xuehui Wu;Huanliang Xu;Henry Leung;Xiaobo Lu;Yanbin Li","doi":"10.1109/TCSVT.2024.3449070","DOIUrl":"10.1109/TCSVT.2024.3449070","url":null,"abstract":"This paper presents a novel single-image object counting method based on block co-saliency density map estimation, called free-to-count everything network (F2CENet). Image block co-saliency attention is introduced to promote density estimation adaptation, allowing to input any image with arbitrary size for accurate counting using the learned model without requiring manually labeled few shots. The proposed network also outperforms existing crowd counting methods based on geometry-adaptive kernels in complex scenes. A novel module generates multilevel & scale block correlation maps to guide the co-saliency density map estimation. Co-saliency attention maps are then fused for accurately locating block-wise salient objects under guidance of the initial cues. Hence, accurate density maps are generated via comprehensive learning of internal relations in block co-salient features and progressive optimization of local details with saliency-oriented scene understanding. Results from extensive experiments on existing density map estimation datasets with arbitrary challenges verify the effectiveness of the proposed F2CENet and show that it outperforms various state-of-the-art few-shot and crowd counting methods. Mean Absolute Error (MAE) and Root Mean Square Error (RMSE) are used as evaluation metrics to measure the accuracy which are commonly used metrics for counting task. The average predicted MAE and RMSE are 10.88% and 8.44% less compared with the state-of-the-art evaluated on dataset contains sufficiently large and diverse categories used for few-shot and crowd counting.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13141-13151"},"PeriodicalIF":8.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142177201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wentao Zhang;Yujun Huang;Weizhuo Zhang;Tong Zhang;Qicheng Lao;Yue Yu;Wei-Shi Zheng;Ruixuan Wang
{"title":"Continual Learning of Image Classes With Language Guidance From a Vision-Language Model","authors":"Wentao Zhang;Yujun Huang;Weizhuo Zhang;Tong Zhang;Qicheng Lao;Yue Yu;Wei-Shi Zheng;Ruixuan Wang","doi":"10.1109/TCSVT.2024.3449109","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3449109","url":null,"abstract":"Current deep learning models often catastrophically forget the knowledge of old classes when continually learning new ones. State-of-the-art approaches to continual learning of image classes often require retaining a small subset of old data to partly alleviate the catastrophic forgetting issue, and their performance would be degraded sharply when no old data can be stored due to privacy or safety concerns. In this study, inspired by human learning of visual knowledge with the effective help of language, we propose a novel continual learning framework based on a pre-trained vision-language model (VLM) without retaining any old data. Rich prior knowledge of each new image class is effectively encoded by the frozen text encoder of the VLM, which is then used to guide the learning of new image classes. The output space of the frozen text encoder is unchanged over the whole process of continual learning, through which image representations of different classes become comparable during model inference even when the image classes are learned at different times. Extensive empirical evaluations on multiple image classification datasets under various settings confirm the superior performance of our method over existing ones. The source code is available at \u0000<uri>https://github.com/Fatflower/CIL_LG_VLM/</uri>\u0000.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13152-13163"},"PeriodicalIF":8.3,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142859153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fan Yang;Shigeyuki Odashima;Shoichi Masui;Ikuo Kusajima;Sosuke Yamao;Shan Jiang
{"title":"Enhancing Multi-Camera Gymnast Tracking Through Domain Knowledge Integration","authors":"Fan Yang;Shigeyuki Odashima;Shoichi Masui;Ikuo Kusajima;Sosuke Yamao;Shan Jiang","doi":"10.1109/TCSVT.2024.3447670","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3447670","url":null,"abstract":"We present a robust multi-camera gymnast tracking, which has been applied at international gymnastics championships for gymnastics judging. Despite considerable progress in multi-camera tracking algorithms, tracking gymnasts presents unique challenges: 1) due to space restrictions, only a limited number of cameras can be installed in the gymnastics stadium; and 2) due to variations in lighting, background, uniforms, and occlusions, multi-camera gymnast detection may fail in certain views and only provide valid detections from two opposing views. These factors complicate the accurate determination of a gymnast’s 3D trajectory using conventional multi-camera triangulation. To alleviate this issue, we incorporate gymnastics domain knowledge into our tracking solution. Given that a gymnast’s 3D center typically lies within a predefined vertical plane during much of their performance, we can apply a ray-plane intersection to generate coplanar 3D trajectory candidates for opposing-view detections. More specifically, we propose a novel cascaded data association (DA) paradigm that employs triangulation to generate 3D trajectory candidates when cross-view detections are sufficient, and resort to the ray-plane intersection when they are insufficient. Consequently, coplanar candidates are used to compensate for uncertain trajectories, thereby minimizing tracking failures. The robustness of our method is validated through extensive experimentation, demonstrating its superiority over existing methods in challenging scenarios. Furthermore, our gymnastics judging system, equipped with this tracking method, has been successfully applied to recent Gymnastics World Championships, earning significant recognition from the International Gymnastics Federation.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13386-13400"},"PeriodicalIF":8.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142859119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cover Selection in Encrypted Images","authors":"Jiang Yu;Jing Zhang;Zichi Wang;Fengyong Li;Xinpeng Zhang","doi":"10.1109/TCSVT.2024.3447913","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3447913","url":null,"abstract":"Existing effective cover selection methods aim to select the complex images as covers to achieve the highly security with the aid of the embedding distortion computed from a natural image. However, the calculation of the embedding distortion divulges the image content to a steganographer. To overcome this issue, this work proposes a novel cover selection scheme in encrypted images to achieve the image content-protection and cover-selection simultaneously. In the first phase, the content owner encrypts several most significant bits (MSBs) of each image using an encryption key and the encrypted image is shuffled by block. Meanwhile, with a sampling key, the content owner selects some encrypted blocks and outputs them to the steganographer. In the second phase, the steganographer calculates first-order noise residuals of adjacent pixels of the acquired blocks along different directions. Importantly, we design a texture descriptor named as structured Local binary pattern (SLBP) to encode all the residuals by which the images owing the maximal SLBP values are chosen as the optimal covers. We demonstrate the security of our proposed scheme on multiple steganographic and steganalytic methods and the extensive results show that our scheme exhibits excellent performance without knowing of the original image content. Moreover, the results testify that the designed SLBP achieves the perfect evaluation of image complexity.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13626-13641"},"PeriodicalIF":8.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142859151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Throughput and Memory-Efficient Deblocking Filter Hardware Architecture for VVC","authors":"Bingjing Hou;Leilei Huang;Minge Jing;Yibo Fan","doi":"10.1109/TCSVT.2024.3447698","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3447698","url":null,"abstract":"Video coding has become more and more important since high-resolution and high-quality videos have been used in a variety of application areas. Deblocking filter (DBF) is a video coding technology which can improve both video quality and coding efficiency. However, its hardware architecture design suffers from huge computations and high memory requirements. Moreover, the latest Versatile Video Coding (VVC) standard extends DBF with several complex enhancements, which makes the design more difficult. In this paper, a high-throughput and memory-efficient DBF hardware architecture for VVC systems is presented. By analyz-ing the DBF algorithm, we firstly propose a unified filter core to perform edge filtering process with low complexity, and two resource sharing techniques are utilized to reduce hardware costs. Furthermore, we propose a whole DBF architecture to process all the edges in a coding tree unit (CTU). To improve its throughput, we propose novel pre-calculation processing flow and double processing flow to fully utilize pipelining and parallel processing techniques. At the same time, to reduce its memory requirements, we propose four novel data reuse approaches to fully utilize intermediate data reusabilities. Synthesis results show that our proposed hardware architecture can support real-time VVC DBF processing of \u0000<inline-formula> <tex-math>$7680times 4320$ </tex-math></inline-formula>\u0000 at 158 frames/s at 500 MHz working frequency. The hardware costs are only 163.2k gate count and three two-port on-chip SRAMs with data width of 128 bits and depth of 32. Compared with other state-of-the-art works for previous standards, our proposed VVC DBF hardware architecture achieves good results in performance, area efficiency and memory efficiency.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13569-13583"},"PeriodicalIF":8.3,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142859049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}