Xi Wang;Wei Liu;Shimin Gong;Zhi Liu;Jing Xu;Yuming Fang
{"title":"Spatial Quality Oriented Rate Control for Volumetric Video Streaming via Deep Reinforcement Learning","authors":"Xi Wang;Wei Liu;Shimin Gong;Zhi Liu;Jing Xu;Yuming Fang","doi":"10.1109/TCSVT.2024.3523348","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523348","url":null,"abstract":"Volumetric videos offer an incredibly immersive viewing experience but encounters challenges in maintaining quality of experience (QoE) due to its ultra-high bandwidth requirements. One significant challenge stems from user’s spatial interactions, potentially leading to discrepancies between transmission bitrates and the actual quality of rendered viewports. In this study, we conduct comprehensive measurement experiments to investigate the impact of six degrees of freedom information on received video quality. Our results indicate that the correlation between spatial quality and transmission bitrates is influenced by the user’s viewing distance, exhibiting variability among users. To address this, we propose a spatial quality oriented rate control system, namely sparkle, that aims to satisfy spatial quality requirements while maximizing long-term QoE for volumetric video streaming services. Leveraging richer user interaction information, we devise a tailored learning-based algorithm to enhance long-term QoE. To address the complexity brought by richer state input and precise allocation, we integrate pre-constraints derived from three-dimensional displays to intervene action selection, efficiently reducing the action space and speeding up convergence. Extensive experimental results illustrate that sparkle significantly enhances the averaged QoE by up to 29% under practical network and user tracking scenarios.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"5092-5108"},"PeriodicalIF":8.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GMTNet: Dense Object Detection via Global Dynamically Matching Transformer Network","authors":"Chaojun Dong;Chengxuan Wang;Yikui Zhai;Ye Li;Jianhong Zhou;Pasquale Coscia;Angelo Genovese;Vincenzo Piuri;Fabio Scotti","doi":"10.1109/TCSVT.2024.3522661","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3522661","url":null,"abstract":"In recent years, object detection models have been extensively applied across various industries, leveraging learned samples to recognize and locate objects. However, industrial environments present unique challenges, including complex backgrounds, dense object distributions, object stacking, and occlusion. To address these challenges, we propose the Global Dynamic Matching Transformer Network (GMTNet). GMTNet partitions images into blocks and employs a sliding window approach to capture information from each block and their interrelationships, mitigating background interference while acquiring global information for dense object recognition. By reweighting key-value pairs in multi-scale feature maps, GMTNet enhances global information relevance and effectively handles occlusion and overlap between objects. Furthermore, we introduce a dynamic sample matching method to tackle the issue of excessive candidate boxes in dense detection tasks. This method adaptively adjusts the number of matched positive samples according to the specific detection task, enabling the model to reduce the learning of irrelevant features and simplify post-processing. Experimental results demonstrate that GMTNet excels in dense detection tasks and outperforms current mainstream algorithms. The code will be available at <uri>http://github.com/yikuizhai/GMTNet</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4923-4936"},"PeriodicalIF":8.3,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BEVUDA++: Geometric-Aware Unsupervised Domain Adaptation for Multi-View 3D Object Detection","authors":"Rongyu Zhang;Jiaming Liu;Xiaoqi Li;Xiaowei Chi;Dan Wang;Li Du;Yuan Du;Shanghang Zhang","doi":"10.1109/TCSVT.2024.3523049","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523049","url":null,"abstract":"Vision-centric Bird’s Eye View (BEV) perception holds considerable promise for autonomous driving. Recent studies have prioritized efficiency or accuracy enhancements, yet the issue of domain shift has been overlooked, leading to substantial performance degradation upon transfer. We identify major domain gaps in real-world cross-domain scenarios and initiate the first effort to address the Domain Adaptation (DA) challenge in multi-view 3D object detection for BEV perception. Given the complexity of BEV perception approaches with their multiple components, domain shift accumulation across multi-geometric spaces (e.g., 2D, 3D Voxel, BEV) poses a significant challenge for BEV domain adaptation. In this paper, we introduce an innovative geometric-aware teacher-student framework, BEVUDA++, to diminish this issue, comprising a Reliable Depth Teacher (RDT) and a Geometric Consistent Student (GCS) model. Specifically, RDT effectively blends target LiDAR with dependable depth predictions to generate depth-aware information based on uncertainty estimation, enhancing the extraction of Voxel and BEV features that are essential for understanding the target domain. To collaboratively reduce the domain shift, GCS maps features from multiple spaces into a unified geometric embedding space, thereby narrowing the gap in data distribution between the two domains. Additionally, we introduce a novel Uncertainty-guided Exponential Moving Average (UEMA) to further reduce error accumulation due to domain shifts informed by previously obtained uncertainty guidance. To demonstrate the superiority of our proposed method, we execute comprehensive experiments in four cross-domain scenarios, securing state-of-the-art performance in BEV 3D object detection tasks, e.g., 12.9% NDS and 9.5% mAP enhancement on Day-Night adaptation.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"5109-5122"},"PeriodicalIF":8.3,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PCTrack: Accurate Object Tracking for Live Video Analytics on Resource-Constrained Edge Devices","authors":"Xinyi Zhang;Haoran Xu;Chenyun Yu;Guang Tan","doi":"10.1109/TCSVT.2024.3523204","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523204","url":null,"abstract":"The task of live video analytics relies on real-time object tracking that typically involves computationally expensive deep neural network (DNN) models. In practice, it has become essential to process video data on edge devices deployed near the cameras. However, these edge devices often have very limited computing resources and thus suffer from poor tracking accuracy. Through a measurement study, we identify three major factors contributing to the performance issue: outdated detection results, tracking error accumulation, and ignorance of new objects. We introduce a novel approach, called Predict & Correct based Tracking, or <monospace>PCTrack</monospace>, to systematically address these problems. Our design incorporates three innovative components: 1) a Predictive Detection Propagator that rapidly updates outdated object bounding boxes to match the current frame through a lightweight prediction model; 2) a Frame Difference Corrector that refines the object bounding boxes based on frame difference information; and 3) a New Object Detector that efficiently discovers newly appearing objects during tracking. Experimental results show that our approach achieves remarkable accuracy improvements, ranging from 19.4% to 34.7%, across diverse traffic scenarios, compared to state of the art methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3969-3982"},"PeriodicalIF":8.3,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anwei Luo;Rizhao Cai;Chenqi Kong;Yakun Ju;Xiangui Kang;Jiwu Huang;Alex C. Kot
{"title":"Forgery-Aware Adaptive Learning With Vision Transformer for Generalized Face Forgery Detection","authors":"Anwei Luo;Rizhao Cai;Chenqi Kong;Yakun Ju;Xiangui Kang;Jiwu Huang;Alex C. Kot","doi":"10.1109/TCSVT.2024.3522091","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3522091","url":null,"abstract":"With the rapid progress of generative models, the current challenge in face forgery detection is how to effectively detect realistic manipulated faces from different unseen domains. Though previous studies show that pre-trained Vision Transformer (ViT) based models can achieve some promising results after fully fine-tuning on the Deepfake dataset, their generalization performances are still unsatisfactory. To this end, we present a Forgery-aware Adaptive Vision Transformer (FA-ViT) under the adaptive learning paradigm for generalized face forgery detection, where the parameters in the pre-trained ViT are kept fixed while the designed adaptive modules are optimized to capture forgery features. Specifically, a global adaptive module is designed to model long-range interactions among input tokens, which takes advantage of self-attention mechanism to mine global forgery clues. To further explore essential local forgery clues, a local adaptive module is proposed to expose local inconsistencies by enhancing the local contextual association. In addition, we introduce a fine-grained adaptive learning module that emphasizes the common compact representation of genuine faces through relationship learning in fine-grained pairs, driving these proposed adaptive modules to be aware of fine-grained forgery-aware information. Extensive experiments demonstrate that our FA-ViT achieves state-of-the-arts results in the cross-dataset evaluation, and enhances the robustness against unseen perturbations. Particularly, FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation. The code and trained model have been released at: <uri>https://github.com/LoveSiameseCat/FAViT</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4116-4129"},"PeriodicalIF":8.3,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Zhu;Runbing Wu;Xinghui Zhu;Chengyuan Zhang;Lin Wu;Shichao Zhang;Xuelong Li
{"title":"Bi-Direction Label-Guided Semantic Enhancement for Cross-Modal Hashing","authors":"Lei Zhu;Runbing Wu;Xinghui Zhu;Chengyuan Zhang;Lin Wu;Shichao Zhang;Xuelong Li","doi":"10.1109/TCSVT.2024.3521646","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3521646","url":null,"abstract":"Supervised cross-modal hashing has gained significant attention due to its efficiency in reducing storage and computation costs while maintaining rich semantic information. Despite substantial progress in generating compact binary codes, two key challenges remain: (1) insufficient utilization of labels to mine and fuse multi-grained semantic information, and (2) unreliable cross-modal interaction, which does not fully leverage multi-grained semantics or accurately capture sample relationships. To address these limitations, we propose a novel method called Bi-direction Label-Guided Semantic Enhancement for cross-modal Hashing (BiLGSEH). To tackle the first challenge, we introduce a label-guided semantic fusion strategy that extracts and integrates multi-grained semantic features guided by multi-labels. For the second challenge, we propose a semantic-enhanced relation aggregation strategy that constructs and aggregates multi-modal relational information through bi-directional similarity. Additionally, we incorporate CLIP features to improve the alignment between multi-modal content and complex semantics. In summary, BiLGSEH generates discriminative hash codes by effectively aligning semantic distribution and relational structure across modalities. Extensive performance evaluations against 18 competitive methods demonstrate the superiority of our approach. The source code for our method is publicly available at: <uri>https://github.com/yileicc/BiLGSEH</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"3983-3999"},"PeriodicalIF":8.3,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhaofeng Shi;Heqian Qiu;Lanxiao Wang;Fanman Meng;Qingbo Wu;Hongliang Li
{"title":"Cognition Transferring and Decoupling for Text-Supervised Egocentric Semantic Segmentation","authors":"Zhaofeng Shi;Heqian Qiu;Lanxiao Wang;Fanman Meng;Qingbo Wu;Hongliang Li","doi":"10.1109/TCSVT.2024.3521955","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3521955","url":null,"abstract":"In this paper, we explore a novel Text-supervised Egocentic Semantic Segmentation (TESS) task that aims to assign pixel-level categories to egocentric images weakly supervised by texts from image-level labels. In this task with prospective potential, the egocentric scenes contain dense wearer-object relations and inter-object interference. However, most recent third-view methods leverage the frozen Contrastive Language-Image Pre-training (CLIP) model, which is pre-trained on the semantic-oriented third-view data and lapses in the egocentric view due to the “relation insensitive” problem. Hence, we propose a Cognition Transferring and Decoupling Network (CTDN) that first learns the egocentric wearer-object relations via correlating the image and text. Besides, a Cognition Transferring Module (CTM) is developed to distill the cognitive knowledge from the large-scale pre-trained model to our model for recognizing egocentric objects with various semantics. Based on the transferred cognition, the Foreground-background Decoupling Module (FDM) disentangles the visual representations to explicitly discriminate the foreground and background regions to mitigate false activation areas caused by foreground-background interferential objects during egocentric relation learning. Extensive experiments on four TESS benchmarks demonstrate the effectiveness of our approach, which outperforms many recent related methods by a large margin. Code will be available at <uri>https://github.com/ZhaofengSHI/CTDN</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4144-4157"},"PeriodicalIF":8.3,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CLIP-Based Camera-Agnostic Feature Learning for Intra-Camera Supervised Person Re-Identification","authors":"Xuan Tan;Xun Gong;Yang Xiang","doi":"10.1109/TCSVT.2024.3522178","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3522178","url":null,"abstract":"Contrastive Language-Image Pre-Training (CLIP) model excels in traditional person re-identification (ReID) tasks due to its inherent advantage in generating textual descriptions for pedestrian images. However, applying CLIP directly to intra-camera supervised person re-identification (ICS ReID) presents challenges. ICS ReID requires independent identity labeling within each camera, without associations across cameras. This limits the effectiveness of text-based enhancements. To address this, we propose a novel framework called CLIP-based Camera-Agnostic Feature Learning (CCAFL) for ICS ReID. Accordingly, two custom modules are designed to guide the model to actively learn camera-agnostic pedestrian features: Intra-Camera Discriminative Learning (ICDL) and Inter-Camera Adversarial Learning (ICAL). Specifically, we first establish learnable textual prompts for intra-camera pedestrian images to obtain crucial semantic supervision signals for subsequent intra- and inter-camera learning. Then, we design ICDL to increase inter-class variation by considering the hard positive and hard negative samples within each camera, thereby learning intra-camera finer-grained pedestrian features. Additionally, we propose ICAL to reduce inter-camera pedestrian feature discrepancies by penalizing the model’s ability to predict the camera from which a pedestrian image originates, thus enhancing the model’s capability to recognize pedestrians from different viewpoints. Extensive experiments on popular ReID datasets demonstrate the effectiveness of our approach. Especially, on the challenging MSMT17 dataset, we arrive at 58.9% in terms of mAP accuracy, surpassing state-of-the-art methods by 7.6%. Code is available at <uri>https://gitee.com/swjtugx/classmate/tree/master/OurGroup/CCAFL</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4100-4115"},"PeriodicalIF":8.3,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dan Song;Xuanpu Zhang;Jianhao Zeng;Pengxin Zhan;Qingguo Chen;Weihua Luo;An-An Liu
{"title":"Better Fit: Accommodate Variations in Clothing Types for Virtual Try-On","authors":"Dan Song;Xuanpu Zhang;Jianhao Zeng;Pengxin Zhan;Qingguo Chen;Weihua Luo;An-An Liu","doi":"10.1109/TCSVT.2024.3521299","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3521299","url":null,"abstract":"Image-based virtual try-on aims to transfer target in-shop clothing to a dressed model image, the objectives of which are totally taking off original clothing while preserving the contents outside of the try-on area, naturally wearing target clothing and correctly inpainting the gap between target clothing and original clothing. Tremendous efforts have been made to facilitate this popular research area, but cannot keep the type of target clothing with the try-on area affected by original clothing. In this paper, we focus on the unpaired virtual try-on situation where target clothing and original clothing on the model are different, i.e., the practical scenario. To break the correlation between the try-on area and the original clothing and make the model learn the correct information to inpaint, we propose an adaptive mask training paradigm that dynamically adjusts training masks. It not only improves the alignment and fit of clothing but also significantly enhances the fidelity of virtual try-on experience. Furthermore, we for the first time propose two metrics for unpaired try-on evaluation, the Semantic-Densepose-Ratio (SDR) and Skeleton-LPIPS (S-LPIPS), to evaluate the correctness of clothing type and the accuracy of clothing texture. For unpaired try-on validation, we construct a comprehensive cross-try-on benchmark (Cross-27) with distinctive clothing items and model physiques, covering a broad try-on scenarios. Experiments demonstrate the effectiveness of the proposed methods, contributing to the advancement of virtual try-on technology and offering new insights and tools for future research in the field. The code, model and benchmark will be publicly released.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4287-4299"},"PeriodicalIF":8.3,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RaLiBEV: Radar and LiDAR BEV Fusion Learning for Anchor Box Free Object Detection Systems","authors":"Yanlong Yang;Jianan Liu;Tao Huang;Qing-Long Han;Gang Ma;Bing Zhu","doi":"10.1109/TCSVT.2024.3521375","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3521375","url":null,"abstract":"In autonomous driving, LiDAR and radar are crucial for environmental perception. LiDAR offers precise 3D spatial sensing information but struggles in adverse weather like fog. Conversely, radar signals can penetrate rain or mist due to their specific wavelength but are prone to noise disturbances. Recent state-of-the-art works reveal that the fusion of radar and LiDAR can lead to robust detection in adverse weather. Current approaches typically fuse features from various data sources using basic convolutional/transformer network architectures and employ straightforward label assignment strategies for object detection. However, these methods have two main limitations: they fail to adequately capture feature interactions and lack consistent regression constraints. In this paper, we propose a bird’s-eye view fusion learning-based anchor box-free object detection system. Our approach introduces a novel interactive transformer module for enhanced feature fusion and an advanced label assignment strategy for more consistent regression, addressing key limitations in existing methods. Specifically, experiments show that, our approach’s average precision ranks <inline-formula> <tex-math>$1^{st}$ </tex-math></inline-formula> and significantly outperforms the state-of-the-art method by 13.1% and 19.0% at Intersection of Union (IoU) of 0.8 under “Clear+Foggy” training conditions for “Clear” and “Foggy” testing, respectively. Our code repository is available at: <uri>https://github.com/yyxr75/RaLiBEV</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4130-4143"},"PeriodicalIF":8.3,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}