IEEE Transactions on Circuits and Systems for Video Technology最新文献_第4页

Adaptive Occlusion-Aware Network for Occluded Person Re-Identification 自适应闭塞感知网络在闭塞人群再识别中的应用

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-31 DOI: 10.1109/TCSVT.2024.3524555

Xiangzeng Liu;Jianfeng Guo;Hao Chen;Qiguang Miao;Yue Xi;Ruyi Liu

{"title":"Adaptive Occlusion-Aware Network for Occluded Person Re-Identification","authors":"Xiangzeng Liu;Jianfeng Guo;Hao Chen;Qiguang Miao;Yue Xi;Ruyi Liu","doi":"10.1109/TCSVT.2024.3524555","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524555","url":null,"abstract":"Occluded person re-identification (ReID) is a challenging task due to some of the essential features are interfered by obstacles or other pedestrians. Multi-granularity local feature extraction and recognition can effectively improve the accuracy of ReID under occlusion. However, manual segmentation methods for local features can lead to feature misalignment. Feature alignment based on pose estimation often ignores non-body details (e.g., handbags, backpacks, etc.) while increasing the complexity of the model. To address the above challenges, we propose a novel Adaptive Occlusion-Aware Network (AOANet), which mainly consists of two modules, the Adaptive Position Extractor (APE) and the Occlusion Awareness Module (OAM). In order to adaptively extract distinguishing features of body parts, APE optimizes the representation of multi-granularity features by the guidance of attention mechanism and keypoint features. To further perceive the occluded region, the OAM is developed by adaptively calculating the occlusion weights for body parts. These weights can lead to highlighting the non-occluded parts and suppressing the occluded parts, which in turn improves the accuracy in the occluded situation. Extensive experimental results confirm the advantages of our method on the MSMT17, DukeMTMC-reID, Market-1501, Occluded-Duke and Occluded-ReID datasets. The comparative results demonstrate that our method outperforms comparable methods. Especially on the Occluded-Duke dataset, our method achieved 70.6% mAP and 81.2% Rank-1 accuracy.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"5067-5077"},"PeriodicalIF":8.3,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FocusCLIP: Focusing on Anomaly Regions by Visual-Text Discrepancies FocusCLIP：通过视觉文本差异关注异常区域

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-31 DOI: 10.1109/TCSVT.2024.3524784

Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu

{"title":"FocusCLIP: Focusing on Anomaly Regions by Visual-Text Discrepancies","authors":"Yuan Zhao;Jiayu Sun;Lihe Zhang;Huchuan Lu","doi":"10.1109/TCSVT.2024.3524784","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524784","url":null,"abstract":"Few-shot anomaly detection aims to detect defects with only a limited number of normal samples for training. Recent few-shot methods typically focus on object-level features rather than subtle defects within objects, as pretrained models are generally trained on classification or image-text matching datasets. However, object-level features are often insufficient to detect defects, which are characterized by fine-grained texture variations. To address this, we propose FocusCLIP, which consists of a vision-guided branch and a language-guided branch. FocusCLIP leverages the complementary relationship between visual and text modalities to jointly emphasize discrepancies in fine-grained textures of defect regions. Specifically, we design three modules to mine these discrepancies. In the vision-guided branch, we propose the Bidirectional Self-knowledge Distillation (BSD) structure, which identifies anomaly regions through inconsistent representations and accumulates these discrepancies. Within this structure, the Anomaly Capture Module (ACM) is designed to refine features and detect more comprehensive anomalies by leveraging semantic cues from multi-head self-attention. In the language-guided branch, Multi-level Adversarial Class Activation Mapping (MACAM) utilizes foreground-invariant responses to adversarial text prompts, reducing interference from object regions and further focusing on defect regions. Our approach outperforms the state-of-the-art methods in few-shot anomaly detection. Additionally, the language-guided branch within FocusCLIP also demonstrates competitive performance in zero-shot anomaly detection, further validating the effectiveness of our proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4883-4895"},"PeriodicalIF":8.3,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distilling Multi-Level Semantic Cues Across Multi-Modalities for Face Forgery Detection 基于多模态的多层次语义线索提取人脸伪造检测

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-31 DOI: 10.1109/TCSVT.2024.3524602

Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie

{"title":"Distilling Multi-Level Semantic Cues Across Multi-Modalities for Face Forgery Detection","authors":"Lingyun Yu;Tian Xie;Chuanbin Liu;Guoqing Jin;Zhiguo Ding;Hongtao Xie","doi":"10.1109/TCSVT.2024.3524602","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524602","url":null,"abstract":"Existing face forgery detection methods attempt to identify low-level forgery artifacts (e.g., blending boundary, flickering) in spatial-temporal domains or high-level semantic inconsistencies (e.g., abnormal lip movements) between visual-auditory modalities for generalized face forgery detection. However, they still suffer from significant performance degradation when dealing with out-of-domain artifacts, as they only consider single semantic mode inconsistencies, but ignore the complementarity of forgery traces at different levels and different modalities. In this paper, we propose a novel Multi-modal Multi-level Semantic Cues Distillation Detection framework that adopts the teacher-student protocol to focus on both spatial-temporal artifacts and visual-auditory incoherence to capture multi-level semantic cues. Specifically, our framework primarily comprises the Spatial-Temporal Pattern Learning module and the Visual-Auditory Consistency Modeling module. The Spatial-Temporal Pattern Learning module employs a mask-reconstruction strategy, in which the student network learns diverse spatial-temporal patterns from a pixel-wise teacher network to capture low-level forgery artifacts. The Visual-Auditory Consistency Modeling module is designed to enhance the student network’s ability to identify high-level semantic irregularities, with a visual-auditory consistency modeling expert serving as a guide. Furthermore, a novel Real-Similarity loss is proposed to enhance the proximity of real faces in feature space without explicitly penalizing the distance from manipulated faces, which prevents the overfitting in particular manipulation methods and improves the generalization capability. Extensive experiments show that our method substantially improves the generalization and robustness performance. Particularly, our approach outperforms the SOTA detector by 1.4% in generalization performance on DFDC with large domain gaps, and by 2.0% in the robustness evaluation on the FF++ dataset under various extreme settings. Our code is available at <uri>https://github.com/TianXie834/M2SD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4698-4712"},"PeriodicalIF":8.3,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MSCoTDet: Language-Driven Multi-Modal Fusion for Improved Multispectral Pedestrian Detection MSCoTDet：语言驱动的多模态融合改进的多光谱行人检测

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-31 DOI: 10.1109/TCSVT.2024.3524645

Taeheon Kim;Sangyun Chung;Damin Yeom;Youngjoon Yu;Hak Gu Kim;Yong Man Ro

{"title":"MSCoTDet: Language-Driven Multi-Modal Fusion for Improved Multispectral Pedestrian Detection","authors":"Taeheon Kim;Sangyun Chung;Damin Yeom;Youngjoon Yu;Hak Gu Kim;Yong Man Ro","doi":"10.1109/TCSVT.2024.3524645","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524645","url":null,"abstract":"Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using a Large Language Model (LLM). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"5006-5021"},"PeriodicalIF":8.3,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Style and Layout Synthesizing: Toward Generalizable Remote Sensing Semantic Segmentation 联合样式与布局综合：面向可泛化的遥感语义分割

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-30 DOI: 10.1109/TCSVT.2024.3522936

Qi Zang;Shuang Wang;Dong Zhao;Zhun Zhong;Biao Hou;Licheng Jiao

{"title":"Joint Style and Layout Synthesizing: Toward Generalizable Remote Sensing Semantic Segmentation","authors":"Qi Zang;Shuang Wang;Dong Zhao;Zhun Zhong;Biao Hou;Licheng Jiao","doi":"10.1109/TCSVT.2024.3522936","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3522936","url":null,"abstract":"This paper studies the domain generalized remote sensing semantic segmentation (RSSS), aiming to generalize a model trained only on the source domain to unseen domains. Existing methods in computer vision treat style information as domain characteristics to achieve domain-agnostic learning. Nevertheless, their generalizability to RSSS remains constrained, due to the incomplete consideration of domain characteristics. We argue that remote sensing scenes have layout differences beyond just style. Considering this, we devise a joint style and layout synthesizing framework, enabling the model to jointly learn out-of-domain samples synthesized from these two perspectives. For style, we estimate the variant intensities of per-class representations affected by domain shift and randomly sample within this modeled scope to reasonably expand the boundaries of style-carrying feature statistics. For layout, we explore potential scenes with diverse layouts in the source domain and propose granularity-fixed and granularity-learnable masks to perturb layouts, forcing the model to learn characteristics of objects rather than variable positions. The mask is designed to learn more context-robust representations by discovering difficult-to-recognize perturbation directions. Subsequently, we impose gradient angle constraints between the samples synthesized using the two ways to correct conflicting optimization directions. Extensive experiments demonstrate the superior generalization ability of our method over existing methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4055-4071"},"PeriodicalIF":8.3,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation CNN-Transformer校正协同学习用于医学图像分割

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-27 DOI: 10.1109/TCSVT.2024.3523316

Lanhu Wu;Miao Zhang;Yongri Piao;Zhenyan Yao;Weibing Sun;Feng Tian;Huchuan Lu

{"title":"CNN-Transformer Rectified Collaborative Learning for Medical Image Segmentation","authors":"Lanhu Wu;Miao Zhang;Yongri Piao;Zhenyan Yao;Weibing Sun;Feng Tian;Huchuan Lu","doi":"10.1109/TCSVT.2024.3523316","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523316","url":null,"abstract":"Automatic and precise medical image segmentation (MIS) is of vital importance for clinical diagnosis and analysis. Current MIS methods mainly rely on the convolutional neural network (CNN) or self-attention mechanism (Transformer) for feature modeling. However, CNN-based methods suffer from the inaccurate localization owing to the limited global dependency while Transformer-based methods always present the coarse boundary for the lack of local emphasis. Although some CNN-Transformer hybrid methods are designed to synthesize the complementary local and global information for better performance, the combination of CNN and Transformer introduces numerous parameters and increases the computation cost. To this end, this paper proposes a CNN-Transformer rectified collaborative learning (CTRCL) framework to learn stronger CNN-based and Transformer-based models for MIS tasks via the bi-directional knowledge transfer between them. Specifically, we propose a rectified logit-wise collaborative learning (RLCL) strategy which introduces the ground truth to adaptively select and rectify the wrong regions in student soft labels for accurate knowledge transfer in the logit space. We also propose a class-aware feature-wise collaborative learning (CFCL) strategy to achieve effective knowledge transfer between CNN-based and Transformer-based models in the feature space by granting their intermediate features the similar capability of category perception. Extensive experiments on three popular MIS benchmarks demonstrate that our CTRCL outperforms most state-of-the-art collaborative learning methods under different evaluation metrics. The source code will be publicly available at <uri>https://github.com/LanhooNg/CTRCL</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4072-4086"},"PeriodicalIF":8.3,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VFM-Depth: Leveraging Vision Foundation Model for Self-Supervised Monocular Depth Estimation VFM-Depth：利用视觉基础模型进行自监督单目深度估计

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-27 DOI: 10.1109/TCSVT.2024.3523702

Shangshu Yu;Meiqing Wu;Siew-Kei Lam

{"title":"VFM-Depth: Leveraging Vision Foundation Model for Self-Supervised Monocular Depth Estimation","authors":"Shangshu Yu;Meiqing Wu;Siew-Kei Lam","doi":"10.1109/TCSVT.2024.3523702","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523702","url":null,"abstract":"Self-supervised monocular depth estimation has exploited semantics to reduce depth ambiguities in texture-less regions and object boundaries. However, existing methods struggle to obtain universal semantics across scenes for effective depth estimation. This paper proposes VFM-Depth, a novel self-supervised teacher-student framework, that effectively leverages the vision foundation model as semantic regularization to significantly improve the accuracy of monocular depth estimation. Firstly, we propose a novel Geometric-Semantic Aggregation Encoding, integrating universal semantic constraints from the foundation model to reduce ambiguities in the teacher model. Specifically, semantic features from the foundation model and geometric features from the depth model are first encoded and then fused through cross-modal aggregation. Secondly, we introduce a novel Multi-Alignment for Depth Distillation to distill semantic constraints from the teacher, further leveraging knowledge from the foundation model. We obtain a lightweight yet effective student model through an innovative approach that combines distance category alignment with complementary feature and depth imitation. Extensive experiments on KITTI, Cityscapes, and Make3D datasets demonstrate that VFM-Depth (both teacher and student) outperforms state-of-the-art self-supervised methods by a large margin.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"5078-5091"},"PeriodicalIF":8.3,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multimodal Emotional Talking Face Generation Based on Action Units 基于动作单元的多模态情感说话脸生成

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-26 DOI: 10.1109/TCSVT.2024.3523359

Jiayi Lyu;Xing Lan;Guohong Hu;Hanyu Jiang;Wei Gan;Jinbao Wang;Jian Xue

{"title":"Multimodal Emotional Talking Face Generation Based on Action Units","authors":"Jiayi Lyu;Xing Lan;Guohong Hu;Hanyu Jiang;Wei Gan;Jinbao Wang;Jian Xue","doi":"10.1109/TCSVT.2024.3523359","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523359","url":null,"abstract":"Talking face generation focuses on creating natural facial animations that align with the provided text or audio input. Current methods in this field primarily rely on facial landmarks to convey emotional changes. However, spatial key-points are valuable, yet limited in capturing the intricate dynamics and subtle nuances of emotional expressions due to their restricted spatial coverage. Consequently, this reliance on sparse landmarks can result in decreased accuracy and visual quality, especially when representing complex emotional states. To address this issue, we propose a novel method called Emotional Talking with Action Unit (ETAU), which seamlessly integrates facial Action Units (AUs) into the generation process. Unlike previous works that solely rely on facial landmarks, ETAU employs both Action Units and landmarks to comprehensively represent facial expressions through interpretable representations. Our method provides a detailed and dynamic representation of emotions by capturing the complex interactions among facial muscle movements. Moreover, ETAU adopts a multi-modal strategy by seamlessly integrating emotion prompts, driving videos, and target images, and by leveraging various input data effectively, it generates highly realistic and emotional talking-face videos. Through extensive evaluations across multiple datasets, including MEAD, LRW, GRID and HDTF, ETAU outperforms previous methods, showcasing its superior ability to generate high-quality, expressive talking faces with improved visual fidelity and synchronization. Moreover, ETAU exhibits a significant improvement on the emotion accuracy of the generated results, reaching an impressive average accuracy of 84% on the MEAD dataset.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4026-4038"},"PeriodicalIF":8.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Color Decoupling for Multi-Illumination Color Constancy 多照度色彩常数的色彩解耦

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-26 DOI: 10.1109/TCSVT.2024.3523019

Wen Zhang;Zhenshan Tan;Li Zhang;Zhijiang Li

{"title":"Color Decoupling for Multi-Illumination Color Constancy","authors":"Wen Zhang;Zhenshan Tan;Li Zhang;Zhijiang Li","doi":"10.1109/TCSVT.2024.3523019","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3523019","url":null,"abstract":"Current multi-illumination color constancy methods typically estimate illumination for each pixel directly. However, according to the multi-illumination imaging equation, the color of each pixel is determined by various components, including the innate color of the scene content, the colors of multiple illuminations, and the weightings of these illuminations. Failing to distinguish between these components results in color coupling. On the one hand, there is color coupling between illumination and scene content, where estimations are easily misled by the colors of the content, and the distribution of the estimated illuminations is relatively scattered. On the other hand, there is color coupling between illuminations, where estimations are susceptible to interference from high-frequency and heterogeneous illumination colors, and the local contrast is low. To address color coupling, we propose a Color Decoupling Network (CDNet) that includes a Content Color Awareness Module (CCAM) and a Contrast HArmonization Module (CHAM). CCAM learns scene content color priors, decoupling the colors of content and illuminations by providing the model with the color features of the content, thereby reducing out-of-gamut estimations and enhancing consistency. CHAM constrains feature representation, decoupling illuminants by mutual calibration between adjacent features. CHAM utilizes spatial correlation to make the model more sensitive to the relationships between neighboring features and utilizes illumination disparity degree to guide feature classification. By enhancing the uniqueness of homogeneous illumination features and the distinctiveness of heterogeneous illumination features, CHAM improves local edge contrast. Additionally, by allocating fine-grained margin coefficients to emphasize the soft distinctiveness of similar illumination features, further enhancing local contrast. Extensive experiments on single- and multi-illumination benchmark datasets demonstrate that the proposed method achieves superior performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4087-4099"},"PeriodicalIF":8.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sparse Point Clouds Assisted Learned Image Compression 稀疏点云辅助学习图像压缩

IF 8.3 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-26 DOI: 10.1109/TCSVT.2024.3522621

Yiheng Jiang;Haotian Zhang;Li Li;Dong Liu;Zhu Li

{"title":"Sparse Point Clouds Assisted Learned Image Compression","authors":"Yiheng Jiang;Haotian Zhang;Li Li;Dong Liu;Zhu Li","doi":"10.1109/TCSVT.2024.3522621","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3522621","url":null,"abstract":"In the field of autonomous driving, a variety of sensor data types exist, each representing different modalities of the same scene. Therefore, it is feasible to utilize data from other sensors to facilitate image compression. However, few techniques have explored the potential benefits of utilizing inter-modality correlations to enhance the image compression performance. In this paper, motivated by the recent success of learned image compression, we propose a new framework that uses sparse point clouds to assist in learned image compression in the autonomous driving scenario. We first project the 3D sparse point cloud onto a 2D plane, resulting in a sparse depth map. Utilizing this depth map, we proceed to predict camera images. Subsequently, we use these predicted images to extract multi-scale structural features. These features are then incorporated into learned image compression pipeline as additional information to improve the compression performance. Our proposed framework is compatible with various mainstream learned image compression models, and we validate our approach using different existing image compression methods. The experimental results show that incorporating point cloud assistance into the compression pipeline consistently enhances the performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4000-4010"},"PeriodicalIF":8.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0