IEEE Transactions on Circuits and Systems for Video Technology最新文献

筛选
英文 中文
Task–Adapted Learnable Embedded Quantization for Scalable Human-Machine Image Compression 面向可扩展人机图像压缩的任务适应可学习嵌入式量化
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2025.3525664
Shaohui Li;Shuoyu Ma;Wenrui Dai;Nuowen Kan;Fan Cheng;Chenglin Li;Junni Zou;Hongkai Xiong
{"title":"Task–Adapted Learnable Embedded Quantization for Scalable Human-Machine Image Compression","authors":"Shaohui Li;Shuoyu Ma;Wenrui Dai;Nuowen Kan;Fan Cheng;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TCSVT.2025.3525664","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3525664","url":null,"abstract":"Image compression for both human and machine vision has become prevailing to accommodate to rising demands for machine-machine and human-machine communications. Scalable human-machine image compression is recently emerging as an efficient alternative to simultaneously achieve high accuracy for machine vision in the base layer and obtain high-fidelity reconstruction for human vision in the enhancement layer. However, existing methods achieve scalable coding with heuristic mechanisms, which cannot fully exploit the inter-layer correlations and evidently sacrifice rate-distortion performance. In this paper, we propose task-adapted learnable embedded quantization to address this problem in an analytically optimized fashion. We first reveal the relationship between the latent representations for machine and human vision and demonstrate that optimal representation for machine vision can be approximated with post-training optimization on the learned representation for human vision. On such basis, we propose task-adapted learnable embedded quantization that leverages learnable step predictor to adaptively determine the optimal quantization step for diverse machine vision tasks such that inter-layer correlations between representations for human and machine vision are sufficiently exploited using embedded quantization. Furthermore, we develop a human-machine scalable coding framework by incorporating the proposed embedded quantization into pre-trained learned image compression models. Experimental results demonstrate that the proposed framework achieves state-of-the-art performance on machine vision tasks like object detection, instance segmentation, and panoptic segmentation with negligible loss in rate-distortion performance for human vision.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4768-4783"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Real-Time Object Detection With Optical Flow-Guided Streaming Perception 利用光流引导流感知增强实时目标检测
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2025.3525796
Tongbo Wang;Lin Zhu;Hua Huang
{"title":"Enhancing Real-Time Object Detection With Optical Flow-Guided Streaming Perception","authors":"Tongbo Wang;Lin Zhu;Hua Huang","doi":"10.1109/TCSVT.2025.3525796","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3525796","url":null,"abstract":"Real-time object detection in Unmanned Aerial Vehicle (UAV) videos remains a significant challenge due to the fast motion and small scale of objects. Existing streaming perception models struggle to accurately capture fine-grained motion cues between consecutive frames, leading to suboptimal performance in dynamic UAV scenarios. To address these challenges, StreamFlow is proposed to integrate optical flow information and enhance real-time object detection in UAV videos. StreamFlow incorporates Flow-Guided Dynamic Prediction (FGDP) to refine position predictions using local optical flow information and Optical Flow Guided Optimization (OFGO) to optimize model parameters considering both localization loss and optical flow reliability. Central to OFGO is the Adaptive Flow Weighting (AFW) module, which focuses on reliable flow samples during training. The proposed integration of optical flow and adaptive weighting scheme significantly enhances the ability of streaming perception models to handle fast-moving objects in dynamic UAV environments. Extensive experiments on four challenging UAV video datasets demonstrate the superior performance of StreamFlow compared to state-of-the-art methods in terms of accuracy.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4816-4830"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible Temperature Parallel Distillation for Dense Object Detection: Make Response-Based Knowledge Distillation Great Again 面向密集目标检测的柔性温度并行蒸馏:使基于响应的知识蒸馏再次伟大
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2024.3525051
Yaoye Song;Peng Zhang;Wei Huang;Yufei Zha;Yanning Zhang
{"title":"Flexible Temperature Parallel Distillation for Dense Object Detection: Make Response-Based Knowledge Distillation Great Again","authors":"Yaoye Song;Peng Zhang;Wei Huang;Yufei Zha;Yanning Zhang","doi":"10.1109/TCSVT.2024.3525051","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3525051","url":null,"abstract":"Feature-based approaches have been the focal point of previous research on knowledge distillation (KD) for dense object detection. These methods employ feature imitation and result in competitive performance. Despite being able to achieve comparable performance in image recognition, response-based KD methods can not reach the same level in dense object detection. Inspired by improving distillation performance from two key aspects: where to distill and how to distill, in this paper, a parallel distillation (PD) is introduced to fully utilize the sophisticated detection head and transfer all the output responses from the teacher to the student efficiently. In particular, the proposed PD takes an important consideration of the specific location of distillation, which is crucial for effective knowledge transfer. Regarding the discrepancies in output responses between the localization branch and the classification branch, we propose a novel Dynamic Localization Temperature (DLT) module to enhance the precision of distilling localization information. As for the classification branch, a Classification Temperature-Free (CTF) module is also designed to increase the robustness of distillation in heterogeneous networks. By incorporating the DLT and CTF into the PD framework to avoid setting temperature values manually, the Flexible Temperature Parallel Distillation (FTPD) is proposed to achieve a state-of-the-art (SOTA) performance, which can also be further combined with mainstream feature-based methods for better results. In terms of accuracy and robustness with extensive experiments, the proposed FTPD outperforms other KD methods in the task of dense object detection.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4963-4975"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S3F2Net: Spatial-Spectral-Structural Feature Fusion Network for Hyperspectral Image and LiDAR Data Classification S3F2Net:高光谱图像与激光雷达数据分类的空间-光谱-结构特征融合网络
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2025.3525734
Xianghai Wang;Liyang Song;Yining Feng;Junheng Zhu
{"title":"S3F2Net: Spatial-Spectral-Structural Feature Fusion Network for Hyperspectral Image and LiDAR Data Classification","authors":"Xianghai Wang;Liyang Song;Yining Feng;Junheng Zhu","doi":"10.1109/TCSVT.2025.3525734","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3525734","url":null,"abstract":"The continuous development of Earth observation (EO) technology has significantly increased the availability of multi-sensor remote sensing (RS) data. The fusion of hyperspectral image (HSI) and light detection and ranging (LiDAR) data has become a research hotspot. Current mainstream convolutional neural networks (CNNs) excel at extracting local features from images but have limitations in modeling global information, which may affect the performance of classification tasks. In contrast, modern graph convolutional networks (GCNs) excel at capturing global information, particularly demonstrating significant advantages when processing RS images with irregular topological structures. By integrating these two frameworks, features can be fused from multiple perspectives, enabling a more comprehensive capture of multimodal data attributes and improving classification performance. The paper proposes a spatial-spectral-structural feature fusion network (S3F2Net) for HSI and LiDAR data classification. S3F2Net utilizes multiple architectures to extract rich features of multimodal data from different perspectives. On one hand, local spatial and spectral features of multimodal data are extracted using CNN, enhancing interactions among heterogeneous data through shared-weight convolution to achieve detailed representations of land cover. On the other hand, the global topological structure is learned using GCN, which models the spatial relationships between land cover types through graph structure constructed from LiDAR data, thereby enhancing the model’s understanding of scene content. Furthermore, the dynamic node updating strategy within the GCN enhances the model’s ability to identify representative nodes for specific land cover types while facilitating information aggregation among remote nodes, thereby strengthening adaptability to complex topological structures. By employing a multi-level information fusion strategy to integrate data representations from both global and local perspectives, the accuracy and reliability of the results are ensured. Compared with state-of-the-art (SOTA) methods, the framework’s validity is verified on three real multimodal RS datasets. The source code will be available at <uri>https://github.com/slylnnu/S3F2Net</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4801-4815"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enriched Image Captioning Based on Knowledge Divergence and Focus 基于知识发散和聚焦的丰富图像字幕
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2024.3525158
An-An Liu;Quanhan Wu;Ning Xu;Hongshuo Tian;Lanjun Wang
{"title":"Enriched Image Captioning Based on Knowledge Divergence and Focus","authors":"An-An Liu;Quanhan Wu;Ning Xu;Hongshuo Tian;Lanjun Wang","doi":"10.1109/TCSVT.2024.3525158","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3525158","url":null,"abstract":"Image captioning is a fundamental task in computer vision that aims to generate precise and comprehensive descriptions of images automatically. Intuitively, humans initially rely on the image content, e.g., “cake on a plate”, to gradually gather relevant knowledge facts e.g., “birthday party”, “candles”, which is a process referred to as divergence. Then, we perform step-by-step reasoning based on the images to refine, and rearrange these knowledge facts for explicit sentence generation, a process referred to as focus. However, existing image captioning methods mainly rely on the encode-decode framework that does not well fit the “divergence-focus” nature of the task. To this end, we propose the knowledge “divergence-focus” method for Image Captioning (K-DFIC) to gather and polish knowledge facts for image understanding, which consists of two components: 1) Knowledge Divergence Module aims to leverage the divergence capability of large-scale pre-trained model to acquire knowledge facts relevant to the image content. To achieve this, we design a scene-graph-aware prompt that serves as a “trigger” for GPT-3.5, encouraging it to “diverge” and generate more sophisticated, human-like knowledge. 2) Knowledge Focus Module aims to refine acquired knowledge facts and further rearrange them in a coherent manner. We design the interactive refining network to encode knowledge, which is refined with the visual features to remove irrelevant words. Then, to generate fluent image descriptions, we design the large-scale pre-trained model-based rearrangement method to estimate the importance of each knowledge word for an image. Finally, we fuse the refined knowledge and visual features to assist the decoder in generating captions. We demonstrate the superiority of our approach through extensive experiments on the MSCOCO dataset. Our approach surpasses state-of-the-art performance across all metrics in the Karpathy split. For example, our model obtains the best CIDEr-D score of 148.4%. Additional ablation studies and visualization further validate our effectiveness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4937-4948"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913465","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Bridging Domain Gap of Point Cloud Representations via Self-Supervised Geometric Augmentation 基于自监督几何增强的点云表示域间隙桥接
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2024.3525052
Li Yu;Hongchao Zhong;Longkun Zou;Ke Chen;Pan Gao
{"title":"Bridging Domain Gap of Point Cloud Representations via Self-Supervised Geometric Augmentation","authors":"Li Yu;Hongchao Zhong;Longkun Zou;Ke Chen;Pan Gao","doi":"10.1109/TCSVT.2024.3525052","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3525052","url":null,"abstract":"Recent progress of semantic point clouds analysis is largely driven by synthetic data (e.g., the ModelNet and the ShapeNet), which are typically complete, well-aligned and noisy-free. Therefore, representations of those ideal synthetic point clouds have limited variations in the geometric perspective and can gain good performance on a number of 3D vision tasks such as point cloud classification. In the context of unsupervised domain adaptation (UDA), representation learning designed for synthetic point clouds can hardly capture domain invariant geometric patterns from incomplete and noisy point clouds. To address such a problem, we introduce a novel scheme for induced geometric invariance of point cloud representations across domains, via regularizing representation learning with two self-supervised geometric augmentation tasks. On one hand, a novel pretext task of predicting translation distances of augmented samples is proposed to alleviate centroid shift of point clouds due to occlusion and noises. On the other hand, we pioneer an integration of the self-supervised relational learning on geometrically-augmented point clouds in a cascade manner, utilizing the intrinsic relationship of augmented variants and other samples as extra constraints of cross-domain geometric features. Experiments on the PointDA-10 dataset demonstrate the effectiveness of the proposed method, achieving the state-of-the-art performance.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4846-4856"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Unified Framework for Adversarial Patch Attacks Against Visual 3D Object Detection in Autonomous Driving 针对自动驾驶视觉三维目标检测的对抗性补丁攻击统一框架
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-03 DOI: 10.1109/TCSVT.2025.3525725
Jian Wang;Fan Li;Lijun He
{"title":"A Unified Framework for Adversarial Patch Attacks Against Visual 3D Object Detection in Autonomous Driving","authors":"Jian Wang;Fan Li;Lijun He","doi":"10.1109/TCSVT.2025.3525725","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3525725","url":null,"abstract":"The rapid development of vision-based 3D perceptions, in conjunction with the inherent vulnerability of deep neural networks to adversarial examples, motivates us to investigate realistic adversarial attacks for the 3D detection models in autonomous driving scenarios. Due to the perspective transformation from 3D space to the image and object occlusion, current 2D image attacks are difficult to generalize to 3D detectors and are limited by physical feasibility. In this work, we propose a unified framework to generate physically printable adversarial patches with different attack goals: 1) instance-level hiding—pasting the learned patches to any target vehicle allows it to evade the detection process; 2) scene-level creating—placing the adversarial patch in the scene induces the detector to perceive plenty of fake objects. Both crafted patches are universal, which can take effect across a wide range of objects and scenes. To achieve above attacks, we first introduce the differentiable image-3D rendering algorithm that makes it possible to learn a patch located in 3D space. Then, two novel designs are devised to promote effective learning of patch content: 1) a Sparse Object Sampling Strategy is proposed to ensure that the rendered patches follow the perspective criterion and avoid being occluded during training, and 2) a Patch-Oriented Adversarial Optimization is used to facilitate the learning process focused on the patch areas. Both digital and physical-world experiments are conducted and demonstrate the effectiveness of our approaches, revealing potential threats when confronted with malicious attacks. We also investigate the defense strategy using adversarial augmentation to further improve the model’s robustness.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4949-4962"},"PeriodicalIF":8.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation 用于参考图像分割的释放分割模型
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-01 DOI: 10.1109/TCSVT.2024.3524543
Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang
{"title":"ReferSAM: Unleashing Segment Anything Model for Referring Image Segmentation","authors":"Sun-Ao Liu;Hongtao Xie;Jiannan Ge;Yongdong Zhang","doi":"10.1109/TCSVT.2024.3524543","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524543","url":null,"abstract":"The Segment Anything Model (SAM) has demonstrated remarkable capability as a general segmentation model given visual prompts such as points or boxes. While SAM is conceptually compatible with text prompts, it merely employs linguistic features from vision-language models as prompt embeddings and lacks fine-grained cross-modal interaction. This deficiency limits its application in referring image segmentation (RIS), where the targets are specified by free-form natural language expressions. In this paper, we introduce ReferSAM, a novel SAM-based framework that enhances cross-modal interaction and reformulates prompt encoding, thereby unleashing SAM’s segmentation capability for RIS. Specifically, ReferSAM incorporates the Vision-Language Interactor (VLI) to integrate linguistic features with visual features during the image encoding stage of SAM. This interactor introduces fine-grained alignment between linguistic features and multi-scale visual representations without altering the architecture of pre-trained models. Additionally, we present the Vision-Language Prompter (VLP) to generate dense and sparse prompt embeddings by aggregating the aligned linguistic and visual features. Consequently, the generated embeddings sufficiently prompt SAM’s mask decoder to provide precise segmentation results. Extensive experiments on five public benchmarks demonstrate that ReferSAM achieves state-of-the-art performance on both classic and generalized RIS tasks. The code and models are available at <uri>https://github.com/lsa1997/ReferSAM</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4910-4922"},"PeriodicalIF":8.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SparseTrack: Multi-Object Tracking by Performing Scene Decomposition Based on Pseudo-Depth SparseTrack:基于伪深度执行场景分解的多目标跟踪
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-01 DOI: 10.1109/TCSVT.2024.3524670
Zelin Liu;Xinggang Wang;Cheng Wang;Wenyu Liu;Xiang Bai
{"title":"SparseTrack: Multi-Object Tracking by Performing Scene Decomposition Based on Pseudo-Depth","authors":"Zelin Liu;Xinggang Wang;Cheng Wang;Wenyu Liu;Xiang Bai","doi":"10.1109/TCSVT.2024.3524670","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524670","url":null,"abstract":"Exploring robust and efficient association methods has always been an important issue in multi-object tracking (MOT). Although existing tracking methods have achieved impressive performance, congestion and frequent occlusions still pose challenging problems in multi-object tracking. We reveal that performing sparse decomposition on dense scenes is a crucial step to enhance the performance of associating occluded targets. To this end, we propose a pseudo-depth estimation method for obtaining the relative depth of targets from 2D images. Secondly, we design a depth cascading matching (DCM) algorithm, which can use the obtained depth information to convert a dense target set into multiple sparse target subsets and perform data association on these sparse target subsets in order from near to far. By integrating the pseudo-depth method and the DCM strategy into the data association process, we propose a new tracker, called SparseTrack. SparseTrack provides a new perspective for solving the challenging crowded scene MOT problem. Only using IoU matching, SparseTrack achieves comparable performance with the state-of-the-art (SOTA) methods on the MOT17 and MOT20 benchmarks. Code and models are publicly available at <uri>https://github.com/hustvl/SparseTrack</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4870-4882"},"PeriodicalIF":8.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
T2EA: Target-Aware Taylor Expansion Approximation Network for Infrared and Visible Image Fusion 红外与可见光图像融合的目标感知泰勒展开逼近网络
IF 8.3 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-01-01 DOI: 10.1109/TCSVT.2024.3524794
Zhenghua Huang;Cheng Lin;Biyun Xu;Menghan Xia;Qian Li;Yansheng Li;Nong Sang
{"title":"T2EA: Target-Aware Taylor Expansion Approximation Network for Infrared and Visible Image Fusion","authors":"Zhenghua Huang;Cheng Lin;Biyun Xu;Menghan Xia;Qian Li;Yansheng Li;Nong Sang","doi":"10.1109/TCSVT.2024.3524794","DOIUrl":"https://doi.org/10.1109/TCSVT.2024.3524794","url":null,"abstract":"In the image fusion mission, the crucial task is to generate high-quality images for highlighting the key objects while enhancing the scenes to be understood. To complete this task and provide a powerful interpretability as well as a strong generalization ability in producing enjoyable fusion results which are comfortable for vision tasks (such as objects detection and their segmentation), we present a novel interpretable decomposition scheme and develop a target-aware Taylor expansion approximation (T2EA) network for infrared and visible image fusion, where our T2EA includes the following key procedures: Firstly, visible and infrared images are both decomposed into feature maps through a designed Taylor expansion approximation (TEA) network. Then, the Taylor feature maps are hierarchically fused by a dual-branch feature fusion (DBFF) network. Next, the fused map of each layer is contributed to synthesize an enjoyable fusion result by the inverse Taylor expansion. Finally, a segmentation network is jointed to refine the fusion network parameters which can promote the pleasing fusion results to be more suitable for segmenting the objects. To validate the effectiveness of our reported T2EA network, we first discuss the selection of Taylor expansion layers and fusion strategies. Then, both quantitatively and qualitatively experimental results generated by the selected SOTA approaches on three datasets (MSRS, TNO, and LLVIP) are compared in testing, generalization, and target detection and segmentation, demonstrating that our T2EA can produce more competitive fusion results for vision tasks and is more powerful for image adaption. The code will be available at <uri>https://github.com/MysterYxby/T2EA</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4831-4845"},"PeriodicalIF":8.3,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143913387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信