IEEE Transactions on Circuits and Systems for Video Technology最新文献_第5页

ShiftLIC: Lightweight Learned Image Compression With Spatial-Channel Shift Operations ShiftLIC：轻量级学习图像压缩与空间通道移位操作

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-01 DOI: 10.1109/TCSVT.2025.3556708

Youneng Bao;Wen Tan;Chuanmin Jia;Mu Li;Yongsheng Liang;Yonghong Tian

{"title":"ShiftLIC: Lightweight Learned Image Compression With Spatial-Channel Shift Operations","authors":"Youneng Bao;Wen Tan;Chuanmin Jia;Mu Li;Yongsheng Liang;Yonghong Tian","doi":"10.1109/TCSVT.2025.3556708","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556708","url":null,"abstract":"Learned Image Compression (LIC) has attracted considerable attention due to their outstanding rate-distortion (R-D) performance and flexibility. However, the substantial computational cost poses challenges for practical deployment. The issue of feature redundancy in LIC is rarely addressed. Our findings indicate that many features within the LIC backbone network exhibit similarities. This paper introduces ShiftLIC, a novel and efficient LIC framework that employs parameter-free shift operations to replace large-kernel convolutions, significantly reducing the model’s computational burden and parameter count. Specifically, we propose the Spatial Shift Block (SSB), which combines shift operations with small-kernel convolutions to replace large-kernel. This approach maintains feature extraction efficiency while reducing both computational complexity and model size. To further enhance the representation capability in the channel dimension, we propose a channel attention module based on recursive feature fusion. This module enhances feature interaction while minimizing computational overhead. Additionally, we introduce an improved entropy model integrated with the SSB module, making the entropy estimation process more lightweight and thereby comprehensively reducing computational costs. Experimental results demonstrate that ShiftLIC outperforms leading compression methods, such as VVC Intra and GMM, in terms of computational cost, parameter count, and decoding latency. Additionally, ShiftLIC sets a new SOTA benchmark with a BD-rate gain per MACs/pixel of −102.6%, showcasing its potential for practical deployment in resource-constrained environments. The code is released at <uri>https://github.com/baoyu2020/ShiftLIC</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9428-9442"},"PeriodicalIF":11.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TinyFusionDet: Hardware-Efficient LiDAR-Camera Fusion Framework for 3D Object Detection at Edge TinyFusionDet：用于边缘3D目标检测的硬件高效LiDAR-Camera融合框架

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-01 DOI: 10.1109/TCSVT.2025.3556711

Yishi Li;Fanhong Zeng;Rui Lai;Tong Wu;Juntao Guan;Anfu Zhu;Zhangming Zhu

{"title":"TinyFusionDet: Hardware-Efficient LiDAR-Camera Fusion Framework for 3D Object Detection at Edge","authors":"Yishi Li;Fanhong Zeng;Rui Lai;Tong Wu;Juntao Guan;Anfu Zhu;Zhangming Zhu","doi":"10.1109/TCSVT.2025.3556711","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556711","url":null,"abstract":"Current LiDAR-Camera fusion methods for 3D object detection achieve considerable accuracy at the immense cost of computation and storage, posing challenges for the deployment at the edge. To address this issue, we propose a lightweight 3D object detection framework, namely TinyFusionDet. Specially, we put forward an ingenious Hybrid Scale Pillar Strategy in LiDAR point cloud feature extraction to efficiently improve the detection accuracy of small objects. Meanwhile, a low cost Cross-Modal Heatmap Attention module is presented to suppress background interference in image features for reducing false positives. Moreover, a Cross-Modal Feature Interaction module is designed to enhance the cross-modal information fusion among channels for further promoting the detection precision. Extensive experiments demonstrated that TinyFusionDet achieves competitive accuracy with the lowest memory consumption and inference latency, making it suitable for hardware constrained edge devices. Furthermore, TinyFusionDet is implemented on a customized FPGA-based prototype system, yielding a record high energy efficiency up to 114.97GOPS/W. To the best of our knowledge, this marks the first real-time LiDAR-Camera fusion detection framework for edge applications.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8819-8834"},"PeriodicalIF":11.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GLV: Geometric Correlation Distillation for Latent Diffusion-Enhanced Parser-Free Virtual Try-On GLV：用于潜在扩散增强的无解析器虚拟试戴的几何相关蒸馏

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-01 DOI: 10.1109/TCSVT.2025.3556749

Chenghu Du;Junyin Wang;Kai Liu;Shengwu Xiong

{"title":"GLV: Geometric Correlation Distillation for Latent Diffusion-Enhanced Parser-Free Virtual Try-On","authors":"Chenghu Du;Junyin Wang;Kai Liu;Shengwu Xiong","doi":"10.1109/TCSVT.2025.3556749","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556749","url":null,"abstract":"Applying knowledge distillation to virtual try-on tasks is challenging because current methods fail to fully and efficiently exploit responsible teacher knowledge. In other words, existing approaches merely transfer prior knowledge to the student model via pseudo-labels generated by the teacher model, resulting in shallow knowledge representation and low training efficiency. To address these limitations, we propose a novel teacher-student architecture for parser-free virtual try-on, named GLV, which generates high-quality try-on results with realistic body details. Specifically, we propose a deformation-related prior distillation method to effectively leverage the valuable deformation information contained in the teacher warpage model. This enhances the convergence efficiency of the student warpage model, preventing it from getting stuck in a local minima. Moreover, we are the first to propose a geometric correlation distillation, which models the underlying geometric relationship between clothing and the person and transfers this relationship from the teacher to the student. This enables the student warpage model to reduce the entanglement of deformation-irrelevant features, such as color and texture. Finally, we propose a clothing-body retouching method for try-on result synthesis, which refines the denoising process in the latent space of a well-trained diffusion model, thereby preventing catastrophic forgetting. This method seamlessly transforms the parser-based inpainting synthesis paradigm into a parser-free synthesis paradigm and enables efficient convergence of the diffusion model with only fine-tuning. Extensive experiments demonstrate the generality of our approach and highlight its superiority over previous methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9175-9189"},"PeriodicalIF":11.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generative Image Steganography Based on Text-to-Image Multimodal Generative Model 基于文本到图像多模态生成模型的生成图像隐写

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-01 DOI: 10.1109/TCSVT.2025.3556892

Jingyuan Jiang;Zichi Wang;Zihan Yuan;Xinpeng Zhang

{"title":"Generative Image Steganography Based on Text-to-Image Multimodal Generative Model","authors":"Jingyuan Jiang;Zichi Wang;Zihan Yuan;Xinpeng Zhang","doi":"10.1109/TCSVT.2025.3556892","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556892","url":null,"abstract":"Image steganography, the technique of hiding secret messages within images, has recently advanced with generative image steganography, which hides messages during image creation. However, current generative steganography methods often face criticism for their low extraction accuracy and poor robustness—particularly their vulnerability to JPEG compression. To address these challenges, we propose a novel generative image steganography method based on the text-to-image multimodal generative model (StegaMGM). StegaMGM utilizes the initial random normalization distribution in the generative process of latent diffusion models (LDMs), the secret message is hidden in the generated image through message sampling, ensuring it follows the same probability distribution as typical image generative. The content of the stego image can also be controlled through the prompts. On the receiver side, using the shared prompt and diffusion inversion, can extract secret message with high accuracy. In the experimental section, we conducted detailed experiments to demonstrate the advantages of our proposed StegaMGM framework in extraction accuracy, resistance to JPEG compression, and security.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8907-8916"},"PeriodicalIF":11.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sequential Ground Moving Target Imaging Based on Hybrid ViSAR-ISAR Image Formation in Terahertz Band 基于太赫兹ViSAR-ISAR混合成像的序贯地面运动目标成像

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-01 DOI: 10.1109/TCSVT.2025.3556396

Lei Fan;Qi Yang;Hongqiang Wang;Yuliang Qin;Bin Deng

{"title":"Sequential Ground Moving Target Imaging Based on Hybrid ViSAR-ISAR Image Formation in Terahertz Band","authors":"Lei Fan;Qi Yang;Hongqiang Wang;Yuliang Qin;Bin Deng","doi":"10.1109/TCSVT.2025.3556396","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556396","url":null,"abstract":"Sequential ground moving target imaging (GMTIm) is an imperative and challenging task under terahertz video synthetic aperture radar (THz-ViSAR), which contributes to fine-grained situational awareness and moving target (MT) recognition. However, traditional GMTIm methods are usually designed for single-frame images, which involves repetitive parameter estimation for the sequential imaging problem and lacks the efficiency due to the parameter sensitivity. To tackle the aforementioned problems, this paper proposes a sequential GMTIm method based on hybrid THz-ViSAR-inverse SAR (ISAR) image formation. With respect to ViSAR processing, the sequential imaging results are firstly obtained. Considering the similarity of scene among inter-frame images, MTs can be detected based on target-level change detection after image registration and shadows left on the road. Following this, the defocused target region is transformed to obtain raw echoes, which is beneficial for parallel processing and reduce the computation amount. As for ISAR processing, the envelope alignment and auto-focus methods are employed to eliminate the residual motion errors and compensate for phase errors without constructing prior motion patterns. Thereafter, the ratio of equivalent rotational velocities between MTs and the scene is estimated to achieve the azimuth scaling. Finally, sparsity-based imaging enhancement is employed to further enhance the imaging quality. Simulations and airborne experiments are carried out to validate the effectiveness of the proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8738-8753"},"PeriodicalIF":11.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Progressive Human Motion Generation Based on Text and Few Motion Frames 基于文本和少量运动帧的渐进式人体运动生成

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-01 DOI: 10.1109/TCSVT.2025.3556868

Ling-An Zeng;Gaojie Wu;Ancong Wu;Jian-Fang Hu;Wei-Shi Zheng

{"title":"Progressive Human Motion Generation Based on Text and Few Motion Frames","authors":"Ling-An Zeng;Gaojie Wu;Ancong Wu;Jian-Fang Hu;Wei-Shi Zheng","doi":"10.1109/TCSVT.2025.3556868","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556868","url":null,"abstract":"Although existing text-to-motion (T2M) methods can produce realistic human motion from text description, it is still difficult to align the generated motion with the desired postures since using text alone is insufficient for precisely describing diverse postures. To achieve more controllable generation, an intuitive way is to allow the user to input a few motion frames describing precise desired postures. Thus, we explore a new Text-Frame-to-Motion (TF2M) generation task that aims to generate motions from text and very few given frames. Intuitively, the closer a frame is to a given frame, the lower the uncertainty of this frame is when conditioned on this given frame. Hence, we propose a novel Progressive Motion Generation (PMG) method to progressively generate a motion from the frames with low uncertainty to those with high uncertainty in multiple stages. During each stage, new frames are generated by a Text-Frame Guided Generator conditioned on frame-aware semantics of the text, given frames, and frames generated in previous stages. Additionally, to alleviate the train-test gap caused by multi-stage accumulation of incorrectly generated frames during testing, we propose a Pseudo-frame Replacement Strategy for training. Experimental results show that our PMG outperforms existing T2M generation methods by a large margin with even one given frame, validating the effectiveness of our PMG. Code is available here.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9205-9217"},"PeriodicalIF":11.1,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ERD: Encoder-Residual-Decoder Neural Network for Underwater Image Enhancement 用于水下图像增强的编码器-残差-解码器神经网络

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-31 DOI: 10.1109/TCSVT.2025.3556203

Jingchao Cao;Wangzhen Peng;Yutao Liu;Junyu Dong;Patrick Le Callet;Sam Kwong

{"title":"ERD: Encoder-Residual-Decoder Neural Network for Underwater Image Enhancement","authors":"Jingchao Cao;Wangzhen Peng;Yutao Liu;Junyu Dong;Patrick Le Callet;Sam Kwong","doi":"10.1109/TCSVT.2025.3556203","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556203","url":null,"abstract":"In underwater environments, the absorption and scattering of light often result in various types of degradation in captured images, including color cast, low contrast, low brightness, and blurriness. These undesirable effects pose significant challenges for both underwater photography and downstream tasks such as object detection, recognition, and navigation. To address these challenges, we propose a novel end-to-end underwater image enhancement (UIE) network via the multistage and mixed attention mechanism and a residual-based feature refinement module, called ERD. Specifically, our network includes an encoder stage for extracting features from input underwater images with channel, spatial, and patch attention modules to emphasize degraded channels and regions for restoration; a residual stage for further purification of informative features through sufficient feature learning; and a decoder stage for effective image reconstruction. Inspired by visual perception mechanism, we design the frequency domain loss and edge details loss to retain more high-frequency information and object details while ensuring that the enhanced image approximates the reference image in terms of color tone while preserving content and structure. To comprehensively evaluate our proposed UIE model, we also curated three additional underwater image datasets through online collection and generation using Cycle-GAN. Rigorous experiments conducted on a total of eight underwater image datasets demonstrate that the proposed ERD model outperforms state-of-the-art methods in enhancing both real-world and generated underwater images. Our code and datasets are available at <uri>https://github.com/fansuregrin/ERD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8958-8972"},"PeriodicalIF":11.1,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pose-Skeleton Guided Cross-Attention Representation Fusion for Occluded Pedestrian Re-Identification 姿态-骨架引导下的交叉注意表征融合在遮挡行人再识别中的应用

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-31 DOI: 10.1109/TCSVT.2025.3556250

Shuze Geng;Yifan Liu;Zijin Wang;Gang Yan;Yang Yu;Yingchun Guo

{"title":"Pose-Skeleton Guided Cross-Attention Representation Fusion for Occluded Pedestrian Re-Identification","authors":"Shuze Geng;Yifan Liu;Zijin Wang;Gang Yan;Yang Yu;Yingchun Guo","doi":"10.1109/TCSVT.2025.3556250","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556250","url":null,"abstract":"Most methods address occluded pedestrian Re-Identification (Re-ID) by employing external auxiliary models in the feature output stage of the backbone network to locate visible appearance areas. Nevertheless, these approaches suffer from issues such as occlusion information diffusion and imprecise masks generated by external models, indicating the need for further exploration in the decoupling of pedestrian features from occlusion information. In light of these challenges, we propose an innovative algorithm called Pose-Skeleton guided Cross-attention Representation fusion (PSCR) method. Firstly, we introduce the Visible Appearance Region Attention (VARA) model designed to leverage pose information for guiding the backbone network in effectively distinguishing between occlusion information and pedestrian features at the intermediate layer. By employing a suppression strategy, the model is able to effectively suppress occlusion interference and alleviate the diffusion of occlusion information. Next, to achieve precise localization of pedestrian-specific semantic regions, a groundbreaking Skeletal Area Modeling (SAM) is proposed. Leveraging the principles of mathematical modeling and capitalizing on the efficacy of human keypoint confidence, this module generates finely-grained masks for local skeleton regions and extracts an exhaustive set of local features. Lastly, under the constraints imposed by spatial attention masks, a cross-attention mechanism is employed to fuse the features acquired from the previous two steps with local features. This fusion process results in the generation of enhanced local features that seamlessly integrate aligning high-level semantic information. Extensive experimentation demonstrates that the proposed algorithm exhibits notable performance advancements when compared to existing methodologies.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"8598-8613"},"PeriodicalIF":11.1,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Temporal-Guided Mixture-of-Experts for Zero-Shot Video Question Answering 零镜头视频问答的时间引导混合专家

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-31 DOI: 10.1109/TCSVT.2025.3556422

Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen

{"title":"Temporal-Guided Mixture-of-Experts for Zero-Shot Video Question Answering","authors":"Yixin Qin;Lei Zhao;Lianli Gao;Haonan Zhang;Pengpeng Zeng;Heng Tao Shen","doi":"10.1109/TCSVT.2025.3556422","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556422","url":null,"abstract":"Video Question Answering (VideoQA) is a challenging task in the vision-language field. Due to the time-consuming and labor-intensive labeling process of the question-answer pairs, fully supervised methods are no longer suitable for the current increasing demand for data. This has led to the rise of zero-shot VideoQA, and some works propose to adapt large language models (LLMs) to assist zero-shot learning. Despite recent progress, the inadequacy of LLMs in comprehending temporal information in videos and the neglect of temporal differences, e.g., the different dynamic changes between scenes or objects, remain insufficiently addressed by existing attempts in zero-shot VideoQA. In light of these challenges, a novel Temporal-guided Mixture-of-Experts Network (T-MoENet) for zero-shot video question answering is proposed in this paper. Specifically, we apply a temporal module to imbue language models with the capacity to perceive temporal information. Then a temporal-guided mixture-of-experts module is proposed to further learn the temporal differences presented in different videos. It enables the model to effectively improve the capacity of generalization. Our proposed method achieves state-of-the-art performance on multiple zero-shot VideoQA benchmarks, notably improving accuracy by 5.6% on TGIF-FrameQA and 2.3% on MSRVTT-QA while remaining competitive with other methods in the fully supervised setting. The codes and models developed in this study will be made publicly available at <uri>https://github.com/qyx1121/T-MoENet</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9003-9016"},"PeriodicalIF":11.1,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-Grained Feature and Template Reconstruction for TIR Object Tracking TIR目标跟踪的细粒度特征和模板重建

IF 11.1 1区工程技术

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-03-31 DOI: 10.1109/TCSVT.2025.3556529

Donghai Liao;Xiu Shu;Zhihui Li;Qiao Liu;Di Yuan;Xiaojun Chang;Zhenyu He

{"title":"Fine-Grained Feature and Template Reconstruction for TIR Object Tracking","authors":"Donghai Liao;Xiu Shu;Zhihui Li;Qiao Liu;Di Yuan;Xiaojun Chang;Zhenyu He","doi":"10.1109/TCSVT.2025.3556529","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3556529","url":null,"abstract":"Thermal infrared (TIR) object tracking is a significant subject within the field of computer vision. Currently, TIR object tracking faces challenges such as insufficient representation of object texture information and underutilization of temporal information, which severely affects the tracking accuracy of TIR tracking methods. To address these issues, we propose a TIR object tracking method (called: FFTR) based on fine-grained feature and template reconstruction. Specifically, aiming at the fine-grained information of the TIR object, we employ a frequency channel attention mechanism that transforms TIR images into the frequency domain using discrete cosine transform features. By capturing the fine-grained feature of TIR images from the frequency domain, we enhance the model’s ability to comprehend these images. To better leverage temporal information, we utilize a template region reconstruction method. This method reconstructs the template from the previous frame based on the search area of the current frame, which is then incorporated into the attention computation for the subsequent frame, thereby improving the tracking capability of TIR objects. Extensive quantitative and qualitative experiments show that our method achieves competitive tracking performance on the TIR benchmarks.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9276-9286"},"PeriodicalIF":11.1,"publicationDate":"2025-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145021424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0