{"title":"An Attention-Locating Algorithm for Eliminating Background Effects in Fine-Grained Visual Classification","authors":"Yueting Huang;Zhenzhe Hechen;Mingliang Zhou;Zhengguo Li;Sam Kwong","doi":"10.1109/TCSVT.2025.3535818","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3535818","url":null,"abstract":"Fine-grained visual classification (FGVC) is a challenging task characterized by interclass similarity and intraclass diversity and has broad application prospects. Recently, several methods have adopted the vision Transformer (ViT) in FGVC tasks since the data specificity of the multihead self-attention (MSA) mechanism in ViT is beneficial for extracting discriminative feature representations. However, these works focus on integrating feature dependencies at a high level, which leads to the model being easily disturbed by low-level background information. To address this issue, we propose a fine-grained attention-locating vision Transformer (FAL-ViT) and an attention selection module (ASM). First, FAL-ViT contains a two-stage framework to identify crucial regions effectively within images and enhance features by strategically reusing parameters. Second, the ASM accurately locates important target regions via the natural scores of the MSA, extracting finer low-level features to offer more comprehensive information through position mapping. Extensive experiments on public datasets demonstrate that FAL-ViT outperforms the other methods in terms of performance, confirming the effectiveness of our proposed methods. The source code is available at <uri>https://github.com/Yueting-Huang/FAL-ViT</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5993-6006"},"PeriodicalIF":8.3,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Relighting Scenes With Object Insertions in Neural Radiance Fields","authors":"Xuening Zhu;Renjiao Yi;Xin Wen;Chenyang Zhu;Kai Xu","doi":"10.1109/TCSVT.2025.3535599","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3535599","url":null,"abstract":"Inserting objects into scenes and performing realistic relighting are common applications in augmented reality (AR). Previous methods focused on inserting virtual objects using CAD models or real objects from single-view images, resulting in highly limited AR application scenarios. We introduce a novel pipeline based on Neural Radiance Fields (NeRFs) for seamlessly integrating objects into scenes, from two sets of images depicting the object and scene. This approach enables novel view synthesis, realistic relighting, and supports physical interactions such as shadow casting between objects. The lighting environment is in a hybrid representation of Spherical Harmonics and Spherical Gaussians, representing both high- and low-frequency lighting components very well, and supporting non-Lambertian surfaces. Specifically, we leverage the benefits of volume rendering and introduce an innovative approach for efficient shadow rendering by comparing the depth maps between the camera view and the light source view and generating vivid soft shadows. The proposed method achieves realistic relighting effects in extensive experimental evaluations.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 7","pages":"6787-6802"},"PeriodicalIF":8.3,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144558014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flexible ViG: Learning the Self-Saliency for Flexible Object Recognition","authors":"Kunshan Yang;Lin Zuo;Mengmeng Jing;Xianlong Tian;Kunbin He;Yongqi Ding","doi":"10.1109/TCSVT.2025.3534204","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3534204","url":null,"abstract":"Existing computer vision methods mainly focus on the recognition of rigid objects, whereas the recognition of flexible objects remains unexplored. Recognizing flexible objects poses significant challenges due to their inherently diverse shapes and sizes, translucent attributes, ambiguous boundaries, and subtle inter-class differences. In this paper, we claim that these problems primarily arise from the lack of object saliency. To this end, we propose the Flexible Vision Graph Neural Network (FViG) to optimize the self-saliency and thereby improve the discrimination of the representations for flexible objects. Specifically, on one hand, we propose to maximize the channel-aware saliency by extracting the weight of neighboring graph nodes, which is employed to identify flexible objects with minimal inter-class differences. On the other hand, we maximize the spatial-aware saliency based on clustering to aggregate neighborhood information for the centroid graph nodes. This introduces local context information and enables extracting of consistent representation, effectively adapting to the shape and size variations in flexible objects. To verify the performance of flexible objects recognition thoroughly, for the first time we propose the Flexible Dataset (FDA), which consists of various images of flexible objects collected from real-world scenarios or online. Extensive experiments evaluated on our FDA, FireNet, CIFAR-100 and ImageNet-Hard datasets demonstrate the effectiveness of our method on enhancing the discrimination of flexible objects.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 7","pages":"6424-6436"},"PeriodicalIF":8.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144557907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SGFormer: Spherical Geometry Transformer for 360° Depth Estimation","authors":"Junsong Zhang;Zisong Chen;Chunyu Lin;Zhijie Shen;Lang Nie;Kang Liao;Yao Zhao","doi":"10.1109/TCSVT.2025.3534220","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3534220","url":null,"abstract":"Panoramic distortion poses a significant challenge in 360° depth estimation, particularly pronounced at the north and south poles. Existing methods either adopt a bi-projection fusion strategy to remove distortions or model long-range dependencies to capture global structures, resulting in either unclear structure or insufficient local perception. In this paper, we propose a spherical geometry transformer, named SGFormer, to address the above issues, with an innovative step to integrate spherical geometric priors into vision transformers. To this end, we retarget the transformer decoder to a spherical prior decoder (termed SPDecoder), which endeavors to uphold the integrity of spherical structures during decoding. Concretely, we leverage bipolar reprojection, circular rotation, and curve local embedding to preserve the spherical characteristics of equidistortion, continuity, and surface distance, respectively. Furthermore, we present a query-based global conditional position embedding to compensate for spatial structure at varying resolutions. It not only boosts the global perception of spatial position but also sharpens the depth structure across different patches. Finally, we conduct extensive experiments on popular benchmarks, demonstrating our superiority over state-of-the-art solutions. Our code will be made publicly at <uri>https://github.com/iuiuJaon/SGFormer</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5738-5748"},"PeriodicalIF":8.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TRNet: Two-Tier Recursion Network for Co-Salient Object Detection","authors":"Runmin Cong;Ning Yang;Hongyu Liu;Dingwen Zhang;Qingming Huang;Sam Kwong;Wei Zhang","doi":"10.1109/TCSVT.2025.3534908","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3534908","url":null,"abstract":"Co-salient object detection (CoSOD) is to find the salient and recurring objects from a series of relevant images, where modeling inter-image relationships plays a crucial role. Different from the commonly used direct learning structure that inputs all the intra-image features into some well-designed modules to represent the inter-image relationship, we resort to adopting a recursive structure for inter-image modeling, and propose a two-tier recursion network (TRNet) to achieve CoSOD in this paper. The two-tier recursive structure of the proposed TRNet is embodied in two stages of inter-image extraction and distribution. On the one hand, considering the task adaptability and inter-image correlation, we design an inter-image exploration with recursive reinforcement module to learn the local and global inter-image correspondences, guaranteeing the validity and discriminativeness of the information in the step-by-step propagation. On the other hand, we design a dynamic recursion distribution module to fully exploit the role of inter-image correspondences in a recursive structure, adaptively assigning common attributes to each individual image through an improved semi-dynamic convolution. Experimental results on five prevailing CoSOD benchmarks demonstrate that our TRNet outperforms other competitors in terms of various evaluation metrics. The code and results of our method are available at <uri>https://github.com/rmcong/TRNet_TCSVT2025</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5844-5857"},"PeriodicalIF":8.3,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang
{"title":"Facial Depression Estimation via Multi-Cue Contrastive Learning","authors":"Xinke Wang;Jingyuan Xu;Xiao Sun;Mingzheng Li;Bin Hu;Wei Qian;Dan Guo;Meng Wang","doi":"10.1109/TCSVT.2025.3533543","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533543","url":null,"abstract":"Vision-based depression estimation is an emerging yet impactful task, whose challenge lies in predicting the severity of depression from facial videos lasting at least several minutes. Existing methods primarily focus on fusing frame-level features to create comprehensive representations. However, they often overlook two crucial aspects: 1) inter- and intra-cue correlations, and 2) variations among samples. Hence, simply characterizing sample embeddings while ignoring to mine the relation among multiple cues leads to limitations. To address this problem, we propose a novel Multi-Cue Contrastive Learning (MCCL) framework to mine the relation among multiple cues for discriminative representation. Specifically, we first introduce a novel cross-characteristic attentive interaction module to model the relationship among multiple cues from four facial features (e.g., 3D landmarks, head poses, gazes, FAUs). Then, we propose a temporal segment attentive interaction module to capture the temporal relationships within each facial feature over time intervals. Moreover, we integrate contrastive learning to leverage the variations among samples by regarding the embeddings of inter-cue and intra-cue as positive pairs while considering embeddings from other samples as negative. In this way, the proposed MCCL framework leverages the relationships among the facial features and the variations among samples to enhance the process of multi-cue mining, thereby achieving more accurate facial depression estimation. Extensive experiments on public datasets, DAIC-WOZ, CMDC, and E-DAIC, demonstrate that our model not only outperforms the advanced depression methods but that the discriminative representations of facial behaviors provide potential insights about depression. Our code is available at: <uri>https://github.com/xkwangcn/MCCL.git</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"6007-6020"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Underwater Image Quality Assessment Using Feature Disentanglement and Dynamic Content-Distortion Guidance","authors":"Junjie Zhu;Liquan Shen;Zhengyong Wang;Yihan Yu","doi":"10.1109/TCSVT.2025.3533598","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533598","url":null,"abstract":"Due to the complex underwater imaging process, underwater images contain a variety of unique distortions. While existing underwater image quality assessment (UIQA) methods have made progress by highlighting these distortions, they overlook the fact that image content also affects how distortions are perceived, as different content exhibits varying sensitivities to different types of distortions. Both the characteristics of the content itself and the properties of the distortions determine the quality of underwater images. Additionally, the intertwined nature of content and distortion features in underwater images complicates the accurate extraction of both. In this paper, we address these issues by comprehensively accounting for both content and distortion information and explicitly disentangling underwater image features into content and distortion components. To achieve this, we introduce a dynamic content-distortion guiding and feature disentanglement network (DysenNet), composed of three main components: the feature disentanglement sub-network (FDN), the dynamic content guidance module (DCM), and the dynamic distortion guidance module (DDM). Specifically, the FDN disentangles underwater features into content and distortion elements, allowing us to more clearly measure their respective contributions to image quality. The DCM generates dynamic multi-scale convolutional kernels tailored to the unique content of each image, enabling content-adaptive feature extraction for quality perception. The DDM, on the other hand, addresses both global and local underwater distortions by identifying distortion cues from both channel and spatial perspectives, focusing on regions and channels with severe degradation. Extensive experiments on UIQA datasets demonstrate the state-of-the-art performance of the proposed method.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5602-5616"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144272998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Edge Guided Network With Motion Enhancement for Few-Shot Action Recognition","authors":"Kaiwen Du;Weirong Ye;Hanyu Guo;Yan Yan;Hanzi Wang","doi":"10.1109/TCSVT.2025.3533573","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533573","url":null,"abstract":"Existing state-of-the-art methods for few-shot action recognition (FSAR) achieve promising performance by spatial and temporal modeling. However, most current methods ignore the importance of edge information and motion cues, leading to inferior performance. For the few-shot task, it is important to effectively explore limited data. Additionally, effectively utilizing edge information is beneficial for exploring motion cues, and vice versa. In this paper, we propose a novel edge guided network with motion enhancement (EGME) for FSAR. To the best of our knowledge, this is the first work to utilize the edge information as guidance in the FSAR task. Our EGME contains two crucial components, including an edge information extractor (EIE) and a motion enhancement module (ME). Specifically, EIE is used to obtain edge information on video frames. Afterward, the edge information is used as guidance to fuse with the frame features. In addition, ME can adaptively capture motion-sensitive features of videos. It adopts a self-gating mechanism to highlight motion-sensitive regions in videos from a large temporal receptive field. Based on the above designed components, EGME can capture edge information and motion cues, resulting in superior recognition performance. Experimental results on four challenging benchmarks show that EGME performs favorably against recent advanced methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5331-5342"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantic Concept Perception Network With Interactive Prompting for Cross-View Image Geo-Localization","authors":"Yuan Gao;Haibo Liu;Xiaohui Wei","doi":"10.1109/TCSVT.2025.3533574","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533574","url":null,"abstract":"Cross-view image geo-localization aims to estimate the geographic position of a query image from the ground platform (such as mobile phone, vehicle camera) by matching it with geo-tagged reference images from the aerial platform (such as drone, satellite). Although existing studies have achieved promising results, they usually rely only on depth features and fail to effectively handle the serious changes in geometric shape and appearance caused by view differences. In this paper, a novel Semantic Concept Perception Network (SCPNet) with interactive prompting is proposed, whose core is to extract and integrate semantic concept information reflecting spatial position relationship between objects. Specifically, for a given of pair input images, a CNN stem with positional embedding is first adopted to extract depth features. Meanwhile, a semantic concept mining module is designed to distinguish different objects and capture the associations between them, thereby achieving the purpose of extracting semantic concept information. Furthermore, to obtain global descriptions of different views, a feature bidirectional injection fusion module based on attention mechanism is proposed to exploit the long-range dependencies of semantic concept and depth features. Finally, a triplet loss with a flexible hard sample mining strategy is used to guide the optimization of the network. Experimental results have shown that our proposed method can achieve better performance compared with state-of-the-art methods on mainstream cross-view datasets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5343-5354"},"PeriodicalIF":8.3,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conditional Dual Diffusion for Multimodal Clustering of Optical and SAR Images","authors":"Shujun Liu;Ling Chang","doi":"10.1109/TCSVT.2025.3533301","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3533301","url":null,"abstract":"Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at <uri>https://github.com/suldier/CDD</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 6","pages":"5318-5330"},"PeriodicalIF":8.3,"publicationDate":"2025-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144243918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}