IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献_第3页

Class-Customized Domain Adaptation: Unlock Each Customer-Specific Class With Single Annotation 类定制的领域适应：用单个注释解锁每个客户特定的类

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-27 DOI: 10.1109/TIP.2025.3597036

Kaixin Chen;Huiying Chang;Mengqiu Xu;Ruoyi Du;Ming Wu;Zhanyu Ma;Chuang Zhang

{"title":"Class-Customized Domain Adaptation: Unlock Each Customer-Specific Class With Single Annotation","authors":"Kaixin Chen;Huiying Chang;Mengqiu Xu;Ruoyi Du;Ming Wu;Zhanyu Ma;Chuang Zhang","doi":"10.1109/TIP.2025.3597036","DOIUrl":"10.1109/TIP.2025.3597036","url":null,"abstract":"Model customization mitigates the issues of inadequate performance, resource wastage, and privacy risks associated with using general-purpose models in specialized domains and well-defined tasks. However, achieving customization at a low annotation cost still poses a challenge. Existing domain adaptation research has addressed cases where all customized classes are present in the labeled database, yet scenarios involving customer-specific classes are still unresolved. Therefore, this paper proposes a novel Class-Customized Domain Adaptation (CCDA) method, addressing the latter scenario with just one additional annotation for each customer-specific class. CCDA adopts the classic adaptation training framework and comprises two innovative techniques. Firstly, to ensure the shared class knowledge from the database and the private class knowledge from additional annotations are transferred and propagated to the correct regions within the target domain, we design the partial-feature alignment strategy, based on the mechanical properties of feature alignment. Second, we propose soft-balanced sampling to tackle the long-tail distribution problem in labeled data, preventing the model from overfitting to the labeled samples of customer-specific classes. The effectiveness of CCDA has been validated across 48 tasks simulated on domain adaptation benchmarks and two real-world customization scenarios, consistently showing excellent performance. Additionally, extensive analytical experiments illustrate the contributions of two innovative techniques. The code is available at <uri>https://github.com/CHEN-kx/ClassCustomizedDA</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5527-5542"},"PeriodicalIF":13.7,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144911064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TripleNet: Exploiting Complementary Features and Pseudo-Labels for Semi-Supervised Salient Object Detection TripleNet：利用互补特征和伪标签进行半监督显著目标检测

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-27 DOI: 10.1109/TIP.2025.3601334

Liyuan Chen;Ming-Hsuan Yang;Jian Pu;Zhonglong Zheng

{"title":"TripleNet: Exploiting Complementary Features and Pseudo-Labels for Semi-Supervised Salient Object Detection","authors":"Liyuan Chen;Ming-Hsuan Yang;Jian Pu;Zhonglong Zheng","doi":"10.1109/TIP.2025.3601334","DOIUrl":"10.1109/TIP.2025.3601334","url":null,"abstract":"Due to the limited output categories, semi-supervised salient object detection faces challenges in adapting conventional semi-supervised strategies. To address this limitation, we propose a multi-branch architecture that extracts complementary features from labeled data. Specifically, we introduce TripleNet, a three-branch network architecture designed for contour, content, and holistic saliency prediction. The supervision signals for the contour and content branches are derived by decomposing the limited ground truths. After training on the labeled data, the model produces pseudo-labels for unlabeled images, including contour, content, and salient objects. By leveraging the complementarity between the contour and content branches, we construct coupled pseudo-saliency labels by integrating the pseudo-contour and pseudo-content labels, which differ from the model-inferred pseudo-saliency labels. We further develop an enhanced pseudo-labeling mechanism that generates enhanced pseudo-saliency labels by combining reliable regions from both pseudo-saliency labels. Moreover, we incorporate a partial binary cross-entropy loss function to guide the learning of the saliency branch to focus on effective regions within the enhanced pseudo-saliency labels, which are identified through our adaptive thresholding approach. Extensive experiments demonstrate that the proposed method achieves state-of-the-art performance using only 329 labeled training images.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5628-5641"},"PeriodicalIF":13.7,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144911063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Real-World Animation Super-Resolution Benchmark With Color Degradation and Multi-Scale Multi-Frequency Alignment 具有颜色退化和多尺度多频率对齐的真实世界动画超分辨率基准

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-25 DOI: 10.1109/TIP.2025.3599946

Yu Jiang;Yongji Zhang;Siqi Li;Yang Huang;Yuehang Wang;Yutong Yao;Yue Gao

{"title":"A Real-World Animation Super-Resolution Benchmark With Color Degradation and Multi-Scale Multi-Frequency Alignment","authors":"Yu Jiang;Yongji Zhang;Siqi Li;Yang Huang;Yuehang Wang;Yutong Yao;Yue Gao","doi":"10.1109/TIP.2025.3599946","DOIUrl":"10.1109/TIP.2025.3599946","url":null,"abstract":"Animation super-resolution (SR) aims to generate high-resolution (HR) animation frames from degraded low-resolution (LR) inputs, constituting an important task in real-world SR. Existing animation SR methods typically follow a photorealistic real-world SR computational paradigm. However, digital animation frames commonly suffer from compression and transmission-related degradation, distinct from degradations in camera-captured real-world images. In this paper, we introduce a novel real-world animation super-resolution benchmark designed explicitly for animation frames, named ADASR, featuring both 2D and modern 3D animation content to facilitate industry applications. Additionally, we propose a Color-Aware Animation Super-Resolution (CAASR) method. CAASR, for the first time, incorporates a color degradation simulation mechanism tailored for animations, addressing color banding, blocking, and color shift. Furthermore, we develop a multi-scale multi-frequency alignment mechanism to robustly extract degradation-invariant features. Extensive experiments conducted on both the existing AVC dataset and our newly constructed ADASR dataset demonstrate that our proposed CAASR achieves state-of-the-art performance in restoring HR frames for both 2D and 3D animations. Code and data are available at <uri>https://github.com/huangyang-666/CAASR</uri>.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5598-5613"},"PeriodicalIF":13.7,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HDRSL Net for Accurate High Dynamic Range Imaging-Based Structured Light 3D Reconstruction 基于HDRSL网络的精确高动态范围成像结构光三维重建

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-25 DOI: 10.1109/TIP.2025.3599934

Hao Wang;Chaobo Zhang;Xiang Qian;Xiaohao Wang;Weihua Gui;Wen Gao;Xiaojun Liang;Xinghui Li

{"title":"HDRSL Net for Accurate High Dynamic Range Imaging-Based Structured Light 3D Reconstruction","authors":"Hao Wang;Chaobo Zhang;Xiang Qian;Xiaohao Wang;Weihua Gui;Wen Gao;Xiaojun Liang;Xinghui Li","doi":"10.1109/TIP.2025.3599934","DOIUrl":"10.1109/TIP.2025.3599934","url":null,"abstract":"In fringe projection profilometry systems, accurately reconstructing 3D objects with varying surface reflectivity requires high dynamic range (HDR) imaging. However, the limited dynamic range of single-exposure cameras poses challenges for capturing HDR fringe patterns efficiently. This paper introduces a deep learning-based HDR structured light 3D reconstruction pipeline, comprising an HDR Fringe Generation Module and a Phase Calculation Module. The HDR Fringe Generation Module employs an end-to-end network with attention guidance and feature distillation to reconstruct HDR fringe images from short- and long-exposure low dynamic range (LDR) inputs. The Phase Calculation Module processes the phase information from HDR fringes to enable 3D reconstruction. On a metallic HDR dataset, the method achieved a phase error of 0.105, comparable to the 4-exposure 6-step Phase Shifting Profilometry (PSP) method (0.069), with only 8.3% of the projection time. Experimental results demonstrate the robustness of our approach under diverse object geometries, exposure levels, and challenging global illumination environments. In quantitative measurements, our method achieved accuracies of sub-50<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>m on ceramic spheres, flat plates and metal step object. Ablation experiments confirmed that feature distillation and attention module effectively enhance the HDR Fringe Generation Module, producing high-quality HDR fringe patterns critical for reconstructing objects with HDR surface reflectivity. Furthermore, we constructed an HDR imaging metal dataset comprising 1,700 samples of machined metal parts with diverse shapes, sizes, and materials, making it a benchmark in the field of HDR structured light measurement. Our method offers a general HDR imaging-based structured light 3D reconstruction approach, integrating the two modules into an efficient, end-to-end solution for objects with HDR reflective surfaces.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5486-5499"},"PeriodicalIF":13.7,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CompletionMamba: Taming State Space Model for Point Cloud Completion CompletionMamba：驯服状态空间模型的点云完成

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-25 DOI: 10.1109/TIP.2025.3597041

Zhiheng Fu;Jiehua Zhang;Longguang Wang;Lian Xu;Hamid Laga;Yulan Guo;Farid Boussaid;Mohammed Bennamoun

{"title":"CompletionMamba: Taming State Space Model for Point Cloud Completion","authors":"Zhiheng Fu;Jiehua Zhang;Longguang Wang;Lian Xu;Hamid Laga;Yulan Guo;Farid Boussaid;Mohammed Bennamoun","doi":"10.1109/TIP.2025.3597041","DOIUrl":"10.1109/TIP.2025.3597041","url":null,"abstract":"Point cloud completion aims to reconstruct complete 3D shapes from partial scans. The long-range dependencies between points and shape perception are crucial for this task. While Transformers are effective due to their global processing ability, the quadratic complexity of their attention mechanism makes them unsuitable for long sequences when computational resources are constrained. As an alternative, State Space Models (SSMs) provide a memory-efficient solution for handling long-range dependencies, yet applying them directly to unordered point clouds presents challenges because of their intrinsic causality requirements. Existing methods attempt to address this by sorting points along a single axis. This, however, often overlooks complex causal relationships in 3D space since adjacency relationships based on Euclidean distance between points in the 3D space may not be preserved by this linear arrangement. To overcome this issue, we introduce CompletionMamba, a novel SSM-based network designed to harness SSMs for capturing both global and local dependencies within a point cloud. Initially, the input point cloud is causally structured by rearranging its coordinates. Then, a local SSM framework is proposed that defines neighborhood spaces around each point based on Euclidean distance, enhancing the causal structure. Although local SSM enhances relationships in short and long distance sequences, it still lacks full shape modeling of point cloud. To address this, we propose a novel shape-aware Mamba by integrating the shape code of each 3D shape into the model, enabling shape information propagation to all points. Our experiments show that CompletionMamba achieves state-of-the-art performance on both the MVP and PCN datasets.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5473-5485"},"PeriodicalIF":13.7,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Multimodal Learning via Hierarchical Fusion Architecture Search With Inconsistency Mitigation 基于层次融合架构搜索和不一致缓解的多模态学习

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-22 DOI: 10.1109/TIP.2025.3599673

Kaifang Long;Guoyang Xie;Lianbo Ma;Qing Li;Min Huang;Jianhui Lv;Zhichao Lu

{"title":"Enhancing Multimodal Learning via Hierarchical Fusion Architecture Search With Inconsistency Mitigation","authors":"Kaifang Long;Guoyang Xie;Lianbo Ma;Qing Li;Min Huang;Jianhui Lv;Zhichao Lu","doi":"10.1109/TIP.2025.3599673","DOIUrl":"10.1109/TIP.2025.3599673","url":null,"abstract":"The design of effective multimodal feature fusion strategies is the key task for multimodal learning, which often requires huge computational costs with extensive expertise. In this paper, we seek to enhance multimodal learning via hierarchical fusion architecture search with inconsistency mitigation. Different from previous works, our Hierarchical Fusion Multimodal Neural Architecture Search (HF-MNAS) considers the inconsistency in modalities and labels, and fine-grained exploitation in multi-level fusion architectures. Specifically, it disentangles the hierarchical fusion problem into two-level (macro- and micro-level) search spaces. In the macro-level search space, the high-level and low-level features are extracted and then connected in a fine-grained way, where the inconsistency mitigation module is designed to minimize discrepancies between modalities and labels in cell outputs. In the micro-level search space, we find that different intermediate nodes in the cells exhibit different importance degrees. Then, we propose an importance-based node selection mechanism to form the optimal cells for feature fusion. We evaluate HF-MNAS on a series of multimodal classification tasks. Empirical evidence shows that HF-MNAS achieves competitive trade-off performance across accuracy, search time, and inference speed. In particular, HF-MNAS consumes minimal computational cost compared with state-of-the-art MNASs. Furthermore, we theoretically and experimentally verify that the modality-label inconsistency deteriorates the overall fusion performance of models such as accuracy and F1 score, and demonstrate that the proposed inconsistency mitigation module could effectively mitigate this phenomenon.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5458-5472"},"PeriodicalIF":13.7,"publicationDate":"2025-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rate-Distortion-Complexity Optimized Framework for Multi-Model Image Compression 多模型图像压缩的速率-失真-复杂度优化框架

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-21 DOI: 10.1109/TIP.2025.3598916

Xinyu Hang;Ziqing Ge;Hongfei Fan;Chuanmin Jia;Siwei Ma;Wen Gao

{"title":"Rate-Distortion-Complexity Optimized Framework for Multi-Model Image Compression","authors":"Xinyu Hang;Ziqing Ge;Hongfei Fan;Chuanmin Jia;Siwei Ma;Wen Gao","doi":"10.1109/TIP.2025.3598916","DOIUrl":"10.1109/TIP.2025.3598916","url":null,"abstract":"Learned Image Compression (LIC) has experienced rapid growth with the emergence of diverse frameworks. However, the variability in model design and training datasets poses a challenge for the universal application of a single coding model. To address this problem, this paper introduces a pioneering multi-model image coding framework that integrates various image codecs to overcome these limitations. By dynamically allocating codecs to different image regions, our framework optimizes reconstruction quality within the constraints of limited bitrate and decoding time, offering a high-performance, ubiquitous solution for the rate-distortion-complexity trade-off. Our framework features a detailed codec assignment algorithm based on the Simulated Annealing (SA) method, selected for its proven efficacy in managing the discrete and intricate nature of codec assignment optimization. We have implemented a coarse-to-fine strategy, which significantly enhances efficiency. Notably, our framework maintains compatibility with all standard image codecs without necessitating structural modifications. Empirical results indicate that our framework establishes a new standard in LIC, advancing the Pareto frontier for performance-complexity trade-offs. It achieves a significant 70% reduction in decoding time compared to current state-of-the-art methods, without compromising reconstruction quality. Furthermore, under comparable conditions, our approach not only outperforms but significantly eclipses existing Rate-Distortion-Complexity (RDC) optimized codecs, with decoding speeds up to 30 times faster.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5385-5399"},"PeriodicalIF":13.7,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models UpGen：释放基础模型的潜力，通过生成模型进行无需训练的伪装检测

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-21 DOI: 10.1109/TIP.2025.3599101

Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li

{"title":"UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models","authors":"Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li","doi":"10.1109/TIP.2025.3599101","DOIUrl":"10.1109/TIP.2025.3599101","url":null,"abstract":"Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5400-5413"},"PeriodicalIF":13.7,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing the Two-Stream Framework for Efficient Visual Tracking 增强两流框架的高效视觉跟踪

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-20 DOI: 10.1109/TIP.2025.3598934

Chengao Zong;Xin Chen;Jie Zhao;Yang Liu;Huchuan Lu;Dong Wang

{"title":"Enhancing the Two-Stream Framework for Efficient Visual Tracking","authors":"Chengao Zong;Xin Chen;Jie Zhao;Yang Liu;Huchuan Lu;Dong Wang","doi":"10.1109/TIP.2025.3598934","DOIUrl":"10.1109/TIP.2025.3598934","url":null,"abstract":"Practical deployments, especially on resource-limited edge devices, necessitate high speed for visual object trackers. To meet this demand, we introduce a new efficient tracker with a Two-Stream architecture, named ToS. While the recent one-stream tracking framework, employing a unified backbone for simultaneous processing of both the template and search region, has demonstrated exceptional efficacy, we find the conventional two-stream tracking framework, which employs two separate backbones for the template and search region, offers inherent advantages. The two-stream tracking framework is more compatible with advanced lightweight backbones and can efficiently utilize benefits from large templates. We demonstrate that the two-stream setup can exceed the one-stream tracking model in both speed and accuracy through strategic designs. Our methodology rejuvenates the two-stream tracking paradigm with lightweight pre-trained backbones and the proposed three efficient strategies: 1) A feature-aggregation module that improves the representation capability of the backbone, 2) A channel-wise approach for feature fusion, presenting a more effective and lighter alternative to spatial concatenation techniques, and 3) An expanded template strategy to boost tracking accuracy with negligible additional computational cost. Extensive evaluations across multiple tracking benchmarks demonstrate that the proposed method sets a new state-of-the-art performance in efficient visual tracking.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5500-5512"},"PeriodicalIF":13.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Video Summarization by Exploring the Coherence Between Corresponding Captions 通过探索相应字幕之间的连贯性来改进视频摘要

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-20 DOI: 10.1109/TIP.2025.3598709

Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao

{"title":"Improving Video Summarization by Exploring the Coherence Between Corresponding Captions","authors":"Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao","doi":"10.1109/TIP.2025.3598709","DOIUrl":"10.1109/TIP.2025.3598709","url":null,"abstract":"Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores, <inline-formula> <tex-math>$tau $ </tex-math></inline-formula> and <inline-formula> <tex-math>$rho $ </tex-math></inline-formula> metrics on the BLiSS dataset.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5369-5384"},"PeriodicalIF":13.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0