IEEE transactions on image processing : a publication of the IEEE Signal Processing Society最新文献

筛选
英文 中文
UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models UpGen:释放基础模型的潜力,通过生成模型进行无需训练的伪装检测
IF 13.7
Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li
{"title":"UpGen: Unleashing Potential of Foundation Models for Training-Free Camouflage Detection via Generative Models","authors":"Ji Du;Jiesheng Wu;Desheng Kong;Weiyun Liang;Fangwei Hao;Jing Xu;Bin Wang;Guiling Wang;Ping Li","doi":"10.1109/TIP.2025.3599101","DOIUrl":"10.1109/TIP.2025.3599101","url":null,"abstract":"Camouflaged Object Detection (COD) aims to segment objects resembling their environment. To address the challenges of extensive annotations and complex optimizations in supervised learning, recent prompt-based segmentation methods excavate insightful prompts from Large Vision-Language Models (LVLMs) and refine them using various foundation models. These are subsequently fed into the Segment Anything Model (SAM) for segmentation. However, due to the hallucinations of LVLMs and insufficient image-prompt interactions during the refinement stage, these prompts often struggle to capture well-established class differentiation and localization of camouflaged objects, resulting in performance degradation. To provide SAM with more informative prompts, we present UpGen, a pipeline that prompts SAM with generative prompts without requiring training, marking a novel integration of generative models with LVLMs. Specifically, we propose the Multi-Student-Single-Teacher (MSST) knowledge integration framework to alleviate hallucinations of LVLMs. This framework integrates insights from multiple sources to enhance the classification of camouflaged objects. To enhance interactions during the prompt refinement stage, we are the first to leverage generative models on real camouflage images to produce SAM-style prompts without fine-tuning. By capitalizing on the unique learning mechanism and structure of generative models, we effectively enable image-prompt interactions and generate highly informative prompts for SAM. Our extensive experiments demonstrate that UpGen outperforms weakly-supervised models and its SAM-based counterparts. We also integrate our framework into existing weakly-supervised methods to generate pseudo-labels, resulting in consistent performance gains. Moreover, with minor adjustments, UpGen shows promising results in open-vocabulary COD, referring COD, salient object detection, marine animal segmentation, and transparent object segmentation.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5400-5413"},"PeriodicalIF":13.7,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing the Two-Stream Framework for Efficient Visual Tracking 增强两流框架的高效视觉跟踪
IF 13.7
Chengao Zong;Xin Chen;Jie Zhao;Yang Liu;Huchuan Lu;Dong Wang
{"title":"Enhancing the Two-Stream Framework for Efficient Visual Tracking","authors":"Chengao Zong;Xin Chen;Jie Zhao;Yang Liu;Huchuan Lu;Dong Wang","doi":"10.1109/TIP.2025.3598934","DOIUrl":"10.1109/TIP.2025.3598934","url":null,"abstract":"Practical deployments, especially on resource-limited edge devices, necessitate high speed for visual object trackers. To meet this demand, we introduce a new efficient tracker with a Two-Stream architecture, named ToS. While the recent one-stream tracking framework, employing a unified backbone for simultaneous processing of both the template and search region, has demonstrated exceptional efficacy, we find the conventional two-stream tracking framework, which employs two separate backbones for the template and search region, offers inherent advantages. The two-stream tracking framework is more compatible with advanced lightweight backbones and can efficiently utilize benefits from large templates. We demonstrate that the two-stream setup can exceed the one-stream tracking model in both speed and accuracy through strategic designs. Our methodology rejuvenates the two-stream tracking paradigm with lightweight pre-trained backbones and the proposed three efficient strategies: 1) A feature-aggregation module that improves the representation capability of the backbone, 2) A channel-wise approach for feature fusion, presenting a more effective and lighter alternative to spatial concatenation techniques, and 3) An expanded template strategy to boost tracking accuracy with negligible additional computational cost. Extensive evaluations across multiple tracking benchmarks demonstrate that the proposed method sets a new state-of-the-art performance in efficient visual tracking.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5500-5512"},"PeriodicalIF":13.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Video Summarization by Exploring the Coherence Between Corresponding Captions 通过探索相应字幕之间的连贯性来改进视频摘要
IF 13.7
Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao
{"title":"Improving Video Summarization by Exploring the Coherence Between Corresponding Captions","authors":"Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao","doi":"10.1109/TIP.2025.3598709","DOIUrl":"10.1109/TIP.2025.3598709","url":null,"abstract":"Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores, <inline-formula> <tex-math>$tau $ </tex-math></inline-formula> and <inline-formula> <tex-math>$rho $ </tex-math></inline-formula> metrics on the BLiSS dataset.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5369-5384"},"PeriodicalIF":13.7,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144898587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hyperspectral Information Extraction With Full Resolution From Arbitrary Photographs 高光谱信息提取与全分辨率从任意照片。
IF 13.7
Semin Kwon;Sang Mok Park;Yuhyun Ji;Haripriya Sakthivel;Jung Woo Leem;Young L. Kim
{"title":"Hyperspectral Information Extraction With Full Resolution From Arbitrary Photographs","authors":"Semin Kwon;Sang Mok Park;Yuhyun Ji;Haripriya Sakthivel;Jung Woo Leem;Young L. Kim","doi":"10.1109/TIP.2025.3597038","DOIUrl":"10.1109/TIP.2025.3597038","url":null,"abstract":"Because optical spectrometers capture abundant molecular, biological, and physical information beyond images, ongoing efforts focus on both algorithmic and hardware approaches to obtain detailed spectral information. Spectral reconstruction from red-green-blue (RGB) values acquired by conventional trichromatic cameras has been an active area of study. However, the resultant spectral profile is often affected not only by the unknown spectral properties of the sample itself, but also by light conditions, device characteristics, and image file formats. Existing machine learning models for spectral reconstruction are further limited in generalizability due to their reliance on task-specific training data or fixed models. Advanced spectrometer hardware employing sophisticated nanofabricated components also constrains scalability and affordability. Here we introduce a general computational framework, co-designed with spectrally incoherent color reference charts, to recover the spectral information of an arbitrary sample from a single-shot photo in the visible range. The mutual optimization of reference color selection and the computational algorithm eliminates the need for training data or pretrained models. In transmission mode, altered RGB values of reference colors are used to recover the spectral intensity of the sample, achieving spectral resolution comparable to that of scientific spectrometers. In reflection mode, a spectral hypercube of the sample can be constructed from a single-shot photo, analogous to hyperspectral imaging. The reported computational photography spectrometry has the potential to make optical spectroscopy and hyperspectral imaging accessible using off-the-shelf smartphones.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5429-5441"},"PeriodicalIF":13.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11125864","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semi-Supervised Medical Hyperspectral Image Segmentation Using Adversarial Consistency Constraint Learning and Cross Indication Network 基于对抗性一致性约束学习和交叉指征网络的半监督医学高光谱图像分割。
IF 13.7
Geng Qin;Huan Liu;Xueyu Zhang;Wei Li;Yuxing Guo;Chuanbin Guo
{"title":"Semi-Supervised Medical Hyperspectral Image Segmentation Using Adversarial Consistency Constraint Learning and Cross Indication Network","authors":"Geng Qin;Huan Liu;Xueyu Zhang;Wei Li;Yuxing Guo;Chuanbin Guo","doi":"10.1109/TIP.2025.3598499","DOIUrl":"10.1109/TIP.2025.3598499","url":null,"abstract":"Hyperspectral imaging technology is considered a new paradigm for high-precision pathological image segmentation due to its ability to obtain spatial and spectral information of the detected object simultaneously. However, due to the time-consuming and laborious manual annotation, precise annotation of medical hyperspectral images is difficult to obtain. Therefore, there is an urgent need for a semi-supervised learning framework that can fully utilize unlabeled data for medical hyperspectral image segmentation. In this work, we propose an adversarial consistency constraint learning cross indication network (ACCL-CINet), which achieves accurate pathological image segmentation through adversarial consistency constraint learning training strategies. The ACCL-CINet comprises a contextual and structural encoder to form the spatial-spectral feature encoding part. The contextual and structural indications are aggregated into features through a cross indication attention module and finally decoded by a pixel decoder to generate prediction results. For the semi-supervised training strategy, a pixel perceptual consistency module encourages the two models to generate consistent and low-entropy predictions. Secondly, a pixel maximum neighborhood probability adversarial constraint strategy is designed, which produces high-quality pseudo labels for cross supervision training. The proposed ACCL-CINet has been rigorously evaluated on both public and private datasets, with experimental results demonstrating that it outperforms state-of-the-art semi-supervised methods. The code is available at: <uri>https://github.com/Qugeryolo/ACCL-CINet</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5414-5428"},"PeriodicalIF":13.7,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144884637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Alternating Direction Unfolding With a Cross Spectral Attention Prior for Dual-Camera Compressive Hyperspectral Imaging 基于交叉光谱注意先验的双相机压缩高光谱成像交替方向展开。
IF 13.7
Yubo Dong;Dahua Gao;Danhua Liu;Yanli Liu;Guangming Shi
{"title":"Alternating Direction Unfolding With a Cross Spectral Attention Prior for Dual-Camera Compressive Hyperspectral Imaging","authors":"Yubo Dong;Dahua Gao;Danhua Liu;Yanli Liu;Guangming Shi","doi":"10.1109/TIP.2025.3597775","DOIUrl":"10.1109/TIP.2025.3597775","url":null,"abstract":"Coded Aperture Snapshot Spectral Imaging (CASSI) multiplexes 3D Hyperspectral Images (HSIs) into a 2D sensor to capture dynamic spectral scenes, which, however, sacrifices the spatial information. Dual-Camera Compressive Hyperspectral Imaging (DCCHI) enhances CASSI by incorporating a Panchromatic (PAN) camera to compensate for the loss of spatial information in CASSI. However, the dual-camera structure of DCCHI disrupts the diagonal property of the product of the sensing matrix and its transpose, making it difficult to efficiently and accurately solve the data subproblem in closed-form and thereby hindering the application of model-based methods and Deep Unfolding Networks (DUNs) that rely on such a closed-form solution. To address this issue, we propose an Alternating Direction DUN, named ADRNN, which decouples the imaging model of DCCHI into a CASSI subproblem and a PAN subproblem. The ADRNN alternately solves data terms analytically and a joint prior term in these subproblems. Additionally, we propose a Cross Spectral Transformer (XST) to exploit the joint prior. The XST utilizes cross spectral attention to exploit the correlation between the compressed HSI and the PAN image, and incorporates Grouped-Query Attention (GQA) to alleviate the burden of parameters and computational cost brought by impartially treating the compressed HSI and the PAN image. Furthermore, we built a real DCCHI system and captured large-scale indoor and outdoor scenes for future academic research. Extensive experiments on both simulation and real datasets demonstrate that the proposed method achieves state-of-the-art (SOTA) performance. The code and datasets have been open-sourced at: <uri>https://github.com/ShawnDong98/ADRNN-XST</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5325-5340"},"PeriodicalIF":13.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144877627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Color Spike Camera Reconstruction via Long Short-Term Temporal Aggregation of Spike Signals 基于脉冲信号长短期时间聚合的彩色脉冲相机重建。
IF 13.7
Yanchen Dong;Ruiqin Xiong;Jing Zhao;Xiaopeng Fan;Xinfeng Zhang;Tiejun Huang
{"title":"Color Spike Camera Reconstruction via Long Short-Term Temporal Aggregation of Spike Signals","authors":"Yanchen Dong;Ruiqin Xiong;Jing Zhao;Xiaopeng Fan;Xinfeng Zhang;Tiejun Huang","doi":"10.1109/TIP.2025.3595368","DOIUrl":"10.1109/TIP.2025.3595368","url":null,"abstract":"With the prevalence of emerging computer vision applications, the demand for capturing dynamic scenes with high-speed motion has increased. A kind of neuromorphic sensor called spike camera shows great potential in this aspect since it generates a stream of binary spikes to describe the dynamic light intensity with a very high temporal resolution. Color spike camera (CSC) was recently invented to capture the color information of dynamic scenes via a color filter array (CFA) on the sensor. This paper proposes a long short-term temporal aggregation strategy of spike signals. First, we utilize short-term temporal correlation to adaptively extract temporal features of each time point. Then we align the features and aggregate them to exploit long-term temporal correlation, suppressing undesired motion blur. To implement the strategy, we design a CSC reconstruction network. Based on adaptive short-term temporal aggregation, we propose a spike representation module to extract temporal features of each color channel, leveraging multiple temporal scales. Considering the long-term temporal correlation, we develop an alignment module to align the temporal features. In particular, we perform motion alignment of red and blue channels with the guidance of the higher-sampling-rate green channel, leveraging motion consistency among color channels. Besides, we propose a module to aggregate the aligned temporal features for the restored color image, which exploits color channel correlation. We have also developed a CSC simulator for data generation. Experimental results demonstrate that our method can restore color images with fine texture details, achieving state-of-the-art CSC reconstruction performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5312-5324"},"PeriodicalIF":13.7,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144877628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Geometric-Aware Low-Light Image and Video Enhancement via Depth Guidance 基于深度引导的几何感知低光图像和视频增强。
IF 13.7
Yingqi Lin;Xiaogang Xu;Jiafei Wu;Yan Han;Zhe Liu
{"title":"Geometric-Aware Low-Light Image and Video Enhancement via Depth Guidance","authors":"Yingqi Lin;Xiaogang Xu;Jiafei Wu;Yan Han;Zhe Liu","doi":"10.1109/TIP.2025.3597046","DOIUrl":"10.1109/TIP.2025.3597046","url":null,"abstract":"Low-Light Enhancement (LLE) is aimed at improving the quality of photos/videos captured under low-light conditions. It is worth noting that most existing LLE methods do not take advantage of geometric modeling. We believe that incorporating geometric information can enhance LLE performance, as it provides insights into the physical structure of the scene that influences illumination conditions. To address this, we propose a Geometry-Guided Low-Light Enhancement Refine Framework (GG-LLERF) designed to assist low-light enhancement models in learning improved features by integrating geometric priors into the feature representation space. In this paper, we employ depth priors as the geometric representation. Our approach focuses on the integration of depth priors into various LLE frameworks using a unified methodology. This methodology comprises two key novel modules. First, a depth-aware feature extraction module is designed to inject depth priors into the image representation. Then, the Hierarchical Depth-Guided Feature Fusion Module (HDGFFM) is formulated with a cross-domain attention mechanism, which combines depth-aware features with the original image features within LLE models. We conducted extensive experiments on public low-light image and video enhancement benchmarks. The results illustrate that our framework significantly enhances existing LLE methods. The source code and pre-trained models are available at <uri>https://github.com/Estheryingqi/GG-LLERF</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5442-5457"},"PeriodicalIF":13.7,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144857248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Confound Controlled Multimodal Neuroimaging Data Fusion and Its Application to Developmental Disorders 混杂控制多模态神经影像数据融合及其在发育障碍中的应用。
IF 13.7
Chuang Liang;Rogers F. Silva;Tülay Adali;Rongtao Jiang;Daoqiang Zhang;Shile Qi;Vince D. Calhoun
{"title":"Confound Controlled Multimodal Neuroimaging Data Fusion and Its Application to Developmental Disorders","authors":"Chuang Liang;Rogers F. Silva;Tülay Adali;Rongtao Jiang;Daoqiang Zhang;Shile Qi;Vince D. Calhoun","doi":"10.1109/TIP.2025.3597045","DOIUrl":"10.1109/TIP.2025.3597045","url":null,"abstract":"Multimodal fusion provides multiple benefits over single modality analysis by leveraging both shared and complementary information from different modalities. Notably, supervised fusion enjoys extensive interest for capturing multimodal co-varying patterns associated with clinical measures. A key challenge of brain data analysis is how to handle confounds, which, if unaddressed, can lead to an unrealistic description of the relationship between the brain and clinical measures. Current approaches often rely on linear regression to remove covariate effects prior to fusion, which may lead to information loss, rather than pursue the more global strategy of optimizing both fusion and covariates removal simultaneously. Thus, we propose “CR-mCCAR” to jointly optimize for confounds within a guided fusion model, capturing co-varying multimodal patterns associated with a specific clinical domain while also discounting covariate effects. Simulations show that CR-mCCAR separate the reference and covariate factors accurately. Functional and structural neuroimaging data fusion reveals co-varying patterns in attention deficit/hyperactivity disorder (ADHD, striato-thalamo-cortical and salience areas) and in autism spectrum disorder (ASD, salience and fronto-temporal areas) that link with core symptoms but uncorrelate with age and motion. These results replicate in an independent cohort. Downstream classification accuracy between ADHD/ASD and controls is markedly higher for CR-mCCAR compared to fusion and regression separately. CR-mCCAR can be extended to include multiple targets and multiple covariates. Overall, results demonstrate CR-mCCAR can jointly optimize for target components that correlate with the reference(s) while removing nuisance covariates. This approach can improve the meaningful detection of reliable phenotype-linked multimodal biomarkers for brain disorders.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5271-5284"},"PeriodicalIF":13.7,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144851240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parts2Whole: Generalizable Multi-Part Portrait Customization parts2整体:可通用的多部分肖像定制。
IF 13.7
Hongxing Fan;Zehuan Huang;Lipeng Wang;Haohua Chen;Li Yin;Lu Sheng
{"title":"Parts2Whole: Generalizable Multi-Part Portrait Customization","authors":"Hongxing Fan;Zehuan Huang;Lipeng Wang;Haohua Chen;Li Yin;Lu Sheng","doi":"10.1109/TIP.2025.3597037","DOIUrl":"10.1109/TIP.2025.3597037","url":null,"abstract":"Multi-part portrait customization aims to generate realistic human images by assembling specified body parts from multiple reference images, with significant applications in digital human creation. Existing customization methods typically follow two approaches: 1) test-time fine-tuning, which learn concepts effectively but is time-consuming and struggles with multi-part composition; 2) generalizable feed-forward methods, which offer efficiency but lack fine control over appearance specifics. To address these limitations, we present Parts2Whole, a diffusion-based generalizable portrait generator that harmoniously integrates multiple reference parts into high-fidelity human images by our proposed multi-reference mechanism. To adequately characterize each part, we propose a detail-aware appearance encoder, which is initialized and inherits powerful image priors from the pre-trained denoising U-Net, enabling the encoding of detailed information from reference images. The extracted features are incorporated into the denoising U-Net by a shared self-attention mechanism, enhanced by mask information for precise part selection. Additionally, we integrate pose map conditioning to control the target posture of generated portraits, facilitating more flexible customization. Extensive experiments demonstrate the superiority of our approach over existing methods and applicability to related tasks like pose transfer and pose-guided human image generation, showcasing its versatile conditioning. Our project is available at <uri>https://huanngzh.github.io/Parts2Whole/</uri>","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5241-5256"},"PeriodicalIF":13.7,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144857249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信