Image and Vision Computing最新文献

筛选
英文 中文
Mining fine-grained attributes for vision–semantics integration in few-shot learning 挖掘细粒度属性,实现少镜头学习中的视觉语义集成
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-18 DOI: 10.1016/j.imavis.2025.105739
Juan Zhao , Lili Kong , Deshang Sun , Deng Xiong , Jiancheng Lv
{"title":"Mining fine-grained attributes for vision–semantics integration in few-shot learning","authors":"Juan Zhao ,&nbsp;Lili Kong ,&nbsp;Deshang Sun ,&nbsp;Deng Xiong ,&nbsp;Jiancheng Lv","doi":"10.1016/j.imavis.2025.105739","DOIUrl":"10.1016/j.imavis.2025.105739","url":null,"abstract":"<div><div>Recent advancements in Few-Shot Learning (FSL) have been significantly driven by leveraging semantic descriptions to enhance feature discrimination and recognition performance. However, existing methods, such as SemFew, often rely on verbose or manually curated attributes and apply semantic guidance only to the support set, limiting their effectiveness in distinguishing fine-grained categories. Inspired by human visual perception, which emphasizes crucial features for accurate recognition, this study introduces concise, fine-grained semantic attributes to address these limitations. We propose a Visual Attribute Enhancement (VAE) mechanism that integrates enriched semantic information into visual features, enabling the model to highlight the most relevant visual attributes and better distinguish visually similar samples. This module enhances visual features by aligning them with semantic attribute embeddings through a cross-attention mechanism and optimizes this alignment using an attribute-based cross-entropy loss. Furthermore, to mitigate the performance degradation caused by methods that supply semantic information exclusively to the support set, we propose a semantic attribute reconstruction (SAR) module. This module predicts and integrates semantic features for query samples, ensuring balanced information distribution between the support and query sets. Specifically, SAR enhances query representations by aligning and reconstructing semantic and visual attributes through regression and optimal transport losses to ensure semantic–visual consistency. Experiments on five benchmark datasets, including both general datasets and more challenging fine-grained Few-Shot datasets consistently demonstrate that our proposed method outperforms state-of-the-art methods in both 5-way 1-shot and 5-way 5-shot settings.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105739"},"PeriodicalIF":4.2,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Novel extraction of discriminative fine-grained feature to improve retinal vessel segmentation 一种新的鉴别细粒度特征提取方法来改善视网膜血管分割
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-18 DOI: 10.1016/j.imavis.2025.105729
Shuang Zeng , Chee Hong Lee , Micky C. Nnamdi , Wenqi Shi , J. Ben Tamo , Hangzhou He , Xinliang Zhang , Qian Chen , May D. Wang , Lei Zhu , Yanye Lu , Qiushi Ren
{"title":"Novel extraction of discriminative fine-grained feature to improve retinal vessel segmentation","authors":"Shuang Zeng ,&nbsp;Chee Hong Lee ,&nbsp;Micky C. Nnamdi ,&nbsp;Wenqi Shi ,&nbsp;J. Ben Tamo ,&nbsp;Hangzhou He ,&nbsp;Xinliang Zhang ,&nbsp;Qian Chen ,&nbsp;May D. Wang ,&nbsp;Lei Zhu ,&nbsp;Yanye Lu ,&nbsp;Qiushi Ren","doi":"10.1016/j.imavis.2025.105729","DOIUrl":"10.1016/j.imavis.2025.105729","url":null,"abstract":"<div><div>Retinal vessel segmentation is a vital early detection method for several severe ocular diseases. Despite significant progress in retinal vessel segmentation with the advancement of Neural Networks, there are still challenges to overcome. Specifically, retinal vessel segmentation aims to predict the class label for every pixel within a fundus image, with a primary focus on intra-image discrimination, making it vital for models to extract more discriminative features. Nevertheless, existing methods primarily focus on minimizing the difference between the output from the decoder and the label, but ignore fully using feature-level fine-grained representations from the encoder. To address these issues, we propose a novel Attention U-shaped Kolmogorov–Arnold Network named AttUKAN along with a novel Label-guided Pixel-wise Contrastive Loss for retinal vessel segmentation. Specifically, we implement Attention Gates into Kolmogorov–Arnold Networks to enhance model sensitivity by suppressing irrelevant feature activations and model interpretability by non-linear modeling of KAN blocks. Additionally, we also design a novel Label-guided Pixel-wise Contrastive Loss to supervise our proposed AttUKAN to extract more discriminative features by distinguishing between foreground vessel-pixel pairs and background pairs. Experiments are conducted across four public datasets including DRIVE, STARE, CHASE_DB1, HRF and our private dataset. AttUKAN achieves F1 scores of 82.50%, 81.14%, 81.34%, 80.21% and 80.09%, along with MIoU scores of 70.24%, 68.64%, 68.59%, 67.21% and 66.94% in the above datasets, which are the highest compared to 11 networks for retinal vessel segmentation. Quantitative and qualitative results show that our AttUKAN achieves state-of-the-art performance and outperforms existing retinal vessel segmentation methods. Our code will be available at <span><span>https://github.com/stevezs315/AttUKAN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105729"},"PeriodicalIF":4.2,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Your image generator is your new private dataset 您的图像生成器是您的新私有数据集
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-18 DOI: 10.1016/j.imavis.2025.105727
Nicolò Francesco Resmini , Eugenio Lomurno, Cristian Sbrolli , Matteo Matteucci
{"title":"Your image generator is your new private dataset","authors":"Nicolò Francesco Resmini ,&nbsp;Eugenio Lomurno,&nbsp;Cristian Sbrolli ,&nbsp;Matteo Matteucci","doi":"10.1016/j.imavis.2025.105727","DOIUrl":"10.1016/j.imavis.2025.105727","url":null,"abstract":"<div><div>Generative diffusion models have emerged as powerful tools to synthetically produce training data, offering potential solutions to data scarcity and reducing labelling costs for downstream supervised deep learning applications. However, existing approaches for synthetic dataset generation face significant limitations: previous methods like Knowledge Recycling rely on label-conditioned generation with models trained from scratch, limiting flexibility and requiring extensive computational resources, while simple class-based conditioning fails to capture the semantic diversity and intra-class variations found in real datasets. Additionally, effectively leveraging text-conditioned image generation for building classifier training sets requires addressing key issues: constructing informative textual prompts, adapting generative models to specific domains, and ensuring robust performance. This paper proposes the Text-Conditioned Knowledge Recycling (TCKR) pipeline to tackle these challenges. TCKR combines dynamic image captioning, parameter-efficient diffusion model fine-tuning, and Generative Knowledge Distillation techniques to create synthetic datasets tailored for image classification. The pipeline is rigorously evaluated on ten diverse image classification benchmarks. The results demonstrate that models trained solely on TCKR-generated data achieve classification accuracies on par with (and in several cases exceeding) models trained on real images. Furthermore, the evaluation reveals that these synthetic-data-trained models exhibit substantially enhanced privacy characteristics: their vulnerability to Membership Inference Attacks is significantly reduced, with the membership inference AUC lowered by 5.49 points on average compared to using real training data, demonstrating a substantial improvement in the performance-privacy trade-off. These findings indicate that high-fidelity synthetic data can effectively replace real data for training classifiers, yielding strong performance whilst simultaneously providing improved privacy protection as a valuable emergent property. The code and trained models are available in the accompanying <span><span>open-source repository</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105727"},"PeriodicalIF":4.2,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
InceptionWTMNet: A hybrid network for Alzheimer’s Disease detection using wavelet transform convolution and Mixed Local Channel Attention on finely fused multimodal images 基于小波变换卷积和混合局部通道关注的混合阿尔茨海默病检测网络
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-18 DOI: 10.1016/j.imavis.2025.105693
Zenan Xu, Zhengyao Bai, Han Ma, Mingqiang Xu, Qiqin Huang, Tao Lin
{"title":"InceptionWTMNet: A hybrid network for Alzheimer’s Disease detection using wavelet transform convolution and Mixed Local Channel Attention on finely fused multimodal images","authors":"Zenan Xu,&nbsp;Zhengyao Bai,&nbsp;Han Ma,&nbsp;Mingqiang Xu,&nbsp;Qiqin Huang,&nbsp;Tao Lin","doi":"10.1016/j.imavis.2025.105693","DOIUrl":"10.1016/j.imavis.2025.105693","url":null,"abstract":"<div><div>Multimodal fusion has emerged as a critical technique for the diagnosis of Alzheimer’s Disease (AD), with the aim of effectively extracting and utilising complementary information from diverse modalities. Current fusion methods frequently cause the precise alignment of source images and do not adequately address parallax issues. This oversight can result in artifacts during the fusion process when images are misaligned. In response to this challenge, we propose a refined registration fusion technique, termed MURF, which integrates multimodal image registration and fusion within a cohesive framework. The Vision Transformer (ViT) has inspired the application of large-kernel convolutions in the diagnosis of Alzheimer’s disease (AD) because of its ability to model long-range dependencies. This approach aims to expand the receptive field and enhance the performance of diagnostic models. Despite requiring a minimal number of floating-point operations (FLOPs), these deep operators encounter challenges associated with over-parameterisation because of high memory access costs, which ultimately compromises computational efficiency. By utilising wavelet transform convolutions (WTConv), we decompose large-kernel depth-wise convolutions into four parallel branches. One branch employs a wavelet-transform convolution with square kernels, while the other two branches incorporate orthogonal wavelet-transform kernels with an identity mapping. This innovative method, with a Mixed Local Channel Attention mechanism, has facilitated the development of the InceptionWTConvolutions network. This network maintains a receptive field comparable to that of large-kernel convolutions, while concurrently minimising over-parameterisation and enhancing computational efficiency. InceptionWTMNet classified AD, MCI, and NC using MRI and PET data from ADNI dataset with 98.69% accuracy, 98.65% recall, 98.70% F1-score, and 98.98% AUC. and provide Graphical abstract in correct format.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105693"},"PeriodicalIF":4.2,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Explainable deepfake detection across different modalities: An overview of methods and challenges 不同模式下可解释的深度伪造检测:方法和挑战概述
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-16 DOI: 10.1016/j.imavis.2025.105738
MD Sarfaraz Momin , Abu Sufian , Debaditya Barman , Marco Leo , Cosimo Distante , Naser Damer
{"title":"Explainable deepfake detection across different modalities: An overview of methods and challenges","authors":"MD Sarfaraz Momin ,&nbsp;Abu Sufian ,&nbsp;Debaditya Barman ,&nbsp;Marco Leo ,&nbsp;Cosimo Distante ,&nbsp;Naser Damer","doi":"10.1016/j.imavis.2025.105738","DOIUrl":"10.1016/j.imavis.2025.105738","url":null,"abstract":"<div><div>The increasing use of deepfake technology enables the creation of realistic and deceptive content, raising concerns about several serious issues, including biometric authentication, misinformation, politics, privacy, and trust. Many Deepfake Detection (DD) models are entering the market to combat the misuse of deepfakes. With these developments, one primary issue occurs in ensuring the explainability of the proposed detection models to understand the rationale of the decision. This paper aims to investigate the state-of-the-art explainable DD models across multiple modalities, including image, video, audio, and text. Unlike existing surveys that focus on detection methodologies with minimal attention to explainability and limited modality coverage, this paper directly focuses on these gaps. It offers a comprehensive analysis of advanced explainability techniques, including Grad-CAM, LIME, SHAP, LRP, Saliency Maps, and Anchors, for detecting deceptive content across the modalities. It identifies the strengths and limitations of existing models and outlines research directions to enhance explainability and interpretability in future works. By exploring these models, we aim to enhance transparency, provide deeper insights into model decisions, and bridge the gap between detection accuracy with explainability in DD models.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105738"},"PeriodicalIF":4.2,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artificial intelligence content detection techniques using watermarking: A survey 基于水印的人工智能内容检测技术综述
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-15 DOI: 10.1016/j.imavis.2025.105728
Nishant Kumar, Amit Kumar Singh
{"title":"Artificial intelligence content detection techniques using watermarking: A survey","authors":"Nishant Kumar,&nbsp;Amit Kumar Singh","doi":"10.1016/j.imavis.2025.105728","DOIUrl":"10.1016/j.imavis.2025.105728","url":null,"abstract":"<div><div>The rapid advancement in AI-generated content has catalyzed artistic creation, advertising, and media dissemination. Despite their widespread applications across several domains, AI-generated content inherently poses risks of identity fraud, copyright violation and unauthorized use. Watermarking has emerged as a critical tool for copyright protection, allowing embedding of identification information in AI-generated content, and enhances traceability and verification without hurting user experience. In this study, we provide a systematic literature review of the technique for detecting AI content, especially text and images, using watermarking, spanning studies from 2010 to 2025. Studies included in this review were peer-reviewed articles that applied watermarking to effectively distinguish AI-generated content from real or human-written content. We report strong past and current approaches to detecting watermarking-based AI content, especially text and images. This includes an analysis of how watermarking methods are used on AI-generated content, their role in enhancing performance, and a detail comparative analysis of notable techniques. Furthermore, we discuss how these methods have been evaluated, identify the research gaps and potential solutions. Our findings provide valuable insights for future watermarking-based AI content detection researchers, applications and organizations seeking to implement watermarking solutions in potential applications. To the best of our knowledge, we are the first to explore the detection of AI content, especially text and image, detection using watermarking.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105728"},"PeriodicalIF":4.2,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UNIR-Net: A novel approach for restoring underwater images with non-uniform illumination using synthetic data UNIR-Net:一种利用合成数据恢复非均匀光照水下图像的新方法
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-15 DOI: 10.1016/j.imavis.2025.105734
Ezequiel Pérez-Zarate , Chunxiao Liu , Oscar Ramos-Soto , Diego Oliva , Marco Pérez-Cisneros
{"title":"UNIR-Net: A novel approach for restoring underwater images with non-uniform illumination using synthetic data","authors":"Ezequiel Pérez-Zarate ,&nbsp;Chunxiao Liu ,&nbsp;Oscar Ramos-Soto ,&nbsp;Diego Oliva ,&nbsp;Marco Pérez-Cisneros","doi":"10.1016/j.imavis.2025.105734","DOIUrl":"10.1016/j.imavis.2025.105734","url":null,"abstract":"<div><div>Restoring underwater images affected by non-uniform illumination (NUI) is essential to improve visual quality and usability in marine applications. Conventional methods often fall short in handling complex illumination patterns, while learning-based approaches face challenges due to the lack of targeted datasets. To address these limitations, the Underwater Non-uniform Illumination Restoration Network (UNIR-Net) is proposed. UNIR-Net integrates multiple components, including illumination enhancement, attention mechanisms, visual refinement, and contrast correction, to effectively restore underwater images affected by NUI. In addition, the Paired Underwater Non-uniform Illumination (PUNI) dataset is introduced, specifically designed for training and evaluating models under NUI conditions. Experimental results on PUNI and the large-scale real-world Non-Uniform Illumination Dataset (NUID) show that UNIR-Net achieves superior performance in both quantitative metrics and visual outcomes. UNIR-Net also improves downstream tasks such as underwater semantic segmentation, highlighting its practical relevance. The code is available at <span><span>https://github.com/xingyumex/UNIR-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105734"},"PeriodicalIF":4.2,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MITS: A large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance 智能交通监控的大规模多模态基准数据集
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-15 DOI: 10.1016/j.imavis.2025.105736
Kaikai Zhao, Zhaoxiang Liu, Peng Wang, Xin Wang, Zhicheng Ma, Yajun Xu, Wenjing Zhang, Yibing Nan, Kai Wang, Shiguo Lian
{"title":"MITS: A large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance","authors":"Kaikai Zhao,&nbsp;Zhaoxiang Liu,&nbsp;Peng Wang,&nbsp;Xin Wang,&nbsp;Zhicheng Ma,&nbsp;Yajun Xu,&nbsp;Wenjing Zhang,&nbsp;Yibing Nan,&nbsp;Kai Wang,&nbsp;Shiguo Lian","doi":"10.1016/j.imavis.2025.105736","DOIUrl":"10.1016/j.imavis.2025.105736","url":null,"abstract":"<div><div>General-domain large multimodal models (LMMs) have achieved significant advances in various image-text tasks. However, their performance in the Intelligent Traffic Surveillance (ITS) domain remains limited due to the absence of dedicated multimodal datasets. To address this gap, we introduce <strong>MITS</strong> (Multimodal Intelligent Traffic Surveillance), the first large-scale multimodal benchmark dataset specifically designed for ITS. MITS includes <strong>170,400 independently collected real-world ITS images</strong> sourced from traffic surveillance cameras, annotated with <strong>eight main categories</strong> and <strong>24 subcategories</strong> of ITS-specific objects and events under diverse environmental conditions. Additionally, through a systematic data generation pipeline, we generate <strong>high-quality image captions</strong> and <strong>5 million instruction-following visual question-answer pairs</strong>, addressing <strong>five critical ITS tasks</strong>: object and event recognition, object counting, object localization, background analysis, and event reasoning. To demonstrate MITS’s effectiveness, we fine-tune mainstream LMMs on this dataset, enabling the development of ITS-specific applications. Experimental results show that MITS significantly improves LMM performance in ITS applications, increasing LLaVA-1.5’s performance from 0.494 to 0.905 (+83.2%), LLaVA-1.6’s from 0.678 to 0.921 (+35.8%), Qwen2-VL’s from 0.584 to 0.926 (+58.6%), and Qwen2.5-VL’s from 0.732 to 0.930 (+27.0%). We release the dataset, code, and models as <span><span>open-source</span><svg><path></path></svg></span>, providing high-value resources to advance both ITS and LMM research.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105736"},"PeriodicalIF":4.2,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UpAttTrans: Upscaled attention based transformer for facial image super-resolution UpAttTrans:升级的基于注意力的面部图像超分辨率转换器
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-13 DOI: 10.1016/j.imavis.2025.105731
Neeraj Baghel, Shiv Ram Dubey, Satish Kumar Singh
{"title":"UpAttTrans: Upscaled attention based transformer for facial image super-resolution","authors":"Neeraj Baghel,&nbsp;Shiv Ram Dubey,&nbsp;Satish Kumar Singh","doi":"10.1016/j.imavis.2025.105731","DOIUrl":"10.1016/j.imavis.2025.105731","url":null,"abstract":"<div><div>Image super-resolution (SR) aims to reconstruct high-quality images from low-resolution inputs, a task particularly challenging in face-related applications due to extreme degradations and modality differences (e.g., visible, low-resolution, near-infrared). Conventional convolutional neural networks (CNNs) and GAN-based approaches have achieved notable success; however, they often struggle with preserving identity and fine structural details at high upscaling factors. In this work, we introduce UpAttTrans, a novel attention mechanism that connects original and upsampled features for better detail recovery based on vision transformer for SR. The core generator leverages a custom UpAttTrans module that translates input image patches into embeddings, processes them through transformer layers enhanced with connector-up attention, and reconstructs high-resolution outputs with improved detail retention. We evaluate our model on the CelebA dataset across multiple upscaling factors (<span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>8</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>16</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>32</mn><mo>×</mo></mrow></math></span>, and <span><math><mrow><mn>64</mn><mo>×</mo></mrow></math></span>). UpAttTrans achieves a 24.63% increase in PSNR, 21.56% in SSIM, and 19.61% reduction in FID for <span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>8</mn><mo>×</mo></mrow></math></span> SR, outperforming state-of-the-art baselines. Additionally, for higher magnification levels, our model maintains strong performance, with average gains of 6.20% in PSNR and 21.49% in SSIM, indicating its robustness in extreme SR settings. These findings suggest that UpAttTrans holds significant promise for real-world applications such as face recognition in surveillance, forensic image enhancement, and cross-spectral matching, where high-quality reconstruction from severely degraded inputs is critical.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105731"},"PeriodicalIF":4.2,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145108834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flexible disentangled representation learning with soft-splitting for multi-view data 基于软分割的多视图数据柔性解纠缠表示学习
IF 4.2 3区 计算机科学
Image and Vision Computing Pub Date : 2025-09-13 DOI: 10.1016/j.imavis.2025.105722
Xunzhan Yao , Ming Yin , Yonghua Wang , Yi Guo
{"title":"Flexible disentangled representation learning with soft-splitting for multi-view data","authors":"Xunzhan Yao ,&nbsp;Ming Yin ,&nbsp;Yonghua Wang ,&nbsp;Yi Guo","doi":"10.1016/j.imavis.2025.105722","DOIUrl":"10.1016/j.imavis.2025.105722","url":null,"abstract":"<div><div>Multi-view representation learning has gained significant attention in the machine learning and computer vision communities. However, existing approaches often fail to fully exploit the complementary part among different views during the fusion process, which may lead to representation entanglement and consequently degrade the performance for downstream tasks. To this end, we propose a novel Flexible Disentangled Representation Learning for Multi-View data in this paper. Specifically, the representation learning is performed by an adaptive soft-splitting multi-view gated fusion auto-encoder network, namely ASS-MVGFAE, which aims at separating the complementary and consistency parts in a soft way, rather than hard-splitting in the traditional methods. And then the decoupled common features are fed into a Gated Fusion Unit (GFU) to be aligned and fused, such that the shared latent representation is achieved for downstream clustering. Extensive experiments on several real-world datasets demonstrate that our method outperforms the state-of-the-art in terms of several evaluation metrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"162 ","pages":"Article 105722"},"PeriodicalIF":4.2,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145060335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信