{"title":"UpAttTrans: Upscaled attention based transformer for facial image super-resolution","authors":"Neeraj Baghel, Shiv Ram Dubey, Satish Kumar Singh","doi":"10.1016/j.imavis.2025.105731","DOIUrl":null,"url":null,"abstract":"<div><div>Image super-resolution (SR) aims to reconstruct high-quality images from low-resolution inputs, a task particularly challenging in face-related applications due to extreme degradations and modality differences (e.g., visible, low-resolution, near-infrared). Conventional convolutional neural networks (CNNs) and GAN-based approaches have achieved notable success; however, they often struggle with preserving identity and fine structural details at high upscaling factors. In this work, we introduce UpAttTrans, a novel attention mechanism that connects original and upsampled features for better detail recovery based on vision transformer for SR. The core generator leverages a custom UpAttTrans module that translates input image patches into embeddings, processes them through transformer layers enhanced with connector-up attention, and reconstructs high-resolution outputs with improved detail retention. We evaluate our model on the CelebA dataset across multiple upscaling factors (<span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>8</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>16</mn><mo>×</mo></mrow></math></span>, <span><math><mrow><mn>32</mn><mo>×</mo></mrow></math></span>, and <span><math><mrow><mn>64</mn><mo>×</mo></mrow></math></span>). UpAttTrans achieves a 24.63% increase in PSNR, 21.56% in SSIM, and 19.61% reduction in FID for <span><math><mrow><mn>4</mn><mo>×</mo></mrow></math></span> and <span><math><mrow><mn>8</mn><mo>×</mo></mrow></math></span> SR, outperforming state-of-the-art baselines. Additionally, for higher magnification levels, our model maintains strong performance, with average gains of 6.20% in PSNR and 21.49% in SSIM, indicating its robustness in extreme SR settings. These findings suggest that UpAttTrans holds significant promise for real-world applications such as face recognition in surveillance, forensic image enhancement, and cross-spectral matching, where high-quality reconstruction from severely degraded inputs is critical.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"163 ","pages":"Article 105731"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885625003191","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Image super-resolution (SR) aims to reconstruct high-quality images from low-resolution inputs, a task particularly challenging in face-related applications due to extreme degradations and modality differences (e.g., visible, low-resolution, near-infrared). Conventional convolutional neural networks (CNNs) and GAN-based approaches have achieved notable success; however, they often struggle with preserving identity and fine structural details at high upscaling factors. In this work, we introduce UpAttTrans, a novel attention mechanism that connects original and upsampled features for better detail recovery based on vision transformer for SR. The core generator leverages a custom UpAttTrans module that translates input image patches into embeddings, processes them through transformer layers enhanced with connector-up attention, and reconstructs high-resolution outputs with improved detail retention. We evaluate our model on the CelebA dataset across multiple upscaling factors (, , , , and ). UpAttTrans achieves a 24.63% increase in PSNR, 21.56% in SSIM, and 19.61% reduction in FID for and SR, outperforming state-of-the-art baselines. Additionally, for higher magnification levels, our model maintains strong performance, with average gains of 6.20% in PSNR and 21.49% in SSIM, indicating its robustness in extreme SR settings. These findings suggest that UpAttTrans holds significant promise for real-world applications such as face recognition in surveillance, forensic image enhancement, and cross-spectral matching, where high-quality reconstruction from severely degraded inputs is critical.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.