Multi-Layer Cross-Modal Prompt Fusion for No-Reference Image Quality Assessment

IF 3.4 2区工程技术 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

Displays Pub Date : 2025-04-02 DOI:10.1016/j.displa.2025.103045

Yang Lu , Zilu Zhou , Zifan Yang , Shuangyao Han , Xiaoheng Jiang , Mingliang Xu

{"title":"Multi-Layer Cross-Modal Prompt Fusion for No-Reference Image Quality Assessment","authors":"Yang Lu , Zilu Zhou , Zifan Yang , Shuangyao Han , Xiaoheng Jiang , Mingliang Xu","doi":"10.1016/j.displa.2025.103045","DOIUrl":null,"url":null,"abstract":"<div><div>No-Reference Image Quality Assessment (NR-IQA) predicts image quality without reference images and exhibits high consistency with human visual perception. Multi-modal approaches based on vision-language (VL) models, like CLIP, have demonstrated remarkable generalization capabilities in NR-IQA tasks. While prompt learning has improved CLIP’s adaptation to downstream tasks, existing methods often lack synergy between textual and visual prompts, limiting their ability to capture complex cross-modal semantics. In response to this limitation, this paper proposes an innovative framework named MCPF-IQA with multi-layer cross-modal prompt fusion to further enhance the performance of CLIP model on NR-IQA tasks. Specifically, we introduce multi-layer prompt learning in both the text and visual branches of CLIP to improve the model’s comprehension of visual features and image quality. Additionally, we design a novel cross-modal prompt fusion module that deeply integrates text and visual prompts to enhance the accuracy of image quality assessment. We also develop five auxiliary quality-related category labels to describe image quality more precisely. Experimental results demonstrate MCPF-IQA model delivers exceptional performance on natural image datasets, with SRCC of 0.988 on the LIVE dataset (1.8% higher than the second-best method) and 0.913 on the LIVEC dataset (1.0% superior to the second-best method). Furthermore, it also exhibits strong performance on AI-generated image datasets. Ablation study results demonstrate the effectiveness and advantages of our method.</div></div>","PeriodicalId":50570,"journal":{"name":"Displays","volume":"88 ","pages":"Article 103045"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Displays","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0141938225000824","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

No-Reference Image Quality Assessment (NR-IQA) predicts image quality without reference images and exhibits high consistency with human visual perception. Multi-modal approaches based on vision-language (VL) models, like CLIP, have demonstrated remarkable generalization capabilities in NR-IQA tasks. While prompt learning has improved CLIP’s adaptation to downstream tasks, existing methods often lack synergy between textual and visual prompts, limiting their ability to capture complex cross-modal semantics. In response to this limitation, this paper proposes an innovative framework named MCPF-IQA with multi-layer cross-modal prompt fusion to further enhance the performance of CLIP model on NR-IQA tasks. Specifically, we introduce multi-layer prompt learning in both the text and visual branches of CLIP to improve the model’s comprehension of visual features and image quality. Additionally, we design a novel cross-modal prompt fusion module that deeply integrates text and visual prompts to enhance the accuracy of image quality assessment. We also develop five auxiliary quality-related category labels to describe image quality more precisely. Experimental results demonstrate MCPF-IQA model delivers exceptional performance on natural image datasets, with SRCC of 0.988 on the LIVE dataset (1.8% higher than the second-best method) and 0.913 on the LIVEC dataset (1.0% superior to the second-best method). Furthermore, it also exhibits strong performance on AI-generated image datasets. Ablation study results demonstrate the effectiveness and advantages of our method.

查看原文本刊更多论文

多层跨模态提示融合无参考图像质量评估

无参考图像质量评估（NR-IQA）在没有参考图像的情况下预测图像质量，与人类视觉感知具有较高的一致性。基于视觉语言（VL）模型的多模态方法，如CLIP，已经在NR-IQA任务中展示了显著的泛化能力。虽然提示学习提高了CLIP对下游任务的适应性，但现有的方法往往缺乏文本和视觉提示之间的协同作用，限制了它们捕捉复杂的跨模态语义的能力。针对这一局限性，本文提出了一种多层跨模态提示融合的创新框架MCPF-IQA，进一步提高CLIP模型在NR-IQA任务上的性能。具体而言，我们在CLIP的文本和视觉分支中引入多层提示学习，以提高模型对视觉特征和图像质量的理解。此外，我们设计了一种新颖的跨模态提示融合模块，将文本和视觉提示深度融合，以提高图像质量评估的准确性。我们还开发了五个辅助的与质量相关的类别标签来更准确地描述图像质量。实验结果表明，MCPF-IQA模型在自然图像数据集上具有优异的性能，在LIVE数据集上的SRCC为0.988（比次优方法高1.8%），在LIVEC数据集上的SRCC为0.913（比次优方法高1.0%）。此外，它在人工智能生成的图像数据集上也表现出强大的性能。烧蚀实验结果证明了该方法的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Displays 工程技术-工程：电子与电气

CiteScore

4.60

自引率

25.60%

发文量

138

审稿时长

92 days

期刊介绍： Displays is the international journal covering the research and development of display technology, its effective presentation and perception of information, and applications and systems including display-human interface. Technical papers on practical developments in Displays technology provide an effective channel to promote greater understanding and cross-fertilization across the diverse disciplines of the Displays community. Original research papers solving ergonomics issues at the display-human interface advance effective presentation of information. Tutorial papers covering fundamentals intended for display technologies and human factor engineers new to the field will also occasionally featured.