AMFMER: A multimodal full transformer for unifying aesthetic assessment tasks

IF 3.4 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Signal Processing-Image Communication Pub Date : 2025-04-21 DOI:10.1016/j.image.2025.117320

Jin Qi , Can Su , Xiaoxuan Hu , Mengwei Chen , Yanfei Sun , Zhenjiang Dong , Tianliang Liu , Jiebo Luo

{"title":"AMFMER: A multimodal full transformer for unifying aesthetic assessment tasks","authors":"Jin Qi , Can Su , Xiaoxuan Hu , Mengwei Chen , Yanfei Sun , Zhenjiang Dong , Tianliang Liu , Jiebo Luo","doi":"10.1016/j.image.2025.117320","DOIUrl":null,"url":null,"abstract":"<div><div>Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117320"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525000670","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.

查看原文本刊更多论文

AMFMER：用于统一美学评估任务的多模态全变压器

计算美学旨在通过计算机模拟人类的视觉感知过程，以自动的方法对审美质量进行自动评价。这个话题已经被许多研究者广泛研究。然而，现有的研究大多集中在图像内容上，而忽略了相关图像评论的高级语义。此外，大多数主要的评估方法都是基于卷积神经网络（cnn）来学习特征，缺乏多模态评估需求的表示能力和建模能力。此外，许多基于变换的模型方法存在模型各部分之间信息流有限的问题，许多多模态融合方法用于提取图像特征和文本特征，不能很好地处理多模态信息。受上述问题的启发，本文提出了一种新的不考虑审美风格信息的多模态全变形（AMFMER）评价模型，该模型由视觉流、文本流和多模态融合层三部分组成。首先，视觉流利用改进的Swin变换提取输入图像的鲜明层特征；其次，文本流基于稳健优化的双向编码器表示（RoBERTa）文本编码器，从相应的注释中提取语义信息。第三，多模态融合层以交叉关注的方式融合视觉特征、文本特征和低层显著特征，提取多模态显著特征。实验结果表明，本文提出的AMFMER方法在统一的审美预测任务中优于目前的主流方法，特别是在客观模型评价与主观人类评价之间的相关性方面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Signal Processing-Image Communication 工程技术-工程：电子与电气

CiteScore

8.40

自引率

2.90%

发文量

138

审稿时长

5.2 months

期刊介绍： Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.