Jin Qi , Can Su , Xiaoxuan Hu , Mengwei Chen , Yanfei Sun , Zhenjiang Dong , Tianliang Liu , Jiebo Luo
{"title":"AMFMER: A multimodal full transformer for unifying aesthetic assessment tasks","authors":"Jin Qi , Can Su , Xiaoxuan Hu , Mengwei Chen , Yanfei Sun , Zhenjiang Dong , Tianliang Liu , Jiebo Luo","doi":"10.1016/j.image.2025.117320","DOIUrl":null,"url":null,"abstract":"<div><div>Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117320"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525000670","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.
期刊介绍:
Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following:
To present a forum for the advancement of theory and practice of image communication.
To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems.
To contribute to a rapid information exchange between the industrial and academic environments.
The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world.
Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments.
Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.