AMFMER:用于统一美学评估任务的多模态全变压器

IF 3.4 3区 工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC
Jin Qi , Can Su , Xiaoxuan Hu , Mengwei Chen , Yanfei Sun , Zhenjiang Dong , Tianliang Liu , Jiebo Luo
{"title":"AMFMER:用于统一美学评估任务的多模态全变压器","authors":"Jin Qi ,&nbsp;Can Su ,&nbsp;Xiaoxuan Hu ,&nbsp;Mengwei Chen ,&nbsp;Yanfei Sun ,&nbsp;Zhenjiang Dong ,&nbsp;Tianliang Liu ,&nbsp;Jiebo Luo","doi":"10.1016/j.image.2025.117320","DOIUrl":null,"url":null,"abstract":"<div><div>Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"138 ","pages":"Article 117320"},"PeriodicalIF":3.4000,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AMFMER: A multimodal full transformer for unifying aesthetic assessment tasks\",\"authors\":\"Jin Qi ,&nbsp;Can Su ,&nbsp;Xiaoxuan Hu ,&nbsp;Mengwei Chen ,&nbsp;Yanfei Sun ,&nbsp;Zhenjiang Dong ,&nbsp;Tianliang Liu ,&nbsp;Jiebo Luo\",\"doi\":\"10.1016/j.image.2025.117320\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.</div></div>\",\"PeriodicalId\":49521,\"journal\":{\"name\":\"Signal Processing-Image Communication\",\"volume\":\"138 \",\"pages\":\"Article 117320\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-04-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Signal Processing-Image Communication\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0923596525000670\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Signal Processing-Image Communication","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0923596525000670","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0

摘要

计算美学旨在通过计算机模拟人类的视觉感知过程,以自动的方法对审美质量进行自动评价。这个话题已经被许多研究者广泛研究。然而,现有的研究大多集中在图像内容上,而忽略了相关图像评论的高级语义。此外,大多数主要的评估方法都是基于卷积神经网络(cnn)来学习特征,缺乏多模态评估需求的表示能力和建模能力。此外,许多基于变换的模型方法存在模型各部分之间信息流有限的问题,许多多模态融合方法用于提取图像特征和文本特征,不能很好地处理多模态信息。受上述问题的启发,本文提出了一种新的不考虑审美风格信息的多模态全变形(AMFMER)评价模型,该模型由视觉流、文本流和多模态融合层三部分组成。首先,视觉流利用改进的Swin变换提取输入图像的鲜明层特征;其次,文本流基于稳健优化的双向编码器表示(RoBERTa)文本编码器,从相应的注释中提取语义信息。第三,多模态融合层以交叉关注的方式融合视觉特征、文本特征和低层显著特征,提取多模态显著特征。实验结果表明,本文提出的AMFMER方法在统一的审美预测任务中优于目前的主流方法,特别是在客观模型评价与主观人类评价之间的相关性方面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
AMFMER: A multimodal full transformer for unifying aesthetic assessment tasks
Computational aesthetics aims to simulate the human visual perception process via the computers to automatically evaluate aesthetic quality with automatic methods. This topic has been widely studied by numerous researchers. However, existing research mostly focuses on image content while disregarding high-level semantics in the related image comments. In addition, most major assessment methods are based on convolutional neural networks (CNNs) for learning the distinctive features, which lack representational power and modeling capabilities for multimodal assessment requirement. Furthermore, many transformer-based model approaches suffer from limited information flow between different parts of the assumed model, and many multimodal fusion methods are used to extract image features and text features, and cannot handle multi-modal information well. Inspired by the above questions, in this paper, A novel Multimodal Full transforMER (AMFMER) evaluation model without aesthetic style information is proposed, consisting of three components: visual stream, textual stream and multimodal fusion layer. Firstly, the visual stream exploits the improved Swin transformer to extract the distinctive layer features of the input image. Secondly, the textual stream is based on the robustly optimized bidirectional encoder representations from transformers (RoBERTa) text encoder to extract semantic information from the corresponding comments. Thirdly, the multimodal fusion layer fuses visual features, textual features and low-layer salient features in a cross-attention manner to extract the multimodal distinctive features. Experimental results show that the proposed AMFMER approach in this paper outperforms current mainstream methods in a unified aesthetic prediction task, especially in terms of the correlation between the objective model evaluation and subjective human evaluation.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Signal Processing-Image Communication
Signal Processing-Image Communication 工程技术-工程:电子与电气
CiteScore
8.40
自引率
2.90%
发文量
138
审稿时长
5.2 months
期刊介绍: Signal Processing: Image Communication is an international journal for the development of the theory and practice of image communication. Its primary objectives are the following: To present a forum for the advancement of theory and practice of image communication. To stimulate cross-fertilization between areas similar in nature which have traditionally been separated, for example, various aspects of visual communications and information systems. To contribute to a rapid information exchange between the industrial and academic environments. The editorial policy and the technical content of the journal are the responsibility of the Editor-in-Chief, the Area Editors and the Advisory Editors. The Journal is self-supporting from subscription income and contains a minimum amount of advertisements. Advertisements are subject to the prior approval of the Editor-in-Chief. The journal welcomes contributions from every country in the world. Signal Processing: Image Communication publishes articles relating to aspects of the design, implementation and use of image communication systems. The journal features original research work, tutorial and review articles, and accounts of practical developments. Subjects of interest include image/video coding, 3D video representations and compression, 3D graphics and animation compression, HDTV and 3DTV systems, video adaptation, video over IP, peer-to-peer video networking, interactive visual communication, multi-user video conferencing, wireless video broadcasting and communication, visual surveillance, 2D and 3D image/video quality measures, pre/post processing, video restoration and super-resolution, multi-camera video analysis, motion analysis, content-based image/video indexing and retrieval, face and gesture processing, video synthesis, 2D and 3D image/video acquisition and display technologies, architectures for image/video processing and communication.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信