MIGF-Net: Multimodal interaction-guided fusion network for image aesthetics assessment

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Pattern Recognition Pub Date : 2025-09-05 DOI:10.1016/j.patcog.2025.112401

Yun Liu , Zhipeng Wen , Leida Li , Peiguang Jing , Daoxin Fan

{"title":"MIGF-Net: Multimodal interaction-guided fusion network for image aesthetics assessment","authors":"Yun Liu , Zhipeng Wen , Leida Li , Peiguang Jing , Daoxin Fan","doi":"10.1016/j.patcog.2025.112401","DOIUrl":null,"url":null,"abstract":"<div><div>With the development of social media, people like to post images and comments to share their ideas, which provides rich visual and textural semantic information for image aesthetics assessment (IAA). However, most previous works either extracted the unimodal aesthetic features from image due to the difficulty of obtaining comments, or combined multimodal information together but ignoring the interactive relationship between image and comment, which limits the overall performance. To solve the above problem, we propose a Multimodal Interaction-Guided Fusion Network (MIGF-Net) for image aesthetics assessment based on both image and comment semantic information, which can not only solve the challenge of comment generating, but also provide the multimodal feature interactive information. Specifically, considering the coupling mechanism of the image theme, we construct a visual semantic fusion module to extract the visual semantic feature based on the visual attributes and the theme features. Then, a textural semantic feature extractor is designed to mine the semantic information hidden in comments, which not only addresses the issue of missing comments but also effectively complements the visual semantic features. Furthermore, we establish a Dual-Stream Interaction-Guided Fusion module to fuse the semantic features of images and comments, fully exploring the interactive relationship between images and comments in the human brain’s perception mechanism. Experimental results on two public image aesthetics evaluation datasets demonstrate that our model outperforms the current state-of-the-art methods. Our code will be released at <span><span>https://github.com/wenzhipeng123/MIGF-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"172 ","pages":"Article 112401"},"PeriodicalIF":7.6000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0031320325010623","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

With the development of social media, people like to post images and comments to share their ideas, which provides rich visual and textural semantic information for image aesthetics assessment (IAA). However, most previous works either extracted the unimodal aesthetic features from image due to the difficulty of obtaining comments, or combined multimodal information together but ignoring the interactive relationship between image and comment, which limits the overall performance. To solve the above problem, we propose a Multimodal Interaction-Guided Fusion Network (MIGF-Net) for image aesthetics assessment based on both image and comment semantic information, which can not only solve the challenge of comment generating, but also provide the multimodal feature interactive information. Specifically, considering the coupling mechanism of the image theme, we construct a visual semantic fusion module to extract the visual semantic feature based on the visual attributes and the theme features. Then, a textural semantic feature extractor is designed to mine the semantic information hidden in comments, which not only addresses the issue of missing comments but also effectively complements the visual semantic features. Furthermore, we establish a Dual-Stream Interaction-Guided Fusion module to fuse the semantic features of images and comments, fully exploring the interactive relationship between images and comments in the human brain’s perception mechanism. Experimental results on two public image aesthetics evaluation datasets demonstrate that our model outperforms the current state-of-the-art methods. Our code will be released at https://github.com/wenzhipeng123/MIGF-Net.

查看原文本刊更多论文

MIGF-Net：用于图像美学评估的多模态交互引导融合网络

随着社交媒体的发展，人们喜欢发布图片和评论来分享自己的想法，这为图像美学评价（IAA）提供了丰富的视觉和纹理语义信息。然而，以往的作品大多由于难以获得评论而从图像中提取单模态的审美特征，或者将多模态信息组合在一起而忽略了图像与评论之间的互动关系，从而限制了整体表现。针对上述问题，本文提出了一种基于图像和评论语义信息的图像美学评价多模态交互引导融合网络（MIGF-Net），既解决了评论生成的挑战，又提供了多模态特征交互信息。具体而言，考虑到图像主题的耦合机制，构建了视觉语义融合模块，基于视觉属性和主题特征提取图像的视觉语义特征。然后，设计了一个纹理语义特征提取器来挖掘隐藏在注释中的语义信息，既解决了注释缺失的问题，又有效地补充了视觉语义特征。此外，我们建立了双流交互引导融合模块，融合图像和评论的语义特征，充分探索图像和评论在人脑感知机制中的交互关系。在两个公共图像美学评估数据集上的实验结果表明，我们的模型优于当前最先进的方法。我们的代码将在https://github.com/wenzhipeng123/MIGF-Net上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Pattern Recognition 工程技术-工程：电子与电气

CiteScore

14.40

自引率

16.20%

发文量

683

审稿时长

5.6 months

期刊介绍： The field of Pattern Recognition is both mature and rapidly evolving, playing a crucial role in various related fields such as computer vision, image processing, text analysis, and neural networks. It closely intersects with machine learning and is being applied in emerging areas like biometrics, bioinformatics, multimedia data analysis, and data science. The journal Pattern Recognition, established half a century ago during the early days of computer science, has since grown significantly in scope and influence.