Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing.

IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-11-06 DOI:10.1109/TPAMI.2024.3492259

Yipo Huang, Leida Li, Pengfei Chen, Haoning Wu, Weisi Lin, Guangming Shi

{"title":"Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing.","authors":"Yipo Huang, Leida Li, Pengfei Chen, Haoning Wu, Weisi Lin, Guangming Shi","doi":"10.1109/TPAMI.2024.3492259","DOIUrl":null,"url":null,"abstract":"<p><p>In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. (1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. (2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks. The code, database and pre-trained weights will be available at https://github.com/yipoh/AesNet.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TPAMI.2024.3492259","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In the Image Aesthetics Computing (IAC) field, most prior methods leveraged the off-the-shelf backbones pre-trained on the large-scale ImageNet database. While these pre-trained backbones have achieved notable success, they often overemphasize object-level semantics and fail to capture the high-level concepts of image aesthetics, which may only achieve suboptimal performances. To tackle this long-neglected problem, we propose a multi-modality multi-attribute contrastive pre-training framework, targeting at constructing an alternative to ImageNet-based pre-training for IAC. Specifically, the proposed framework consists of two main aspects. (1) We build a multi-attribute image description database with human feedback, leveraging the competent image understanding capability of the multi-modality large language model to generate rich aesthetic descriptions. (2) To better adapt models to aesthetic computing tasks, we integrate the image-based visual features with the attribute-based text features, and map the integrated features into different embedding spaces, based on which the multi-attribute contrastive learning is proposed for obtaining more comprehensive aesthetic representation. To alleviate the distribution shift encountered when transitioning from the general visual domain to the aesthetic domain, we further propose a semantic affinity loss to restrain the content information and enhance model generalization. Extensive experiments demonstrate that the proposed framework sets new state-of-the-arts for IAC tasks. The code, database and pre-trained weights will be available at https://github.com/yipoh/AesNet.

查看原文本刊更多论文

图像美学计算的多模态多属性对比预训练

在图像美学计算（IAC）领域，之前的大多数方法都是利用在大型 ImageNet 数据库上预先训练好的现成骨干。虽然这些预训练骨干取得了显著的成功，但它们往往过于强调对象层面的语义，而未能捕捉到图像美学的高层次概念，因此可能只能达到次优的性能。为了解决这一长期被忽视的问题，我们提出了一种多模态多属性对比预训练框架，旨在为 IAC 构建一种基于 ImageNet 的预训练替代方案。具体来说，我们提出的框架包括两个主要方面。(1) 我们利用多模态大语言模型的图像理解能力来生成丰富的审美描述，从而建立一个多属性图像描述数据库。(2）为了使模型更好地适应审美计算任务，我们将基于图像的视觉特征与基于属性的文本特征进行了整合，并将整合后的特征映射到不同的嵌入空间，在此基础上提出了多属性对比学习，以获得更全面的审美表征。为了缓解从一般视觉领域过渡到审美领域时遇到的分布偏移问题，我们进一步提出了语义亲和力损失来抑制内容信息，增强模型的泛化能力。广泛的实验证明，所提出的框架为 IAC 任务树立了新的艺术典范。有关代码、数据库和预训练权重，请访问 https://github.com/yipoh/AesNet。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量