GalleryGPT：用大型多模态模型分析绘画作品

arXiv - CS - Multimedia Pub Date : 2024-08-01 DOI:arxiv-2408.00491

Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen

{"title":"GalleryGPT：用大型多模态模型分析绘画作品","authors":"Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen","doi":"arxiv-2408.00491","DOIUrl":null,"url":null,"abstract":"Artwork analysis is important and fundamental skill for art appreciation,\nwhich could enrich personal aesthetic sensibility and facilitate the critical\nthinking ability. Understanding artworks is challenging due to its subjective\nnature, diverse interpretations, and complex visual elements, requiring\nexpertise in art history, cultural background, and aesthetic theory. However,\nlimited by the data collection and model ability, previous works for\nautomatically analyzing artworks mainly focus on classification, retrieval, and\nother simple tasks, which is far from the goal of AI. To facilitate the\nresearch progress, in this paper, we step further to compose comprehensive\nanalysis inspired by the remarkable perception and generation ability of large\nmultimodal models. Specifically, we first propose a task of composing paragraph\nanalysis for artworks, i.e., painting in this paper, only focusing on visual\ncharacteristics to formulate more comprehensive understanding of artworks. To\nsupport the research on formal analysis, we collect a large dataset\nPaintingForm, with about 19k painting images and 50k analysis paragraphs. We\nfurther introduce a superior large multimodal model for painting analysis\ncomposing, dubbed GalleryGPT, which is slightly modified and fine-tuned based\non LLaVA architecture leveraging our collected data. We conduct formal analysis\ngeneration and zero-shot experiments across several datasets to assess the\ncapacity of our model. The results show remarkable performance improvements\ncomparing with powerful baseline LMMs, demonstrating its superb ability of art\nanalysis and generalization. \\textcolor{blue}{The codes and model are available\nat: https://github.com/steven640pixel/GalleryGPT.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"186 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GalleryGPT: Analyzing Paintings with Large Multimodal Models\",\"authors\":\"Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen\",\"doi\":\"arxiv-2408.00491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artwork analysis is important and fundamental skill for art appreciation,\\nwhich could enrich personal aesthetic sensibility and facilitate the critical\\nthinking ability. Understanding artworks is challenging due to its subjective\\nnature, diverse interpretations, and complex visual elements, requiring\\nexpertise in art history, cultural background, and aesthetic theory. However,\\nlimited by the data collection and model ability, previous works for\\nautomatically analyzing artworks mainly focus on classification, retrieval, and\\nother simple tasks, which is far from the goal of AI. To facilitate the\\nresearch progress, in this paper, we step further to compose comprehensive\\nanalysis inspired by the remarkable perception and generation ability of large\\nmultimodal models. Specifically, we first propose a task of composing paragraph\\nanalysis for artworks, i.e., painting in this paper, only focusing on visual\\ncharacteristics to formulate more comprehensive understanding of artworks. To\\nsupport the research on formal analysis, we collect a large dataset\\nPaintingForm, with about 19k painting images and 50k analysis paragraphs. We\\nfurther introduce a superior large multimodal model for painting analysis\\ncomposing, dubbed GalleryGPT, which is slightly modified and fine-tuned based\\non LLaVA architecture leveraging our collected data. We conduct formal analysis\\ngeneration and zero-shot experiments across several datasets to assess the\\ncapacity of our model. The results show remarkable performance improvements\\ncomparing with powerful baseline LMMs, demonstrating its superb ability of art\\nanalysis and generalization. \\\\textcolor{blue}{The codes and model are available\\nat: https://github.com/steven640pixel/GalleryGPT.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"186 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00491\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

艺术作品分析是艺术鉴赏的重要基本技能，可以丰富个人的审美情趣，促进批判性思维能力。由于艺术作品的主体性、解释的多样性和视觉元素的复杂性，理解艺术作品具有一定的挑战性，需要具备艺术史、文化背景和美学理论等方面的专业知识。然而，受限于数据收集和模型能力，以往的艺术品自动分析工作主要集中在分类、检索和其他简单任务上，与人工智能的目标相去甚远。为了促进研究的进展，本文将进一步从大型多模态模型的卓越感知和生成能力中汲取灵感，进行综合分析。具体地说，我们首先提出了对艺术作品（即本文中的绘画）进行段落分析的任务，只关注视觉特征，以形成对艺术作品更全面的理解。为了支持形式分析的研究，我们收集了一个大型数据集《绘画形式》，其中包含约 19k 幅绘画图像和 50k 个分析段落。我们还引入了一个用于绘画分析合成的优秀大型多模态模型，命名为 GalleryGPT，该模型基于 LLaVA 架构，利用我们收集的数据进行了小幅修改和微调。我们在多个数据集上进行了正式分析生成和零镜头实验，以评估我们模型的能力。结果表明，与功能强大的基线 LMM 相比，我们的模型在性能上有了明显的提高，展示了其卓越的艺术分析和概括能力。\textcolor{blue}{代码和模型见：https://github.com/steven640pixel/GalleryGPT.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GalleryGPT: Analyzing Paintings with Large Multimodal Models

Artwork analysis is important and fundamental skill for art appreciation, which could enrich personal aesthetic sensibility and facilitate the critical thinking ability. Understanding artworks is challenging due to its subjective nature, diverse interpretations, and complex visual elements, requiring expertise in art history, cultural background, and aesthetic theory. However, limited by the data collection and model ability, previous works for automatically analyzing artworks mainly focus on classification, retrieval, and other simple tasks, which is far from the goal of AI. To facilitate the research progress, in this paper, we step further to compose comprehensive analysis inspired by the remarkable perception and generation ability of large multimodal models. Specifically, we first propose a task of composing paragraph analysis for artworks, i.e., painting in this paper, only focusing on visual characteristics to formulate more comprehensive understanding of artworks. To support the research on formal analysis, we collect a large dataset PaintingForm, with about 19k painting images and 50k analysis paragraphs. We further introduce a superior large multimodal model for painting analysis composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based on LLaVA architecture leveraging our collected data. We conduct formal analysis generation and zero-shot experiments across several datasets to assess the capacity of our model. The results show remarkable performance improvements comparing with powerful baseline LMMs, demonstrating its superb ability of art analysis and generalization. \textcolor{blue}{The codes and model are available at: https://github.com/steven640pixel/GalleryGPT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量