Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen
{"title":"GalleryGPT:用大型多模态模型分析绘画作品","authors":"Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen","doi":"arxiv-2408.00491","DOIUrl":null,"url":null,"abstract":"Artwork analysis is important and fundamental skill for art appreciation,\nwhich could enrich personal aesthetic sensibility and facilitate the critical\nthinking ability. Understanding artworks is challenging due to its subjective\nnature, diverse interpretations, and complex visual elements, requiring\nexpertise in art history, cultural background, and aesthetic theory. However,\nlimited by the data collection and model ability, previous works for\nautomatically analyzing artworks mainly focus on classification, retrieval, and\nother simple tasks, which is far from the goal of AI. To facilitate the\nresearch progress, in this paper, we step further to compose comprehensive\nanalysis inspired by the remarkable perception and generation ability of large\nmultimodal models. Specifically, we first propose a task of composing paragraph\nanalysis for artworks, i.e., painting in this paper, only focusing on visual\ncharacteristics to formulate more comprehensive understanding of artworks. To\nsupport the research on formal analysis, we collect a large dataset\nPaintingForm, with about 19k painting images and 50k analysis paragraphs. We\nfurther introduce a superior large multimodal model for painting analysis\ncomposing, dubbed GalleryGPT, which is slightly modified and fine-tuned based\non LLaVA architecture leveraging our collected data. We conduct formal analysis\ngeneration and zero-shot experiments across several datasets to assess the\ncapacity of our model. The results show remarkable performance improvements\ncomparing with powerful baseline LMMs, demonstrating its superb ability of art\nanalysis and generalization. \\textcolor{blue}{The codes and model are available\nat: https://github.com/steven640pixel/GalleryGPT.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"186 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GalleryGPT: Analyzing Paintings with Large Multimodal Models\",\"authors\":\"Yi Bin, Wenhao Shi, Yujuan Ding, Zhiqiang Hu, Zheng Wang, Yang Yang, See-Kiong Ng, Heng Tao Shen\",\"doi\":\"arxiv-2408.00491\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Artwork analysis is important and fundamental skill for art appreciation,\\nwhich could enrich personal aesthetic sensibility and facilitate the critical\\nthinking ability. Understanding artworks is challenging due to its subjective\\nnature, diverse interpretations, and complex visual elements, requiring\\nexpertise in art history, cultural background, and aesthetic theory. However,\\nlimited by the data collection and model ability, previous works for\\nautomatically analyzing artworks mainly focus on classification, retrieval, and\\nother simple tasks, which is far from the goal of AI. To facilitate the\\nresearch progress, in this paper, we step further to compose comprehensive\\nanalysis inspired by the remarkable perception and generation ability of large\\nmultimodal models. Specifically, we first propose a task of composing paragraph\\nanalysis for artworks, i.e., painting in this paper, only focusing on visual\\ncharacteristics to formulate more comprehensive understanding of artworks. To\\nsupport the research on formal analysis, we collect a large dataset\\nPaintingForm, with about 19k painting images and 50k analysis paragraphs. We\\nfurther introduce a superior large multimodal model for painting analysis\\ncomposing, dubbed GalleryGPT, which is slightly modified and fine-tuned based\\non LLaVA architecture leveraging our collected data. We conduct formal analysis\\ngeneration and zero-shot experiments across several datasets to assess the\\ncapacity of our model. The results show remarkable performance improvements\\ncomparing with powerful baseline LMMs, demonstrating its superb ability of art\\nanalysis and generalization. \\\\textcolor{blue}{The codes and model are available\\nat: https://github.com/steven640pixel/GalleryGPT.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"186 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.00491\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.00491","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
GalleryGPT: Analyzing Paintings with Large Multimodal Models
Artwork analysis is important and fundamental skill for art appreciation,
which could enrich personal aesthetic sensibility and facilitate the critical
thinking ability. Understanding artworks is challenging due to its subjective
nature, diverse interpretations, and complex visual elements, requiring
expertise in art history, cultural background, and aesthetic theory. However,
limited by the data collection and model ability, previous works for
automatically analyzing artworks mainly focus on classification, retrieval, and
other simple tasks, which is far from the goal of AI. To facilitate the
research progress, in this paper, we step further to compose comprehensive
analysis inspired by the remarkable perception and generation ability of large
multimodal models. Specifically, we first propose a task of composing paragraph
analysis for artworks, i.e., painting in this paper, only focusing on visual
characteristics to formulate more comprehensive understanding of artworks. To
support the research on formal analysis, we collect a large dataset
PaintingForm, with about 19k painting images and 50k analysis paragraphs. We
further introduce a superior large multimodal model for painting analysis
composing, dubbed GalleryGPT, which is slightly modified and fine-tuned based
on LLaVA architecture leveraging our collected data. We conduct formal analysis
generation and zero-shot experiments across several datasets to assess the
capacity of our model. The results show remarkable performance improvements
comparing with powerful baseline LMMs, demonstrating its superb ability of art
analysis and generalization. \textcolor{blue}{The codes and model are available
at: https://github.com/steven640pixel/GalleryGPT.