以 VLM 对具有说服力的非典型图像的推理能力为基准

arXiv - CS - Multimedia Pub Date : 2024-09-16 DOI:arxiv-2409.10719

Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka

{"title":"以 VLM 对具有说服力的非典型图像的推理能力为基准","authors":"Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka","doi":"arxiv-2409.10719","DOIUrl":null,"url":null,"abstract":"Vision language models (VLMs) have shown strong zero-shot generalization\nacross various tasks, especially when integrated with large language models\n(LLMs). However, their ability to comprehend rhetorical and persuasive visual\nmedia, such as advertisements, remains understudied. Ads often employ atypical\nimagery, using surprising object juxtapositions to convey shared properties.\nFor example, Fig. 1 (e) shows a beer with a feather-like texture. This requires\nadvanced reasoning to deduce that this atypical representation signifies the\nbeer's lightness. We introduce three novel tasks, Multi-label Atypicality\nClassification, Atypicality Statement Retrieval, and Aypical Object\nRecognition, to benchmark VLMs' understanding of atypicality in persuasive\nimages. We evaluate how well VLMs use atypicality to infer an ad's message and\ntest their reasoning abilities by employing semantically challenging negatives.\nFinally, we pioneer atypicality-aware verbalization by extracting comprehensive\nimage descriptions sensitive to atypical elements. Our findings reveal that:\n(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,\neffective strategies can extract atypicality-aware information, leading to\ncomprehensive image verbalization; (3) atypicality aids persuasive\nadvertisement understanding. Code and data will be made available.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"36 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Benchmarking VLMs' Reasoning About Persuasive Atypical Images\",\"authors\":\"Sina Malakouti, Aysan Aghazadeh, Ashmit Khandelwal, Adriana Kovashka\",\"doi\":\"arxiv-2409.10719\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Vision language models (VLMs) have shown strong zero-shot generalization\\nacross various tasks, especially when integrated with large language models\\n(LLMs). However, their ability to comprehend rhetorical and persuasive visual\\nmedia, such as advertisements, remains understudied. Ads often employ atypical\\nimagery, using surprising object juxtapositions to convey shared properties.\\nFor example, Fig. 1 (e) shows a beer with a feather-like texture. This requires\\nadvanced reasoning to deduce that this atypical representation signifies the\\nbeer's lightness. We introduce three novel tasks, Multi-label Atypicality\\nClassification, Atypicality Statement Retrieval, and Aypical Object\\nRecognition, to benchmark VLMs' understanding of atypicality in persuasive\\nimages. We evaluate how well VLMs use atypicality to infer an ad's message and\\ntest their reasoning abilities by employing semantically challenging negatives.\\nFinally, we pioneer atypicality-aware verbalization by extracting comprehensive\\nimage descriptions sensitive to atypical elements. Our findings reveal that:\\n(1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple,\\neffective strategies can extract atypicality-aware information, leading to\\ncomprehensive image verbalization; (3) atypicality aids persuasive\\nadvertisement understanding. Code and data will be made available.\",\"PeriodicalId\":501480,\"journal\":{\"name\":\"arXiv - CS - Multimedia\",\"volume\":\"36 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.10719\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10719","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视觉语言模型（VLMs）在各种任务中都表现出很强的零点泛化能力，尤其是与大型语言模型（LLMs）集成时。然而，视觉语言模型理解修辞性和劝说性视觉媒体（如广告）的能力仍未得到充分研究。广告通常采用非典型图像，利用令人惊讶的物体并置来传达共同属性。例如，图 1 (e) 显示了一种具有羽毛般质感的啤酒。这需要高级推理才能推断出这种非典型的表现形式代表了啤酒的轻盈。我们引入了三个新任务：多标签非典型性分类、非典型性语句检索和非典型对象识别，以衡量 VLMs 对说服性图像中的非典型性的理解。最后，我们通过提取对非典型元素敏感的综合图像描述，开创了非典型感知语言化的先河。我们的研究结果表明：（1）与 LLM 相比，VLM 缺乏高级推理能力；（2）简单有效的策略可以提取非典型感知信息，从而实现全面的图像语言化；（3）非典型性有助于说服性广告的理解。将提供代码和数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking VLMs' Reasoning About Persuasive Atypical Images

Vision language models (VLMs) have shown strong zero-shot generalization across various tasks, especially when integrated with large language models (LLMs). However, their ability to comprehend rhetorical and persuasive visual media, such as advertisements, remains understudied. Ads often employ atypical imagery, using surprising object juxtapositions to convey shared properties. For example, Fig. 1 (e) shows a beer with a feather-like texture. This requires advanced reasoning to deduce that this atypical representation signifies the beer's lightness. We introduce three novel tasks, Multi-label Atypicality Classification, Atypicality Statement Retrieval, and Aypical Object Recognition, to benchmark VLMs' understanding of atypicality in persuasive images. We evaluate how well VLMs use atypicality to infer an ad's message and test their reasoning abilities by employing semantically challenging negatives. Finally, we pioneer atypicality-aware verbalization by extracting comprehensive image descriptions sensitive to atypical elements. Our findings reveal that: (1) VLMs lack advanced reasoning capabilities compared to LLMs; (2) simple, effective strategies can extract atypicality-aware information, leading to comprehensive image verbalization; (3) atypicality aids persuasive advertisement understanding. Code and data will be made available.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量