A Causality-Aware Paradigm for Evaluating Creativity of Multimodal Large Language Models

Zhongzhan Huang;Shanshan Zhong;Pan Zhou;Shanghua Gao;Marinka Zitnik;Liang Lin
{"title":"A Causality-Aware Paradigm for Evaluating Creativity of Multimodal Large Language Models","authors":"Zhongzhan Huang;Shanshan Zhong;Pan Zhou;Shanghua Gao;Marinka Zitnik;Liang Lin","doi":"10.1109/TPAMI.2025.3539433","DOIUrl":null,"url":null,"abstract":"Recently, numerous benchmarks have been developed to evaluate the logical reasoning abilities of large language models (LLMs). However, assessing the equally important creative capabilities of LLMs is challenging due to the subjective, diverse, and data-scarce nature of creativity, especially in multimodal scenarios. In this paper, we consider the comprehensive pipeline for evaluating the creativity of multimodal LLMs, with a focus on suitable evaluation platforms and methodologies. First, we find the Oogiri game—a creativity-driven task requiring humor, associative thinking, and the ability to produce unexpected responses to text, images, or both. This game aligns well with the input-output structure of modern multimodal LLMs and benefits from a rich repository of high-quality, human-annotated creative responses, making it an ideal platform for studying LLM creativity. Next, beyond using the Oogiri game for standard evaluations like ranking and selection, we propose LoTbench, an interactive, causality-aware evaluation framework, to further address some intrinsic risks in standard evaluations, such as information leakage and limited interpretability. The proposed LoTbench not only quantifies LLM creativity more effectively but also visualizes the underlying creative thought processes. Our results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable. Furthermore, we observe a strong correlation between results from the multimodal cognition benchmark MMMU and LoTbench, but only a weak connection with traditional creativity metrics. This suggests that LoTbench better aligns with human cognitive theories, highlighting cognition as a critical foundation in the early stages of creativity and enabling the bridging of diverse concepts. Project Page.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 5","pages":"3830-3846"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10876763/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, numerous benchmarks have been developed to evaluate the logical reasoning abilities of large language models (LLMs). However, assessing the equally important creative capabilities of LLMs is challenging due to the subjective, diverse, and data-scarce nature of creativity, especially in multimodal scenarios. In this paper, we consider the comprehensive pipeline for evaluating the creativity of multimodal LLMs, with a focus on suitable evaluation platforms and methodologies. First, we find the Oogiri game—a creativity-driven task requiring humor, associative thinking, and the ability to produce unexpected responses to text, images, or both. This game aligns well with the input-output structure of modern multimodal LLMs and benefits from a rich repository of high-quality, human-annotated creative responses, making it an ideal platform for studying LLM creativity. Next, beyond using the Oogiri game for standard evaluations like ranking and selection, we propose LoTbench, an interactive, causality-aware evaluation framework, to further address some intrinsic risks in standard evaluations, such as information leakage and limited interpretability. The proposed LoTbench not only quantifies LLM creativity more effectively but also visualizes the underlying creative thought processes. Our results show that while most LLMs exhibit constrained creativity, the performance gap between LLMs and humans is not insurmountable. Furthermore, we observe a strong correlation between results from the multimodal cognition benchmark MMMU and LoTbench, but only a weak connection with traditional creativity metrics. This suggests that LoTbench better aligns with human cognitive theories, highlighting cognition as a critical foundation in the early stages of creativity and enabling the bridging of diverse concepts. Project Page.
多模态大型语言模型创造性评价的因果意识范式
最近,已经开发了许多基准来评估大型语言模型(llm)的逻辑推理能力。然而,由于创造力的主观性、多样性和数据稀缺性,尤其是在多模式场景下,评估法学硕士同等重要的创新能力是具有挑战性的。在本文中,我们考虑了评估多式联运法学硕士创造力的综合管道,重点是合适的评估平台和方法。首先,我们发现了Oogiri游戏——一个需要幽默感、联想思维以及对文本、图像或两者产生意想不到的反应的创造力驱动的任务。这个游戏很好地符合现代多模式法学硕士的投入产出结构,并受益于丰富的高质量,人工注释的创造性响应库,使其成为研究法学硕士创造力的理想平台。接下来,除了使用Oogiri游戏进行排名和选择等标准评估外,我们还提出了LoTbench,一个交互式的因果关系感知评估框架,以进一步解决标准评估中的一些内在风险,如信息泄露和有限的可解释性。拟议的LoTbench不仅可以更有效地量化法学硕士的创造力,还可以将潜在的创造性思维过程可视化。我们的研究结果表明,虽然大多数法学硕士表现出有限的创造力,但法学硕士与人类之间的绩效差距并非不可逾越。此外,我们观察到多模态认知基准MMMU和LoTbench的结果之间存在很强的相关性,但与传统创造力指标之间的相关性很弱。这表明LoTbench更符合人类认知理论,强调认知是创造力早期阶段的关键基础,并使不同概念之间的桥梁成为可能。项目页面。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信