Do Multimodal Large Language Models and Humans Ground Language Similarly?

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2024-07-30 DOI:10.1162/coli_a_00531

Cameron Jones, Benjamin Bergen, Sean Trott

{"title":"Do Multimodal Large Language Models and Humans Ground Language Similarly?","authors":"Cameron Jones, Benjamin Bergen, Sean Trott","doi":"10.1162/coli_a_00531","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"28 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00531","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) have been criticized for failing to connect linguistic meaning to the world—for failing to solve the “symbol grounding problem.” Multimodal Large Language Models (MLLMs) offer a potential solution to this challenge by combining linguistic representations and processing with other modalities. However, much is still unknown about exactly how and to what degree MLLMs integrate their distinct modalities—and whether the way they do so mirrors the mechanisms believed to underpin grounding in humans. In humans, it has been hypothesized that linguistic meaning is grounded through “embodied simulation,” the activation of sensorimotor and affective representations reflecting described experiences. Across four pre-registered studies, we adapt experimental techniques originally developed to investigate embodied simulation in human comprehenders to ask whether MLLMs are sensitive to sensorimotor features that are implied but not explicit in descriptions of an event. In Experiment 1, we find sensitivity to some features (color and shape) but not others (size, orientation, and volume). In Experiment 2, we identify likely bottlenecks to explain an MLLM’s lack of sensitivity. In Experiment 3, we find that despite sensitivity to implicit sensorimotor features, MLLMs cannot fully account for human behavior on the same task. Finally, in Experiment 4, we compare the psychometric predictive power of different MLLM architectures and find that ViLT, a single-stream architecture, is more predictive of human responses to one sensorimotor feature (shape) than CLIP, a dual-encoder architecture—despite being trained on orders of magnitude less data. These results reveal strengths and limitations in the ability of current MLLMs to integrate language with other modalities, and also shed light on the likely mechanisms underlying human language comprehension.

查看原文本刊更多论文

多模态大型语言模型与人类的语言基础相似吗？

大语言模型（LLMs）因未能将语言意义与世界联系起来--未能解决 "符号基础问题"--而饱受批评。多模态大语言模型（MLLMs）通过将语言表征和处理与其他模态相结合，为这一难题提供了潜在的解决方案。然而，对于多模态大语言模型如何以及在多大程度上整合其不同的模态--它们这样做的方式是否反映了人们认为的人类接地机制--还有很多未知数。据推测，人类的语言意义是通过 "具身模拟"（embodied simulation）来实现的，即通过激活感官运动和情感表征来反映所描述的体验。通过四项预先登记的研究，我们调整了最初为研究人类理解者的具身模拟而开发的实验技术，以探究 MLLM 是否对事件描述中隐含但不明确的感觉运动特征敏感。在实验 1 中，我们发现 MLLM 对某些特征（颜色和形状）很敏感，但对其他特征（大小、方向和体积）却不敏感。在实验 2 中，我们发现了 MLLM 缺乏敏感性的可能瓶颈。在实验 3 中，我们发现尽管 MLLM 对内隐感觉运动特征很敏感，但它并不能完全解释人类在同一任务中的行为。最后，在实验 4 中，我们比较了不同 MLLM 架构的心理测量预测能力，发现单流架构 ViLT 比双编码器架构 CLIP 更能预测人类对一个感觉运动特征（形状）的反应，尽管后者的训练数据要少得多。这些结果揭示了当前 MLLM 将语言与其他模态整合的能力的优势和局限性，同时也揭示了人类语言理解的可能机制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.