From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency

IF 9.3 2区计算机科学

Computational Linguistics Pub Date : 2024-07-30 DOI:10.1162/coli_a_00529

Xenia Ohmer, Elia Bruni, Dieuwke Hupkes

{"title":"From Form(s) to Meaning: Probing the Semantic Depths of Language Models Using Multisense Consistency","authors":"Xenia Ohmer, Elia Bruni, Dieuwke Hupkes","doi":"10.1162/coli_a_00529","DOIUrl":null,"url":null,"abstract":"The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what “understanding” means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes — inspired by Fregean senses — of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model’s multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.","PeriodicalId":49089,"journal":{"name":"Computational Linguistics","volume":"73 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Linguistics","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1162/coli_a_00529","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

The staggering pace with which the capabilities of large language models (LLMs) are increasing, as measured by a range of commonly used natural language understanding (NLU) benchmarks, raises many questions regarding what “understanding” means for a language model and how it compares to human understanding. This is especially true since many LLMs are exclusively trained on text, casting doubt on whether their stellar benchmark performances are reflective of a true understanding of the problems represented by these benchmarks, or whether LLMs simply excel at uttering textual forms that correlate with what someone who understands the problem would say. In this philosophically inspired work, we aim to create some separation between form and meaning, with a series of tests that leverage the idea that world understanding should be consistent across presentational modes — inspired by Fregean senses — of the same meaning. Specifically, we focus on consistency across languages as well as paraphrases. Taking GPT-3.5 as our object of study, we evaluate multisense consistency across five different languages and various tasks. We start the evaluation in a controlled setting, asking the model for simple facts, and then proceed with an evaluation on four popular NLU benchmarks. We find that the model’s multisense consistency is lacking and run several follow-up analyses to verify that this lack of consistency is due to a sense-dependent task understanding. We conclude that, in this aspect, the understanding of LLMs is still quite far from being consistent and human-like, and deliberate on how this impacts their utility in the context of learning about human language and understanding.

查看原文本刊更多论文

从形式到意义：利用多义一致性探究语言模型的语义深度

以一系列常用的自然语言理解（NLU）基准来衡量，大型语言模型（LLM）的能力正在以惊人的速度增长，这就引发了许多问题：对于语言模型来说，"理解 "意味着什么？尤其是许多 LLM 完全是在文本中训练出来的，这让人怀疑它们出色的基准性能是否反映了对这些基准所代表的问题的真正理解，或者 LLM 是否只是擅长说出与理解问题的人会说的话相关联的文本形式。在这项受哲学启发的工作中，我们旨在将形式和意义区分开来，通过一系列测试，利用对世界的理解应该在相同意义的呈现模式（受弗雷格感官启发）之间保持一致这一观点。具体来说，我们关注的是不同语言以及不同转述的一致性。以 GPT-3.5 为研究对象，我们对五种不同语言和各种任务的多义一致性进行了评估。我们首先在受控环境下进行评估，要求模型提供简单的事实，然后在四个流行的 NLU 基准上进行评估。我们发现该模型缺乏多义一致性，并进行了几项后续分析，以验证这种一致性的缺乏是由于对任务理解的意义依赖性造成的。我们得出的结论是，在这方面，LLMs 的理解离一致性和类人化还很远，并探讨了这如何影响它们在学习人类语言和理解方面的效用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Linguistics Computer Science-Artificial Intelligence

自引率

0.00%

发文量

期刊介绍： Computational Linguistics is the longest-running publication devoted exclusively to the computational and mathematical properties of language and the design and analysis of natural language processing systems. This highly regarded quarterly offers university and industry linguists, computational linguists, artificial intelligence and machine learning investigators, cognitive scientists, speech specialists, and philosophers the latest information about the computational aspects of all the facets of research on language.