Do large language models and humans have similar behaviors in causal inference with script knowledge?

arXiv (Cornell University) Pub Date : 2023-11-13 DOI:10.48550/arxiv.2311.07311

Hong, Xudong, Ryzhova, Margarita, Biondi, Daniel Adrian, Demberg, Vera

{"title":"Do large language models and humans have similar behaviors in causal\n inference with script knowledge?","authors":"Hong, Xudong, Ryzhova, Margarita, Biondi, Daniel Adrian, Demberg, Vera","doi":"10.48550/arxiv.2311.07311","DOIUrl":null,"url":null,"abstract":"Recently, large pre-trained language models (LLMs) have demonstrated superior language understanding abilities, including zero-shot causal reasoning. However, it is unclear to what extent their capabilities are similar to human ones. We here study the processing of an event $B$ in a script-based story, which causally depends on a previous event $A$. In our manipulation, event $A$ is stated, negated, or omitted in an earlier section of the text. We first conducted a self-paced reading experiment, which showed that humans exhibit significantly longer reading times when causal conflicts exist ($\\neg A \\rightarrow B$) than under logical conditions ($A \\rightarrow B$). However, reading times remain similar when cause A is not explicitly mentioned, indicating that humans can easily infer event B from their script knowledge. We then tested a variety of LLMs on the same data to check to what extent the models replicate human behavior. Our experiments show that 1) only recent LLMs, like GPT-3 or Vicuna, correlate with human behavior in the $\\neg A \\rightarrow B$ condition. 2) Despite this correlation, all models still fail to predict that $nil \\rightarrow B$ is less surprising than $\\neg A \\rightarrow B$, indicating that LLMs still have difficulties integrating script knowledge. Our code and collected data set are available at https://github.com/tony-hong/causal-script.","PeriodicalId":496270,"journal":{"name":"arXiv (Cornell University)","volume":"114 22","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv (Cornell University)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arxiv.2311.07311","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, large pre-trained language models (LLMs) have demonstrated superior language understanding abilities, including zero-shot causal reasoning. However, it is unclear to what extent their capabilities are similar to human ones. We here study the processing of an event $B$ in a script-based story, which causally depends on a previous event $A$. In our manipulation, event $A$ is stated, negated, or omitted in an earlier section of the text. We first conducted a self-paced reading experiment, which showed that humans exhibit significantly longer reading times when causal conflicts exist ($\neg A \rightarrow B$) than under logical conditions ($A \rightarrow B$). However, reading times remain similar when cause A is not explicitly mentioned, indicating that humans can easily infer event B from their script knowledge. We then tested a variety of LLMs on the same data to check to what extent the models replicate human behavior. Our experiments show that 1) only recent LLMs, like GPT-3 or Vicuna, correlate with human behavior in the $\neg A \rightarrow B$ condition. 2) Despite this correlation, all models still fail to predict that $nil \rightarrow B$ is less surprising than $\neg A \rightarrow B$, indicating that LLMs still have difficulties integrating script knowledge. Our code and collected data set are available at https://github.com/tony-hong/causal-script.

查看原文本刊更多论文

大型语言模型和人类在使用文字知识进行因果推理时有相似的行为吗?

最近，大型预训练语言模型(llm)已经展示了卓越的语言理解能力，包括零概率因果推理。然而，目前尚不清楚它们的能力在多大程度上与人类相似。我们在这里研究基于脚本的故事中事件$B$的处理，它因果地依赖于之前的事件$ a $。在我们的操作中，事件$A$在文本的前面部分被声明、否定或省略。我们首先进行了一个自定节奏的阅读实验，结果表明，当存在因果冲突($\负a \右箭头B$)时，人类的阅读时间明显长于逻辑条件($ a \右箭头B$)。然而，当没有明确提到原因A时，阅读时间仍然相似，这表明人类可以很容易地从他们的脚本知识中推断出事件B。然后，我们在相同的数据上测试了各种llm，以检查模型在多大程度上复制了人类行为。我们的实验表明，1)只有最近的LLMs，如GPT-3或Vicuna，与人类在负A右B条件下的行为相关。2)尽管存在这种相关性，但所有模型仍然无法预测$nil \right - row B$比$\ - A \right - row B$更令人惊讶，这表明llm仍然难以整合脚本知识。我们的代码和收集的数据集可在https://github.com/tony-hong/causal-script上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv (Cornell University)

自引率

0.00%

发文量