Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis

The BMJ Pub Date : 2024-12-20 DOI:10.1136/bmj-2024-081948

Roy Dayan, Benjamin Uliel, Gal Koplewitz

{"title":"Age against the machine—susceptibility of large language models to cognitive impairment: cross sectional analysis","authors":"Roy Dayan, Benjamin Uliel, Gal Koplewitz","doi":"10.1136/bmj-2024-081948","DOIUrl":null,"url":null,"abstract":"Objective To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests. Design Cross sectional analysis. Setting Online interaction with large language models via text based prompts. Participants Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). Assessments The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test. Main outcome measures MoCA scores, performance in visuospatial/executive tasks, and Stroop test results. Results ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. Conclusions With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence. No additional data available.","PeriodicalId":22388,"journal":{"name":"The BMJ","volume":"31 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The BMJ","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmj-2024-081948","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective To evaluate the cognitive abilities of the leading large language models and identify their susceptibility to cognitive impairment, using the Montreal Cognitive Assessment (MoCA) and additional tests. Design Cross sectional analysis. Setting Online interaction with large language models via text based prompts. Participants Publicly available large language models, or “chatbots”: ChatGPT versions 4 and 4o (developed by OpenAI), Claude 3.5 “Sonnet” (developed by Anthropic), and Gemini versions 1 and 1.5 (developed by Alphabet). Assessments The MoCA test (version 8.1) was administered to the leading large language models with instructions identical to those given to human patients. Scoring followed official guidelines and was evaluated by a practising neurologist. Additional assessments included the Navon figure, cookie theft picture, Poppelreuter figure, and Stroop test. Main outcome measures MoCA scores, performance in visuospatial/executive tasks, and Stroop test results. Results ChatGPT 4o achieved the highest score on the MoCA test (26/30), followed by ChatGPT 4 and Claude (25/30), with Gemini 1.0 scoring lowest (16/30). All large language models showed poor performance in visuospatial/executive tasks. Gemini models failed at the delayed recall task. Only ChatGPT 4o succeeded in the incongruent stage of the Stroop test. Conclusions With the exception of ChatGPT 4o, almost all large language models subjected to the MoCA test showed signs of mild cognitive impairment. Moreover, as in humans, age is a key determinant of cognitive decline: “older” chatbots, like older patients, tend to perform worse on the MoCA test. These findings challenge the assumption that artificial intelligence will soon replace human doctors, as the cognitive impairment evident in leading chatbots may affect their reliability in medical diagnostics and undermine patients’ confidence. No additional data available.

查看原文本刊更多论文

目的通过蒙特利尔认知评估（MoCA）和其他测试，评估主要大型语言模型的认知能力，并确定其认知障碍的易感性。设计横断面分析。设置通过文本提示与大型语言模型进行在线互动。参与者公开的大型语言模型或 "聊天机器人"：ChatGPT第4版和第4o版（由OpenAI开发）、Claude 3.5 "Sonnet"（由Anthropic开发）以及Gemini第1版和第1.5版（由Alphabet开发）。评估对主要的大型语言模型进行 MoCA 测试（8.1 版），测试说明与人类患者完全相同。评分遵循官方指南，并由执业神经科医生进行评估。额外的评估包括纳文图、饼干盗窃图片、波佩勒特图和 Stroop 测试。主要结果指标 MoCA 评分、视觉空间/执行任务表现和 Stroop 测试结果。结果 ChatGPT 4o 在 MoCA 测试中得分最高（26/30），其次是 ChatGPT 4 和 Claude（25/30），Gemini 1.0 得分最低（16/30）。所有大型语言模型在视觉空间/执行任务方面都表现不佳。双子座模型在延迟回忆任务中失败。只有 ChatGPT 4o 在 Stroop 测试的不一致阶段取得了成功。结论除了 ChatGPT 4o 之外，几乎所有接受 MoCA 测试的大型语言模型都显示出轻度认知障碍的迹象。此外，与人类一样，年龄也是决定认知能力衰退的关键因素："年长的 "聊天机器人和年长的病人一样，在 MoCA 测试中的表现往往较差。这些发现对人工智能将很快取代人类医生的假设提出了质疑，因为主要聊天机器人明显的认知障碍可能会影响它们在医疗诊断中的可靠性，并削弱患者的信心。暂无更多数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

The BMJ

自引率

0.00%

发文量