人工智能评分的可行性：大型语言模型能否取代人类评分员？

IF 2.7 3区心理学 Q2 CLINICAL NEUROLOGY

Clinical Neuropsychologist Pub Date : 2025-09-01 DOI:10.1080/13854046.2025.2552289

Michael Jaworski, Jacob Balconi, Celeste Santivasci, Matthew Calamia

{"title":"人工智能评分的可行性：大型语言模型能否取代人类评分员？","authors":"Michael Jaworski, Jacob Balconi, Celeste Santivasci, Matthew Calamia","doi":"10.1080/13854046.2025.2552289","DOIUrl":null,"url":null,"abstract":"Objective: To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). Method: Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples t-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. Results: Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-R = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). Conclusions: ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.","PeriodicalId":55250,"journal":{"name":"Clinical Neuropsychologist","volume":" ","pages":"1-14"},"PeriodicalIF":2.7000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feasibility of AI-powered assessment scoring: Can large language models replace human raters?\",\"authors\":\"Michael Jaworski, Jacob Balconi, Celeste Santivasci, Matthew Calamia\",\"doi\":\"10.1080/13854046.2025.2552289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). Method: Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples t-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. Results: Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-R = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). Conclusions: ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.\",\"PeriodicalId\":55250,\"journal\":{\"name\":\"Clinical Neuropsychologist\",\"volume\":\" \",\"pages\":\"1-14\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Neuropsychologist\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.1080/13854046.2025.2552289\",\"RegionNum\":3,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neuropsychologist","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1080/13854046.2025.2552289","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：评估大型语言模型（LLM） ChatGPT-4.5 （early-access）用于多发性硬化症简易国际认知评估（BICAMS）方案自动评分的可行性、准确性和可靠性。ChatGPT-4.5的表现与人类评分者在评分记录形式（即单词列表、数字表和绘图响应）上进行了比较。方法：由两名训练有素的人类评分员和ChatGPT-4.5分别对符号数字模式测试（SDMT）、加州语言学习测试ii （CVLT-II）和简短视觉空间记忆测试修订版（BVMT-R）等35个已识别的BICAMS测试方案进行独立评分。使用ChatGPT-4.5进行评分涉及上传协议扫描和结构化提示。评分差异由盲法第三评分者解决。分类内相关系数（ICCs）、配对样本t检验和描述性统计评估了分类间的可靠性、准确性和速度。结果：在ChatGPT-4.5公开发布之前，ChatGPT-4.5与人类评分者在所有总分上具有较强的互信度（例如CVLT- ii ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853），每次测试的评分差异最小（CVLT = 1.05, SDMT = 0.05, BVMT-R = 1.05-1.19）。ChatGPT-4.5识别了两名人类评分员忽略的评分错误，并在9分钟内完成了每个BICAMS方案的评分。ChatGPT-4.5公开发布后，信度显著下降（例如BVMT-R Trial 3的ICC = -0.046），每次测试的平均得分差异增加（例如SDMT = 6.79）。结论：ChatGPT-4.5表现出与人类评分者相当的准确性，尽管在公开发布后出现了性能变化。有了足够的计算资源和快速/模型优化，法学硕士可以简化神经心理学评估，提高临床效率，减少人为错误。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Feasibility of AI-powered assessment scoring: Can large language models replace human raters?

Objective: To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). Method: Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples t-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. Results: Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-R = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). Conclusions: ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Clinical Neuropsychologist 医学-临床神经学

CiteScore

8.40

自引率

12.80%

发文量

审稿时长

6-12 weeks

期刊介绍： The Clinical Neuropsychologist (TCN) serves as the premier forum for (1) state-of-the-art clinically-relevant scientific research, (2) in-depth professional discussions of matters germane to evidence-based practice, and (3) clinical case studies in neuropsychology. Of particular interest are papers that can make definitive statements about a given topic (thereby having implications for the standards of clinical practice) and those with the potential to expand today’s clinical frontiers. Research on all age groups, and on both clinical and normal populations, is considered.