Michael Jaworski, Jacob Balconi, Celeste Santivasci, Matthew Calamia
{"title":"人工智能评分的可行性:大型语言模型能否取代人类评分员?","authors":"Michael Jaworski, Jacob Balconi, Celeste Santivasci, Matthew Calamia","doi":"10.1080/13854046.2025.2552289","DOIUrl":null,"url":null,"abstract":"<p><p><b>Objective:</b> To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). <b>Method:</b> Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples <i>t</i>-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. <b>Results:</b> Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-<i>R</i> = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). <b>Conclusions:</b> ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.</p>","PeriodicalId":55250,"journal":{"name":"Clinical Neuropsychologist","volume":" ","pages":"1-14"},"PeriodicalIF":2.7000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feasibility of AI-powered assessment scoring: Can large language models replace human raters?\",\"authors\":\"Michael Jaworski, Jacob Balconi, Celeste Santivasci, Matthew Calamia\",\"doi\":\"10.1080/13854046.2025.2552289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><b>Objective:</b> To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). <b>Method:</b> Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples <i>t</i>-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. <b>Results:</b> Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-<i>R</i> = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). <b>Conclusions:</b> ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.</p>\",\"PeriodicalId\":55250,\"journal\":{\"name\":\"Clinical Neuropsychologist\",\"volume\":\" \",\"pages\":\"1-14\"},\"PeriodicalIF\":2.7000,\"publicationDate\":\"2025-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Neuropsychologist\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://doi.org/10.1080/13854046.2025.2552289\",\"RegionNum\":3,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Neuropsychologist","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1080/13854046.2025.2552289","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
Feasibility of AI-powered assessment scoring: Can large language models replace human raters?
Objective: To assess the feasibility, accuracy, and reliability of using ChatGPT-4.5 (early-access), a large language model (LLM), for automated scoring of Brief International Cognitive Assessment for Multiple Sclerosis (BICAMS) protocols. Performance of ChatGPT-4.5 was compared against human raters on scoring record forms (i.e. word lists, numeric tables, and drawing responses). Method: Thirty-five deidentified BICAMS protocols, including the Symbol Digit Modalities Test (SDMT), California Verbal Learning Test-II (CVLT-II), and Brief Visuospatial Memory Test-Revised (BVMT-R), were independently scored by two trained human raters and ChatGPT-4.5. Scoring with ChatGPT-4.5 involved uploading protocol scans and structured prompts. Scoring discrepancies were resolved by a blinded third rater. Intraclass correlation coefficients (ICCs), paired samples t-tests, and descriptive statistics evaluated interrater reliability, accuracy, and speed. Results: Before public release of ChatGPT-4.5, strong interrater reliability was found between ChatGPT-4.5 and human raters on all total scores (e.g. CVLT-II ICC = 0.992; SDMT ICC = 1.000; BVMT-R ICC = 0.822-853), with minimal scoring discrepancies per test (CVLT = 1.05, SDMT = 0.05, BVMT-R = 1.05-1.19). ChatGPT-4.5 identified scoring errors overlooked by two human raters and completed scoring of each BICAMS protocol in under 9 min. After ChatGPT-4.5 was publicly released, reliability decreased notably (e.g. ICC = -0.046 for BVMT-R Trial 3), and average scoring discrepancies per test increased (e.g. SDMT = 6.79). Conclusions: ChatGPT-4.5 demonstrated comparable accuracy relative to human raters, though performance variability emerged after public release. With adequate computational resources and prompt/model optimization, LLMs may streamline neuropsychological assessment, enhancing clinical efficiency, and reducing human errors.
期刊介绍:
The Clinical Neuropsychologist (TCN) serves as the premier forum for (1) state-of-the-art clinically-relevant scientific research, (2) in-depth professional discussions of matters germane to evidence-based practice, and (3) clinical case studies in neuropsychology. Of particular interest are papers that can make definitive statements about a given topic (thereby having implications for the standards of clinical practice) and those with the potential to expand today’s clinical frontiers. Research on all age groups, and on both clinical and normal populations, is considered.