{"title":"8. Agreement of Large Language Models with Humans in Extracting Data from Unstructured Records of Multiple Sclerosis Patients","authors":"Zainab Ali Alnhwi , Faisal Saleh Aleisa , Walaa Fahad Almutawa , Hamad Saleh Alamro , Atheer Hussain Alanazi , Abdulrhman Aljouie , Ahmad Abulaban","doi":"10.1016/j.msard.2024.105969","DOIUrl":null,"url":null,"abstract":"<div><h3>Background/Objective(s)</h3><div>Multiple Sclerosis (MS) is a chronic autoimmune disorder affecting the central nervous system, leading to progressive neurological dysfunction, occurs in individuals aged 20-40 years. Monitoring MS progression often involves the Expanded Disability Status Scale (EDSS), though its accuracy can be compromised in elderly patients with comorbidities. Artificial Intelligence (AI) and Large Language Models (LLMs) have revolutionized data analysis and decision-making processes in healthcare. Machine Learning (ML), a core component of AI, enhances its efficacy through data-driven learning, particularly in Natural Language Processing (NLP) applications which extract information from narrative data. LLMs, exemplified by models like GPT-4, generate human-like responses by utilizing self-supervised learning techniques and extensive text training.</div><div>Recent studies have highlighted the capabilities of LLMs in medical data tasks, such as calculating medical scores and extracting information from clinical notes. However, no studies have compared LLMs with human performance in extracting data specifically from unstructured MS patient records.</div><div>This study aims to measure the agreement between LLMs and humans in extracting data from unstructured records of MS patients at the National Guard Health Affairs (NGHA) in Saudi Arabia, highlighting the potential and limitations of LLMs in this context.</div></div><div><h3>Material(s) and Method(s)</h3><div>This cross-sectional study was conducted at King Abdulaziz Medical City (KAMC) and King Abdullah International Medical Research Centre (KAIMRC). A total of 382 patients diagnosed with MS, aged over 14 years, were included. Patients with atypical MS presentations or incomplete medical records were excluded. The primary variables of interest were EDSS scores, Disease-Modifying Therapies (DMTs), and the number and characteristics of relapses. Secondary variables included patient demographics, smoking status, body metrics, date of first MS symptoms, diagnosis date, secondary diseases, and MRI findings. Data were extracted from clinical notes using a Phi-3 Mini 128K LLM running on a local AI workstation, with manual data extraction for comparison by trained students. Statistical analyses included the Kappa coefficient for categorical variables and the Intraclass Correlation Coefficient (ICC) for numerical variables, with significance set at p < 0.05. Analyses were conducted using SAS 9.4 on an AI workstation equipped with an AMD threadripper pro CPU, 258 GB of memory, and two NVIDIA RTX A6000 GPUs.</div></div><div><h3>Result(s)</h3><div>The analysis revealed that the agreement between human annotators and the Phi-3 Mini LLM for extracting clinical data from unstructured MS patient records was lower than expected. The ICC for the EDSS and the number of clinical relapses were -0.4123 and 0.3848, respectively, indicating weak agreement. The model demonstrated limited ability in identifying the first MS attack symptoms, with a Kappa value of 0.3976, and in recognizing the initial DMT used, with a particularly low Kappa value of 0.0077. These findings highlight the challenges faced by the LLM in accurately interpreting and extracting information from unstructured narrative data. However, the LLM performed better in extracting data related to sensory clinical relapses, achieving a higher Kappa value compared to other data types. This suggests that standardized reporting practices in medical records for specific clinical characteristics, such as sensory relapses, may enhance LLM accuracy. Overall, human annotators consistently outperformed the LLM in most clinical data extraction tasks.</div></div><div><h3>Conclusion(s)</h3><div>The agreement evaluation between trained students and LLM, in area of EDSS, number of attacks, 1st attack symptoms, and 1st DMT use, showed a weak agreement. Due to the computational limitations we had used SLM with few parameters which might explain the poor resutls, further study with larger language models, such as, Lama3.1(400B) could yield more positive results.</div><div>Following the personal data protection law(PDPL), and to protect patient's privacy, a limited version of LLM was used, which showed a limitation for the study, due to time constraints and sample size overload on Phi3 mini model.</div><div>The Kappa inter-rater agreement revealed that Phi3 mini did better in sensory clinical relapse compared to other types.</div></div>","PeriodicalId":18958,"journal":{"name":"Multiple sclerosis and related disorders","volume":"92 ","pages":"Article 105969"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multiple sclerosis and related disorders","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2211034824005455","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background/Objective(s)
Multiple Sclerosis (MS) is a chronic autoimmune disorder affecting the central nervous system, leading to progressive neurological dysfunction, occurs in individuals aged 20-40 years. Monitoring MS progression often involves the Expanded Disability Status Scale (EDSS), though its accuracy can be compromised in elderly patients with comorbidities. Artificial Intelligence (AI) and Large Language Models (LLMs) have revolutionized data analysis and decision-making processes in healthcare. Machine Learning (ML), a core component of AI, enhances its efficacy through data-driven learning, particularly in Natural Language Processing (NLP) applications which extract information from narrative data. LLMs, exemplified by models like GPT-4, generate human-like responses by utilizing self-supervised learning techniques and extensive text training.
Recent studies have highlighted the capabilities of LLMs in medical data tasks, such as calculating medical scores and extracting information from clinical notes. However, no studies have compared LLMs with human performance in extracting data specifically from unstructured MS patient records.
This study aims to measure the agreement between LLMs and humans in extracting data from unstructured records of MS patients at the National Guard Health Affairs (NGHA) in Saudi Arabia, highlighting the potential and limitations of LLMs in this context.
Material(s) and Method(s)
This cross-sectional study was conducted at King Abdulaziz Medical City (KAMC) and King Abdullah International Medical Research Centre (KAIMRC). A total of 382 patients diagnosed with MS, aged over 14 years, were included. Patients with atypical MS presentations or incomplete medical records were excluded. The primary variables of interest were EDSS scores, Disease-Modifying Therapies (DMTs), and the number and characteristics of relapses. Secondary variables included patient demographics, smoking status, body metrics, date of first MS symptoms, diagnosis date, secondary diseases, and MRI findings. Data were extracted from clinical notes using a Phi-3 Mini 128K LLM running on a local AI workstation, with manual data extraction for comparison by trained students. Statistical analyses included the Kappa coefficient for categorical variables and the Intraclass Correlation Coefficient (ICC) for numerical variables, with significance set at p < 0.05. Analyses were conducted using SAS 9.4 on an AI workstation equipped with an AMD threadripper pro CPU, 258 GB of memory, and two NVIDIA RTX A6000 GPUs.
Result(s)
The analysis revealed that the agreement between human annotators and the Phi-3 Mini LLM for extracting clinical data from unstructured MS patient records was lower than expected. The ICC for the EDSS and the number of clinical relapses were -0.4123 and 0.3848, respectively, indicating weak agreement. The model demonstrated limited ability in identifying the first MS attack symptoms, with a Kappa value of 0.3976, and in recognizing the initial DMT used, with a particularly low Kappa value of 0.0077. These findings highlight the challenges faced by the LLM in accurately interpreting and extracting information from unstructured narrative data. However, the LLM performed better in extracting data related to sensory clinical relapses, achieving a higher Kappa value compared to other data types. This suggests that standardized reporting practices in medical records for specific clinical characteristics, such as sensory relapses, may enhance LLM accuracy. Overall, human annotators consistently outperformed the LLM in most clinical data extraction tasks.
Conclusion(s)
The agreement evaluation between trained students and LLM, in area of EDSS, number of attacks, 1st attack symptoms, and 1st DMT use, showed a weak agreement. Due to the computational limitations we had used SLM with few parameters which might explain the poor resutls, further study with larger language models, such as, Lama3.1(400B) could yield more positive results.
Following the personal data protection law(PDPL), and to protect patient's privacy, a limited version of LLM was used, which showed a limitation for the study, due to time constraints and sample size overload on Phi3 mini model.
The Kappa inter-rater agreement revealed that Phi3 mini did better in sensory clinical relapse compared to other types.
期刊介绍:
Multiple Sclerosis is an area of ever expanding research and escalating publications. Multiple Sclerosis and Related Disorders is a wide ranging international journal supported by key researchers from all neuroscience domains that focus on MS and associated disease of the central nervous system. The primary aim of this new journal is the rapid publication of high quality original research in the field. Important secondary aims will be timely updates and editorials on important scientific and clinical care advances, controversies in the field, and invited opinion articles from current thought leaders on topical issues. One section of the journal will focus on teaching, written to enhance the practice of community and academic neurologists involved in the care of MS patients. Summaries of key articles written for a lay audience will be provided as an on-line resource.
A team of four chief editors is supported by leading section editors who will commission and appraise original and review articles concerning: clinical neurology, neuroimaging, neuropathology, neuroepidemiology, therapeutics, genetics / transcriptomics, experimental models, neuroimmunology, biomarkers, neuropsychology, neurorehabilitation, measurement scales, teaching, neuroethics and lay communication.