Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses
{"title":"Exploring the potential of large language models in identifying metabolic dysfunction-associated steatotic liver disease: A comparative study of non-invasive tests and artificial intelligence-generated responses","authors":"Wanying Wu, Yuhu Guo, Qi Li, Congzhuo Jia","doi":"10.1111/liv.16112","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background and Aims</h3>\n \n <p>This study sought to assess the capabilities of large language models (LLMs) in identifying clinically significant metabolic dysfunction-associated steatotic liver disease (MASLD).</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We included individuals from NHANES 2017–2018. The validity and reliability of MASLD diagnosis by GPT-3.5 and GPT-4 were quantitatively examined and compared with those of the Fatty Liver Index (FLI) and United States FLI (USFLI). A receiver operating characteristic curve was conducted to assess the accuracy of MASLD diagnosis via different scoring systems. Additionally, GPT-4V's potential in clinical diagnosis using ultrasound images from MASLD patients was evaluated to provide assessments of LLM capabilities in both textual and visual data interpretation.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>GPT-4 demonstrated comparable performance in MASLD diagnosis to FLI and USFLI with the AUROC values of .831 (95% CI .796–.867), .817 (95% CI .797–.837) and .827 (95% CI .807–.848), respectively. GPT-4 exhibited a trend of enhanced accuracy, clinical relevance and efficiency compared to GPT-3.5 based on clinician evaluation. Additionally, Pearson's <i>r</i> values between GPT-4 and FLI, as well as USFLI, were .718 and .695, respectively, indicating robust and moderate correlations. Moreover, GPT-4V showed potential in understanding characteristics from hepatic ultrasound imaging but exhibited limited interpretive accuracy in diagnosing MASLD compared to skilled radiologists.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>GPT-4 achieved performance comparable to traditional risk scores in diagnosing MASLD and exhibited improved convenience, versatility and the capacity to offer user-friendly outputs. The integration of GPT-4V highlights the capacities of LLMs in handling both textual and visual medical data, reinforcing their expansive utility in healthcare practice.</p>\n </section>\n </div>","PeriodicalId":18101,"journal":{"name":"Liver International","volume":"45 4","pages":""},"PeriodicalIF":6.0000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Liver International","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/liv.16112","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background and Aims
This study sought to assess the capabilities of large language models (LLMs) in identifying clinically significant metabolic dysfunction-associated steatotic liver disease (MASLD).
Methods
We included individuals from NHANES 2017–2018. The validity and reliability of MASLD diagnosis by GPT-3.5 and GPT-4 were quantitatively examined and compared with those of the Fatty Liver Index (FLI) and United States FLI (USFLI). A receiver operating characteristic curve was conducted to assess the accuracy of MASLD diagnosis via different scoring systems. Additionally, GPT-4V's potential in clinical diagnosis using ultrasound images from MASLD patients was evaluated to provide assessments of LLM capabilities in both textual and visual data interpretation.
Results
GPT-4 demonstrated comparable performance in MASLD diagnosis to FLI and USFLI with the AUROC values of .831 (95% CI .796–.867), .817 (95% CI .797–.837) and .827 (95% CI .807–.848), respectively. GPT-4 exhibited a trend of enhanced accuracy, clinical relevance and efficiency compared to GPT-3.5 based on clinician evaluation. Additionally, Pearson's r values between GPT-4 and FLI, as well as USFLI, were .718 and .695, respectively, indicating robust and moderate correlations. Moreover, GPT-4V showed potential in understanding characteristics from hepatic ultrasound imaging but exhibited limited interpretive accuracy in diagnosing MASLD compared to skilled radiologists.
Conclusions
GPT-4 achieved performance comparable to traditional risk scores in diagnosing MASLD and exhibited improved convenience, versatility and the capacity to offer user-friendly outputs. The integration of GPT-4V highlights the capacities of LLMs in handling both textual and visual medical data, reinforcing their expansive utility in healthcare practice.
期刊介绍:
Liver International promotes all aspects of the science of hepatology from basic research to applied clinical studies. Providing an international forum for the publication of high-quality original research in hepatology, it is an essential resource for everyone working on normal and abnormal structure and function in the liver and its constituent cells, including clinicians and basic scientists involved in the multi-disciplinary field of hepatology. The journal welcomes articles from all fields of hepatology, which may be published as original articles, brief definitive reports, reviews, mini-reviews, images in hepatology and letters to the Editor.