Yanqiu Li , Zhuojun Li , Jinze Li , Long Liu , Yao Liu , Bingbing Zhu , Ke shi , Yu Lu , Yongqi Li , Xuanwei Zeng , Ying Feng , Xianbo Wang
{"title":"The actual performance of large language models in providing liver cirrhosis-related information: A comparative study","authors":"Yanqiu Li , Zhuojun Li , Jinze Li , Long Liu , Yao Liu , Bingbing Zhu , Ke shi , Yu Lu , Yongqi Li , Xuanwei Zeng , Ying Feng , Xianbo Wang","doi":"10.1016/j.ijmedinf.2025.105961","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>With the increasing prevalence of large language models (LLMs) in the medical field, patients are increasingly turning to advanced online resources for information related to liver cirrhosis due to its long-term management processes. Therefore, a comprehensive evaluation of real-world performance of LLMs in these specialized medical areas is necessary.</div></div><div><h3>Methods</h3><div>This study evaluates the performance of four mainstream LLMs (ChatGPT-4o, Claude-3.5 Sonnet, Gemini-1.5 Pro, and Llama-3.1) in answering 39 questions related to liver cirrhosis. The information quality, readability and accuracy were assessed using Ensuring Quality Information for Patients tool, Flesch-Kincaid metrics and consensus scoring. The simplification and their self-correction ability of LLMs were also assessed.</div></div><div><h3>Results</h3><div>Significant performance differences were observed among the models. Gemini scored highest in providing high-quality information. While the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they exhibited strong capabilities in simplifying complex information. ChatGPT performed best in terms of accuracy, with a “Good” rating of 80%, higher than Claude (72%), Gemini (49%), and Llama (64%). All models received high scores for comprehensiveness. Each of the four LLMs demonstrated some degree of self-correction ability, improving the accuracy of initial answers with simple prompts. ChatGPT’s and Llama’s accuracy improved by 100%, Claude’s by 50% and Gemini’s by 67%.</div></div><div><h3>Conclusion</h3><div>LLMs demonstrate excellent performance in generating health information related to liver cirrhosis, yet they exhibit differences in answer quality, readability and accuracy. Future research should enhance their value in healthcare, ultimately achieving reliable, accessible and patient-centered medical information dissemination.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"201 ","pages":"Article 105961"},"PeriodicalIF":3.7000,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505625001789","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
With the increasing prevalence of large language models (LLMs) in the medical field, patients are increasingly turning to advanced online resources for information related to liver cirrhosis due to its long-term management processes. Therefore, a comprehensive evaluation of real-world performance of LLMs in these specialized medical areas is necessary.
Methods
This study evaluates the performance of four mainstream LLMs (ChatGPT-4o, Claude-3.5 Sonnet, Gemini-1.5 Pro, and Llama-3.1) in answering 39 questions related to liver cirrhosis. The information quality, readability and accuracy were assessed using Ensuring Quality Information for Patients tool, Flesch-Kincaid metrics and consensus scoring. The simplification and their self-correction ability of LLMs were also assessed.
Results
Significant performance differences were observed among the models. Gemini scored highest in providing high-quality information. While the readability of all four LLMs was generally low, requiring a college-level reading comprehension ability, they exhibited strong capabilities in simplifying complex information. ChatGPT performed best in terms of accuracy, with a “Good” rating of 80%, higher than Claude (72%), Gemini (49%), and Llama (64%). All models received high scores for comprehensiveness. Each of the four LLMs demonstrated some degree of self-correction ability, improving the accuracy of initial answers with simple prompts. ChatGPT’s and Llama’s accuracy improved by 100%, Claude’s by 50% and Gemini’s by 67%.
Conclusion
LLMs demonstrate excellent performance in generating health information related to liver cirrhosis, yet they exhibit differences in answer quality, readability and accuracy. Future research should enhance their value in healthcare, ultimately achieving reliable, accessible and patient-centered medical information dissemination.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.