Madunil A Niriella, Pathum Premaratna, Mananjala Senanayake, Senerath Kodisinghe, Uditha Dassanayake, Anuradha Dassanayake, Dileepa S Ediriweera, H Janaka de Silva
{"title":"自由获取的、基线的、通用的大型语言模型生成肝病常见问题患者信息的可靠性:一项初步横断面研究。","authors":"Madunil A Niriella, Pathum Premaratna, Mananjala Senanayake, Senerath Kodisinghe, Uditha Dassanayake, Anuradha Dassanayake, Dileepa S Ediriweera, H Janaka de Silva","doi":"10.1080/17474124.2025.2471874","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information.</p><p><strong>Research design and methods: </strong>We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response.</p><p><strong>Results: </strong>The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, <i>p</i> = 0.811], completeness [H(2) = 3.146, <i>p</i> = 0.207], or quality [H(2) = 3.350, <i>p</i> = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, <i>p</i> = 0.062], completeness [H(2) = 0.104, <i>p</i> = 0.949], or quality [H(2) = 0.420, <i>p</i> = 0.810] between the three raters (R1, R2, R3).</p><p><strong>Conclusion: </strong>Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.</p>","PeriodicalId":12257,"journal":{"name":"Expert Review of Gastroenterology & Hepatology","volume":" ","pages":"437-442"},"PeriodicalIF":3.8000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study.\",\"authors\":\"Madunil A Niriella, Pathum Premaratna, Mananjala Senanayake, Senerath Kodisinghe, Uditha Dassanayake, Anuradha Dassanayake, Dileepa S Ediriweera, H Janaka de Silva\",\"doi\":\"10.1080/17474124.2025.2471874\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information.</p><p><strong>Research design and methods: </strong>We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response.</p><p><strong>Results: </strong>The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, <i>p</i> = 0.811], completeness [H(2) = 3.146, <i>p</i> = 0.207], or quality [H(2) = 3.350, <i>p</i> = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, <i>p</i> = 0.062], completeness [H(2) = 0.104, <i>p</i> = 0.949], or quality [H(2) = 0.420, <i>p</i> = 0.810] between the three raters (R1, R2, R3).</p><p><strong>Conclusion: </strong>Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.</p>\",\"PeriodicalId\":12257,\"journal\":{\"name\":\"Expert Review of Gastroenterology & Hepatology\",\"volume\":\" \",\"pages\":\"437-442\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Expert Review of Gastroenterology & Hepatology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/17474124.2025.2471874\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/2/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"GASTROENTEROLOGY & HEPATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Review of Gastroenterology & Hepatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/17474124.2025.2471874","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/2/27 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
背景:我们评估了ChatGPT-3.5和Gemini等大型语言模型(llm)与人类专家作为患者信息来源的使用情况。研究设计和方法:使用Kruskal-Wallis测试,我们比较了免费获取的、基线的、通用的法学硕士生成的关于肝病的20个常见问题(FAQs)的回答与两位胃肠病学家的回答的准确性、完整性和质量。三位独立的胃肠病学家盲目地给每个回答打分。结果:专家和人工智能生成的回答在所有领域都显示出很高的平均得分,两组之间在准确性[H(2) = 0.421, p = 0.811]、完整性[H(2) = 3.146, p = 0.207]或质量[H(2) = 3.350, p = 0.187]方面没有统计学差异。我们发现三个评分者(R1, R2, R3)在排序总数的准确性[H(2) = 5.559, p = 0.062]、完整性[H(2) = 0.104, p = 0.949]和质量[H(2) = 0.420, p = 0.810]方面没有统计学差异。结论:我们的研究结果概述了免费获取的、基线的、通用的法学硕士在为肝病常见问题提供可靠答案方面的潜力。
The reliability of freely accessible, baseline, general-purpose large language model generated patient information for frequently asked questions on liver disease: a preliminary cross-sectional study.
Background: We assessed the use of large language models (LLMs) like ChatGPT-3.5 and Gemini against human experts as sources of patient information.
Research design and methods: We compared the accuracy, completeness and quality of freely accessible, baseline, general-purpose LLM-generated responses to 20 frequently asked questions (FAQs) on liver disease, with those from two gastroenterologists, using the Kruskal-Wallis test. Three independent gastroenterologists blindly rated each response.
Results: The expert and AI-generated responses displayed high mean scores across all domains, with no statistical difference between the groups for accuracy [H(2) = 0.421, p = 0.811], completeness [H(2) = 3.146, p = 0.207], or quality [H(2) = 3.350, p = 0.187]. We found no statistical difference between rank totals in accuracy [H(2) = 5.559, p = 0.062], completeness [H(2) = 0.104, p = 0.949], or quality [H(2) = 0.420, p = 0.810] between the three raters (R1, R2, R3).
Conclusion: Our findings outline the potential of freely accessible, baseline, general-purpose LLMs in providing reliable answers to FAQs on liver disease.
期刊介绍:
The enormous health and economic burden of gastrointestinal disease worldwide warrants a sharp focus on the etiology, epidemiology, prevention, diagnosis, treatment and development of new therapies. By the end of the last century we had seen enormous advances, both in technologies to visualize disease and in curative therapies in areas such as gastric ulcer, with the advent first of the H2-antagonists and then the proton pump inhibitors - clear examples of how advances in medicine can massively benefit the patient. Nevertheless, specialists face ongoing challenges from a wide array of diseases of diverse etiology.