{"title":"探讨大型语言模型在乙型肝炎感染相关问题上的表现:一项比较研究。","authors":"Yu Li, Chen-Kai Huang, Yi Hu, Xiao-Dong Zhou, Cong He, Jia-Wei Zhong","doi":"10.3748/wjg.v31.i3.101092","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients.</p><p><strong>Aim: </strong>To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions.</p><p><strong>Methods: </strong>LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed <i>via</i> the Gunning Fog index and Flesch-Kincaid grade level.</p><p><strong>Results: </strong>Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population.</p><p><strong>Conclusion: </strong>Our results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.</p>","PeriodicalId":23778,"journal":{"name":"World Journal of Gastroenterology","volume":"31 3","pages":"101092"},"PeriodicalIF":4.3000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684168/pdf/","citationCount":"0","resultStr":"{\"title\":\"Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study.\",\"authors\":\"Yu Li, Chen-Kai Huang, Yi Hu, Xiao-Dong Zhou, Cong He, Jia-Wei Zhong\",\"doi\":\"10.3748/wjg.v31.i3.101092\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients.</p><p><strong>Aim: </strong>To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions.</p><p><strong>Methods: </strong>LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed <i>via</i> the Gunning Fog index and Flesch-Kincaid grade level.</p><p><strong>Results: </strong>Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population.</p><p><strong>Conclusion: </strong>Our results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.</p>\",\"PeriodicalId\":23778,\"journal\":{\"name\":\"World Journal of Gastroenterology\",\"volume\":\"31 3\",\"pages\":\"101092\"},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2025-01-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11684168/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"World Journal of Gastroenterology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.3748/wjg.v31.i3.101092\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"GASTROENTEROLOGY & HEPATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Journal of Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3748/wjg.v31.i3.101092","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
Exploring the performance of large language models on hepatitis B infection-related questions: A comparative study.
Background: Patients with hepatitis B virus (HBV) infection require chronic and personalized care to improve outcomes. Large language models (LLMs) can potentially provide medical information for patients.
Aim: To examine the performance of three LLMs, ChatGPT-3.5, ChatGPT-4.0, and Google Gemini, in answering HBV-related questions.
Methods: LLMs' responses to HBV-related questions were independently graded by two medical professionals using a four-point accuracy scale, and disagreements were resolved by a third reviewer. Each question was run three times using three LLMs. Readability was assessed via the Gunning Fog index and Flesch-Kincaid grade level.
Results: Overall, all three LLM chatbots achieved high average accuracy scores for subjective questions (ChatGPT-3.5: 3.50; ChatGPT-4.0: 3.69; Google Gemini: 3.53, out of a maximum score of 4). With respect to objective questions, ChatGPT-4.0 achieved an 80.8% accuracy rate, compared with 62.9% for ChatGPT-3.5 and 73.1% for Google Gemini. Across the six domains, ChatGPT-4.0 performed better in terms of diagnosis, whereas Google Gemini demonstrated excellent clinical manifestations. Notably, in the readability analysis, the mean Gunning Fog index and Flesch-Kincaid grade level scores of the three LLM chatbots were significantly higher than the standard level eight, far exceeding the reading level of the normal population.
Conclusion: Our results highlight the potential of LLMs, especially ChatGPT-4.0, for delivering responses to HBV-related questions. LLMs may be an adjunctive informational tool for patients and physicians to improve outcomes. Nevertheless, current LLMs should not replace personalized treatment recommendations from physicians in the management of HBV infection.
期刊介绍:
The primary aims of the WJG are to improve diagnostic, therapeutic and preventive modalities and the skills of clinicians and to guide clinical practice in gastroenterology and hepatology.