在回答患者问题方面,大语言模型聊天机器人的表现是否优于现有的患者信息资源?黑色素瘤对比研究。

IF 11 1区 医学 Q1 DERMATOLOGY
Nadia C W Kamminga, June E C Kievits, Peter W Plaisier, Jako S Burgers, Astrid M van der Veldt, Jan A G J van den Brand, Mark Mulder, Marlies Wakkee, Marjolein Lugtenberg, Tamar Nijsten
{"title":"在回答患者问题方面,大语言模型聊天机器人的表现是否优于现有的患者信息资源?黑色素瘤对比研究。","authors":"Nadia C W Kamminga, June E C Kievits, Peter W Plaisier, Jako S Burgers, Astrid M van der Veldt, Jan A G J van den Brand, Mark Mulder, Marlies Wakkee, Marjolein Lugtenberg, Tamar Nijsten","doi":"10.1093/bjd/ljae377","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large language models (LLMs) have a potential role in providing adequate patient information.</p><p><strong>Objectives: </strong>To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.</p><p><strong>Methods: </strong>Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.</p><p><strong>Results: </strong>Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.</p><p><strong>Conclusions: </strong>Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.</p>","PeriodicalId":9238,"journal":{"name":"British Journal of Dermatology","volume":" ","pages":"306-315"},"PeriodicalIF":11.0000,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.\",\"authors\":\"Nadia C W Kamminga, June E C Kievits, Peter W Plaisier, Jako S Burgers, Astrid M van der Veldt, Jan A G J van den Brand, Mark Mulder, Marlies Wakkee, Marjolein Lugtenberg, Tamar Nijsten\",\"doi\":\"10.1093/bjd/ljae377\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large language models (LLMs) have a potential role in providing adequate patient information.</p><p><strong>Objectives: </strong>To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.</p><p><strong>Methods: </strong>Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.</p><p><strong>Results: </strong>Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.</p><p><strong>Conclusions: </strong>Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.</p>\",\"PeriodicalId\":9238,\"journal\":{\"name\":\"British Journal of Dermatology\",\"volume\":\" \",\"pages\":\"306-315\"},\"PeriodicalIF\":11.0000,\"publicationDate\":\"2025-01-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"British Journal of Dermatology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1093/bjd/ljae377\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"DERMATOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Dermatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/bjd/ljae377","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"DERMATOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:大型语言模型(LLMs)在提供充足的患者信息方面具有潜在作用:大语言模型(LLMs)在提供充分的患者信息方面具有潜在的作用:比较大型语言模型与现有荷兰患者信息资源(PIR)在回答患者有关黑色素瘤问题时的回答质量:方法:对 ChatGPT 3.5 和 4.0 版、Gemini 和三个主要的荷兰黑色素瘤患者信息资源对 50 个黑色素瘤特定问题的回答进行基线检查,并在八个月后再次对 LLM 进行检查。结果包括(医疗)准确性、完整性、个性化、可读性,以及 LLM 的可重复性。使用弗里德曼方差分析对 LLM 和 PIR 进行了比较分析,使用 Wilcoxon Signed Ranks 检验对表现最佳的 LLM 和黄金标准 PIR 进行了比较分析:在 LLM 中,ChatGPT-3.5 的准确率最高(p=0.009)。双子座在完整性方面表现最佳(p结论:本研究比较了大型语言模型(LLMs)与荷兰现有患者信息资源(PIRs)对黑色素瘤相关患者问题的回答质量。结果显示,大型语言模型提供了高度个性化和完整的回答,往往超过了患者信息资源。不过,在取代或补充传统的患者信息资源之前,提高和保障准确性、可重复性和可访问性至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Do large language model chatbots perform better than established patient information resources in answering patient questions? A comparative study on melanoma.

Background: Large language models (LLMs) have a potential role in providing adequate patient information.

Objectives: To compare the quality of LLM responses with established Dutch patient information resources (PIRs) in answering patient questions regarding melanoma.

Methods: Responses from ChatGPT versions 3.5 and 4.0, Gemini, and three leading Dutch melanoma PIRs to 50 melanoma-specific questions were examined at baseline and for LLMs again after 8 months. Outcomes included (medical) accuracy, completeness, personalization, readability and, additionally, reproducibility for LLMs. Comparative analyses were performed within LLMs and PIRs using Friedman's Anova, and between best-performing LLMs and gold-standard (GS) PIRs using the Wilcoxon signed-rank test.

Results: Within LLMs, ChatGPT-3.5 demonstrated the highest accuracy (P = 0.009). Gemini performed best in completeness (P < 0.001), personalization (P = 0.007) and readability (P < 0.001). PIRs were consistent in accuracy and completeness, with the general practitioner's website excelling in personalization (P = 0.013) and readability (P < 0.001). The best-performing LLMs outperformed the GS-PIR on completeness and personalization, yet it was less accurate and less readable. Over time, response reproducibility decreased for all LLMs, showing variability across outcomes.

Conclusions: Although LLMs show potential in providing highly personalized and complete responses to patient questions regarding melanoma, improving and safeguarding accuracy, reproducibility and accessibility is crucial before they can replace or complement conventional PIRs.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
British Journal of Dermatology
British Journal of Dermatology 医学-皮肤病学
CiteScore
16.30
自引率
3.90%
发文量
1062
审稿时长
2-4 weeks
期刊介绍: The British Journal of Dermatology (BJD) is committed to publishing the highest quality dermatological research. Through its publications, the journal seeks to advance the understanding, management, and treatment of skin diseases, ultimately aiming to improve patient outcomes.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信