甲状腺眼病与人工智能:ChatGPT-3.5、chatgpt - 40和Gemini在患者信息传递中的比较研究

IF 1.2 4区 医学 Q3 OPHTHALMOLOGY
Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam
{"title":"甲状腺眼病与人工智能:ChatGPT-3.5、chatgpt - 40和Gemini在患者信息传递中的比较研究","authors":"Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam","doi":"10.1097/IOP.0000000000002882","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.</p><p><strong>Methods: </strong>Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.</p><p><strong>Results: </strong>GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.</p><p><strong>Conclusions: </strong>GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.</p>","PeriodicalId":19588,"journal":{"name":"Ophthalmic Plastic and Reconstructive Surgery","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.\",\"authors\":\"Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam\",\"doi\":\"10.1097/IOP.0000000000002882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.</p><p><strong>Methods: </strong>Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.</p><p><strong>Results: </strong>GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.</p><p><strong>Conclusions: </strong>GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.</p>\",\"PeriodicalId\":19588,\"journal\":{\"name\":\"Ophthalmic Plastic and Reconstructive Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmic Plastic and Reconstructive Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/IOP.0000000000002882\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmic Plastic and Reconstructive Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IOP.0000000000002882","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究旨在比较3种人工智能语言模型——gpt -3.5、gpt - 40和Gemini在提供以患者为中心的甲状腺眼病(TED)信息方面的有效性。我们评估他们的表现基于他们的准确性和全面的答复常见的病人询问有关TED。该研究没有评估人工智能反应的可重复性,而是侧重于每个模型的单次评估。方法:5名经验丰富的眼科医生对人工智能模型对TED患者常见的12个关键问题的回答进行评估。这些问题涉及TED病理生理学、危险因素、临床表现、诊断测试和治疗方案。每个回答的正确性和可靠性评分为7分李克特量表,其中1表示不正确或不可靠的信息,7表示高度准确和可靠的信息。正确性指的是事实的准确性,而可靠性评估的是患者使用的可信度。评估是匿名的,为了便于模型比较,对所有外科医生的最终得分取平均值。结果:GPT-3.5是表现最好的,平均正确性得分为5.75分,可靠性得分为5.68分,在提供TED治疗和手术干预等复杂主题的详细信息方面表现出色。gpt - 40的正确性得分为5.32分,可靠性得分为5.25分,总体上提供了准确但不太详细的信息。双子座在正确性和可靠性方面的得分分别为5.10和4.70,在较简单的问题上通常能给出足够的回答,但在二线免疫抑制治疗等复杂领域却缺乏细节。使用Friedman检验的统计分析显示,在关键主题上,模型之间存在显著差异(p < 0.05), GPT-3.5始终领先。结论:GPT-3.5是提供可靠和全面的患者信息的最有效模型,特别是对于复杂的治疗和手术主题。gpt - 40提供了可靠的一般信息,但缺乏专业主题的必要深度,而Gemini适合解决基本的患者询问,但不足以提供详细的医疗信息。这项研究强调了人工智能在患者教育中的作用,表明像GPT-3.5这样的模型可以成为临床医生提高患者对TED理解的有价值的工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.

Purpose: This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.

Methods: Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.

Results: GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.

Conclusions: GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
2.50
自引率
10.00%
发文量
322
审稿时长
3-8 weeks
期刊介绍: Ophthalmic Plastic and Reconstructive Surgery features original articles and reviews on topics such as ptosis, eyelid reconstruction, orbital diagnosis and surgery, lacrimal problems, and eyelid malposition. Update reports on diagnostic techniques, surgical equipment and instrumentation, and medical therapies are included, as well as detailed analyses of recent research findings and their clinical applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信