人工智能聊天机器人在回答医疗专业人员和护理人员关于德拉韦综合征的问题时的表现比较评估。

IF 2.8 3区 医学 Q2 CLINICAL NEUROLOGY
Epilepsia Open Pub Date : 2025-04-01 DOI:10.1002/epi4.70022
Joana Jesus-Ribeiro, Eugenia Roza, Bárbara Oliveiros, Joana Barbosa Melo, Mar Carreño
{"title":"人工智能聊天机器人在回答医疗专业人员和护理人员关于德拉韦综合征的问题时的表现比较评估。","authors":"Joana Jesus-Ribeiro, Eugenia Roza, Bárbara Oliveiros, Joana Barbosa Melo, Mar Carreño","doi":"10.1002/epi4.70022","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Artificial intelligence chatbots have been a game changer in healthcare, providing immediate, round-the-clock assistance. However, their accuracy across specific medical domains remains under-evaluated. Dravet syndrome remains one of the most challenging epileptic encephalopathies, with new data continuously emerging in the literature. This study aims to evaluate and compare the performance of ChatGPT 3.5 and Perplexity in responding to questions about Dravet Syndrome.</p><p><strong>Methods: </strong>We curated 96 questions about Dravet syndrome, 43 from healthcare professionals and 53 from caregivers. Two epileptologists independently graded the chatbots' responses, with a third senior epileptologist resolving any disagreements to reach a final consensus. Accuracy and completeness of correct answers were rated on predefined 3-point scales. Incorrect responses were prompted for self-correction and re-evaluated. Readability was assessed using Flesch reading ease and Flesch-Kincaid grade level.</p><p><strong>Results: </strong>Both chatbots had the majority of their responses rated as \"correct\" (ChatGPT 3.5: 66.7%, Perplexity: 81.3%), with no significant difference in performance between the two (χ<sup>2</sup> = 5.30, p = 0.071). ChatGPT 3.5 performed significantly better for caregivers than for healthcare professionals (χ<sup>2</sup> = 7.27, p = 0.026). The topic with the poorest performance was Dravet syndrome's treatment, particularly for healthcare professional questions. Both models exhibited exemplary completeness, with most responses rated as \"complete\" to \"comprehensive\" (ChatGPT 3.5: 73.4%, Perplexity: 75.7%). Substantial self-correction capabilities were observed: ChatGPT 3.5 improved 55.6% of responses and Perplexity 80%. The texts were generally very difficult to read, requiring an advanced reading level. However, Perplexity's responses were significantly more readable than ChatGPT 3.5's [Flesch reading ease: 29.0 (SD 13.9) vs. 24.1 (SD 15.0), p = 0.018].</p><p><strong>Significance: </strong>Our findings underscore the potential of AI chatbots in delivering accurate and complete responses to Dravet syndrome queries. However, they have limitations, particularly in complex areas like treatment. Continuous efforts to update information and improve readability are essential.</p><p><strong>Plain language summary: </strong>Artificial intelligence chatbots have the potential to improve access to medical information, including on conditions like Dravet syndrome, but the quality of this information is still unclear. In this study, ChatGPT 3.5 and Perplexity correctly answered most questions from healthcare professionals and caregivers, with ChatGPT 3.5 performing better for caregivers. Treatment-related questions had the most incorrect answers, particularly those from healthcare professionals. Both chatbots demonstrated the ability to correct previous incorrect responses, particularly Perplexity. Both chatbots produced text requiring advanced reading skills. Further improvements are needed to make the text easier to understand and address difficult medical topics.</p>","PeriodicalId":12038,"journal":{"name":"Epilepsia Open","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparative assessment of artificial intelligence chatbots' performance in responding to healthcare professionals' and caregivers' questions about Dravet syndrome.\",\"authors\":\"Joana Jesus-Ribeiro, Eugenia Roza, Bárbara Oliveiros, Joana Barbosa Melo, Mar Carreño\",\"doi\":\"10.1002/epi4.70022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>Artificial intelligence chatbots have been a game changer in healthcare, providing immediate, round-the-clock assistance. However, their accuracy across specific medical domains remains under-evaluated. Dravet syndrome remains one of the most challenging epileptic encephalopathies, with new data continuously emerging in the literature. This study aims to evaluate and compare the performance of ChatGPT 3.5 and Perplexity in responding to questions about Dravet Syndrome.</p><p><strong>Methods: </strong>We curated 96 questions about Dravet syndrome, 43 from healthcare professionals and 53 from caregivers. Two epileptologists independently graded the chatbots' responses, with a third senior epileptologist resolving any disagreements to reach a final consensus. Accuracy and completeness of correct answers were rated on predefined 3-point scales. Incorrect responses were prompted for self-correction and re-evaluated. Readability was assessed using Flesch reading ease and Flesch-Kincaid grade level.</p><p><strong>Results: </strong>Both chatbots had the majority of their responses rated as \\\"correct\\\" (ChatGPT 3.5: 66.7%, Perplexity: 81.3%), with no significant difference in performance between the two (χ<sup>2</sup> = 5.30, p = 0.071). ChatGPT 3.5 performed significantly better for caregivers than for healthcare professionals (χ<sup>2</sup> = 7.27, p = 0.026). The topic with the poorest performance was Dravet syndrome's treatment, particularly for healthcare professional questions. Both models exhibited exemplary completeness, with most responses rated as \\\"complete\\\" to \\\"comprehensive\\\" (ChatGPT 3.5: 73.4%, Perplexity: 75.7%). Substantial self-correction capabilities were observed: ChatGPT 3.5 improved 55.6% of responses and Perplexity 80%. The texts were generally very difficult to read, requiring an advanced reading level. However, Perplexity's responses were significantly more readable than ChatGPT 3.5's [Flesch reading ease: 29.0 (SD 13.9) vs. 24.1 (SD 15.0), p = 0.018].</p><p><strong>Significance: </strong>Our findings underscore the potential of AI chatbots in delivering accurate and complete responses to Dravet syndrome queries. However, they have limitations, particularly in complex areas like treatment. Continuous efforts to update information and improve readability are essential.</p><p><strong>Plain language summary: </strong>Artificial intelligence chatbots have the potential to improve access to medical information, including on conditions like Dravet syndrome, but the quality of this information is still unclear. In this study, ChatGPT 3.5 and Perplexity correctly answered most questions from healthcare professionals and caregivers, with ChatGPT 3.5 performing better for caregivers. Treatment-related questions had the most incorrect answers, particularly those from healthcare professionals. Both chatbots demonstrated the ability to correct previous incorrect responses, particularly Perplexity. Both chatbots produced text requiring advanced reading skills. Further improvements are needed to make the text easier to understand and address difficult medical topics.</p>\",\"PeriodicalId\":12038,\"journal\":{\"name\":\"Epilepsia Open\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Epilepsia Open\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/epi4.70022\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CLINICAL NEUROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Epilepsia Open","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/epi4.70022","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的:人工智能聊天机器人已经改变了医疗保健领域的游戏规则,提供即时、全天候的帮助。然而,它们在特定医学领域的准确性仍未得到充分评估。随着文献中不断出现新的数据,Dravet综合征仍然是最具挑战性的癫痫性脑病之一。本研究旨在评估和比较ChatGPT 3.5和Perplexity在回答有关Dravet综合征的问题方面的表现。方法:收集96个关于Dravet综合征的问题,其中43个来自医护人员,53个来自护理人员。两位癫痫病专家分别对聊天机器人的反应进行评分,第三位资深癫痫病专家解决任何分歧,以达成最终共识。正确答案的准确性和完整性按照预先设定的3分制进行评分。不正确的回答提示自我纠正和重新评估。可读性采用Flesch阅读难易度和Flesch- kincaid等级水平进行评估。结果:两个聊天机器人的大多数回答都被评为“正确”(ChatGPT 3.5: 66.7%, Perplexity: 81.3%),两者之间的性能没有显著差异(χ2 = 5.30, p = 0.071)。ChatGPT 3.5在护理人员中的表现明显优于卫生保健专业人员(χ2 = 7.27, p = 0.026)。表现最差的话题是德拉韦综合征的治疗,尤其是医疗保健专业问题。这两个模型都展示了典型的完整性,大多数回答被评为“完整”到“全面”(ChatGPT 3.5: 73.4%, Perplexity: 75.7%)。观察到大量的自我纠正能力:ChatGPT 3.5改善了55.6%的回答和80%的困惑。这些文章一般都很难读,需要较高的阅读水平。然而,Perplexity的回答明显比ChatGPT 3.5的可读性更高[Flesch reading ease: 29.0 (SD 13.9) vs. 24.1 (SD 15.0), p = 0.018]。意义:我们的研究结果强调了人工智能聊天机器人在提供准确和完整的回复草稿综合征查询方面的潜力。然而,它们也有局限性,特别是在治疗等复杂领域。不断努力更新信息和提高可读性是必不可少的。简单的语言总结:人工智能聊天机器人有可能改善医疗信息的获取,包括像德拉韦综合征这样的疾病,但这些信息的质量仍然不清楚。在本研究中,ChatGPT 3.5和Perplexity正确回答了来自医疗保健专业人员和护理人员的大多数问题,ChatGPT 3.5对护理人员的表现更好。与治疗相关的问题的错误答案最多,尤其是那些来自医疗保健专业人员的问题。这两个聊天机器人都展示了纠正之前错误回答的能力,尤其是Perplexity。这两个聊天机器人生成的文本都需要高级阅读技能。需要进一步改进,使文本更容易理解和解决困难的医学主题。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Comparative assessment of artificial intelligence chatbots' performance in responding to healthcare professionals' and caregivers' questions about Dravet syndrome.

Objective: Artificial intelligence chatbots have been a game changer in healthcare, providing immediate, round-the-clock assistance. However, their accuracy across specific medical domains remains under-evaluated. Dravet syndrome remains one of the most challenging epileptic encephalopathies, with new data continuously emerging in the literature. This study aims to evaluate and compare the performance of ChatGPT 3.5 and Perplexity in responding to questions about Dravet Syndrome.

Methods: We curated 96 questions about Dravet syndrome, 43 from healthcare professionals and 53 from caregivers. Two epileptologists independently graded the chatbots' responses, with a third senior epileptologist resolving any disagreements to reach a final consensus. Accuracy and completeness of correct answers were rated on predefined 3-point scales. Incorrect responses were prompted for self-correction and re-evaluated. Readability was assessed using Flesch reading ease and Flesch-Kincaid grade level.

Results: Both chatbots had the majority of their responses rated as "correct" (ChatGPT 3.5: 66.7%, Perplexity: 81.3%), with no significant difference in performance between the two (χ2 = 5.30, p = 0.071). ChatGPT 3.5 performed significantly better for caregivers than for healthcare professionals (χ2 = 7.27, p = 0.026). The topic with the poorest performance was Dravet syndrome's treatment, particularly for healthcare professional questions. Both models exhibited exemplary completeness, with most responses rated as "complete" to "comprehensive" (ChatGPT 3.5: 73.4%, Perplexity: 75.7%). Substantial self-correction capabilities were observed: ChatGPT 3.5 improved 55.6% of responses and Perplexity 80%. The texts were generally very difficult to read, requiring an advanced reading level. However, Perplexity's responses were significantly more readable than ChatGPT 3.5's [Flesch reading ease: 29.0 (SD 13.9) vs. 24.1 (SD 15.0), p = 0.018].

Significance: Our findings underscore the potential of AI chatbots in delivering accurate and complete responses to Dravet syndrome queries. However, they have limitations, particularly in complex areas like treatment. Continuous efforts to update information and improve readability are essential.

Plain language summary: Artificial intelligence chatbots have the potential to improve access to medical information, including on conditions like Dravet syndrome, but the quality of this information is still unclear. In this study, ChatGPT 3.5 and Perplexity correctly answered most questions from healthcare professionals and caregivers, with ChatGPT 3.5 performing better for caregivers. Treatment-related questions had the most incorrect answers, particularly those from healthcare professionals. Both chatbots demonstrated the ability to correct previous incorrect responses, particularly Perplexity. Both chatbots produced text requiring advanced reading skills. Further improvements are needed to make the text easier to understand and address difficult medical topics.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Epilepsia Open
Epilepsia Open Medicine-Neurology (clinical)
CiteScore
4.40
自引率
6.70%
发文量
104
审稿时长
8 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信