Assessing ChatGPT’s Responses to Otolaryngology Patient Questions

Annals of Otology, Rhinology & Laryngology Pub Date : 2024-04-27 DOI:10.1177/00034894241249621

Jonathan M. Carnino, William R. Pellegrini, Megan Willis, Michael B. Cohen, Marianella Paz-Lansberg, Elizabeth M. Davis, Gregory A. Grillone, Jessica R. Levi

{"title":"Assessing ChatGPT’s Responses to Otolaryngology Patient Questions","authors":"Jonathan M. Carnino, William R. Pellegrini, Megan Willis, Michael B. Cohen, Marianella Paz-Lansberg, Elizabeth M. Davis, Gregory A. Grillone, Jessica R. Levi","doi":"10.1177/00034894241249621","DOIUrl":null,"url":null,"abstract":"Objective:This study aims to evaluate ChatGPT’s performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare.Methods:A cross-sectional study was conducted using patient questions from the public online forum Reddit’s r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified.Results:Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous.Conclusion:While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.","PeriodicalId":8361,"journal":{"name":"Annals of Otology, Rhinology & Laryngology","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Otology, Rhinology & Laryngology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/00034894241249621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Objective:This study aims to evaluate ChatGPT’s performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare.Methods:A cross-sectional study was conducted using patient questions from the public online forum Reddit’s r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified.Results:Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous.Conclusion:While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.

查看原文本刊更多论文

评估 ChatGPT 对耳鼻喉科患者问题的回答

目的：本研究旨在评估 ChatGPT 在解决现实世界中耳鼻喉科患者问题时的表现，重点关注准确性、全面性和患者安全性，以评估其是否适合整合到医疗保健中。方法：我们使用来自 Reddit's r/AskDocs 公共在线论坛的患者问题进行了一项横断面研究，患者可在该论坛上向医疗保健专业人士寻求医疗建议。患者的问题被输入到 ChatGPT（GPT-3.5）中，并由 5 位经董事会认证的耳鼻喉科专家对回复进行审核。评估标准包括难度、准确性、全面性和床边态度/同情心。统计分析探讨了患者问题特征与 ChatGPT 回答得分之间的关系。结果：患者提问的平均字数为 224.93 个字，而 ChatGPT 回复的平均字数为 414.93 个字。ChatGPT 回复的准确性评分为 3.76/5，全面性评分为 3.59/5，床边态度/同情心评分为 4.28/5。较长的患者问题与较高的回复评分并不相关。然而，较长的 ChatGPT 回答在床边态度/同情心方面得分较高。问题难度越高，全面性越低。结论：虽然 ChatGPT 在解决耳鼻喉科患者的问题方面大有可为，但这项研究表明了它的局限性，尤其是在准确性和全面性方面。潜在危险回复的识别强调了在医疗咨询中采用人工智能的谨慎态度。要负责任地将人工智能融入医疗保健，就必须对模型的性能和患者安全的道德因素进行全面评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Annals of Otology, Rhinology & Laryngology

自引率

0.00%

发文量