Jonathan M. Carnino, William R. Pellegrini, Megan Willis, Michael B. Cohen, Marianella Paz-Lansberg, Elizabeth M. Davis, Gregory A. Grillone, Jessica R. Levi
{"title":"Assessing ChatGPT’s Responses to Otolaryngology Patient Questions","authors":"Jonathan M. Carnino, William R. Pellegrini, Megan Willis, Michael B. Cohen, Marianella Paz-Lansberg, Elizabeth M. Davis, Gregory A. Grillone, Jessica R. Levi","doi":"10.1177/00034894241249621","DOIUrl":null,"url":null,"abstract":"Objective:This study aims to evaluate ChatGPT’s performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare.Methods:A cross-sectional study was conducted using patient questions from the public online forum Reddit’s r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified.Results:Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous.Conclusion:While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.","PeriodicalId":8361,"journal":{"name":"Annals of Otology, Rhinology & Laryngology","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Otology, Rhinology & Laryngology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/00034894241249621","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Objective:This study aims to evaluate ChatGPT’s performance in addressing real-world otolaryngology patient questions, focusing on accuracy, comprehensiveness, and patient safety, to assess its suitability for integration into healthcare.Methods:A cross-sectional study was conducted using patient questions from the public online forum Reddit’s r/AskDocs, where medical advice is sought from healthcare professionals. Patient questions were input into ChatGPT (GPT-3.5), and responses were reviewed by 5 board-certified otolaryngologists. The evaluation criteria included difficulty, accuracy, comprehensiveness, and bedside manner/empathy. Statistical analysis explored the relationship between patient question characteristics and ChatGPT response scores. Potentially dangerous responses were also identified.Results:Patient questions averaged 224.93 words, while ChatGPT responses were longer at 414.93 words. The accuracy scores for ChatGPT responses were 3.76/5, comprehensiveness scores were 3.59/5, and bedside manner/empathy scores were 4.28/5. Longer patient questions did not correlate with higher response ratings. However, longer ChatGPT responses scored higher in bedside manner/empathy. Higher question difficulty correlated with lower comprehensiveness. Five responses were flagged as potentially dangerous.Conclusion:While ChatGPT exhibits promise in addressing otolaryngology patient questions, this study demonstrates its limitations, particularly in accuracy and comprehensiveness. The identification of potentially dangerous responses underscores the need for a cautious approach to AI in medical advice. Responsible integration of AI into healthcare necessitates thorough assessments of model performance and ethical considerations for patient safety.