Taylor Kring , Soumil Prasad , Supriya Dadi , Eric Sokhn , Elizabeth Franzmann
{"title":"人工智能聊天机器人在头颈癌分诊中的质量和可读性比较","authors":"Taylor Kring , Soumil Prasad , Supriya Dadi , Eric Sokhn , Elizabeth Franzmann","doi":"10.1016/j.amjoto.2025.104710","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>Head and neck cancers (HNCs) are a significant global health concern, contributing to substantial morbidity and mortality. AI-powered chatbots such as ChatGPT, Google Gemini, Microsoft Copilot, and Open Evidence are increasingly used by patients seeking health information. While these tools provide immediate access to medical content, concerns remain regarding their reliability, readability, and potential impact on patient outcomes.</div></div><div><h3>Methods</h3><div>Responses to 25 patient-like HNC symptom queries were assessed using four leading AI platforms: ChatGPT, Google Gemini, Microsoft Copilot, and Open Evidence. Responses were evaluated using modified DISCERN criteria for quality and SMOG scoring for readability, with ANOVA and post hoc analysis conducted afterward.</div></div><div><h3>Results</h3><div>Microsoft Copilot achieved the highest mean DISCERN score of 41.40 (95 % CI: 40.31 to 42.49) and the lowest mean SMOG reading levels of 12.56 (95 % CI: 11.82 to 13.31), outperforming ChatGPT, Google Gemini, and Open Evidence in overall quality and accessibility (p < .001). Open Evidence scored lowest in both quality averaging 30.52 (95 % CI: 27.52 to 33.52) and readability of 17.49 (95 % CI: 16.66 to 18.31), reflecting a graduate reading level.</div></div><div><h3>Conclusion</h3><div>Significant variability exists in the readability and quality of AI-generated responses to HNC-related queries, highlighting the need for platform-specific validation and oversight to ensure accurate, patient-centered communication.</div></div><div><h3>Level of evidence</h3><div>Our study is a cross-sectional analysis that evaluates chatbot responses using established grading tools. This aligns best with level 4 evidence.</div></div>","PeriodicalId":7591,"journal":{"name":"American Journal of Otolaryngology","volume":"46 5","pages":"Article 104710"},"PeriodicalIF":1.7000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparison of quality and readability of Artificial Intelligence chatbots in triage for head and neck cancer\",\"authors\":\"Taylor Kring , Soumil Prasad , Supriya Dadi , Eric Sokhn , Elizabeth Franzmann\",\"doi\":\"10.1016/j.amjoto.2025.104710\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><div>Head and neck cancers (HNCs) are a significant global health concern, contributing to substantial morbidity and mortality. AI-powered chatbots such as ChatGPT, Google Gemini, Microsoft Copilot, and Open Evidence are increasingly used by patients seeking health information. While these tools provide immediate access to medical content, concerns remain regarding their reliability, readability, and potential impact on patient outcomes.</div></div><div><h3>Methods</h3><div>Responses to 25 patient-like HNC symptom queries were assessed using four leading AI platforms: ChatGPT, Google Gemini, Microsoft Copilot, and Open Evidence. Responses were evaluated using modified DISCERN criteria for quality and SMOG scoring for readability, with ANOVA and post hoc analysis conducted afterward.</div></div><div><h3>Results</h3><div>Microsoft Copilot achieved the highest mean DISCERN score of 41.40 (95 % CI: 40.31 to 42.49) and the lowest mean SMOG reading levels of 12.56 (95 % CI: 11.82 to 13.31), outperforming ChatGPT, Google Gemini, and Open Evidence in overall quality and accessibility (p < .001). Open Evidence scored lowest in both quality averaging 30.52 (95 % CI: 27.52 to 33.52) and readability of 17.49 (95 % CI: 16.66 to 18.31), reflecting a graduate reading level.</div></div><div><h3>Conclusion</h3><div>Significant variability exists in the readability and quality of AI-generated responses to HNC-related queries, highlighting the need for platform-specific validation and oversight to ensure accurate, patient-centered communication.</div></div><div><h3>Level of evidence</h3><div>Our study is a cross-sectional analysis that evaluates chatbot responses using established grading tools. This aligns best with level 4 evidence.</div></div>\",\"PeriodicalId\":7591,\"journal\":{\"name\":\"American Journal of Otolaryngology\",\"volume\":\"46 5\",\"pages\":\"Article 104710\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2025-07-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"American Journal of Otolaryngology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0196070925001139\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"OTORHINOLARYNGOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"American Journal of Otolaryngology","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0196070925001139","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
A comparison of quality and readability of Artificial Intelligence chatbots in triage for head and neck cancer
Objective
Head and neck cancers (HNCs) are a significant global health concern, contributing to substantial morbidity and mortality. AI-powered chatbots such as ChatGPT, Google Gemini, Microsoft Copilot, and Open Evidence are increasingly used by patients seeking health information. While these tools provide immediate access to medical content, concerns remain regarding their reliability, readability, and potential impact on patient outcomes.
Methods
Responses to 25 patient-like HNC symptom queries were assessed using four leading AI platforms: ChatGPT, Google Gemini, Microsoft Copilot, and Open Evidence. Responses were evaluated using modified DISCERN criteria for quality and SMOG scoring for readability, with ANOVA and post hoc analysis conducted afterward.
Results
Microsoft Copilot achieved the highest mean DISCERN score of 41.40 (95 % CI: 40.31 to 42.49) and the lowest mean SMOG reading levels of 12.56 (95 % CI: 11.82 to 13.31), outperforming ChatGPT, Google Gemini, and Open Evidence in overall quality and accessibility (p < .001). Open Evidence scored lowest in both quality averaging 30.52 (95 % CI: 27.52 to 33.52) and readability of 17.49 (95 % CI: 16.66 to 18.31), reflecting a graduate reading level.
Conclusion
Significant variability exists in the readability and quality of AI-generated responses to HNC-related queries, highlighting the need for platform-specific validation and oversight to ensure accurate, patient-centered communication.
Level of evidence
Our study is a cross-sectional analysis that evaluates chatbot responses using established grading tools. This aligns best with level 4 evidence.
期刊介绍:
Be fully informed about developments in otology, neurotology, audiology, rhinology, allergy, laryngology, speech science, bronchoesophagology, facial plastic surgery, and head and neck surgery. Featured sections include original contributions, grand rounds, current reviews, case reports and socioeconomics.