Sholem Hack, Shibli Alsleibi, Naseem Saleh, Eran E Alon, Naomi Rabinovics, Eric Remer
{"title":"聊天机器人是患者关于颈部肿块的常见问题的可靠来源吗?","authors":"Sholem Hack, Shibli Alsleibi, Naseem Saleh, Eran E Alon, Naomi Rabinovics, Eric Remer","doi":"10.1007/s00405-025-09433-6","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the reliability and accuracy of Large Language Models in answering patient Frequently Asked Questions about adult neck masses.</p><p><strong>Methods: </strong>Twenty-four questions from the American Academy of Otolaryngology-Head and Neck Surgery were presented to ChatGPT, Claude, and Gemini. Five independent otolaryngologists evaluated responses using six criteria: accuracy, extensiveness, misleading information, resource quality, guideline citations, and overall reliability. Statistical analysis used Fisher's exact tests and Fleiss' Kappa.</p><p><strong>Results: </strong>All models showed high reliability (91.7-100%). Paid GPT and Gemini achieved highest accuracy (95.8%). Extensiveness varied significantly (p = 0.012), with Gemini scoring lowest (62.5%). Resource quality ranged from 58.3% (Claude) to 100% (Paid GPT). Guideline citations were highest for GPT models (50%) and lowest for Gemini (16.7%). Misleading information was rare (0-16.7%). Inter-rater reliability was near-perfect across five reviewers (κ = 0.95).</p><p><strong>Conclusion: </strong>Large Language Models demonstrate high reliability and accuracy for neck mass patient education, with paid versions showing marginally better performance. While promising as educational tools, variable guideline adherence and occasional misinformation suggest they should complement rather than replace professional medical advice.</p>","PeriodicalId":11952,"journal":{"name":"European Archives of Oto-Rhino-Laryngology","volume":" ","pages":"4273-4282"},"PeriodicalIF":2.2000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Are chatbots a reliable source for patient frequently asked questions on neck masses?\",\"authors\":\"Sholem Hack, Shibli Alsleibi, Naseem Saleh, Eran E Alon, Naomi Rabinovics, Eric Remer\",\"doi\":\"10.1007/s00405-025-09433-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To evaluate the reliability and accuracy of Large Language Models in answering patient Frequently Asked Questions about adult neck masses.</p><p><strong>Methods: </strong>Twenty-four questions from the American Academy of Otolaryngology-Head and Neck Surgery were presented to ChatGPT, Claude, and Gemini. Five independent otolaryngologists evaluated responses using six criteria: accuracy, extensiveness, misleading information, resource quality, guideline citations, and overall reliability. Statistical analysis used Fisher's exact tests and Fleiss' Kappa.</p><p><strong>Results: </strong>All models showed high reliability (91.7-100%). Paid GPT and Gemini achieved highest accuracy (95.8%). Extensiveness varied significantly (p = 0.012), with Gemini scoring lowest (62.5%). Resource quality ranged from 58.3% (Claude) to 100% (Paid GPT). Guideline citations were highest for GPT models (50%) and lowest for Gemini (16.7%). Misleading information was rare (0-16.7%). Inter-rater reliability was near-perfect across five reviewers (κ = 0.95).</p><p><strong>Conclusion: </strong>Large Language Models demonstrate high reliability and accuracy for neck mass patient education, with paid versions showing marginally better performance. While promising as educational tools, variable guideline adherence and occasional misinformation suggest they should complement rather than replace professional medical advice.</p>\",\"PeriodicalId\":11952,\"journal\":{\"name\":\"European Archives of Oto-Rhino-Laryngology\",\"volume\":\" \",\"pages\":\"4273-4282\"},\"PeriodicalIF\":2.2000,\"publicationDate\":\"2025-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"European Archives of Oto-Rhino-Laryngology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00405-025-09433-6\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/30 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"OTORHINOLARYNGOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Archives of Oto-Rhino-Laryngology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00405-025-09433-6","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/30 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}
Are chatbots a reliable source for patient frequently asked questions on neck masses?
Purpose: To evaluate the reliability and accuracy of Large Language Models in answering patient Frequently Asked Questions about adult neck masses.
Methods: Twenty-four questions from the American Academy of Otolaryngology-Head and Neck Surgery were presented to ChatGPT, Claude, and Gemini. Five independent otolaryngologists evaluated responses using six criteria: accuracy, extensiveness, misleading information, resource quality, guideline citations, and overall reliability. Statistical analysis used Fisher's exact tests and Fleiss' Kappa.
Results: All models showed high reliability (91.7-100%). Paid GPT and Gemini achieved highest accuracy (95.8%). Extensiveness varied significantly (p = 0.012), with Gemini scoring lowest (62.5%). Resource quality ranged from 58.3% (Claude) to 100% (Paid GPT). Guideline citations were highest for GPT models (50%) and lowest for Gemini (16.7%). Misleading information was rare (0-16.7%). Inter-rater reliability was near-perfect across five reviewers (κ = 0.95).
Conclusion: Large Language Models demonstrate high reliability and accuracy for neck mass patient education, with paid versions showing marginally better performance. While promising as educational tools, variable guideline adherence and occasional misinformation suggest they should complement rather than replace professional medical advice.
期刊介绍:
Official Journal of
European Union of Medical Specialists – ORL Section and Board
Official Journal of Confederation of European Oto-Rhino-Laryngology Head and Neck Surgery
"European Archives of Oto-Rhino-Laryngology" publishes original clinical reports and clinically relevant experimental studies, as well as short communications presenting new results of special interest. With peer review by a respected international editorial board and prompt English-language publication, the journal provides rapid dissemination of information by authors from around the world. This particular feature makes it the journal of choice for readers who want to be informed about the continuing state of the art concerning basic sciences and the diagnosis and management of diseases of the head and neck on an international level.
European Archives of Oto-Rhino-Laryngology was founded in 1864 as "Archiv für Ohrenheilkunde" by A. von Tröltsch, A. Politzer and H. Schwartze.