To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.

IF 1.9 3区医学 Q2 OTORHINOLARYNGOLOGY

European Archives of Oto-Rhino-Laryngology Pub Date : 2024-11-01 Epub Date: 2024-04-23 DOI:10.1007/s00405-024-08643-8

Magdalena Ostrowska, Paulina Kacała, Deborah Onolememen, Katie Vaughan-Lane, Anitta Sisily Joseph, Adam Ostrowski, Wioletta Pietruszewska, Jacek Banaszewski, Maciej J Wróbel

{"title":"To trust or not to trust: evaluating the reliability and safety of AI responses to laryngeal cancer queries.","authors":"Magdalena Ostrowska, Paulina Kacała, Deborah Onolememen, Katie Vaughan-Lane, Anitta Sisily Joseph, Adam Ostrowski, Wioletta Pietruszewska, Jacek Banaszewski, Maciej J Wróbel","doi":"10.1007/s00405-024-08643-8","DOIUrl":null,"url":null,"abstract":"Purpose: As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer.Methods: A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations.Results: Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length.Conclusions: LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.","PeriodicalId":11952,"journal":{"name":"European Archives of Oto-Rhino-Laryngology","volume":null,"pages":null},"PeriodicalIF":1.9000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11512842/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Archives of Oto-Rhino-Laryngology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00405-024-08643-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/4/23 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"OTORHINOLARYNGOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: As online health information-seeking surges, concerns mount over the quality and safety of accessible content, potentially leading to patient harm through misinformation. On one hand, the emergence of Artificial Intelligence (AI) in healthcare could prevent it; on the other hand, questions raise regarding the quality and safety of the medical information provided. As laryngeal cancer is a prevalent head and neck malignancy, this study aims to evaluate the utility and safety of three large language models (LLMs) as sources of patient information about laryngeal cancer.

Methods: A cross-sectional study was conducted using three LLMs (ChatGPT 3.5, ChatGPT 4.0, and Bard). A questionnaire comprising 36 inquiries about laryngeal cancer was categorised into diagnosis (11 questions), treatment (9 questions), novelties and upcoming treatments (4 questions), controversies (8 questions), and sources of information (4 questions). The population of reviewers consisted of 3 groups, including ENT specialists, junior physicians, and non-medicals, who graded the responses. Each physician evaluated each question twice for each model, while non-medicals only once. Everyone was blinded to the model type, and the question order was shuffled. Outcome evaluations were based on a safety score (1-3) and a Global Quality Score (GQS, 1-5). Results were compared between LLMs. The study included iterative assessments and statistical validations.

Results: Analysis revealed that ChatGPT 3.5 scored highest in both safety (mean: 2.70) and GQS (mean: 3.95). ChatGPT 4.0 and Bard had lower safety scores of 2.56 and 2.42, respectively, with corresponding quality scores of 3.65 and 3.38. Inter-rater reliability was consistent, with less than 3% discrepancy. About 4.2% of responses fell into the lowest safety category (1), particularly in the novelty category. Non-medical reviewers' quality assessments correlated moderately (r = 0.67) with response length.

Conclusions: LLMs can be valuable resources for patients seeking information on laryngeal cancer. ChatGPT 3.5 provided the most reliable and safe responses among the models evaluated.

查看原文本刊更多论文

相信还是不相信：评估人工智能对喉癌询问的回答的可靠性和安全性。

目的：随着在线健康信息搜索量的激增，人们对可获取内容的质量和安全性越来越担忧，因为错误信息可能会对患者造成伤害。一方面，人工智能（AI）在医疗保健领域的出现可以防止这种情况的发生；另一方面，人们对所提供的医疗信息的质量和安全性提出了质疑。喉癌是一种常见的头颈部恶性肿瘤，本研究旨在评估三种大型语言模型（LLM）作为喉癌患者信息来源的实用性和安全性：使用三种大型语言模型（ChatGPT 3.5、ChatGPT 4.0 和 Bard）进行了一项横断面研究。调查问卷由 36 个有关喉癌的问题组成，分为诊断（11 个问题）、治疗（9 个问题）、新治疗方法和即将采用的治疗方法（4 个问题）、争议（8 个问题）和信息来源（4 个问题）。审阅者包括耳鼻喉科专家、初级医师和非医师等三组，他们对回答进行评分。每位医生对每个模型的每个问题评估两次，而非医生只评估一次。每个人对机型类型都是盲评，问题顺序也是随机的。结果评估基于安全性评分（1-3 分）和总体质量评分（GQS，1-5 分）。结果在 LLM 之间进行比较。研究包括迭代评估和统计验证：分析表明，ChatGPT 3.5 在安全性（平均值：2.70）和 GQS（平均值：3.95）方面得分最高。ChatGPT 4.0 和 Bard 的安全性得分较低，分别为 2.56 和 2.42，相应的质量得分分别为 3.65 和 3.38。评分者之间的可靠性保持一致，差异不到 3%。约有 4.2% 的回复属于安全性最低的类别 (1)，尤其是在新颖性类别中。非医学审稿人的质量评估与回复长度呈中度相关（r = 0.67）：结论：对于寻求喉癌相关信息的患者来说，LLM 是非常有价值的资源。ChatGPT 3.5 提供的回复是所有评估模型中最可靠、最安全的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Archives of Oto-Rhino-Laryngology 医学-耳鼻喉科学

CiteScore

5.30

自引率

7.70%

发文量

537

审稿时长

2-4 weeks

期刊介绍： Official Journal of European Union of Medical Specialists – ORL Section and Board Official Journal of Confederation of European Oto-Rhino-Laryngology Head and Neck Surgery "European Archives of Oto-Rhino-Laryngology" publishes original clinical reports and clinically relevant experimental studies, as well as short communications presenting new results of special interest. With peer review by a respected international editorial board and prompt English-language publication, the journal provides rapid dissemination of information by authors from around the world. This particular feature makes it the journal of choice for readers who want to be informed about the continuing state of the art concerning basic sciences and the diagnosis and management of diseases of the head and neck on an international level. European Archives of Oto-Rhino-Laryngology was founded in 1864 as "Archiv für Ohrenheilkunde" by A. von Tröltsch, A. Politzer and H. Schwartze.