How accurate are ChatGPT-4 responses in chronic urticaria? A critical analysis with information quality metrics

IF 4.3 2区医学 Q2 ALLERGY

World Allergy Organization Journal Pub Date : 2025-06-14 DOI:10.1016/j.waojou.2025.101071

Ivan Cherrez-Ojeda Ph.D.(c) , Marco Faytong-Haro Ph.D.(c) , Patricio Alvarez-Muñoz Ph.D. , José Ignacio Larco MD , Erika de Arruda Chaves MD , Isabel Rojo MD , Carol Vivian Moncayo MD , German D. Ramon MD , Gabriela Rodas-Valero MD , Emek Kocatürk MD , Giselle S. Mosnaim MD , Karla Robles-Velasco MD

{"title":"How accurate are ChatGPT-4 responses in chronic urticaria? A critical analysis with information quality metrics","authors":"Ivan Cherrez-Ojeda Ph.D.(c) , Marco Faytong-Haro Ph.D.(c) , Patricio Alvarez-Muñoz Ph.D. , José Ignacio Larco MD , Erika de Arruda Chaves MD , Isabel Rojo MD , Carol Vivian Moncayo MD , German D. Ramon MD , Gabriela Rodas-Valero MD , Emek Kocatürk MD , Giselle S. Mosnaim MD , Karla Robles-Velasco MD","doi":"10.1016/j.waojou.2025.101071","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The increasing use of artificial intelligence (AI) in healthcare, especially in delivering medical information, prompts concerns over the reliability and accuracy of AI-generated responses. This study evaluates the quality, reliability, and readability of ChatGPT-4 responses for chronic urticaria (CU) care, considering the potential implications of inaccurate medical information.</div></div><div><h3>Objective</h3><div>The goal of the study was to assess the quality, reliability, and readability of ChatGPT-4 responses to inquiries on CU management in accordance with international guidelines, utilizing validated metrics to evaluate the effectiveness of ChatGPT-4 as a resource for medical information acquisition.</div></div><div><h3>Methods</h3><div>Twenty-four questions were derived from the EAACI/GA<sup>2</sup>LEN/EuroGuiDerm/APAAACI recommendations and utilized as prompts for ChatGPT-4 to obtain responses in individual chats for each question. The inquiries were categorized into 3 groups: A.) Classification and Diagnosis, B.) Assessment and Monitoring, and C.) Treatment and Management Recommendations. The responses were separately evaluated by allergy specialists utilizing the DISCERN instrument for quality assessment, Journal of the American Medical Association (JAMA) benchmark criteria for reliability evaluation, and Flesch scores for readability analysis. The scores were further examined by median calculations and Intraclass Correlation Coefficient assessments.</div></div><div><h3>Results</h3><div>Categories A and C exhibited insufficient reliability according to JAMA, with median scores of 1 and 0, respectively. Category B exhibited a low reliability score (median 2, interquartile range 2). The information quality from category C questions was satisfactory (median 51.5, IQR 12.5). All 3 groups exhibited confusing readability levels according to the Flesch assessment.</div></div><div><h3>Limitations</h3><div>The study's limitations encompass the emphasis on CU, possible bias in question selection, the use of particular instruments such as DISCERN, JAMA, and Flesch, as well as reliance on expert opinion for assessment.</div></div><div><h3>Conclusion</h3><div>ChatGPT-4 demonstrates potential for producing medical content; nonetheless, its reliability is shaky underscoring the necessity for caution and confirmation when employing AI-generated medical information, especially in the management of CU.</div></div>","PeriodicalId":54295,"journal":{"name":"World Allergy Organization Journal","volume":"18 7","pages":"Article 101071"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Allergy Organization Journal","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1939455125000481","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ALLERGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background

The increasing use of artificial intelligence (AI) in healthcare, especially in delivering medical information, prompts concerns over the reliability and accuracy of AI-generated responses. This study evaluates the quality, reliability, and readability of ChatGPT-4 responses for chronic urticaria (CU) care, considering the potential implications of inaccurate medical information.

Objective

The goal of the study was to assess the quality, reliability, and readability of ChatGPT-4 responses to inquiries on CU management in accordance with international guidelines, utilizing validated metrics to evaluate the effectiveness of ChatGPT-4 as a resource for medical information acquisition.

Methods

Twenty-four questions were derived from the EAACI/GA²LEN/EuroGuiDerm/APAAACI recommendations and utilized as prompts for ChatGPT-4 to obtain responses in individual chats for each question. The inquiries were categorized into 3 groups: A.) Classification and Diagnosis, B.) Assessment and Monitoring, and C.) Treatment and Management Recommendations. The responses were separately evaluated by allergy specialists utilizing the DISCERN instrument for quality assessment, Journal of the American Medical Association (JAMA) benchmark criteria for reliability evaluation, and Flesch scores for readability analysis. The scores were further examined by median calculations and Intraclass Correlation Coefficient assessments.

Results

Categories A and C exhibited insufficient reliability according to JAMA, with median scores of 1 and 0, respectively. Category B exhibited a low reliability score (median 2, interquartile range 2). The information quality from category C questions was satisfactory (median 51.5, IQR 12.5). All 3 groups exhibited confusing readability levels according to the Flesch assessment.

Limitations

The study's limitations encompass the emphasis on CU, possible bias in question selection, the use of particular instruments such as DISCERN, JAMA, and Flesch, as well as reliance on expert opinion for assessment.

Conclusion

ChatGPT-4 demonstrates potential for producing medical content; nonetheless, its reliability is shaky underscoring the necessity for caution and confirmation when employing AI-generated medical information, especially in the management of CU.

查看原文本刊更多论文

慢性荨麻疹的ChatGPT-4反应有多准确？具有信息质量度量的关键分析

人工智能（AI）在医疗保健领域的使用越来越多，特别是在提供医疗信息方面，引发了对人工智能生成响应的可靠性和准确性的担忧。考虑到不准确医疗信息的潜在影响，本研究评估了慢性荨麻疹（CU）治疗的ChatGPT-4应答的质量、可靠性和可读性。目的：本研究的目的是根据国际指南评估ChatGPT-4对CU管理查询的回复的质量、可靠性和可读性，利用经过验证的指标来评估ChatGPT-4作为医疗信息获取资源的有效性。方法从EAACI/GA2LEN/EuroGuiDerm/APAAACI推荐中导出24个问题，并将其用作ChatGPT-4的提示，以便在每个问题的单独聊天中获得响应。调查分为3类：A.)B.分类与诊断C.评估和监测。治疗和管理建议。过敏专家分别使用DISCERN仪器进行质量评估，美国医学会杂志（JAMA）的基准标准进行可靠性评估，Flesch评分进行可读性分析。通过中位数计算和类内相关系数评估进一步检查得分。结果根据JAMA， A类和C类的信度不足，中位评分分别为1分和0分。B类的信度评分较低（中位数为2，四分位间距为2）。C类问题的信息质量令人满意（中位数为51.5，IQR为12.5）。根据Flesch评估，所有三组的可读性水平都令人困惑。该研究的局限性包括对CU的强调，问题选择中可能存在的偏差，使用特定工具（如DISCERN， JAMA和Flesch），以及依赖专家意见进行评估。结论chatgpt -4具有制作医学内容的潜力；然而，它的可靠性是不稳定的，这强调了在使用人工智能生成的医疗信息时，特别是在CU管理中，谨慎和确认的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

World Allergy Organization Journal Immunology and Microbiology-Immunology

CiteScore

9.10

自引率

5.90%

发文量

审稿时长

9 weeks

期刊介绍： The official pubication of the World Allergy Organization, the World Allergy Organization Journal (WAOjournal) publishes original mechanistic, translational, and clinical research on the topics of allergy, asthma, anaphylaxis, and clincial immunology, as well as reviews, guidelines, and position papers that contribute to the improvement of patient care. WAOjournal publishes research on the growth of allergy prevalence within the scope of single countries, country comparisons, and practical global issues and regulations, or threats to the allergy specialty. The Journal invites the submissions of all authors interested in publishing on current global problems in allergy, asthma, anaphylaxis, and immunology. Of particular interest are the immunological consequences of climate change and the subsequent systematic transformations in food habits and their consequences for the allergy/immunology discipline.