Ivan Cherrez-Ojeda Ph.D.(c) , Marco Faytong-Haro Ph.D.(c) , Patricio Alvarez-Muñoz Ph.D. , José Ignacio Larco MD , Erika de Arruda Chaves MD , Isabel Rojo MD , Carol Vivian Moncayo MD , German D. Ramon MD , Gabriela Rodas-Valero MD , Emek Kocatürk MD , Giselle S. Mosnaim MD , Karla Robles-Velasco MD
{"title":"How accurate are ChatGPT-4 responses in chronic urticaria? A critical analysis with information quality metrics","authors":"Ivan Cherrez-Ojeda Ph.D.(c) , Marco Faytong-Haro Ph.D.(c) , Patricio Alvarez-Muñoz Ph.D. , José Ignacio Larco MD , Erika de Arruda Chaves MD , Isabel Rojo MD , Carol Vivian Moncayo MD , German D. Ramon MD , Gabriela Rodas-Valero MD , Emek Kocatürk MD , Giselle S. Mosnaim MD , Karla Robles-Velasco MD","doi":"10.1016/j.waojou.2025.101071","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The increasing use of artificial intelligence (AI) in healthcare, especially in delivering medical information, prompts concerns over the reliability and accuracy of AI-generated responses. This study evaluates the quality, reliability, and readability of ChatGPT-4 responses for chronic urticaria (CU) care, considering the potential implications of inaccurate medical information.</div></div><div><h3>Objective</h3><div>The goal of the study was to assess the quality, reliability, and readability of ChatGPT-4 responses to inquiries on CU management in accordance with international guidelines, utilizing validated metrics to evaluate the effectiveness of ChatGPT-4 as a resource for medical information acquisition.</div></div><div><h3>Methods</h3><div>Twenty-four questions were derived from the EAACI/GA<sup>2</sup>LEN/EuroGuiDerm/APAAACI recommendations and utilized as prompts for ChatGPT-4 to obtain responses in individual chats for each question. The inquiries were categorized into 3 groups: A.) Classification and Diagnosis, B.) Assessment and Monitoring, and C.) Treatment and Management Recommendations. The responses were separately evaluated by allergy specialists utilizing the DISCERN instrument for quality assessment, Journal of the American Medical Association (JAMA) benchmark criteria for reliability evaluation, and Flesch scores for readability analysis. The scores were further examined by median calculations and Intraclass Correlation Coefficient assessments.</div></div><div><h3>Results</h3><div>Categories A and C exhibited insufficient reliability according to JAMA, with median scores of 1 and 0, respectively. Category B exhibited a low reliability score (median 2, interquartile range 2). The information quality from category C questions was satisfactory (median 51.5, IQR 12.5). All 3 groups exhibited confusing readability levels according to the Flesch assessment.</div></div><div><h3>Limitations</h3><div>The study's limitations encompass the emphasis on CU, possible bias in question selection, the use of particular instruments such as DISCERN, JAMA, and Flesch, as well as reliance on expert opinion for assessment.</div></div><div><h3>Conclusion</h3><div>ChatGPT-4 demonstrates potential for producing medical content; nonetheless, its reliability is shaky underscoring the necessity for caution and confirmation when employing AI-generated medical information, especially in the management of CU.</div></div>","PeriodicalId":54295,"journal":{"name":"World Allergy Organization Journal","volume":"18 7","pages":"Article 101071"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"World Allergy Organization Journal","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1939455125000481","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ALLERGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background
The increasing use of artificial intelligence (AI) in healthcare, especially in delivering medical information, prompts concerns over the reliability and accuracy of AI-generated responses. This study evaluates the quality, reliability, and readability of ChatGPT-4 responses for chronic urticaria (CU) care, considering the potential implications of inaccurate medical information.
Objective
The goal of the study was to assess the quality, reliability, and readability of ChatGPT-4 responses to inquiries on CU management in accordance with international guidelines, utilizing validated metrics to evaluate the effectiveness of ChatGPT-4 as a resource for medical information acquisition.
Methods
Twenty-four questions were derived from the EAACI/GA2LEN/EuroGuiDerm/APAAACI recommendations and utilized as prompts for ChatGPT-4 to obtain responses in individual chats for each question. The inquiries were categorized into 3 groups: A.) Classification and Diagnosis, B.) Assessment and Monitoring, and C.) Treatment and Management Recommendations. The responses were separately evaluated by allergy specialists utilizing the DISCERN instrument for quality assessment, Journal of the American Medical Association (JAMA) benchmark criteria for reliability evaluation, and Flesch scores for readability analysis. The scores were further examined by median calculations and Intraclass Correlation Coefficient assessments.
Results
Categories A and C exhibited insufficient reliability according to JAMA, with median scores of 1 and 0, respectively. Category B exhibited a low reliability score (median 2, interquartile range 2). The information quality from category C questions was satisfactory (median 51.5, IQR 12.5). All 3 groups exhibited confusing readability levels according to the Flesch assessment.
Limitations
The study's limitations encompass the emphasis on CU, possible bias in question selection, the use of particular instruments such as DISCERN, JAMA, and Flesch, as well as reliance on expert opinion for assessment.
Conclusion
ChatGPT-4 demonstrates potential for producing medical content; nonetheless, its reliability is shaky underscoring the necessity for caution and confirmation when employing AI-generated medical information, especially in the management of CU.
期刊介绍:
The official pubication of the World Allergy Organization, the World Allergy Organization Journal (WAOjournal) publishes original mechanistic, translational, and clinical research on the topics of allergy, asthma, anaphylaxis, and clincial immunology, as well as reviews, guidelines, and position papers that contribute to the improvement of patient care. WAOjournal publishes research on the growth of allergy prevalence within the scope of single countries, country comparisons, and practical global issues and regulations, or threats to the allergy specialty. The Journal invites the submissions of all authors interested in publishing on current global problems in allergy, asthma, anaphylaxis, and immunology. Of particular interest are the immunological consequences of climate change and the subsequent systematic transformations in food habits and their consequences for the allergy/immunology discipline.