Fatih Dolu, Oğuzhan Fatih Ay, Aydın Hakan Kupeli, Enes Karademir, Muhammed Huseyin Büyükavcı
{"title":"Evaluation of ChatGPT-4 as an Online Outpatient Assistant in Puerperal Mastitis Management: Content Analysis of an Observational Study.","authors":"Fatih Dolu, Oğuzhan Fatih Ay, Aydın Hakan Kupeli, Enes Karademir, Muhammed Huseyin Büyükavcı","doi":"10.2196/68980","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The integration of artificial intelligence (AI) into clinical workflows holds promise for enhancing outpatient decision-making and patient education. ChatGPT, a large language model developed by OpenAI, has gained attention for its potential to support both clinicians and patients. However, its performance in the outpatient setting of general surgery remains underexplored.</p><p><strong>Objective: </strong>This study aimed to evaluate whether ChatGPT-4 can function as a virtual outpatient assistant in the management of puerperal mastitis by assessing the accuracy, clarity, and clinical safety of its responses to frequently asked patient questions in Turkish.</p><p><strong>Methods: </strong>Fifteen questions about puerperal mastitis were sourced from public health care websites and online forums. These questions were categorized into general information (n=2), symptoms and diagnosis (n=6), treatment (n=2), and prognosis (n=5). Each question was entered into ChatGPT-4 (September 3, 2024), and a single Turkish-language response was obtained. The responses were evaluated by a panel consisting of 3 board-certified general surgeons and 2 general surgery residents, using five criteria: sufficient length, patient-understandable language, accuracy, adherence to current guidelines, and patient safety. Quantitative metrics included the DISCERN score, Flesch-Kincaid readability score, and inter-rater reliability assessed using the intraclass correlation coefficient (ICC).</p><p><strong>Results: </strong>A total of 15 questions were evaluated. ChatGPT's responses were rated as \"excellent\" overall by the evaluators, with higher scores observed for treatment- and prognosis-related questions. A statistically significant difference was found in DISCERN scores across question types (P=.01), with treatment and prognosis questions receiving higher ratings. In contrast, no significant differences were detected in evaluator-based ratings (sufficient length, understandability, accuracy, guideline compliance, and patient safety), JAMA benchmark scores, or Flesch-Kincaid readability levels (P>.05 for all). Interrater agreement was good across all evaluation parameters (ICC=0.772); however, agreement varied when assessed by individual criteria. Correlation analyses revealed no significant overall associations between subjective ratings and objective quality measures, although a strong positive correlation between literature compliance and patient safety was identified for one question (r=0.968, P<.001).</p><p><strong>Conclusions: </strong>ChatGPT demonstrated adequate capability in providing information on puerperal mastitis, particularly for treatment and prognosis. However, evaluator variability and the subjective nature of assessments highlight the need for further optimization of AI tools. Future research should emphasize iterative questioning and dynamic updates to AI knowledge bases to enhance reliability and accessibility.</p>","PeriodicalId":56334,"journal":{"name":"JMIR Medical Informatics","volume":"13 ","pages":"e68980"},"PeriodicalIF":3.8000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12288767/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/68980","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The integration of artificial intelligence (AI) into clinical workflows holds promise for enhancing outpatient decision-making and patient education. ChatGPT, a large language model developed by OpenAI, has gained attention for its potential to support both clinicians and patients. However, its performance in the outpatient setting of general surgery remains underexplored.
Objective: This study aimed to evaluate whether ChatGPT-4 can function as a virtual outpatient assistant in the management of puerperal mastitis by assessing the accuracy, clarity, and clinical safety of its responses to frequently asked patient questions in Turkish.
Methods: Fifteen questions about puerperal mastitis were sourced from public health care websites and online forums. These questions were categorized into general information (n=2), symptoms and diagnosis (n=6), treatment (n=2), and prognosis (n=5). Each question was entered into ChatGPT-4 (September 3, 2024), and a single Turkish-language response was obtained. The responses were evaluated by a panel consisting of 3 board-certified general surgeons and 2 general surgery residents, using five criteria: sufficient length, patient-understandable language, accuracy, adherence to current guidelines, and patient safety. Quantitative metrics included the DISCERN score, Flesch-Kincaid readability score, and inter-rater reliability assessed using the intraclass correlation coefficient (ICC).
Results: A total of 15 questions were evaluated. ChatGPT's responses were rated as "excellent" overall by the evaluators, with higher scores observed for treatment- and prognosis-related questions. A statistically significant difference was found in DISCERN scores across question types (P=.01), with treatment and prognosis questions receiving higher ratings. In contrast, no significant differences were detected in evaluator-based ratings (sufficient length, understandability, accuracy, guideline compliance, and patient safety), JAMA benchmark scores, or Flesch-Kincaid readability levels (P>.05 for all). Interrater agreement was good across all evaluation parameters (ICC=0.772); however, agreement varied when assessed by individual criteria. Correlation analyses revealed no significant overall associations between subjective ratings and objective quality measures, although a strong positive correlation between literature compliance and patient safety was identified for one question (r=0.968, P<.001).
Conclusions: ChatGPT demonstrated adequate capability in providing information on puerperal mastitis, particularly for treatment and prognosis. However, evaluator variability and the subjective nature of assessments highlight the need for further optimization of AI tools. Future research should emphasize iterative questioning and dynamic updates to AI knowledge bases to enhance reliability and accessibility.
期刊介绍:
JMIR Medical Informatics (JMI, ISSN 2291-9694) is a top-rated, tier A journal which focuses on clinical informatics, big data in health and health care, decision support for health professionals, electronic health records, ehealth infrastructures and implementation. It has a focus on applied, translational research, with a broad readership including clinicians, CIOs, engineers, industry and health informatics professionals.
Published by JMIR Publications, publisher of the Journal of Medical Internet Research (JMIR), the leading eHealth/mHealth journal (Impact Factor 2016: 5.175), JMIR Med Inform has a slightly different scope (emphasizing more on applications for clinicians and health professionals rather than consumers/citizens, which is the focus of JMIR), publishes even faster, and also allows papers which are more technical or more formative than what would be published in the Journal of Medical Internet Research.