Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance.
{"title":"Artificial intelligence-generated responses to frequently asked questions on coccydynia: Evaluating the accuracy and consistency of GPT-4o's performance.","authors":"Aslinur Keles, Ozge Gulsum Illeez, Berkay Erbagci, Esra Giray","doi":"10.46497/ArchRheumatol.2025.10966","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>This study aimed to assess whether GPT-4o's responses to patient-centered frequently asked questions about coccydynia are accurate and consistent when asked at different times and from different accounts.</p><p><strong>Materials and methods: </strong>Questions were collected from medical websites, forums, and patient support groups and posed to GPT-4o. The responses were evaluated by two physiatrists for accuracy and consistency. Responses were categorized: <i>(i)</i> correct and comprehensive, <i>(ii)</i> correct but not inadequate, <i>(iii)</i> partially correct and partially incorrect, and <i>(iv)</i> completely incorrect. Inconsistencies in scoring were resolved by an additional reviewer as needed. Statistical analysis, including Cohen's kappa for interreviewer reliability, was performed.</p><p><strong>Results: </strong>Of the 81 responses, 45.7% were rated as correct and comprehensive, while 49.4% were correct but incomplete. Only 4.9% of the responses contained partially incorrect information, and no responses were completely incorrect. The interreviewer agreement was substantial (kappa=0.67), but 75% of the responses differed between the two rounds. Notably, 34.9% of initially incomplete answers improved in the second round.</p><p><strong>Conclusion: </strong>GPT-4o shows promise in providing accurate and generally reliable information about coccydynia. However, the variability observed in response consistency across repeated queries suggests that while the model is useful for patient education and general inquiries, it may not be suitable for providing specialized clinical knowledge without human oversight.</p>","PeriodicalId":93884,"journal":{"name":"Archives of rheumatology","volume":"40 1","pages":"63-71"},"PeriodicalIF":1.1000,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12010271/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of rheumatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46497/ArchRheumatol.2025.10966","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: This study aimed to assess whether GPT-4o's responses to patient-centered frequently asked questions about coccydynia are accurate and consistent when asked at different times and from different accounts.
Materials and methods: Questions were collected from medical websites, forums, and patient support groups and posed to GPT-4o. The responses were evaluated by two physiatrists for accuracy and consistency. Responses were categorized: (i) correct and comprehensive, (ii) correct but not inadequate, (iii) partially correct and partially incorrect, and (iv) completely incorrect. Inconsistencies in scoring were resolved by an additional reviewer as needed. Statistical analysis, including Cohen's kappa for interreviewer reliability, was performed.
Results: Of the 81 responses, 45.7% were rated as correct and comprehensive, while 49.4% were correct but incomplete. Only 4.9% of the responses contained partially incorrect information, and no responses were completely incorrect. The interreviewer agreement was substantial (kappa=0.67), but 75% of the responses differed between the two rounds. Notably, 34.9% of initially incomplete answers improved in the second round.
Conclusion: GPT-4o shows promise in providing accurate and generally reliable information about coccydynia. However, the variability observed in response consistency across repeated queries suggests that while the model is useful for patient education and general inquiries, it may not be suitable for providing specialized clinical knowledge without human oversight.