ChatGPT-4 in Nursing Research: A Methodological Evaluation of Bias Risk in Randomized Controlled Trials.

IF 2.9 3区 医学 Q1 NURSING
Metin Tuncer, Gülsüm Zekiye Tuncer
{"title":"ChatGPT-4 in Nursing Research: A Methodological Evaluation of Bias Risk in Randomized Controlled Trials.","authors":"Metin Tuncer, Gülsüm Zekiye Tuncer","doi":"10.1111/jnu.70048","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Conducting bias assessments in systematic reviews is a time-consuming process that involves subjective judgments. The use of artificial intelligence (AI) technologies to perform these assessments can potentially save time and enhance consistency. Nevertheless, the efficacy of AI technologies in conducting bias assessments remains inadequately explored.</p><p><strong>Aim: </strong>This study aims to evaluate the efficacy of ChatGPT-4o in assessing bias using the revised Cochrane RoB2 tool, focusing on randomized controlled trials in nursing.</p><p><strong>Methods: </strong>ChatGPT-4o was provided with the RoB2 assessment guide in the form of a PDF document and instructed to perform bias assessments for the 80 open-access RCTs included in the study. The results of the bias assessments conducted by ChatGPT-4o for each domain were then compared with those of the meta-analysis authors using Cohen's weighted kappa analysis.</p><p><strong>Results: </strong>Weighted Cohen's kappa values showed better agreement in bias in the measurement of the outcome (D4, 0.22) and bias arising from the randomization process (D1, 0.20), while negative values in bias due to missing outcome data (D3, -0.12) and bias in the selection of the reported result (D5, -0.09) indicated poor agreement. The highest accuracy was observed in D5 (0.81), and the lowest in D1 (0.60). F1 scores were highest in bias due to deviations from intended interventions (D2, 0.74) and lowest in D3 (0.00) and D5 (0.00). Specificity was higher in D5 (0.93) and D3 (0.82), while sensitivity and precision were low in these domains.</p><p><strong>Conclusions: </strong>The agreement between ChatGPT-4o and the meta-analysis studies in the same RCT assessments is generally low. This indicates that ChatGPT-4o requires substantial enhancements before it can be used as a reliable tool for bias risk assessments.</p><p><strong>Clinical relevance: </strong>The AI-based tools have the potential to expedite bias assessment in systematic reviews. However, this study demonstrates that ChatGPT-4o, in its current form, lacks sufficient consistency, indicating that such tools should be integrated cautiously and used under continuous human oversight, particularly in evidence-based evaluations that inform clinical decision-making.</p>","PeriodicalId":51091,"journal":{"name":"Journal of Nursing Scholarship","volume":" ","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nursing Scholarship","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/jnu.70048","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"NURSING","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Conducting bias assessments in systematic reviews is a time-consuming process that involves subjective judgments. The use of artificial intelligence (AI) technologies to perform these assessments can potentially save time and enhance consistency. Nevertheless, the efficacy of AI technologies in conducting bias assessments remains inadequately explored.

Aim: This study aims to evaluate the efficacy of ChatGPT-4o in assessing bias using the revised Cochrane RoB2 tool, focusing on randomized controlled trials in nursing.

Methods: ChatGPT-4o was provided with the RoB2 assessment guide in the form of a PDF document and instructed to perform bias assessments for the 80 open-access RCTs included in the study. The results of the bias assessments conducted by ChatGPT-4o for each domain were then compared with those of the meta-analysis authors using Cohen's weighted kappa analysis.

Results: Weighted Cohen's kappa values showed better agreement in bias in the measurement of the outcome (D4, 0.22) and bias arising from the randomization process (D1, 0.20), while negative values in bias due to missing outcome data (D3, -0.12) and bias in the selection of the reported result (D5, -0.09) indicated poor agreement. The highest accuracy was observed in D5 (0.81), and the lowest in D1 (0.60). F1 scores were highest in bias due to deviations from intended interventions (D2, 0.74) and lowest in D3 (0.00) and D5 (0.00). Specificity was higher in D5 (0.93) and D3 (0.82), while sensitivity and precision were low in these domains.

Conclusions: The agreement between ChatGPT-4o and the meta-analysis studies in the same RCT assessments is generally low. This indicates that ChatGPT-4o requires substantial enhancements before it can be used as a reliable tool for bias risk assessments.

Clinical relevance: The AI-based tools have the potential to expedite bias assessment in systematic reviews. However, this study demonstrates that ChatGPT-4o, in its current form, lacks sufficient consistency, indicating that such tools should be integrated cautiously and used under continuous human oversight, particularly in evidence-based evaluations that inform clinical decision-making.

ChatGPT-4在护理研究中的应用:随机对照试验偏倚风险的方法学评价。
背景:在系统评价中进行偏倚评估是一个耗时的过程,涉及主观判断。使用人工智能(AI)技术来执行这些评估可以潜在地节省时间并增强一致性。然而,人工智能技术在进行偏见评估方面的功效仍未得到充分探索。目的:本研究旨在利用改进的Cochrane RoB2工具评估chatgpt - 40在评估偏倚方面的有效性,重点关注护理领域的随机对照试验。方法:chatgpt - 40以PDF文档的形式提供RoB2评估指南,并指导其对纳入研究的80项开放获取rct进行偏倚评估。然后将chatgpt - 40对每个领域进行的偏差评估结果与使用Cohen加权kappa分析的元分析作者的结果进行比较。结果:加权Cohen's kappa值在结果测量偏倚(D4, 0.22)和随机化过程引起的偏倚(D1, 0.20)方面表现出较好的一致性,而由于缺少结果数据而产生的偏倚(D3, -0.12)和报告结果选择的偏倚(D5, -0.09)表现出较差的一致性。D5的准确度最高(0.81),D1的准确度最低(0.60)。由于偏离预期干预,F1评分偏差最高(D2, 0.74), D3和D5评分偏差最低(0.00)。D5(0.93)和D3(0.82)特异性较高,敏感性和精密度较低。结论:在相同的RCT评估中,chatgpt - 40与meta分析研究的一致性普遍较低。这表明,chatgpt - 40在作为偏见风险评估的可靠工具之前需要大量的改进。临床相关性:基于人工智能的工具有可能加快系统评价中的偏倚评估。然而,本研究表明,目前形式的chatgpt - 40缺乏足够的一致性,这表明此类工具应谨慎整合,并在持续的人为监督下使用,特别是在为临床决策提供信息的循证评估中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
6.30
自引率
5.90%
发文量
85
审稿时长
6-12 weeks
期刊介绍: This widely read and respected journal features peer-reviewed, thought-provoking articles representing research by some of the world’s leading nurse researchers. Reaching health professionals, faculty and students in 103 countries, the Journal of Nursing Scholarship is focused on health of people throughout the world. It is the official journal of Sigma Theta Tau International and it reflects the society’s dedication to providing the tools necessary to improve nursing care around the world.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信