Novel AI applications in systematic review: GPT-4 assisted data extraction, analysis, review of bias.

IF 9 3区 医学 Q1 MEDICINE, GENERAL & INTERNAL
Jin Kyu Kim, Michael Erlano Chua, Tian Ge Li, Mandy Rickard, Armando J Lorenzo
{"title":"Novel AI applications in systematic review: GPT-4 assisted data extraction, analysis, review of bias.","authors":"Jin Kyu Kim, Michael Erlano Chua, Tian Ge Li, Mandy Rickard, Armando J Lorenzo","doi":"10.1136/bmjebm-2024-113066","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>To assess custom GPT-4 performance in extracting and evaluating data from medical literature to assist in the systematic review (SR) process.</p><p><strong>Design: </strong>A proof-of-concept comparative study was conducted to assess the accuracy and precision of custom GPT-4 models against human-performed reviews of randomised controlled trials (RCTs).</p><p><strong>Setting: </strong>Four custom GPT-4 models were developed, each specialising in one of the following areas: (1) extraction of study characteristics, (2) extraction of outcomes, (3) extraction of bias assessment domains and (4) evaluation of risk of bias using results from the third GPT-4 model. Model outputs were compared against data from four SRs conducted by human authors. The evaluation focused on accuracy in data extraction, precision in replicating outcomes and agreement levels in risk of bias assessments.</p><p><strong>Participants: </strong>Among four SRs chosen, 43 studies were retrieved for data extraction evaluation. Additionally, 17 RCTs were selected for comparison of risk of bias assessments, where both human comparator SRs and an analogous SR provided assessments for comparison.</p><p><strong>Intervention: </strong>Custom GPT-4 models were deployed to extract data and evaluate risk of bias from selected studies, and their outputs were compared to those generated by human reviewers.</p><p><strong>Main outcome measures: </strong>Concordance rates between GPT-4 outputs and human-performed SRs in data extraction, effect size comparability and inter/intra-rater agreement in risk of bias assessments.</p><p><strong>Results: </strong>When comparing the automatically extracted data to the first table of study characteristics from the published review, GPT-4 showed 88.6% concordance with the original review, with <5% discrepancies due to inaccuracies or omissions. It exceeded human accuracy in 2.5% of instances. Study outcomes were extracted and pooling of results showed comparable effect sizes to comparator SRs. A review of bias assessment using GPT-4 showed fair-moderate but significant intra-rater agreement (ICC=0.518, p<0.001) and inter-rater agreements between human comparator SR (weighted kappa=0.237) and the analogous SR (weighted kappa=0.296). In contrast, there was a poor agreement between the two human-performed SRs (weighted kappa=0.094).</p><p><strong>Conclusion: </strong>Customized GPT-4 models perform well in extracting precise data from medical literature with potential for utilization in review of bias. While the evaluated tasks are simpler than the broader range of SR methodologies, they provide an important initial assessment of GPT-4's capabilities.</p>","PeriodicalId":9059,"journal":{"name":"BMJ Evidence-Based Medicine","volume":" ","pages":""},"PeriodicalIF":9.0000,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Evidence-Based Medicine","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1136/bmjebm-2024-113066","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICINE, GENERAL & INTERNAL","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: To assess custom GPT-4 performance in extracting and evaluating data from medical literature to assist in the systematic review (SR) process.

Design: A proof-of-concept comparative study was conducted to assess the accuracy and precision of custom GPT-4 models against human-performed reviews of randomised controlled trials (RCTs).

Setting: Four custom GPT-4 models were developed, each specialising in one of the following areas: (1) extraction of study characteristics, (2) extraction of outcomes, (3) extraction of bias assessment domains and (4) evaluation of risk of bias using results from the third GPT-4 model. Model outputs were compared against data from four SRs conducted by human authors. The evaluation focused on accuracy in data extraction, precision in replicating outcomes and agreement levels in risk of bias assessments.

Participants: Among four SRs chosen, 43 studies were retrieved for data extraction evaluation. Additionally, 17 RCTs were selected for comparison of risk of bias assessments, where both human comparator SRs and an analogous SR provided assessments for comparison.

Intervention: Custom GPT-4 models were deployed to extract data and evaluate risk of bias from selected studies, and their outputs were compared to those generated by human reviewers.

Main outcome measures: Concordance rates between GPT-4 outputs and human-performed SRs in data extraction, effect size comparability and inter/intra-rater agreement in risk of bias assessments.

Results: When comparing the automatically extracted data to the first table of study characteristics from the published review, GPT-4 showed 88.6% concordance with the original review, with <5% discrepancies due to inaccuracies or omissions. It exceeded human accuracy in 2.5% of instances. Study outcomes were extracted and pooling of results showed comparable effect sizes to comparator SRs. A review of bias assessment using GPT-4 showed fair-moderate but significant intra-rater agreement (ICC=0.518, p<0.001) and inter-rater agreements between human comparator SR (weighted kappa=0.237) and the analogous SR (weighted kappa=0.296). In contrast, there was a poor agreement between the two human-performed SRs (weighted kappa=0.094).

Conclusion: Customized GPT-4 models perform well in extracting precise data from medical literature with potential for utilization in review of bias. While the evaluated tasks are simpler than the broader range of SR methodologies, they provide an important initial assessment of GPT-4's capabilities.

人工智能在系统评价中的新应用:GPT-4辅助数据提取、分析、偏倚评价。
目的:评估自定义GPT-4在医学文献数据提取和评估中的性能,以辅助系统评价(SR)过程。设计:进行了一项概念验证比较研究,以评估定制GPT-4模型与人类随机对照试验(rct)的准确性和精密度。设置:开发了四个定制的GPT-4模型,每个模型都专注于以下领域之一:(1)提取研究特征,(2)提取结果,(3)提取偏倚评估域,(4)使用第三个GPT-4模型的结果评估偏倚风险。将模型输出与人类作者进行的四次SRs的数据进行比较。评估的重点是数据提取的准确性、重复结果的准确性和偏倚风险评估的一致性水平。参与者:选取4个SRs,检索43项研究进行数据提取评价。此外,选择17个随机对照试验进行偏倚风险评估的比较,其中人类比较者SR和类似SR都提供了比较评估。干预:采用定制的GPT-4模型从选定的研究中提取数据并评估偏倚风险,并将其输出与人工审稿人产生的结果进行比较。主要结果测量:数据提取中GPT-4输出和人工执行的SRs之间的一致性率,效应大小可比性和偏见风险评估中评分者之间/内部的一致性。结果:将自动提取的数据与已发表综述的第一个研究特征表进行比较,GPT-4与原始综述的一致性为88.6%。结论:定制的GPT-4模型在从医学文献中提取精确数据方面表现良好,具有应用于偏倚评价的潜力。虽然评估任务比更广泛的SR方法更简单,但它们提供了对GPT-4能力的重要初步评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BMJ Evidence-Based Medicine
BMJ Evidence-Based Medicine MEDICINE, GENERAL & INTERNAL-
CiteScore
8.90
自引率
3.40%
发文量
48
期刊介绍: BMJ Evidence-Based Medicine (BMJ EBM) publishes original evidence-based research, insights and opinions on what matters for health care. We focus on the tools, methods, and concepts that are basic and central to practising evidence-based medicine and deliver relevant, trustworthy and impactful evidence. BMJ EBM is a Plan S compliant Transformative Journal and adheres to the highest possible industry standards for editorial policies and publication ethics.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信