Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study

Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein
{"title":"Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study","authors":"Max Rubinstein,&nbsp;Sean Grant,&nbsp;Beth Ann Griffin,&nbsp;Seema Choksy Pessar,&nbsp;Bradley D. Stein","doi":"10.1002/cesm.70031","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.</p>\n </section>\n \n <section>\n \n <h3> Discussion/Conclusions</h3>\n \n <p>GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70031","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction

We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).

Methods

We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.

Results

GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.

Discussion/Conclusions

GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.

Abstract Image

在公共政策文献综述中使用GPT-4筛选标题和摘要的可行性研究
我们描述了在公共政策文献综述中首次使用大型语言模型(llm)来筛选标题和摘要。我们的目的是评估GPT-4推荐排除的文章本应纳入的百分比(“误排除率”)。方法:我们使用GPT-4从数据库中排除文献,对联邦和州解决阿片类药物危机的政策进行定量评估。我们将书目数据库导出为包含标题、摘要和关键词的CSV文件,并要求GPT-4建议是否排除每篇文章。我们使用文章子集对这些建议进行了初步测试,并对整个数据库的样本进行了最终测试。我们将假排除率指定为10%作为适当的性能阈值。结果GPT-4建议从43480篇包含摘要的文章中剔除41742篇(96%)。我们的初步测试只发现了一个错误排除;我们的最终测试没有发现假排除,估计假排除率为0.00[0.00,0.05]。41742篇文章中只有不到1%(417篇)被错误地排除在外。在人工评估所有剩余文献的合格性后,我们从1738篇GPT-4未排除的文献中确定了608篇:65%推荐纳入的文献本应被排除。GPT-4在推荐从我们的文献综述中排除的文章方面表现良好,从而节省了大量的时间和成本。一个关键的限制是,我们没有使用GPT-4来确定夹杂物,我们的模型也没有很好地完成这项任务。然而,GPT-4大大减少了需要审查的文章数量。系统审稿人应该进行绩效评估,以确保LLM在依赖其建议之前满足最低可接受的质量标准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信