Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study

Cochrane Evidence Synthesis and Methods Pub Date : 2025-05-22 DOI:10.1002/cesm.70031

Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein

{"title":"Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study","authors":"Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein","doi":"10.1002/cesm.70031","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.</p>\n </section>\n \n <section>\n \n <h3> Discussion/Conclusions</h3>\n \n <p>GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70031","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).

Methods

We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.

Results

GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.

Discussion/Conclusions

GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.

Abstract Image

查看原文本刊更多论文

在公共政策文献综述中使用GPT-4筛选标题和摘要的可行性研究

我们描述了在公共政策文献综述中首次使用大型语言模型（llm）来筛选标题和摘要。我们的目的是评估GPT-4推荐排除的文章本应纳入的百分比（“误排除率”）。方法：我们使用GPT-4从数据库中排除文献，对联邦和州解决阿片类药物危机的政策进行定量评估。我们将书目数据库导出为包含标题、摘要和关键词的CSV文件，并要求GPT-4建议是否排除每篇文章。我们使用文章子集对这些建议进行了初步测试，并对整个数据库的样本进行了最终测试。我们将假排除率指定为10%作为适当的性能阈值。结果GPT-4建议从43480篇包含摘要的文章中剔除41742篇（96%）。我们的初步测试只发现了一个错误排除；我们的最终测试没有发现假排除，估计假排除率为0.00[0.00,0.05]。41742篇文章中只有不到1%（417篇）被错误地排除在外。在人工评估所有剩余文献的合格性后，我们从1738篇GPT-4未排除的文献中确定了608篇：65%推荐纳入的文献本应被排除。GPT-4在推荐从我们的文献综述中排除的文章方面表现良好，从而节省了大量的时间和成本。一个关键的限制是，我们没有使用GPT-4来确定夹杂物，我们的模型也没有很好地完成这项任务。然而，GPT-4大大减少了需要审查的文章数量。系统审稿人应该进行绩效评估，以确保LLM在依赖其建议之前满足最低可接受的质量标准。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Cochrane Evidence Synthesis and Methods

自引率

0.00%

发文量