Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein
{"title":"Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study","authors":"Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein","doi":"10.1002/cesm.70031","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Introduction</h3>\n \n <p>We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.</p>\n </section>\n \n <section>\n \n <h3> Discussion/Conclusions</h3>\n \n <p>GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.</p>\n </section>\n </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 3","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70031","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cochrane Evidence Synthesis and Methods","FirstCategoryId":"1085","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/cesm.70031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction
We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).
Methods
We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.
Results
GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.
Discussion/Conclusions
GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.