{"title":"Meta-Analysis Using Time-to-Event Data: A Tutorial","authors":"Ashma Krishan, Kerry Dwan","doi":"10.1002/cesm.70041","DOIUrl":"https://doi.org/10.1002/cesm.70041","url":null,"abstract":"<p>This tutorial focuses on trials that assess time-to-event outcomes. We explain what hazard ratios are, how to interpret them and demonstrate how to include time-to-event data in a meta-analysis. Examples are presented to help with understanding. Accompanying the tutorial is a micro learning module, where we demonstrate a few approaches and give you the chance to practice calculating the hazard ratio. Time-to-event micro learning module https://links.cochrane.org/cesm/tutorials/time-to-event-data.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144897288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiyin Li, Chong Wu, Zichen Zhang, Mengli Xiao, Mohammad Hassan Murad, Lifeng Lin
{"title":"Lifecycles of Cochrane Systematic Reviews (2003–2024): A Bibliographic Study","authors":"Shiyin Li, Chong Wu, Zichen Zhang, Mengli Xiao, Mohammad Hassan Murad, Lifeng Lin","doi":"10.1002/cesm.70043","DOIUrl":"https://doi.org/10.1002/cesm.70043","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background and Objectives</h3>\u0000 \u0000 <p>The relevance of Cochrane systematic reviews depends on timely completion and updates. This study aimed to empirically assess the lifecycles of Cochrane reviews published from 2003 to 2024, including transitions from protocol to review, update patterns, and withdrawals.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We extracted data from Cochrane Library publications between 2003 and 2024. Each review topic was identified using a unique six-digit DOI-based ID. We recorded protocol publication, review publication, updates, and withdrawals (i.e., removed from the Cochrane Library for editorial or procedural reasons), calculating time intervals between stages and conducting subgroup analyses by review type.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>Of 8137 protocols, 71.9% progressed to reviews (median 25.7 months), 2.4% were updated during the protocol stage, and 10.0% were withdrawn. Among 8477 reviews, 64.3% were never updated by the time of our analysis; for those updated at least once, the median interval between updates was 57.2 months. Withdrawal occurred in 2.5% of reviews (median 67.6 months post-publication). Subgroup analyses showed variation across review types; diagnostic and qualitative reviews tended to have longer protocol-to-review times than other types of reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>Cochrane reviews show long development and update intervals, with variation by review type. Greater use of automation and targeted support may improve review efficiency and timeliness.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70043","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144858600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimizing Research Impact: A Toolkit for Stakeholder-Driven Prioritization of Systematic Review Topics","authors":"Dyon Hoekstra, Stefan K. Lhachimi","doi":"10.1002/cesm.70039","DOIUrl":"https://doi.org/10.1002/cesm.70039","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Intro</h3>\u0000 \u0000 <p>The prioritization of topics for evidence synthesis is crucial for maximizing the relevance and impact of systematic reviews. This article introduces a comprehensive toolkit designed to facilitate a structured, multi-step framework for engaging a broad spectrum of stakeholders in the prioritization process, ensuring the selection of topics that are both relevant and applicable.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We detail an open-source framework comprising 11 coherent steps, segmented into scoping and Delphi stages, to offer a flexible and resource-efficient approach for stakeholder involvement in research priority setting.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The toolkit provides ready-to-use tools for the development, application, and analysis of the framework, including templates for online surveys developed with free open-source software, ensuring ease of replication and adaptation in various research fields. The framework supports the transparent and systematic development and assessment of systematic review topics, with a particular focus on stakeholder-refined assessment criteria.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>Our toolkit enhances the transparency and ease of the priority-setting process. Targeted primarily at organizations and research groups seeking to allocate resources for future research based on stakeholder needs, this toolkit stands as a valuable resource for informed decision-making in research prioritization.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 5","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70039","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144832632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dagný Halla Ágústsdóttir, Jacob Rosenberg, Jason Joe Baker
{"title":"ChatGPT-4o Compared With Human Researchers in Writing Plain-Language Summaries for Cochrane Reviews: A Blinded, Randomized Non-Inferiority Controlled Trial","authors":"Dagný Halla Ágústsdóttir, Jacob Rosenberg, Jason Joe Baker","doi":"10.1002/cesm.70037","DOIUrl":"https://doi.org/10.1002/cesm.70037","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Introduction</h3>\u0000 \u0000 <p>Plain language summaries in Cochrane reviews are designed to present key information in a way that is understandable to individuals without a medical background. Despite Cochrane's author guidelines, these summaries often fail to achieve their intended purpose. Studies show that they are generally difficult to read and vary in their adherence to the guidelines. Artificial intelligence is increasingly used in medicine and academia, with its potential being tested in various roles. This study aimed to investigate whether ChatGPT-4o could produce plain language summaries that are as good as the already published plain language summaries in Cochrane reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We conducted a randomized, single-blinded study with a total of 36 plain language summaries: 18 human written and 18 ChatGPT-4o generated summaries where both versions were for the same Cochrane reviews. The sample size was calculated to be 36 and each summary was evaluated four times. Each summary was reviewed twice by members of a Cochrane editorial group and twice by laypersons. The summaries were assessed in three different ways: First, all assessors evaluated the summaries for informativeness, readability, and level of detail using a Likert scale from 1 to 10. They were also asked whether they would submit the summary and whether they could identify who had written it. Second, members of a Cochrane editorial group assessed the summaries using a checklist based on Cochrane's guidelines for plain language summaries, with scores ranging from 0 to 10. Finally, the readability of the summaries was analyzed using objective tools such as Lix and Flesch-Kincaid scores. Randomization and allocation to either ChatGPT-4o or human written summaries were conducted using random.org's random sequence generator, and assessors were blinded to the authorship of the summaries.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>The plain language summaries generated by ChatGPT-4o scored 1 point higher on information (<i>p</i> < .001) and level of detail (<i>p</i> = .004), and 2 points higher on readability (<i>p</i> = .002) compared to human written summaries. Lix and Flesch-Kincaid scores were high for both groups of summaries, though ChatGPT was slightly easier to read (<i>p</i> < .001). Assessors found it difficult to distinguish between ChatGPT and human written summaries, with only 20% correctly identifying ChatGPT generated text. ChatGPT summaries were preferred for submission compared to the human written summaries (64% vs. 36%, <i>p</i> < .001).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 ","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70037","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144714684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Claire Stansfield, Hossein Dehdarirad, James Thomas, Silvy Mathew, Alison O'Mara-Eves
{"title":"Analyzing the Utility of Openalex to Identify Studies for Systematic Reviews: Methods and a Case Study","authors":"Claire Stansfield, Hossein Dehdarirad, James Thomas, Silvy Mathew, Alison O'Mara-Eves","doi":"10.1002/cesm.70038","DOIUrl":"https://doi.org/10.1002/cesm.70038","url":null,"abstract":"<p>Open access scholarly resources have potential to simplify the literature search process, support more equitable access to research knowledge, and reduce biases from lack of access to relevant literature. OpenAlex is the world's largest open access database of academic research. However, it is not known whether OpenAlex is suitable for comprehensively identifying research for systematic reviews. We present an approach to measure the utility of OpenAlex as part of undertaking a systematic review, and present findings in the context of undertaking a systematic map on the implementation of diabetic eye screening. Procedures were developed to investigate OpenAlex's content coverage and capture, focusing on: (1) availability of relevant research records; (2) retrieval of relevant records from a Boolean search of OpenAlex (3) retrieval of relevant records from combining a PubMed Boolean search with a citations and related-items search of OpenAlex, and (4) efficient estimation of relevant records not identified elsewhere. The searches were conducted in July 2024 and repeated in March 2025 following removal of certain closed access abstracts from the OpenAlex data set. The original systematic review searches yielded 131 relevant records and 128 (98%) of these are present in OpenAlex. OpenAlex Boolean searches retrieved 126 (96%) of the 131 records, and partial screening yielded two relevant records not previously known to the review team. Retrieval was reduced to 123 (94%) when the searches were repeated in March 2025. However, the volume of records from the OpenAlex Boolean search was considerably greater than assessed for the original systematic map. Combining a Boolean search from PubMed and OpenAlex network graph searches yielded 93% recall. It is feasible and useful to investigate the use of OpenAlex as a key information resource for health topics. This approach can be modified to investigate OpenAlex for other systematic reviews. However, the volume of records obtained from searches is larger than that obtained from conventional sources, something that could be reduced using machine learning. Further investigations are needed, and our approach replicated in other reviews.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70038","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144688187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dr. Olumide Adisa, Ms. Katie Tyrrell, Dr. Katherine Allen
{"title":"Enhancing nursing and other healthcare professionals' knowledge of childhood sexual abuse through self-assessment: A realist review","authors":"Dr. Olumide Adisa, Ms. Katie Tyrrell, Dr. Katherine Allen","doi":"10.1002/cesm.70019","DOIUrl":"https://doi.org/10.1002/cesm.70019","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Aim</h3>\u0000 \u0000 <p>To explore how child sexual abuse/exploitation (CSA/E) self-assessment tools are being used to enhance healthcare professionals' knowledge and confidence.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Background</h3>\u0000 \u0000 <p>Child sexual abuse/exploitation is common and associated with lifelong health impacts. In particular, nurses are well-placed to facilitate disclosures by adult survivors of child sexual abuse/exploitation and promote timely access to support. However, research shows that many are reluctant to enquire about abuse and feel underprepared for disclosures. Self-assessment provides a participatory method for evaluating competencies and identifying areas that need improvement.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Evaluation</h3>\u0000 \u0000 <p>Researchers adopted a realist synthesis approach, searching relevant databases for healthcare professionals' self-assessment tools/protocols relevant to adult survivors. In total, researchers reviewed 247 full-text articles. Twenty-five items met the criteria for data extraction, and to assess relevant contexts (C), mechanisms (M) and outcomes (O) were identified and mapped. Eight of these were included in the final synthesis based on papers that identified two key ‘families’ of abuse-related self-assessment interventions for healthcare contexts: PREMIS, a validated survey instrument to assess HCP knowledge, confidence and practice about domestic violence and abuse (DVA); Trauma-informed practice/care (TIP/C) organisational self-assessment protocols. Two revised programme theories were formulated: (1). Individual self-assessment can promote organisational accountability; and (2). Organisational self-assessment can increase the coherence and sustainability of changes in practice.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>There is a lack of self-assessment tools/protocols designed to improve healthcare professionals' knowledge and confidence. Our review contributes to the evidence base on improving healthcare responses to CSA/E survivors, illustrating that self-assessment tools or protocols designed to improve HCP responses to adult survivors of CSA/E remain underdeveloped and under-studied. Refined programme theories developed during synthesis regarding DVA and TIP/C-related tools or protocols suggest areas for CSA/E-specific future research with stakeholders and service users.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70019","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144681594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Helms Andersen, T. M. Marcussen, A. D. Termannsen, T. W. H. Lawaetz, O. Nørgaard
{"title":"Using Artificial Intelligence Tools as Second Reviewers for Data Extraction in Systematic Reviews: A Performance Comparison of Two AI Tools Against Human Reviewers","authors":"T. Helms Andersen, T. M. Marcussen, A. D. Termannsen, T. W. H. Lawaetz, O. Nørgaard","doi":"10.1002/cesm.70036","DOIUrl":"https://doi.org/10.1002/cesm.70036","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Background</h3>\u0000 \u0000 <p>Systematic reviews are essential but time-consuming and expensive. Large language models (LLMs) and artificial intelligence (AI) tools could potentially automate data extraction, but no comprehensive workflow has been tested for different review types.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Objective</h3>\u0000 \u0000 <p>To evaluate Elicit's and ChatGPT's abilities to extract data from journal articles as a replacement for one of two human data extractors in systematic reviews.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>Human-extracted data from three systematic reviews (30 articles in total) was compared to data extracted by Elicit and ChatGPT. The AI tools extracted population characteristics, study design, and review-specific variables. Performance metrics were calculated against human double-extracted data as the gold standard, followed by a detailed error analysis.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>Precision, recall and F1-score were all 92% for Elicit and 91%, 89% and 90% for ChatGPT. Recall was highest for study design (Elicit: 100%; ChatGPT: 90%) and population characteristics (Elicit: 100%; ChatGPT: 97%), while review-specific variables achieved 77% in Elicit and 80% in ChatGPT. Elicit had four instances of confabulation while ChatGPT had three. There was no significant difference between the two AI tools' performance (recall difference: 3.3% points, 95% CI: –5.2%–11.9%, <i>p</i> = 0.445).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusion</h3>\u0000 \u0000 <p>AI tools demonstrated high and similar performance in data extraction compared to human reviewers, particularly for standardized variables. Error analysis revealed confabulations in 4% of data points. We propose adopting AI-assisted extraction to replace the second human extractor, with the second human instead focusing on reconciling discrepancies between AI and the primary human extractor.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70036","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144615305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leslie A. Perdue, Shaina D. Trevino, Sean Grant, Jennifer S. Lin, Emily E. Tanner-Smith
{"title":"Creating Interactive Data Dashboards for Evidence Syntheses","authors":"Leslie A. Perdue, Shaina D. Trevino, Sean Grant, Jennifer S. Lin, Emily E. Tanner-Smith","doi":"10.1002/cesm.70035","DOIUrl":"https://doi.org/10.1002/cesm.70035","url":null,"abstract":"<p>Systematic review findings are typically disseminated via static outputs, such as scientific manuscripts, which can limit the accessibility and usability for diverse audiences. Interactive data dashboards transform systematic review data into dynamic, user-friendly visualizations, allowing deeper engagement with evidence synthesis findings. We propose a workflow for creating interactive dashboards to display evidence synthesis results, including three key phases: planning, development, and deployment. Planning involves defining the dashboard objectives and key audiences, selecting the appropriate software (e.g., Tableau or R Shiny) and preparing the data. Development includes designing a user-friendly interface and specifying interactive elements. Lastly, deployment focuses on making it available to users and utilizing user-testing. Throughout all phases, we emphasize seeking and incorporating interest-holder input and aligning dashboards with the intended audience's needs. To demonstrate this workflow, we provide two examples from previous systematic reviews. The first dashboard, created in Tableau, presents findings from a meta-analysis to support a U.S. Preventive Services Task Force recommendation on lipid disorder screening in children, while the second utilizes R Shiny to display data from a scoping review on the 4-day school week among K-12 students in the U.S. Both dashboards incorporate interactive elements to present complex evidence tailored to different interest-holders, including non-research audiences. Interactive dashboards can enhance the utility of evidence syntheses by providing a user-friendly tool for interest-holders to explore data relevant to their specific needs. This workflow can be adapted to create interactive dashboards in flexible formats to increase the use and accessibility of systematic review findings.</p>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70035","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144472977","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Extractions Using a Large Language Model (Elicit) and Human Reviewers in Randomized Controlled Trials: A Systematic Comparison","authors":"Joleen Bianchi, Julian Hirt, Magdalena Vogt, Janine Vetsch","doi":"10.1002/cesm.70033","DOIUrl":"https://doi.org/10.1002/cesm.70033","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Aim</h3>\u0000 \u0000 <p>We aimed at comparing data extractions from randomized controlled trials by using Elicit and human reviewers.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Background</h3>\u0000 \u0000 <p>Elicit is an artificial intelligence tool which may automate specific steps in conducting systematic reviews. However, the tool's performance and accuracy have not been independently assessed.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>For comparison, we sampled 20 randomized controlled trials of which data were extracted manually from a human reviewer. We assessed the variables study objectives, sample characteristics and size, study design, interventions, outcome measured, and intervention effects and classified the results into “more,” “equal to,” “partially equal,” and “deviating” extractions. STROBE checklist was used to report the study.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>We analysed 20 randomized controlled trials from 11 countries. The studies covered diverse healthcare topics. Across all seven variables, Elicit extracted “more” data in 29.3% of cases, “equal” in 20.7%, “partially equal” in 45.7%, and “deviating” in 4.3%. Elicit provided “more” information for the variable study design (100%) and sample characteristics (45%). In contrast, for more nuanced variables, such as “intervention effects,” Elicit's extractions were less detailed, with 95% rated as “partially equal.”</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Conclusions</h3>\u0000 \u0000 <p>Elicit was capable of extracting data partly correct for our predefined variables. Variables like “intervention effect” or “intervention” may require a human reviewer to complete the data extraction. Our results suggest that verification by human reviewers is necessary to ensure that all relevant information is captured completely and correctly by Elicit.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Implications</h3>\u0000 \u0000 <p>Systematic reviews are labor-intensive. Data extraction process may be facilitated by artificial intelligence tools. Use of Elicit may require a human reviewer to double-check the extracted data.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 4","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70033","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144244667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein
{"title":"Using GPT-4 for Title and Abstract Screening in a Literature Review of Public Policies: A Feasibility Study","authors":"Max Rubinstein, Sean Grant, Beth Ann Griffin, Seema Choksy Pessar, Bradley D. Stein","doi":"10.1002/cesm.70031","DOIUrl":"https://doi.org/10.1002/cesm.70031","url":null,"abstract":"<div>\u0000 \u0000 \u0000 <section>\u0000 \u0000 <h3> Introduction</h3>\u0000 \u0000 <p>We describe the first known use of large language models (LLMs) to screen titles and abstracts in a review of public policy literature. Our objective was to assess the percentage of articles GPT-4 recommended for exclusion that should have been included (“false exclusion rate”).</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Methods</h3>\u0000 \u0000 <p>We used GPT-4 to exclude articles from a database for a literature review of quantitative evaluations of federal and state policies addressing the opioid crisis. We exported our bibliographic database to a CSV file containing titles, abstracts, and keywords and asked GPT-4 to recommend whether to exclude each article. We conducted a preliminary testing of these recommendations using a subset of articles and a final test on a sample of the entire database. We designated a false exclusion rate of 10% as an adequate performance threshold.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Results</h3>\u0000 \u0000 <p>GPT-4 recommended excluding 41,742 of the 43,480 articles (96%) containing an abstract. Our preliminary test identified only one false exclusion; our final test identified no false exclusions, yielding an estimated false exclusion rate of 0.00 [0.00, 0.05]. Fewer than 1%—417 of the 41,742 articles—were incorrectly excluded. After manually assessing the eligibility of all remaining articles, we identified 608 of the 1738 articles that GPT-4 did not exclude: 65% of the articles recommended for inclusion should have been excluded.</p>\u0000 </section>\u0000 \u0000 <section>\u0000 \u0000 <h3> Discussion/Conclusions</h3>\u0000 \u0000 <p>GPT-4 performed well at recommending articles to exclude from our literature review, resulting in substantial time and cost savings. A key limitation is that we did not use GPT-4 to determine inclusions, nor did our model perform well on this task. However, GPT-4 dramatically reduced the number of articles requiring review. Systematic reviewers should conduct performance evaluations to ensure that an LLM meets a minimally acceptable quality standard before relying on its recommendations.</p>\u0000 </section>\u0000 </div>","PeriodicalId":100286,"journal":{"name":"Cochrane Evidence Synthesis and Methods","volume":"3 3","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1002/cesm.70031","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144108810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}