测试GPT在环境系统证据合成中标题和摘要筛选的效用。

IF 5.2 4区环境科学与生态学 Q2 ENVIRONMENTAL SCIENCES

Environmental Evidence Pub Date : 2025-04-23 DOI:10.1186/s13750-025-00360-x

Björn Nykvist, Biljana Macura, Maria Xylia, Erik Olsson

{"title":"测试GPT在环境系统证据合成中标题和摘要筛选的效用。","authors":"Björn Nykvist, Biljana Macura, Maria Xylia, Erik Olsson","doi":"10.1186/s13750-025-00360-x","DOIUrl":null,"url":null,"abstract":"In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.","PeriodicalId":48621,"journal":{"name":"Environmental Evidence","volume":"14 1","pages":"7"},"PeriodicalIF":5.2000,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12016299/pdf/","citationCount":"0","resultStr":"{\"title\":\"Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.\",\"authors\":\"Björn Nykvist, Biljana Macura, Maria Xylia, Erik Olsson\",\"doi\":\"10.1186/s13750-025-00360-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.\",\"PeriodicalId\":48621,\"journal\":{\"name\":\"Environmental Evidence\",\"volume\":\"14 1\",\"pages\":\"7\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2025-04-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12016299/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Environmental Evidence\",\"FirstCategoryId\":\"93\",\"ListUrlMain\":\"https://doi.org/10.1186/s13750-025-00360-x\",\"RegionNum\":4,\"RegionCategory\":\"环境科学与生态学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENVIRONMENTAL SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Environmental Evidence","FirstCategoryId":"93","ListUrlMain":"https://doi.org/10.1186/s13750-025-00360-x","RegionNum":4,"RegionCategory":"环境科学与生态学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENVIRONMENTAL SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

在本文中，我们表明OpenAI的大型语言模型（LLM） GPT在用于科学文章的标题和摘要资格筛选以及（系统的）文献综述工作流程时表现非常好。我们根据筛选数据对GPT进行了评估，这些数据来自一项关于电动汽车充电基础设施需求的系统审查研究，其中包含近12,000条记录，使用与人类筛选者相同的资格标准。我们测试了这个模型的3个不同版本，它们的任务是通过响应0到1之间的相关概率来区分相关和不相关的内容。对于最新的GPT-4模型（于2023年11月测试）和概率截止值0.5，召回率为100%，这意味着没有相关论文被遗漏，使用这种模式进行筛选将节省50%的时间，否则将花费在人工筛选上。尝试更高的阈值可以节省更多的时间。通过选择阈值，GPT-4的召回率仍然高于95%（其中多达5%的相关论文可能会被遗漏），该模型可以节省75%的人工筛选时间。如果自动化技术能够有效、准确、精确地复制人类专家的人工筛选，那么节省的工作量和成本将是显著的。此外，一份全面的相关文献清单的价值，在一个研究项目开始时很快就能得到，是很难低估的。然而，由于本研究仅评估了一个系统评价和一个提示的表现，我们警告说，需要更多的测试和方法开发，并概述了正确评估法学硕士资格筛选的严谨性和有效性的下一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Testing the utility of GPT for title and abstract screening in environmental systematic evidence synthesis.

In this paper we show that OpenAI's Large Language Model (LLM) GPT perform remarkably well when used for title and abstract eligibility screening of scientific articles and within a (systematic) literature review workflow. We evaluated GPT on screening data from a systematic review study on electric vehicle charging infrastructure demand with almost 12,000 records using the same eligibility criteria as human screeners. We tested 3 different versions of this model that were tasked to distinguishing between relevant and irrelevant content by responding with a relevance probability between 0 and 1. For the latest GPT-4 model (tested in November 2023) and probability cutoff 0.5 the recall rate is 100%, meaning no relevant papers were missed and using this mode for screening would have saved 50% of the time that would otherwise be spent on manual screening. Experimenting with a higher cut of threshold can save more time. With threshold chosen so that recall is still above 95% for GPT-4 (where up to 5% of relevant papers might be missed), the model could save 75% of the time spent on manual screening. If automation technologies can replicate manual screening by human experts with effectiveness, accuracy, and precision, the work and cost savings are significant. Furthermore, the value of a comprehensive list of relevant literature, rather quickly available at the start of a research project, is hard to understate. However, as this study only evaluated the performance on one systematic review and one prompt, we caution that more test and methodological development is needed, and outline the next steps to properly evaluate rigor and effectiveness of LLMs for eligibility screening.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Environmental Evidence Environmental Science-Management, Monitoring, Policy and Law

CiteScore

6.10

自引率

18.20%

发文量

审稿时长

17 weeks

期刊介绍： Environmental Evidence is the journal of the Collaboration for Environmental Evidence (CEE). The Journal facilitates rapid publication of evidence syntheses, in the form of Systematic Reviews and Maps conducted to CEE Guidelines and Standards. We focus on the effectiveness of environmental management interventions and the impact of human activities on the environment. Our scope covers all forms of environmental management and human impacts and therefore spans the natural and social sciences. Subjects include water security, agriculture, food security, forestry, fisheries, natural resource management, biodiversity conservation, climate change, ecosystem services, pollution, invasive species, environment and human wellbeing, sustainable energy use, soil management, environmental legislation, environmental education.