Amanda Konet, Ian Thomas, Gerald Gartlehner, Leila Kahwati, Rainer Hilscher, Shannon Kugley, Karen Crotty, Meera Viswanathan, Robert Chew
{"title":"两种大型语言模型在证据合成中提取数据的性能。","authors":"Amanda Konet, Ian Thomas, Gerald Gartlehner, Leila Kahwati, Rainer Hilscher, Shannon Kugley, Karen Crotty, Meera Viswanathan, Robert Chew","doi":"10.1002/jrsm.1732","DOIUrl":null,"url":null,"abstract":"<p>Accurate data extraction is a key component of evidence synthesis and critical to valid results. The advent of publicly available large language models (LLMs) has generated interest in these tools for evidence synthesis and created uncertainty about the choice of LLM. We compare the performance of two widely available LLMs (Claude 2 and GPT-4) for extracting pre-specified data elements from 10 published articles included in a previously completed systematic review. We use prompts and full study PDFs to compare the outputs from the browser versions of Claude 2 and GPT-4. GPT-4 required use of a third-party plugin to upload and parse PDFs. Accuracy was high for Claude 2 (96.3%). The accuracy of GPT-4 with the plug-in was lower (68.8%); however, most of the errors were due to the plug-in. Both LLMs correctly recognized when prespecified data elements were missing from the source PDF and generated correct information for data elements that were not reported explicitly in the articles. A secondary analysis demonstrated that, when provided selected text from the PDFs, Claude 2 and GPT-4 accurately extracted 98.7% and 100% of the data elements, respectively. Limitations include the narrow scope of the study PDFs used, that prompt development was completed using only Claude 2, and that we cannot guarantee the open-source articles were not used to train the LLMs. This study highlights the potential for LLMs to revolutionize data extraction but underscores the importance of accurate PDF parsing. For now, it remains essential for a human investigator to validate LLM extractions.</p>","PeriodicalId":226,"journal":{"name":"Research Synthesis Methods","volume":"15 5","pages":"818-824"},"PeriodicalIF":5.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of two large language models for data extraction in evidence synthesis\",\"authors\":\"Amanda Konet, Ian Thomas, Gerald Gartlehner, Leila Kahwati, Rainer Hilscher, Shannon Kugley, Karen Crotty, Meera Viswanathan, Robert Chew\",\"doi\":\"10.1002/jrsm.1732\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Accurate data extraction is a key component of evidence synthesis and critical to valid results. The advent of publicly available large language models (LLMs) has generated interest in these tools for evidence synthesis and created uncertainty about the choice of LLM. We compare the performance of two widely available LLMs (Claude 2 and GPT-4) for extracting pre-specified data elements from 10 published articles included in a previously completed systematic review. We use prompts and full study PDFs to compare the outputs from the browser versions of Claude 2 and GPT-4. GPT-4 required use of a third-party plugin to upload and parse PDFs. Accuracy was high for Claude 2 (96.3%). The accuracy of GPT-4 with the plug-in was lower (68.8%); however, most of the errors were due to the plug-in. Both LLMs correctly recognized when prespecified data elements were missing from the source PDF and generated correct information for data elements that were not reported explicitly in the articles. A secondary analysis demonstrated that, when provided selected text from the PDFs, Claude 2 and GPT-4 accurately extracted 98.7% and 100% of the data elements, respectively. Limitations include the narrow scope of the study PDFs used, that prompt development was completed using only Claude 2, and that we cannot guarantee the open-source articles were not used to train the LLMs. This study highlights the potential for LLMs to revolutionize data extraction but underscores the importance of accurate PDF parsing. For now, it remains essential for a human investigator to validate LLM extractions.</p>\",\"PeriodicalId\":226,\"journal\":{\"name\":\"Research Synthesis Methods\",\"volume\":\"15 5\",\"pages\":\"818-824\"},\"PeriodicalIF\":5.0000,\"publicationDate\":\"2024-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Research Synthesis Methods\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/jrsm.1732\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Research Synthesis Methods","FirstCategoryId":"99","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jrsm.1732","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 0
摘要
准确的数据提取是证据合成的关键组成部分,也是获得有效结果的关键。可公开获取的大型语言模型(LLM)的出现引起了人们对这些证据综合工具的兴趣,同时也为选择 LLM 带来了不确定性。我们比较了两种广泛使用的 LLM(Claude 2 和 GPT-4)在从先前完成的系统综述中收录的 10 篇已发表文章中提取预先指定的数据元素时的性能。我们使用提示和完整的研究 PDF 来比较 Claude 2 和 GPT-4 浏览器版本的输出结果。GPT-4 需要使用第三方插件来上传和解析 PDF。Claude 2 的准确率很高(96.3%)。使用插件的 GPT-4 的准确率较低(68.8%);不过,大部分错误是由插件造成的。两种 LLM 都能正确识别源 PDF 中缺少预先指定的数据元素,并为文章中未明确报告的数据元素生成正确的信息。二次分析表明,当提供 PDF 中的选定文本时,Claude 2 和 GPT-4 分别准确提取了 98.7% 和 100% 的数据元素。局限性包括:使用的研究 PDF 范围较窄;仅使用 Claude 2 完成了提示开发;我们无法保证开源文章未被用于训练 LLM。这项研究凸显了 LLM 在数据提取方面的革命性潜力,但同时也强调了精确 PDF 解析的重要性。目前,人类研究人员仍然有必要对 LLM 提取进行验证。
Performance of two large language models for data extraction in evidence synthesis
Accurate data extraction is a key component of evidence synthesis and critical to valid results. The advent of publicly available large language models (LLMs) has generated interest in these tools for evidence synthesis and created uncertainty about the choice of LLM. We compare the performance of two widely available LLMs (Claude 2 and GPT-4) for extracting pre-specified data elements from 10 published articles included in a previously completed systematic review. We use prompts and full study PDFs to compare the outputs from the browser versions of Claude 2 and GPT-4. GPT-4 required use of a third-party plugin to upload and parse PDFs. Accuracy was high for Claude 2 (96.3%). The accuracy of GPT-4 with the plug-in was lower (68.8%); however, most of the errors were due to the plug-in. Both LLMs correctly recognized when prespecified data elements were missing from the source PDF and generated correct information for data elements that were not reported explicitly in the articles. A secondary analysis demonstrated that, when provided selected text from the PDFs, Claude 2 and GPT-4 accurately extracted 98.7% and 100% of the data elements, respectively. Limitations include the narrow scope of the study PDFs used, that prompt development was completed using only Claude 2, and that we cannot guarantee the open-source articles were not used to train the LLMs. This study highlights the potential for LLMs to revolutionize data extraction but underscores the importance of accurate PDF parsing. For now, it remains essential for a human investigator to validate LLM extractions.
期刊介绍:
Research Synthesis Methods is a reputable, peer-reviewed journal that focuses on the development and dissemination of methods for conducting systematic research synthesis. Our aim is to advance the knowledge and application of research synthesis methods across various disciplines.
Our journal provides a platform for the exchange of ideas and knowledge related to designing, conducting, analyzing, interpreting, reporting, and applying research synthesis. While research synthesis is commonly practiced in the health and social sciences, our journal also welcomes contributions from other fields to enrich the methodologies employed in research synthesis across scientific disciplines.
By bridging different disciplines, we aim to foster collaboration and cross-fertilization of ideas, ultimately enhancing the quality and effectiveness of research synthesis methods. Whether you are a researcher, practitioner, or stakeholder involved in research synthesis, our journal strives to offer valuable insights and practical guidance for your work.