Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases.

IF 19.8 1区心理学 Q1 PSYCHOLOGY

Psychological bulletin Pub Date : 2025-12-15 DOI:10.1037/bul0000501

Thorben Jansen, Lucas W. Liebenow, Ute Mertens, Fabian T. C. Schmidt, Julian F. Lohmann, Johanna Fleckenstein, Jennifer Meyer

{"title":"Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases.","authors":"Thorben Jansen, Lucas W. Liebenow, Ute Mertens, Fabian T. C. Schmidt, Julian F. Lohmann, Johanna Fleckenstein, Jennifer Meyer","doi":"10.1037/bul0000501","DOIUrl":null,"url":null,"abstract":"Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).","PeriodicalId":20854,"journal":{"name":"Psychological bulletin","volume":"9 1","pages":"1280-1306"},"PeriodicalIF":19.8000,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological bulletin","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/bul0000501","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).

查看原文本刊更多论文

生成式人工智能的数据提取：从系统审查数据库中使用人工提取的数据评估准确性的决定因素。

心理科学需要可靠的测量方法。在系统的文献综述中，可靠性取决于数据提取过程中研究者的高度一致性。然而，提取过程一直很耗时。在生成式人工智能（genAI），特别是大型语言模型（llm）能够准确地从医学研究中提取变量之前，利用技术加速这一进程的努力取得了有限的成功。尽管如此，对于心理学研究人员来说，考虑到测试变量的范围、医学背景和准确性的可变性，如何利用基因人工智能进行数据提取仍不清楚。通过比较《心理学公报》上发表的22个系统综述数据库中基因人工提取和人工提取的数据，我们系统地评估了心理学各领域的提取准确性和错误模式。8个LLMs从2179项研究中提取了312329个数据点，涉及186个变量。对于20%的变量，LLM提取在所有指标上取得了不可接受的准确性。对于46%的变量，准确性对于某些指标是可以接受的，而对于其他指标则是不可接受的。llm在所有指标上达到可接受但不高的准确率为15%，在8%的变量上达到高但不优秀的准确率，在12%的变量上达到优秀的准确率。准确性在变量之间变化最大，在系统评价之间变化较小，在法学硕士之间变化最小。使用层次逻辑回归、层次线性模型和元分析的调节分析显示，描述研究背景的变量和调节变量的准确性高于效应大小计算的变量。此外，在具有更详细的变量描述的系统评价中，准确性更高，并且与模型大小呈正相关。我们讨论了研究如何使用基因人工智能来加速数据提取，同时确保有意义的人类控制的方向。（PsycInfo Database Record (c) 2025 APA，版权所有）。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Psychological bulletin 医学-心理学

CiteScore

33.60

自引率

0.90%

发文量

期刊介绍： Psychological Bulletin publishes syntheses of research in scientific psychology. Research syntheses seek to summarize past research by drawing overall conclusions from many separate investigations that address related or identical hypotheses. A research synthesis typically presents the authors' assessments: -of the state of knowledge concerning the relations of interest; -of critical assessments of the strengths and weaknesses in past research; -of important issues that research has left unresolved, thereby directing future research so it can yield a maximum amount of new information.