Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases.

IF 19.8 1区 心理学 Q1 PSYCHOLOGY
Thorben Jansen, Lucas W. Liebenow, Ute Mertens, Fabian T. C. Schmidt, Julian F. Lohmann, Johanna Fleckenstein, Jennifer Meyer
{"title":"Data extraction by generative artificial intelligence: Assessing determinants of accuracy using human-extracted data from systematic review databases.","authors":"Thorben Jansen, Lucas W. Liebenow, Ute Mertens, Fabian T. C. Schmidt, Julian F. Lohmann, Johanna Fleckenstein, Jennifer Meyer","doi":"10.1037/bul0000501","DOIUrl":null,"url":null,"abstract":"Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).","PeriodicalId":20854,"journal":{"name":"Psychological bulletin","volume":"9 1","pages":"1280-1306"},"PeriodicalIF":19.8000,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Psychological bulletin","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1037/bul0000501","RegionNum":1,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Psychological science requires reliable measures. Within systematic literature reviews, reliability hinges on high interrater agreement during data extraction. Yet, the extraction process has been time-consuming. Efforts to accelerate the process using technology have shown limited success until generative artificial intelligence (genAI), particularly large language models (LLMs), accurately extracted variables from medical studies. Nonetheless, for psychological researchers, it remains unclear how to utilize genAI for data extraction, given the range of tested variables, the medical context, and the variability in accuracy. We systematically assessed extraction accuracy and error patterns across domains in psychology by comparing genAI-extracted and human-extracted data from 22 systematic review databases published in the Psychological Bulletin. Eight LLMs extracted 312,329 data points from 2,179 studies on 186 variables. LLM extractions achieved unacceptable accuracy on all metrics for 20% of variables. For 46% of variables, accuracy was acceptable for some metrics and unacceptable for others. LLMs reached acceptable but not high accuracy on all metrics in 15%, high but not excellent in 8%, and excellent accuracy in 12% of variables. Accuracy varied most between variables, less between systematic reviews, and least between LLMs. Moderator analyses using a hierarchical logistic regression, hierarchical linear model, and meta-analysis revealed that accuracy was higher for variables describing studies' context and moderator variables compared to variables for effect size calculation. Also, accuracy was higher in systematic reviews with more detailed variable descriptions and positively correlated with model sizes. We discuss directions for investigating ways to use genAI to accelerate data extractions while ensuring meaningful human control. (PsycInfo Database Record (c) 2025 APA, all rights reserved).
生成式人工智能的数据提取:从系统审查数据库中使用人工提取的数据评估准确性的决定因素。
心理科学需要可靠的测量方法。在系统的文献综述中,可靠性取决于数据提取过程中研究者的高度一致性。然而,提取过程一直很耗时。在生成式人工智能(genAI),特别是大型语言模型(llm)能够准确地从医学研究中提取变量之前,利用技术加速这一进程的努力取得了有限的成功。尽管如此,对于心理学研究人员来说,考虑到测试变量的范围、医学背景和准确性的可变性,如何利用基因人工智能进行数据提取仍不清楚。通过比较《心理学公报》上发表的22个系统综述数据库中基因人工提取和人工提取的数据,我们系统地评估了心理学各领域的提取准确性和错误模式。8个LLMs从2179项研究中提取了312329个数据点,涉及186个变量。对于20%的变量,LLM提取在所有指标上取得了不可接受的准确性。对于46%的变量,准确性对于某些指标是可以接受的,而对于其他指标则是不可接受的。llm在所有指标上达到可接受但不高的准确率为15%,在8%的变量上达到高但不优秀的准确率,在12%的变量上达到优秀的准确率。准确性在变量之间变化最大,在系统评价之间变化较小,在法学硕士之间变化最小。使用层次逻辑回归、层次线性模型和元分析的调节分析显示,描述研究背景的变量和调节变量的准确性高于效应大小计算的变量。此外,在具有更详细的变量描述的系统评价中,准确性更高,并且与模型大小呈正相关。我们讨论了研究如何使用基因人工智能来加速数据提取,同时确保有意义的人类控制的方向。(PsycInfo Database Record (c) 2025 APA,版权所有)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Psychological bulletin
Psychological bulletin 医学-心理学
CiteScore
33.60
自引率
0.90%
发文量
21
期刊介绍: Psychological Bulletin publishes syntheses of research in scientific psychology. Research syntheses seek to summarize past research by drawing overall conclusions from many separate investigations that address related or identical hypotheses. A research synthesis typically presents the authors' assessments: -of the state of knowledge concerning the relations of interest; -of critical assessments of the strengths and weaknesses in past research; -of important issues that research has left unresolved, thereby directing future research so it can yield a maximum amount of new information.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书