用于公共卫生领域系统范围审查的科学论文数据提取工具

IF 4.9 2区 医学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Gunn E. Vist , Trine Husøy , Michael Guy Diemar , Hubert Dirven , Erwin L. Roggen , Maria E. Kalyva
{"title":"用于公共卫生领域系统范围审查的科学论文数据提取工具","authors":"Gunn E. Vist ,&nbsp;Trine Husøy ,&nbsp;Michael Guy Diemar ,&nbsp;Hubert Dirven ,&nbsp;Erwin L. Roggen ,&nbsp;Maria E. Kalyva","doi":"10.1016/j.cmpb.2025.108962","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time‐consuming and costly.</div></div><div><h3>Methods</h3><div>We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.</div></div><div><h3>Results</h3><div>The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.</div></div><div><h3>Conclusions</h3><div>ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool’s performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"270 ","pages":"Article 108962"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ExtractPDF: A data extraction tool for scientific papers applied to a systematic scoping review in public health\",\"authors\":\"Gunn E. Vist ,&nbsp;Trine Husøy ,&nbsp;Michael Guy Diemar ,&nbsp;Hubert Dirven ,&nbsp;Erwin L. Roggen ,&nbsp;Maria E. Kalyva\",\"doi\":\"10.1016/j.cmpb.2025.108962\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and Objectives</h3><div>Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time‐consuming and costly.</div></div><div><h3>Methods</h3><div>We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.</div></div><div><h3>Results</h3><div>The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.</div></div><div><h3>Conclusions</h3><div>ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool’s performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"270 \",\"pages\":\"Article 108962\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260725003797\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725003797","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

摘要

背景和目的系统评价被广泛用于识别证据和获得与公共卫生和医学主题相关的各种问题的现有知识的概述。它们可以提供所有可用数据的摘要,并可用于就政策、实践和学术研究做出基于知识的决策。系统评审的实施通常是耗时且昂贵的。方法我们在R语言中开发了一个基于命令行的代码,以自动的方式从全文科学论文中提取数据。ExtractPDF是一个数据提取工具/软件,它提供了一个可靠的计算工作流,用于从众多便携式文档格式(PDF)文件中提取单词或单词组合。结果应用该软件从299篇论文中提取信息,这些论文已被筛选为公共卫生风险评估领域发表的系统性范围审查研究的内容。该软件的输出是每个PDF文件中每个感兴趣的信息类型的提取信息表。在数据提取阶段,这些表被用作第二审查员,与人工审查员一起协助和/或验证数据提取项。结论sextractpdf工具具有新颖的流水线架构,可以自动从非结构化格式类型(如PDF文件)中提取信息。ExtractPDF工具有助于加快数据提取阶段的任务,减少人力资源和错误。该工具的性能和可靠性被发现非常好,精度平均为0.89,召回率为0.92,准确性为0.86,f1得分为0.91。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ExtractPDF: A data extraction tool for scientific papers applied to a systematic scoping review in public health

Background and Objectives

Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time‐consuming and costly.

Methods

We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.

Results

The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.

Conclusions

ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool’s performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer methods and programs in biomedicine
Computer methods and programs in biomedicine 工程技术-工程:生物医学
CiteScore
12.30
自引率
6.60%
发文量
601
审稿时长
135 days
期刊介绍: To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine. Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信