Gunn E. Vist , Trine Husøy , Michael Guy Diemar , Hubert Dirven , Erwin L. Roggen , Maria E. Kalyva
{"title":"用于公共卫生领域系统范围审查的科学论文数据提取工具","authors":"Gunn E. Vist , Trine Husøy , Michael Guy Diemar , Hubert Dirven , Erwin L. Roggen , Maria E. Kalyva","doi":"10.1016/j.cmpb.2025.108962","DOIUrl":null,"url":null,"abstract":"<div><h3>Background and Objectives</h3><div>Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time‐consuming and costly.</div></div><div><h3>Methods</h3><div>We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.</div></div><div><h3>Results</h3><div>The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.</div></div><div><h3>Conclusions</h3><div>ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool’s performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.</div></div>","PeriodicalId":10624,"journal":{"name":"Computer methods and programs in biomedicine","volume":"270 ","pages":"Article 108962"},"PeriodicalIF":4.9000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ExtractPDF: A data extraction tool for scientific papers applied to a systematic scoping review in public health\",\"authors\":\"Gunn E. Vist , Trine Husøy , Michael Guy Diemar , Hubert Dirven , Erwin L. Roggen , Maria E. Kalyva\",\"doi\":\"10.1016/j.cmpb.2025.108962\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background and Objectives</h3><div>Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time‐consuming and costly.</div></div><div><h3>Methods</h3><div>We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.</div></div><div><h3>Results</h3><div>The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.</div></div><div><h3>Conclusions</h3><div>ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool’s performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.</div></div>\",\"PeriodicalId\":10624,\"journal\":{\"name\":\"Computer methods and programs in biomedicine\",\"volume\":\"270 \",\"pages\":\"Article 108962\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2025-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer methods and programs in biomedicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0169260725003797\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer methods and programs in biomedicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0169260725003797","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
ExtractPDF: A data extraction tool for scientific papers applied to a systematic scoping review in public health
Background and Objectives
Systematic reviews are widely used to identify the evidence and get an overview of the available knowledge for various questions related to public health and medical topics. They can provide a summary of all the available data and can be used to make knowledge-based decisions about policy, practice, and academic research. The conduct of systematic reviews can often be time‐consuming and costly.
Methods
We have developed a command-line based code in R to extract data in an automated manner from full-text scientific papers. ExtractPDF is a data extraction tool/software that provides a reliable computational workflow for extracting words or combinations of words from numerous portable document format (PDF) files.
Results
The software was applied to extract information from 299 papers that have been screened as included for a published systematic scoping review study within the field of risk assessment in public health. The output of the software is tables of extracted information per type of information of interest per PDF file. The tables were used during the data extraction stage as a second reviewer alongside a human reviewer to assist and/or validate data extraction items.
Conclusions
ExtractPDF tool has a novel pipeline architecture to automate extraction of information from unstructured format types, such as PDF files. ExtractPDF tool assisted in expediting the task of data extraction stage and reducing human related resources as well as errors. The tool’s performance and reliability were found to be very good with metrics of averagely 0.89 for precision, 0.92 for recall, 0.86 for accuracy and 0.91for F1-score.
期刊介绍:
To encourage the development of formal computing methods, and their application in biomedical research and medical practice, by illustration of fundamental principles in biomedical informatics research; to stimulate basic research into application software design; to report the state of research of biomedical information processing projects; to report new computer methodologies applied in biomedical areas; the eventual distribution of demonstrable software to avoid duplication of effort; to provide a forum for discussion and improvement of existing software; to optimize contact between national organizations and regional user groups by promoting an international exchange of information on formal methods, standards and software in biomedicine.
Computer Methods and Programs in Biomedicine covers computing methodology and software systems derived from computing science for implementation in all aspects of biomedical research and medical practice. It is designed to serve: biochemists; biologists; geneticists; immunologists; neuroscientists; pharmacologists; toxicologists; clinicians; epidemiologists; psychiatrists; psychologists; cardiologists; chemists; (radio)physicists; computer scientists; programmers and systems analysts; biomedical, clinical, electrical and other engineers; teachers of medical informatics and users of educational software.