Automating Data Extraction From Scientific Literature and General PDF Files Using Large Language Models and KNIME: An Application in Toxicology

IF 27 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Wiley Interdisciplinary Reviews: Computational Molecular Science Pub Date : 2025-09-18 DOI:10.1002/wcms.70047

José Teófilo Moreira-Filho, Dhruv Ranganath, Ricardo S. Tieghi, Robert Patton, Vicki Sutherland, Charles Schmitt, Andrew A. Rooney, Jennifer Fostel, Vickie R. Walker, Trey Saddler, David Reif, Kamel Mansouri, Nicole Kleinstreuer

{"title":"Automating Data Extraction From Scientific Literature and General PDF Files Using Large Language Models and KNIME: An Application in Toxicology","authors":"José Teófilo Moreira-Filho, Dhruv Ranganath, Ricardo S. Tieghi, Robert Patton, Vicki Sutherland, Charles Schmitt, Andrew A. Rooney, Jennifer Fostel, Vickie R. Walker, Trey Saddler, David Reif, Kamel Mansouri, Nicole Kleinstreuer","doi":"10.1002/wcms.70047","DOIUrl":null,"url":null,"abstract":"<p>The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research.</p><p>This article is categorized under:\n\n </p>","PeriodicalId":236,"journal":{"name":"Wiley Interdisciplinary Reviews: Computational Molecular Science","volume":"15 5","pages":""},"PeriodicalIF":27.0000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.70047","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews: Computational Molecular Science","FirstCategoryId":"92","ListUrlMain":"https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.70047","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research.

This article is categorized under:

Abstract Image

查看原文本刊更多论文

使用大型语言模型和KNIME从科学文献和一般PDF文件中自动提取数据：在毒理学中的应用

由于科学出版物的非结构化性质，大量且稳步增长的科学出版物在访问和利用数据方面提出了挑战。特别是毒理学，依赖于来自不同研究类型的结构化数据进行研究评估、证据权重化学评估和新方法方法（NAMs）的验证。手动数据提取是费时费力的。这项工作提出了一个使用KNIME平台内的大型语言模型（llm）的自动数据提取工作流。该工作流将文档解析工具与llm集成在一起，从科学出版物和一般PDF文件中提取变量。有两种执行模式：文本模式和图像模式。文本模式使用提取文本和表格的工具，而图像模式使用多模态llm来处理非线性布局和图形内容。该工作流在科学出版物的文本模式下达到81.14%的准确率，在一般PDF文件的图像模式下达到98.54%。KNIME平台通过用户友好的界面确保可访问性，允许非专家使用先进的数据提取方法。这种自动化方法通过改进结构化数据的检索来促进毒理学研究。通过使llm支持的工作流程民主化，这种方法为支持生物医学研究的知识合成方面的重大进步铺平了道路。本文分类如下：

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Wiley Interdisciplinary Reviews: Computational Molecular Science CHEMISTRY, MULTIDISCIPLINARY-MATHEMATICAL & COMPUTATIONAL BIOLOGY

CiteScore

28.90

自引率

1.80%

发文量

审稿时长

6-12 weeks

期刊介绍： Computational molecular sciences harness the power of rigorous chemical and physical theories, employing computer-based modeling, specialized hardware, software development, algorithm design, and database management to explore and illuminate every facet of molecular sciences. These interdisciplinary approaches form a bridge between chemistry, biology, and materials sciences, establishing connections with adjacent application-driven fields in both chemistry and biology. WIREs Computational Molecular Science stands as a platform to comprehensively review and spotlight research from these dynamic and interconnected fields.