José Teófilo Moreira-Filho, Dhruv Ranganath, Ricardo S. Tieghi, Robert Patton, Vicki Sutherland, Charles Schmitt, Andrew A. Rooney, Jennifer Fostel, Vickie R. Walker, Trey Saddler, David Reif, Kamel Mansouri, Nicole Kleinstreuer
{"title":"使用大型语言模型和KNIME从科学文献和一般PDF文件中自动提取数据:在毒理学中的应用","authors":"José Teófilo Moreira-Filho, Dhruv Ranganath, Ricardo S. Tieghi, Robert Patton, Vicki Sutherland, Charles Schmitt, Andrew A. Rooney, Jennifer Fostel, Vickie R. Walker, Trey Saddler, David Reif, Kamel Mansouri, Nicole Kleinstreuer","doi":"10.1002/wcms.70047","DOIUrl":null,"url":null,"abstract":"<p>The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research.</p><p>This article is categorized under:\n\n </p>","PeriodicalId":236,"journal":{"name":"Wiley Interdisciplinary Reviews: Computational Molecular Science","volume":"15 5","pages":""},"PeriodicalIF":27.0000,"publicationDate":"2025-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.70047","citationCount":"0","resultStr":"{\"title\":\"Automating Data Extraction From Scientific Literature and General PDF Files Using Large Language Models and KNIME: An Application in Toxicology\",\"authors\":\"José Teófilo Moreira-Filho, Dhruv Ranganath, Ricardo S. Tieghi, Robert Patton, Vicki Sutherland, Charles Schmitt, Andrew A. Rooney, Jennifer Fostel, Vickie R. Walker, Trey Saddler, David Reif, Kamel Mansouri, Nicole Kleinstreuer\",\"doi\":\"10.1002/wcms.70047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research.</p><p>This article is categorized under:\\n\\n </p>\",\"PeriodicalId\":236,\"journal\":{\"name\":\"Wiley Interdisciplinary Reviews: Computational Molecular Science\",\"volume\":\"15 5\",\"pages\":\"\"},\"PeriodicalIF\":27.0000,\"publicationDate\":\"2025-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://wires.onlinelibrary.wiley.com/doi/epdf/10.1002/wcms.70047\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Wiley Interdisciplinary Reviews: Computational Molecular Science\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.70047\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Wiley Interdisciplinary Reviews: Computational Molecular Science","FirstCategoryId":"92","ListUrlMain":"https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.70047","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
Automating Data Extraction From Scientific Literature and General PDF Files Using Large Language Models and KNIME: An Application in Toxicology
The large and steadily increasing volume of scientific publications presents a challenge in accessing and utilizing data due to their unstructured nature. Toxicology, in particular, depends on structured data from diverse study types for study evaluation, weight-of-evidence chemical assessments, and validation of new approach methodologies (NAMs). Manual data extraction is time and labor-intensive. This work presents an automated data extraction workflow using large language models (LLMs) within the KNIME platform. The workflow integrates document parsing tools with LLMs to extract variables from scientific publications and general PDF files. Two execution modes are available: text mode and image mode. Text mode applies tools for extracting text and tables, while image mode uses multimodal LLMs to process non-linear layouts and graphical content. The workflow achieves 81.14% accuracy in text mode for scientific publications and up to 98.54% in image mode for general PDF files. The KNIME platform ensures accessibility through a user-friendly interface, allowing non-experts to use advanced data extraction methods. This automated approach facilitates toxicological research by improving the retrieval of structured data. By democratizing access to LLM-powered workflows, this approach paves the way for significant advancements in knowledge synthesis to support biomedical research.
期刊介绍:
Computational molecular sciences harness the power of rigorous chemical and physical theories, employing computer-based modeling, specialized hardware, software development, algorithm design, and database management to explore and illuminate every facet of molecular sciences. These interdisciplinary approaches form a bridge between chemistry, biology, and materials sciences, establishing connections with adjacent application-driven fields in both chemistry and biology. WIREs Computational Molecular Science stands as a platform to comprehensively review and spotlight research from these dynamic and interconnected fields.