使用大型语言模型从多相催化文献中提取和分析数据

IF 13.1 1区化学 Q1 CHEMISTRY, PHYSICAL

ACS Catalysis Pub Date : 2025-08-10 DOI:10.1021/acscatal.5c03844

Benjamin W. Walls, and , Suljo Linic*,

{"title":"使用大型语言模型从多相催化文献中提取和分析数据","authors":"Benjamin W. Walls,  and , Suljo Linic*, ","doi":"10.1021/acscatal.5c03844","DOIUrl":null,"url":null,"abstract":"<p >Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.</p>","PeriodicalId":9,"journal":{"name":"ACS Catalysis ","volume":"15 17","pages":"14751–14763"},"PeriodicalIF":13.1000,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature\",\"authors\":\"Benjamin W. Walls,  and , Suljo Linic*, \",\"doi\":\"10.1021/acscatal.5c03844\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.</p>\",\"PeriodicalId\":9,\"journal\":{\"name\":\"ACS Catalysis \",\"volume\":\"15 17\",\"pages\":\"14751–14763\"},\"PeriodicalIF\":13.1000,\"publicationDate\":\"2025-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Catalysis \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acscatal.5c03844\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Catalysis ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acscatal.5c03844","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

摘要

将实验测量的多相催化数据从研究文章文本中提取到结构化数据库中，将有助于快速筛选具有目标性质的催化剂，并开发能够直接预测实验结果的模型。这种文本挖掘任务已经被能够遵循一般自然语言指令的大型语言模型（llm）的发布所改变，这使得挖掘文本成为可能，而不需要训练特定于任务的模型或定义全面的表达式匹配规则。在这里，我们开发并共享了一个名为CatMiner的文本挖掘工具，该工具使用llm提取任意用户指定的结构-环境属性数据。它与LLM的选择无关，OpenAI GPT模型和开源Llama和DeepSeek模型都支持，无需修改。通过对甲烷氧化偶联的案例研究，我们对CatMiner从大量已发表文献中快速提取有用数据的能力进行了基准测试。我们探讨了模型选择和提示策略如何影响提取质量。关键功能，包括领域知识的使用、迭代提示和文档范围的上下文处理，对于有效的性能是至关重要的。我们确定了CatMiner遇到困难的情况，并为社区提出了报告标准，以使催化数据更容易提取。CatMiner可以创建机器可读的催化数据集，简化对埋藏在文献中的实验见解的访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature

查看原文本刊更多论文

Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature

Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACS Catalysis CHEMISTRY, PHYSICAL-

CiteScore

20.80

自引率

6.20%

发文量

1253

审稿时长

1.5 months

期刊介绍： ACS Catalysis is an esteemed journal that publishes original research in the fields of heterogeneous catalysis, molecular catalysis, and biocatalysis. It offers broad coverage across diverse areas such as life sciences, organometallics and synthesis, photochemistry and electrochemistry, drug discovery and synthesis, materials science, environmental protection, polymer discovery and synthesis, and energy and fuels. The scope of the journal is to showcase innovative work in various aspects of catalysis. This includes new reactions and novel synthetic approaches utilizing known catalysts, the discovery or modification of new catalysts, elucidation of catalytic mechanisms through cutting-edge investigations, practical enhancements of existing processes, as well as conceptual advances in the field. Contributions to ACS Catalysis can encompass both experimental and theoretical research focused on catalytic molecules, macromolecules, and materials that exhibit catalytic turnover.