使用大型语言模型从多相催化文献中提取和分析数据

IF 13.1 1区 化学 Q1 CHEMISTRY, PHYSICAL
Benjamin W. Walls,  and , Suljo Linic*, 
{"title":"使用大型语言模型从多相催化文献中提取和分析数据","authors":"Benjamin W. Walls,&nbsp; and ,&nbsp;Suljo Linic*,&nbsp;","doi":"10.1021/acscatal.5c03844","DOIUrl":null,"url":null,"abstract":"<p >Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.</p>","PeriodicalId":9,"journal":{"name":"ACS Catalysis ","volume":"15 17","pages":"14751–14763"},"PeriodicalIF":13.1000,"publicationDate":"2025-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature\",\"authors\":\"Benjamin W. Walls,&nbsp; and ,&nbsp;Suljo Linic*,&nbsp;\",\"doi\":\"10.1021/acscatal.5c03844\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.</p>\",\"PeriodicalId\":9,\"journal\":{\"name\":\"ACS Catalysis \",\"volume\":\"15 17\",\"pages\":\"14751–14763\"},\"PeriodicalIF\":13.1000,\"publicationDate\":\"2025-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Catalysis \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acscatal.5c03844\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Catalysis ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acscatal.5c03844","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

摘要

将实验测量的多相催化数据从研究文章文本中提取到结构化数据库中,将有助于快速筛选具有目标性质的催化剂,并开发能够直接预测实验结果的模型。这种文本挖掘任务已经被能够遵循一般自然语言指令的大型语言模型(llm)的发布所改变,这使得挖掘文本成为可能,而不需要训练特定于任务的模型或定义全面的表达式匹配规则。在这里,我们开发并共享了一个名为CatMiner的文本挖掘工具,该工具使用llm提取任意用户指定的结构-环境属性数据。它与LLM的选择无关,OpenAI GPT模型和开源Llama和DeepSeek模型都支持,无需修改。通过对甲烷氧化偶联的案例研究,我们对CatMiner从大量已发表文献中快速提取有用数据的能力进行了基准测试。我们探讨了模型选择和提示策略如何影响提取质量。关键功能,包括领域知识的使用、迭代提示和文档范围的上下文处理,对于有效的性能是至关重要的。我们确定了CatMiner遇到困难的情况,并为社区提出了报告标准,以使催化数据更容易提取。CatMiner可以创建机器可读的催化数据集,简化对埋藏在文献中的实验见解的访问。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature

Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature

Use of Large Language Models for Extracting and Analyzing Data from Heterogeneous Catalysis Literature

Extracting experimentally measured heterogeneous catalysis data from the text of research articles into structured databases would facilitate the rapid screening of catalysts with target properties and the development of models capable of directly predicting experimental outcomes. This text mining task has been transformed by the release of large language models (LLMs) capable of following general natural language instructions, which have made it possible to mine text without the need to train task-specific models or define comprehensive expression-matching rules. Here, we develop and share a text mining tool called CatMiner that extracts arbitrary user-specified structure–environment–property data using LLMs. It is agnostic to LLM choice, with both OpenAI GPT models and open-source Llama and DeepSeek models supported without modification. We benchmark the ability of CatMiner to rapidly extract useful data from abundant published literature by focusing on a case study of the oxidative coupling of methane. We explore how model choice and prompting strategies affect extraction quality. Key capabilities, including the use of domain knowledge, iterative prompting, and document-wide context handling are shown to be critical for effective performance. We identify situations where CatMiner struggles and suggest reporting standards for the community to make catalysis data easier to extract going forward. CatMiner enables the creation of machine-readable catalysis datasets, streamlining access to experimental insights buried in the literature.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
ACS Catalysis
ACS Catalysis CHEMISTRY, PHYSICAL-
CiteScore
20.80
自引率
6.20%
发文量
1253
审稿时长
1.5 months
期刊介绍: ACS Catalysis is an esteemed journal that publishes original research in the fields of heterogeneous catalysis, molecular catalysis, and biocatalysis. It offers broad coverage across diverse areas such as life sciences, organometallics and synthesis, photochemistry and electrochemistry, drug discovery and synthesis, materials science, environmental protection, polymer discovery and synthesis, and energy and fuels. The scope of the journal is to showcase innovative work in various aspects of catalysis. This includes new reactions and novel synthetic approaches utilizing known catalysts, the discovery or modification of new catalysts, elucidation of catalytic mechanisms through cutting-edge investigations, practical enhancements of existing processes, as well as conceptual advances in the field. Contributions to ACS Catalysis can encompass both experimental and theoretical research focused on catalytic molecules, macromolecules, and materials that exhibit catalytic turnover.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信