从专利文献中提取高质量化学反应数据集的大语言模型的适用性

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, Arijit Roy
{"title":"从专利文献中提取高质量化学反应数据集的大语言模型的适用性","authors":"Sarveswara Rao Vangala,&nbsp;Sowmya Ramaswamy Krishnan,&nbsp;Navneet Bung,&nbsp;Dhandapani Nandagopal,&nbsp;Gomathi Ramasamy,&nbsp;Satyam Kumar,&nbsp;Sridharan Sankaran,&nbsp;Rajgopal Srinivasan,&nbsp;Arijit Roy","doi":"10.1186/s13321-024-00928-8","DOIUrl":null,"url":null,"abstract":"<div><p>With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.</p><p><b>Scientific contribution</b></p><p>In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00928-8","citationCount":"0","resultStr":"{\"title\":\"Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature\",\"authors\":\"Sarveswara Rao Vangala,&nbsp;Sowmya Ramaswamy Krishnan,&nbsp;Navneet Bung,&nbsp;Dhandapani Nandagopal,&nbsp;Gomathi Ramasamy,&nbsp;Satyam Kumar,&nbsp;Sridharan Sankaran,&nbsp;Rajgopal Srinivasan,&nbsp;Arijit Roy\",\"doi\":\"10.1186/s13321-024-00928-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.</p><p><b>Scientific contribution</b></p><p>In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.</p></div>\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":7.1000,\"publicationDate\":\"2024-11-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00928-8\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://link.springer.com/article/10.1186/s13321-024-00928-8\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00928-8","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

随着人工智能(AI)时代的到来,人们现在有可能从以前未曾探索过的化学空间中设计出多种多样的新型分子。然而,化学家面临的一个挑战是如何合成这些分子。最近,有人尝试开发用于逆合成预测的人工智能模型,但这有赖于高质量训练数据集的可用性。在这项工作中,我们探索了大语言模型(LLM)从专利文件中提取高质量化学反应数据的适用性。对早期研究中的同一组专利进行的比较研究表明,所提出的自动化方法可以增加 26% 的新反应,从而增强当前数据集的能力。在反应挖掘过程中发现了一些挑战,并针对其中一些挑战提出了替代解决方案。此外,还进行了详细的分析,发现了之前数据集中的几个错误条目。在更大的专利数据集上使用所提出的管道提取反应,可以提高未来合成预测模型的准确性和效率。科学贡献 在这项工作中,我们评估了大语言模型从专利文献中挖掘高质量化学反应数据集的适用性。结果表明,所提出的方法可以通过识别更多的化学反应来显著提高反应数据库的数量,并通过纠正以前的错误/假阳性反应来提高反应数据库的质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature

With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.

Scientific contribution

In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信