AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry

IF 7.1 2区化学 Q1 CHEMISTRY, MULTIDISCIPLINARY

Journal of Cheminformatics Pub Date : 2024-06-27 DOI:10.1186/s13321-024-00869-2

Lung-Yi Chen, Yi-Pei Li

{"title":"AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry","authors":"Lung-Yi Chen, Yi-Pei Li","doi":"10.1186/s13321-024-00869-2","DOIUrl":null,"url":null,"abstract":"<p>This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00869-2","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-024-00869-2","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

Abstract

This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis.

查看原文本刊更多论文

AutoTemplate：为有机化学中的机器学习应用增强化学反应数据集

本文介绍了一种创新的数据预处理协议--AutoTemplate，以满足有机化学机器学习应用领域对高质量化学反应数据集的关键需求。人工智能的最新进展扩大了机器学习在化学领域的应用，尤其是在产率预测、逆合成和反应条件预测方面。然而，这些模型的有效性取决于化学反应数据集的完整性，而这些数据集往往存在不一致性，如缺少反应物、原子映射不正确以及反应完全错误等。AutoTemplate 采用两阶段方法来完善这些数据集。第一阶段包括提取有意义的反应转换规则，并使用简化的 SMARTS 表示法制定通用反应模板。这种简化扩大了模板在各种化学反应中的适用性。第二阶段是模板指导下的反应整理，系统地应用这些模板来验证和修正反应数据。这一过程可有效修正缺失的反应物信息、纠正原子映射错误并消除错误的数据项。AutoTemplate 的一个突出特点是能够同时识别和纠正错误的化学反应。它的运行前提是数据集中的大多数反应都是准确的，并将这些反应作为模板来指导对错误条目的修正。该协议在一系列化学反应中证明了其有效性，显著提高了数据集的质量。这一进步为开发可靠的化学机器学习模型奠定了更坚实的基础，从而提高了正向和反向合成预测的准确性。AutoTemplate 标志着化学反应数据集预处理的重大进步，弥补了重要的差距，促进了有机合成中更精确、更高效的机器学习应用。所提出的化学反应数据自动预处理工具旨在识别化学数据库中的错误。具体来说，如果错误涉及原子映射或反应物类型缺失，则可使用反应模板进行系统性修正，最终提升数据库的整体质量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS

CiteScore

14.10

自引率

7.00%

发文量

审稿时长

3 months

期刊介绍： Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.