Developing deep learning-based large-scale organic reaction classification model via sigma-profiles

IF 9.1 Q1 ENGINEERING, CHEMICAL

Green Chemical Engineering Pub Date : 2024-06-14 DOI:10.1016/j.gce.2024.06.003

Wenlong Wang , Chenyang Xu , Jian Du , Lei Zhang

{"title":"Developing deep learning-based large-scale organic reaction classification model via sigma-profiles","authors":"Wenlong Wang , Chenyang Xu , Jian Du , Lei Zhang","doi":"10.1016/j.gce.2024.06.003","DOIUrl":null,"url":null,"abstract":"<div><div>Advanced technologies like deep learning have accelerated the discovery of novel chemical reactions, especially in the field of organic synthesis. With hundreds of thousands of reactions available for reference, one way to effectively leverage them is by classifying chemical reactions into different clusters based on their specific characteristics, which makes target-guided navigation in the vast chemical space possible. Although previous attempts that apply deep learning to reaction classification tasks have made substantial progress, developing a model with good interpretability as well as high accuracy for large-scale reaction classification tasks remains an open question. In this work, a deep learning-based model for a large-scale reaction classification task is first constructed by utilizing pre-trained BERT and autoencoder. Then, the model is trained under the open-source dataset USPTO_TPL which contains recorded reactions of up to 1000 different types. The multi-classification accuracy of the model on the testing dataset is 99.382%, showing its great potential for practical use. Besides, a reaction similarity map is presented to correlate the reactions in the USPTO_TPL dataset based on their sigma-profile-based statistical features. Finally, representative reactions from the testing dataset are provided to illustrate the model's effectiveness on the reaction classification task.</div></div>","PeriodicalId":66474,"journal":{"name":"Green Chemical Engineering","volume":"6 2","pages":"Pages 181-192"},"PeriodicalIF":9.1000,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Green Chemical Engineering","FirstCategoryId":"1089","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666952824000396","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Advanced technologies like deep learning have accelerated the discovery of novel chemical reactions, especially in the field of organic synthesis. With hundreds of thousands of reactions available for reference, one way to effectively leverage them is by classifying chemical reactions into different clusters based on their specific characteristics, which makes target-guided navigation in the vast chemical space possible. Although previous attempts that apply deep learning to reaction classification tasks have made substantial progress, developing a model with good interpretability as well as high accuracy for large-scale reaction classification tasks remains an open question. In this work, a deep learning-based model for a large-scale reaction classification task is first constructed by utilizing pre-trained BERT and autoencoder. Then, the model is trained under the open-source dataset USPTO_TPL which contains recorded reactions of up to 1000 different types. The multi-classification accuracy of the model on the testing dataset is 99.382%, showing its great potential for practical use. Besides, a reaction similarity map is presented to correlate the reactions in the USPTO_TPL dataset based on their sigma-profile-based statistical features. Finally, representative reactions from the testing dataset are provided to illustrate the model's effectiveness on the reaction classification task.

Abstract Image

查看原文本刊更多论文

通过西格玛档案开发基于深度学习的大规模有机反应分类模型

像深度学习这样的先进技术加速了新的化学反应的发现，特别是在有机合成领域。有成千上万的反应可供参考，有效利用它们的一种方法是根据化学反应的特定特征将化学反应分类成不同的簇，这使得在广阔的化学空间中进行目标制导导航成为可能。尽管之前将深度学习应用于反应分类任务的尝试已经取得了实质性进展，但开发一个具有良好可解释性和高准确性的大规模反应分类任务模型仍然是一个悬而未决的问题。在这项工作中，首先利用预训练的BERT和自编码器构建了一个基于深度学习的大规模反应分类任务模型。然后，在开源数据集USPTO_TPL下训练模型，该数据集包含多达1000种不同类型的记录反应。该模型在测试数据集上的多分类准确率达到99.382%，显示出极大的实际应用潜力。此外，基于反应的sigma-profile统计特征，给出了USPTO_TPL数据集中反应的相似度图。最后，给出了测试数据集中具有代表性的反应，以说明该模型在反应分类任务上的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊