Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY
Arihanth Srikar Tadanki, H. Surya Prakash Rao and U. Deva Priyakumar
{"title":"Dissecting errors in machine learning for retrosynthesis: a granular metric framework and a transformer-based model for more informative predictions","authors":"Arihanth Srikar Tadanki, H. Surya Prakash Rao and U. Deva Priyakumar","doi":"10.1039/D4DD00263F","DOIUrl":null,"url":null,"abstract":"<p >Chemical reaction prediction, encompassing forward synthesis and retrosynthesis, stands as a fundamental challenge in organic synthesis. A widely adopted computational approach frames synthesis prediction as a sequence-to-sequence translation task, using the commonly used SMILES representation for molecules. The current evaluation of machine learning methods for retrosynthesis assumes perfect training data, overlooking imperfections in reaction equations in popular datasets, such as missing reactants, products, other physical and practical constraints such as temperature and cost, primarily due to a focus on the target molecule. This limitation leads to an incomplete representation of viable synthetic routes, especially when multiple sets of reactants can yield a given desired product. In response to these shortcomings, this study examines the prevailing evaluation methods and introduces comprehensive metrics designed to address imperfections in the dataset. Our novel metrics not only assess absolute accuracy by comparing predicted outputs with ground truth but also introduce a nuanced evaluation approach. We provide scores for partial correctness and compute adjusted accuracy through graph matching, acknowledging the inherent complexities of retrosynthetic pathways. Additionally, we explore the impact of small molecular augmentations while preserving chemical properties and employ similarity matching to enhance the assessment of prediction quality. We introduce SynFormer, a sequence-to-sequence model tailored for SMILES representation. It incorporates architectural enhancements to the original transformer, effectively tackling the challenges of chemical reaction prediction. SynFormer achieves a Top-1 accuracy of 53.2% on the USPTO-50k dataset, matching the performance of widely accepted models like Chemformer, but with greater efficiency by eliminating the need for pre-training.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 831-845"},"PeriodicalIF":6.2000,"publicationDate":"2025-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00263f?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00263f","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Chemical reaction prediction, encompassing forward synthesis and retrosynthesis, stands as a fundamental challenge in organic synthesis. A widely adopted computational approach frames synthesis prediction as a sequence-to-sequence translation task, using the commonly used SMILES representation for molecules. The current evaluation of machine learning methods for retrosynthesis assumes perfect training data, overlooking imperfections in reaction equations in popular datasets, such as missing reactants, products, other physical and practical constraints such as temperature and cost, primarily due to a focus on the target molecule. This limitation leads to an incomplete representation of viable synthetic routes, especially when multiple sets of reactants can yield a given desired product. In response to these shortcomings, this study examines the prevailing evaluation methods and introduces comprehensive metrics designed to address imperfections in the dataset. Our novel metrics not only assess absolute accuracy by comparing predicted outputs with ground truth but also introduce a nuanced evaluation approach. We provide scores for partial correctness and compute adjusted accuracy through graph matching, acknowledging the inherent complexities of retrosynthetic pathways. Additionally, we explore the impact of small molecular augmentations while preserving chemical properties and employ similarity matching to enhance the assessment of prediction quality. We introduce SynFormer, a sequence-to-sequence model tailored for SMILES representation. It incorporates architectural enhancements to the original transformer, effectively tackling the challenges of chemical reaction prediction. SynFormer achieves a Top-1 accuracy of 53.2% on the USPTO-50k dataset, matching the performance of widely accepted models like Chemformer, but with greater efficiency by eliminating the need for pre-training.

Abstract Image

反合成机器学习中的剖析错误:一个粒度度量框架和一个基于变压器的模型,用于提供更多信息的预测
化学反应预测,包括正向合成和反合成,是有机合成中的一个基本挑战。一种广泛采用的计算方法将合成预测框架为序列到序列的翻译任务,使用常用的分子SMILES表示。目前对反合成机器学习方法的评估假设了完美的训练数据,忽略了流行数据集中反应方程的缺陷,例如缺少反应物,产物,其他物理和实际限制,如温度和成本,主要是由于关注目标分子。这种限制导致可行合成路线的不完整表示,特别是当多组反应物可以产生给定的期望产品时。针对这些缺点,本研究考察了流行的评估方法,并引入了旨在解决数据集缺陷的综合指标。我们的新指标不仅通过比较预测输出与实际情况来评估绝对准确性,而且还引入了一种细致入微的评估方法。我们为部分正确性提供分数,并通过图匹配计算调整后的准确性,承认反合成路径的固有复杂性。此外,我们在保留化学性质的同时探索了小分子扩增的影响,并采用相似性匹配来提高预测质量的评估。我们介绍了SynFormer,这是一个为SMILES表示量身定制的序列到序列模型。它结合了原有变压器的架构增强,有效地解决了化学反应预测的挑战。SynFormer在USPTO-50k数据集上达到了53.2%的Top-1精度,与Chemformer等广泛接受的模型的性能相匹配,但通过消除预训练的需要,其效率更高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信