Rethinking Retrosynthesis: Curriculum Learning Reshapes Transformer-Based Small-Molecule Reaction Prediction.

IF 5.3 2区化学 Q1 CHEMISTRY, MEDICINAL

Journal of Chemical Information and Modeling Pub Date : 2025-09-26 DOI:10.1021/acs.jcim.5c01508

Rahul Sheshanarayana,Fengqi You

{"title":"Rethinking Retrosynthesis: Curriculum Learning Reshapes Transformer-Based Small-Molecule Reaction Prediction.","authors":"Rahul Sheshanarayana,Fengqi You","doi":"10.1021/acs.jcim.5c01508","DOIUrl":null,"url":null,"abstract":"Retrosynthesis prediction remains a central challenge in computational chemistry, particularly when models must generalize to rare or structurally complex reactions. We present a curriculum learning (CL) framework that reshapes model training by systematically controlling reaction difficulty during learning, directly addressing the challenge of chemical generalization. In contrast to conventional generative approaches that treat all training reactions uniformly, our method introduces reactions in a chemically informed progression, gradually exposing the model to increasingly complex transformations based on synthetic accessibility, ring complexity, and molecular size. This difficulty-aware pacing allows the model to better capture reaction conditionality, preserve chemical plausibility, and avoid failure modes commonly observed in rare or underrepresented transformations. Applied across three transformer-based architectures─ChemBERTa + DistilGPT2, ReactionT5v2, and BART─the framework yields substantial performance gains. Notably, the largest improvements are observed in the BART model, which lacks any chemical domain pretraining: CL improves its top-1 accuracy from 27.0% to 75.9% (+48.9%). The remainder of our evaluations use ChemBERTa + DistilGPT2 as a representative pretrained model. In low-data regimes with only 50% of the training data, CL increases top-1 accuracy from 16.9% to 46.6% (+29.7%). Under scaffold-based splits, CL improves top-1 accuracy by up to 29%, and in structurally dissimilar settings (Tanimoto similarity <0.4), CL boosts top-1 accuracy from 18.2% to 69.4% (+51.2%), demonstrating strong robustness to distributional shifts. These improvements are achieved without auxiliary labels, templates, or reaction class supervision. Looking forward, this CL framework may aid retrosynthetic route planning for pharmaceutical intermediates, catalysts, polymers, and functional materials.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"91 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c01508","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}

引用次数: 0

Abstract

Retrosynthesis prediction remains a central challenge in computational chemistry, particularly when models must generalize to rare or structurally complex reactions. We present a curriculum learning (CL) framework that reshapes model training by systematically controlling reaction difficulty during learning, directly addressing the challenge of chemical generalization. In contrast to conventional generative approaches that treat all training reactions uniformly, our method introduces reactions in a chemically informed progression, gradually exposing the model to increasingly complex transformations based on synthetic accessibility, ring complexity, and molecular size. This difficulty-aware pacing allows the model to better capture reaction conditionality, preserve chemical plausibility, and avoid failure modes commonly observed in rare or underrepresented transformations. Applied across three transformer-based architectures─ChemBERTa + DistilGPT2, ReactionT5v2, and BART─the framework yields substantial performance gains. Notably, the largest improvements are observed in the BART model, which lacks any chemical domain pretraining: CL improves its top-1 accuracy from 27.0% to 75.9% (+48.9%). The remainder of our evaluations use ChemBERTa + DistilGPT2 as a representative pretrained model. In low-data regimes with only 50% of the training data, CL increases top-1 accuracy from 16.9% to 46.6% (+29.7%). Under scaffold-based splits, CL improves top-1 accuracy by up to 29%, and in structurally dissimilar settings (Tanimoto similarity <0.4), CL boosts top-1 accuracy from 18.2% to 69.4% (+51.2%), demonstrating strong robustness to distributional shifts. These improvements are achieved without auxiliary labels, templates, or reaction class supervision. Looking forward, this CL framework may aid retrosynthetic route planning for pharmaceutical intermediates, catalysts, polymers, and functional materials.

查看原文本刊更多论文

反思逆向合成：课程学习重塑基于转换器的小分子反应预测。

反合成预测仍然是计算化学的核心挑战，特别是当模型必须推广到罕见或结构复杂的反应时。我们提出了一个课程学习（CL）框架，通过系统地控制学习过程中的反应困难来重塑模型训练，直接解决化学泛化的挑战。与传统的生成方法一致地处理所有训练反应相比，我们的方法以化学信息的进展引入反应，逐渐将模型暴露于基于合成可及性、环复杂性和分子大小的日益复杂的转换中。这种困难感知的节奏允许模型更好地捕捉反应条件，保持化学合理性，并避免在罕见或代表性不足的转换中常见的故障模式。应用于三种基于变压器的架构──ChemBERTa + DistilGPT2、ReactionT5v2和BART──该框架可获得显著的性能提升。值得注意的是，最大的改进是在BART模型中观察到的，它缺乏任何化学领域的预训练：CL将其前1名的准确率从27.0%提高到75.9%（+48.9%）。其余的评估使用ChemBERTa + DistilGPT2作为代表性的预训练模型。在只有50%训练数据的低数据情况下，CL将前1名的准确率从16.9%提高到46.6%（+29.7%）。在基于支架的分割下，CL将top-1的准确率提高了29%，在结构不相似的情况下（谷本相似度<0.4），CL将top-1的准确率从18.2%提高到69.4%(+51.2%)，显示出对分布变化的强鲁棒性。这些改进是在没有辅助标签、模板或反应班监督的情况下实现的。展望未来，该CL框架可能有助于医药中间体、催化剂、聚合物和功能材料的反合成路线规划。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemical Information and Modeling 化学-化学综合

CiteScore

9.80

自引率

10.70%

发文量

529

审稿时长

1.4 months

期刊介绍： The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery. Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field. As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.