{"title":"Rethinking Retrosynthesis: Curriculum Learning Reshapes Transformer-Based Small-Molecule Reaction Prediction.","authors":"Rahul Sheshanarayana,Fengqi You","doi":"10.1021/acs.jcim.5c01508","DOIUrl":null,"url":null,"abstract":"Retrosynthesis prediction remains a central challenge in computational chemistry, particularly when models must generalize to rare or structurally complex reactions. We present a curriculum learning (CL) framework that reshapes model training by systematically controlling reaction difficulty during learning, directly addressing the challenge of chemical generalization. In contrast to conventional generative approaches that treat all training reactions uniformly, our method introduces reactions in a chemically informed progression, gradually exposing the model to increasingly complex transformations based on synthetic accessibility, ring complexity, and molecular size. This difficulty-aware pacing allows the model to better capture reaction conditionality, preserve chemical plausibility, and avoid failure modes commonly observed in rare or underrepresented transformations. Applied across three transformer-based architectures─ChemBERTa + DistilGPT2, ReactionT5v2, and BART─the framework yields substantial performance gains. Notably, the largest improvements are observed in the BART model, which lacks any chemical domain pretraining: CL improves its top-1 accuracy from 27.0% to 75.9% (+48.9%). The remainder of our evaluations use ChemBERTa + DistilGPT2 as a representative pretrained model. In low-data regimes with only 50% of the training data, CL increases top-1 accuracy from 16.9% to 46.6% (+29.7%). Under scaffold-based splits, CL improves top-1 accuracy by up to 29%, and in structurally dissimilar settings (Tanimoto similarity <0.4), CL boosts top-1 accuracy from 18.2% to 69.4% (+51.2%), demonstrating strong robustness to distributional shifts. These improvements are achieved without auxiliary labels, templates, or reaction class supervision. Looking forward, this CL framework may aid retrosynthetic route planning for pharmaceutical intermediates, catalysts, polymers, and functional materials.","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"91 1","pages":""},"PeriodicalIF":5.3000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1021/acs.jcim.5c01508","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
引用次数: 0
Abstract
Retrosynthesis prediction remains a central challenge in computational chemistry, particularly when models must generalize to rare or structurally complex reactions. We present a curriculum learning (CL) framework that reshapes model training by systematically controlling reaction difficulty during learning, directly addressing the challenge of chemical generalization. In contrast to conventional generative approaches that treat all training reactions uniformly, our method introduces reactions in a chemically informed progression, gradually exposing the model to increasingly complex transformations based on synthetic accessibility, ring complexity, and molecular size. This difficulty-aware pacing allows the model to better capture reaction conditionality, preserve chemical plausibility, and avoid failure modes commonly observed in rare or underrepresented transformations. Applied across three transformer-based architectures─ChemBERTa + DistilGPT2, ReactionT5v2, and BART─the framework yields substantial performance gains. Notably, the largest improvements are observed in the BART model, which lacks any chemical domain pretraining: CL improves its top-1 accuracy from 27.0% to 75.9% (+48.9%). The remainder of our evaluations use ChemBERTa + DistilGPT2 as a representative pretrained model. In low-data regimes with only 50% of the training data, CL increases top-1 accuracy from 16.9% to 46.6% (+29.7%). Under scaffold-based splits, CL improves top-1 accuracy by up to 29%, and in structurally dissimilar settings (Tanimoto similarity <0.4), CL boosts top-1 accuracy from 18.2% to 69.4% (+51.2%), demonstrating strong robustness to distributional shifts. These improvements are achieved without auxiliary labels, templates, or reaction class supervision. Looking forward, this CL framework may aid retrosynthetic route planning for pharmaceutical intermediates, catalysts, polymers, and functional materials.
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.