{"title":"A hitchhiker's guide to deep chemical language processing for bioactivity prediction.","authors":"Rıza Özçelik, Francesca Grisoni","doi":"10.1039/d4dd00311j","DOIUrl":"10.1039/d4dd00311j","url":null,"abstract":"<p><p>Deep learning has significantly accelerated drug discovery, with 'chemical language' processing (CLP) emerging as a prominent approach. CLP approaches learn from molecular string representations (<i>e.g.</i>, Simplified Molecular Input Line Entry Systems [SMILES] and Self-Referencing Embedded Strings [SELFIES]) with methods akin to natural language processing. Despite their growing importance, training predictive CLP models is far from trivial, as it involves many 'bells and whistles'. Here, we analyze the key elements of CLP and provide guidelines for newcomers and experts. Our study spans three neural network architectures, two string representations, three embedding strategies, across ten bioactivity datasets, for both classification and regression purposes. This 'hitchhiker's guide' not only underscores the importance of certain methodological decisions, but it also equips researchers with practical recommendations on ideal choices, <i>e.g.</i>, in terms of neural network architectures, molecular representations, and hyperparameter optimization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11667676/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142900860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph D Clark, Xuenan Mi, Douglas A Mitchell, Diwakar Shukla
{"title":"Substrate prediction for RiPP biosynthetic enzymes <i>via</i> masked language modeling and transfer learning.","authors":"Joseph D Clark, Xuenan Mi, Douglas A Mitchell, Diwakar Shukla","doi":"10.1039/d4dd00170b","DOIUrl":"10.1039/d4dd00170b","url":null,"abstract":"<p><p>Ribosomally synthesized and post-translationally modified peptide (RiPP) biosynthetic enzymes often exhibit promiscuous substrate preferences that cannot be reduced to simple rules. Large language models are promising tools for predicting the specificity of RiPP biosynthetic enzymes. However, state-of-the-art protein language models are trained on relatively few peptide sequences. A previous study comprehensively profiled the peptide substrate preferences of LazBF (a two-component serine dehydratase) and LazDEF (a three-component azole synthetase) from the lactazole biosynthetic pathway. We demonstrated that masked language modeling of LazBF substrate preferences produced language model embeddings that improved downstream prediction of both LazBF and LazDEF substrates. Similarly, masked language modeling of LazDEF substrate preferences produced embeddings that improved prediction of both LazBF and LazDEF substrates. Our results suggest that the models learned functional forms that are transferable between distinct enzymatic transformations that act within the same biosynthetic pathway. We found that a single high-quality data set of substrates and non-substrates for a RiPP biosynthetic enzyme improved substrate prediction for distinct enzymes in data-scarce scenarios. We then fine-tuned models on each data set and showed that the fine-tuned models provided interpretable insight that we anticipate will facilitate the design of substrate libraries that are compatible with desired RiPP biosynthetic pathways.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11622008/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142803666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brittany C Haas, Melissa A Hardy, Shree Sowndarya S V, Keir Adams, Connor W Coley, Robert S Paton, Matthew S Sigman
{"title":"Rapid prediction of conformationally-dependent DFT-level descriptors using graph neural networks for carboxylic acids and alkyl amines.","authors":"Brittany C Haas, Melissa A Hardy, Shree Sowndarya S V, Keir Adams, Connor W Coley, Robert S Paton, Matthew S Sigman","doi":"10.1039/d4dd00284a","DOIUrl":"10.1039/d4dd00284a","url":null,"abstract":"<p><p>Data-driven reaction discovery and development is a growing field that relies on the use of molecular descriptors to capture key information about substrates, ligands, and targets. Broad adaptation of this strategy is hindered by the associated computational cost of descriptor calculation, especially when considering conformational flexibility. Descriptor libraries can be precomputed agnostic of application to reduce the computational burden of data-driven reaction development. However, as one often applies these models to evaluate novel hypothetical structures, it would be ideal to predict the descriptors of compounds on-the-fly. Herein, we report DFT-level descriptor libraries for conformational ensembles of 8528 carboxylic acids and 8172 alkyl amines towards this goal. Employing 2D and 3D graph neural network architectures trained on these libraries culminated in the development of predictive models for molecule-level descriptors, as well as the bond- and atom-level descriptors for the conserved reactive site (carboxylic acid or amine). The predictions were confirmed to be robust for an external validation set of medicinally-relevant carboxylic acids and alkyl amines. Additionally, a retrospective study correlating the rate of amide coupling reactions demonstrated the suitability of the predicted DFT-level descriptors for downstream applications. Ultimately, these models enable high-fidelity predictions for a vast number of potential substrates, greatly increasing accessibility to the field of data-driven reaction development.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11626426/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142814928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajun Zhou, Yijie Yang, Austin M Mroz, Kim E Jelfs
{"title":"PolyCL: contrastive learning for polymer representation learning <i>via</i> explicit and implicit augmentations.","authors":"Jiajun Zhou, Yijie Yang, Austin M Mroz, Kim E Jelfs","doi":"10.1039/d4dd00236a","DOIUrl":"10.1039/d4dd00236a","url":null,"abstract":"<p><p>Polymers play a crucial role in a wide array of applications due to their diverse and tunable properties. Establishing the relationship between polymer representations and their properties is crucial to the computational design and screening of potential polymers <i>via</i> machine learning. The quality of the representation significantly influences the effectiveness of these computational methods. Here, we present a self-supervised contrastive learning paradigm, PolyCL, for learning robust and high-quality polymer representation without the need for labels. Our model combines explicit and implicit augmentation strategies for improved learning performance. The results demonstrate that our model achieves either better, or highly competitive, performances on transfer learning tasks as a feature extractor without an overcomplicated training strategy or hyperparameter optimisation. Further enhancing the efficacy of our model, we conducted extensive analyses on various augmentation combinations used in contrastive learning. This led to identifying the most effective combination to maximise PolyCL's performance.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11616009/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142803664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sherif Abdulkader Tawfik, Tri Minh Nguyen, Salvy P. Russo, Truyen Tran, Sunil Gupta and Svetha Venkatesh
{"title":"Embedding material graphs using the electron-ion potential: application to material fracture†","authors":"Sherif Abdulkader Tawfik, Tri Minh Nguyen, Salvy P. Russo, Truyen Tran, Sunil Gupta and Svetha Venkatesh","doi":"10.1039/D4DD00246F","DOIUrl":"https://doi.org/10.1039/D4DD00246F","url":null,"abstract":"<p >At the heart of the flourishing field of machine learning potentials are graph neural networks, where deep learning is interwoven with physics-informed machine learning (PIML) architectures. Various PIML models, upon training with density functional theory (DFT) material structure–property datasets, have achieved unprecedented prediction accuracy for a range of molecular and material properties. A critical component in the learned graph representation of crystal structures in PIMLs is how the various fragments of the structure's graph are embedded in a neural network. Several of the state-of-art PIML models apply spherical harmonic functions. Such functions are based on the assumption that DFT computes the Coulomb potential of atom–atom interactions. However, DFT does not directly compute such potentials, but integrates the electron–atom potentials. We introduce the direct integration of the external potential (DIEP) methods which more faithfully reflects that actual computational workflow in DFT. DIEP integrates the external (electron–atom) potential and uses these quantities to embed the structure graph into a deep learning model. We demonstrate the enhanced accuracy of the DIEP model in predicting the energies of pristine and defective materials. By training DIEP to predict the potential energy surface, we show the ability of the model in predicting the onset of fracture of pristine and defective carbon nanotubes.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 2618-2627"},"PeriodicalIF":6.2,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00246f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142778040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jean-Charles Cousty, Tanguy Cavagna, Alec Schmidt, Edy Mariano, Keyan Villat, Florian de Nanteuil and Pascal Miéville
{"title":"GLAS: an open-source easily expandable Git-based scheduling architecture for integral lab automation†","authors":"Jean-Charles Cousty, Tanguy Cavagna, Alec Schmidt, Edy Mariano, Keyan Villat, Florian de Nanteuil and Pascal Miéville","doi":"10.1039/D4DD00253A","DOIUrl":"https://doi.org/10.1039/D4DD00253A","url":null,"abstract":"<p >This paper presents GLAS (Git-based Lab Automated Scheduler or Get Lab Automation Simplified), an open-source, robust, and highly expandable Git-based architecture designed for laboratory automation. GLAS can be deployed in both partially and fully automated experimental science laboratories, enabling the development of a multi-layer scheduling system while maintaining a systematic architecture grounded in a Git repository. We demonstrate the applicability of GLAS through case studies from the Swiss Cat+ automated chemistry laboratory, showcasing its versatility and potential for widespread applicability in various laboratory automation contexts. By offering an open-source scheduling environment, our aim is to foster the development of accessible and adaptable laboratory automation solutions within the scientific community.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 2434-2447"},"PeriodicalIF":6.2,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00253a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Da Tan, Andre K. Y. Low, Shannon Thoi Rui Ying, Sze Yu Tan, Wenguang Zhao, Yee-Fun Lim, Qianxiao Li, Saif A. Khan, Balamurugan Ramalingam and Kedar Hippalgaonkar
{"title":"Multi-objective synthesis optimization and kinetics of a sustainable terpolymer†","authors":"Jin Da Tan, Andre K. Y. Low, Shannon Thoi Rui Ying, Sze Yu Tan, Wenguang Zhao, Yee-Fun Lim, Qianxiao Li, Saif A. Khan, Balamurugan Ramalingam and Kedar Hippalgaonkar","doi":"10.1039/D4DD00233D","DOIUrl":"https://doi.org/10.1039/D4DD00233D","url":null,"abstract":"<p >The properties of polymers are primarily influenced by their monomer constituents, functional groups, and their mode of linkages. Copolymers, synthesized from multiple monomers, offer unique material properties compared to their homopolymers. Optimizing the synthesis of terpolymers is a complex and labor-intensive task due to variations in monomer reactivity and their compositional shifts throughout the polymerization process. The present work focuses on synthesizing a new terpolymer from styrene, myrcene, and dibutyl itaconate (DBI) monomers with the goal of achieving a high glass transition temperature (<em>T</em><small><sub>g</sub></small>) in the resulting terpolymer. While the copolymerization of pairwise combinations of styrene, myrcene, and DBI have been previously investigated, the terpolymerization of all three at once remains unexplored. Terpolymers with monomers like styrene would provide high glass transition temperatures as the resultant polymers exhibit a rigid glassy state at ambient temperatures. Conversely, minimizing styrene incorporation also reduces reliance on petrochemical-derived monomer sources for terpolymer synthesis, thus enhancing the sustainability of terpolymer usage. To balance the objectives of maximizing <em>T</em><small><sub>g</sub></small> while minimizing styrene incorporation, we employ multi-objective Bayesian optimization to efficiently sample in a design space comprising 5 experimental parameters. We perform two iterations of optimization for a total of 89 terpolymers, reporting terpolymers with a <em>T</em><small><sub>g</sub></small> above ambient temperature while retaining less than 50% styrene incorporation. This underscores the potential for exploring and utilizing renewable monomers such as myrcene and DBI, to foster sustainability in polymer synthesis. Additionally, the dataset enables the calculation of ternary reactivity ratios using a system of ordinary differential equations based on the terminal model, providing valuable insights into the reactivity of monomers in complex ternary systems compared to binary copolymer systems. This approach reveals the nuanced kinetics of terpolymerization, further informing the synthesis of polymers with desired properties.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 2628-2636"},"PeriodicalIF":6.2,"publicationDate":"2024-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00233d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142778041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rolf David, Miguel de la Puente, Axel Gomez, Olaia Anton, Guillaume Stirnemann, Damien Laage
{"title":"ArcaNN: automated enhanced sampling generation of training sets for chemically reactive machine learning interatomic potentials.","authors":"Rolf David, Miguel de la Puente, Axel Gomez, Olaia Anton, Guillaume Stirnemann, Damien Laage","doi":"10.1039/d4dd00209a","DOIUrl":"10.1039/d4dd00209a","url":null,"abstract":"<p><p>The emergence of artificial intelligence is profoundly impacting computational chemistry, particularly through machine-learning interatomic potentials (MLIPs). Unlike traditional potential energy surface representations, MLIPs overcome the conventional computational scaling limitations by offering an effective combination of accuracy and efficiency for calculating atomic energies and forces to be used in molecular simulations. These MLIPs have significantly enhanced molecular simulations across various applications, including large-scale simulations of materials, interfaces, chemical reactions, and beyond. Despite these advances, the construction of training datasets-a critical component for the accuracy of MLIPs-has not received proportional attention, especially in the context of chemical reactivity, which depends on rare barrier-crossing events that are not easily included in the datasets. Here we address this gap by introducing ArcaNN, a comprehensive framework designed for generating training datasets for reactive MLIPs. ArcaNN employs a concurrent learning approach combined with advanced sampling techniques to ensure an accurate representation of high-energy geometries. The framework integrates automated processes for iterative training, exploration, new configuration selection, and energy and force labeling, all while ensuring reproducibility and documentation. We demonstrate ArcaNN's capabilities through two paradigm reactions: a nucleophilic substitution and a Diels-Alder reaction. These examples showcase its effectiveness, the uniformly low error of the resulting MLIP everywhere along the chemical reaction coordinate, and its potential for broad applications in reactive molecular dynamics. Finally, we provide guidelines for assessing the quality of MLIPs in reactive systems.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" ","pages":""},"PeriodicalIF":6.2,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11563209/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142649564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Löhr, Michele Assante, Michael Dodds, Lili Cao, Mikhail Kabeshov, Jon-Paul Janet, Marco Klähn and Ola Engkvist
{"title":"Navigating the Maize: cyclic and conditional computational graphs for molecular simulation","authors":"Thomas Löhr, Michele Assante, Michael Dodds, Lili Cao, Mikhail Kabeshov, Jon-Paul Janet, Marco Klähn and Ola Engkvist","doi":"10.1039/D4DD00288A","DOIUrl":"https://doi.org/10.1039/D4DD00288A","url":null,"abstract":"<p >Many computational chemistry and molecular simulation workflows can be expressed as graphs. This abstraction is useful to modularize and potentially reuse existing components, as well as provide parallelization and ease reproducibility. Existing tools represent the computation as a directed acyclic graph (DAG), thus allowing efficient execution by parallelization of concurrent branches. These systems can, however, generally not express cyclic and conditional workflows. We therefore developed Maize, a workflow manager for cyclic and conditional graphs based on the principles of flow-based programming. By running each node of the graph concurrently in separate processes and allowing communication at any time through dedicated inter-node channels, arbitrary graph structures can be executed. We demonstrate the effectiveness of the tool on a dynamic active learning task in computational drug design, involving the use of a small molecule generative model and an associated scoring system, and on a reactivity prediction pipeline using quantum-chemistry and semiempirical approaches.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 2551-2559"},"PeriodicalIF":6.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00288a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142778013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mengjia Zhu, Austin Mroz, Lingfeng Gui, Kim E. Jelfs, Alberto Bemporad, Ehecatl Antonio del Río Chanona and Ye Seol Lee
{"title":"Discrete and mixed-variable experimental design with surrogate-based approach†","authors":"Mengjia Zhu, Austin Mroz, Lingfeng Gui, Kim E. Jelfs, Alberto Bemporad, Ehecatl Antonio del Río Chanona and Ye Seol Lee","doi":"10.1039/D4DD00113C","DOIUrl":"https://doi.org/10.1039/D4DD00113C","url":null,"abstract":"<p >Experimental design plays an important role in efficiently acquiring informative data for system characterization and deriving robust conclusions under resource limitations. Recent advancements in high-throughput experimentation coupled with machine learning have notably improved experimental procedures. While Bayesian optimization (BO) has undeniably revolutionized the landscape of optimization in experimental design, especially in the chemical domain, it is important to recognize the role of other surrogate-based approaches in conventional chemistry optimization problems. This is particularly relevant for chemical problems involving mixed-variable design space with mixed-variable physical constraints, where conventional BO approaches struggle to obtain feasible samples during the acquisition step while maintaining exploration capability. In this paper, we demonstrate that integrating mixed-integer optimization strategies is one way to address these challenges effectively. Specifically, we propose the utilization of mixed-integer surrogates and acquisition functions–methods that offer inherent compatibility with problems with discrete and mixed-variable design space. This work focuses on piecewise affine surrogate-based optimization (PWAS), a surrogate model capable of handling medium-sized mixed-variable problems (up to around 100 variables after encoding) subject to known linear constraints. We demonstrate the effectiveness of this approach in optimizing experimental planning through three case studies. By benchmarking PWAS against state-of-the-art optimization algorithms, including genetic algorithms and BO variants, we offer insights into the practical applicability of mixed-integer surrogates, with emphasis on problems subject to known discrete/mixed-variable linear constraints.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 12","pages":" 2589-2606"},"PeriodicalIF":6.2,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00113c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142778016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}