{"title":"Every atom counts: predicting sites of reaction based on chemistry within two bonds†","authors":"Ching Ching Lam and Jonathan M. Goodman","doi":"10.1039/D4DD00092G","DOIUrl":"https://doi.org/10.1039/D4DD00092G","url":null,"abstract":"<p >How much chemistry can be described by looking only at each atom, its neighbours and its next-nearest neighbours? We present a method for predicting reaction sites based only on a simple, two-bond model. Machine learning classification models were trained and evaluated using atom-level labels and descriptors, including bond strength and connectivity. Despite limitations in covering only local chemical environments, the models achieved over 80% accuracy even with challenging datasets that cover a diverse chemical space. Whilst this simplistic model is necessarily incomplete, it describes a large amount of interesting chemistry.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1878-1888"},"PeriodicalIF":6.2,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00092g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon
{"title":"Introduction to “Accelerate Conference 2022”","authors":"Keith A. Brown, Fedwa El Mellouhi and Claudiane Ouellet-Plamondon","doi":"10.1039/D4DD90036G","DOIUrl":"https://doi.org/10.1039/D4DD90036G","url":null,"abstract":"<p >A graphical abstract is available for this content</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1659-1661"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd90036g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Insights into pharmacokinetic properties for exposure chemicals: predictive modelling of human plasma fraction unbound (fu) and hepatocyte intrinsic clearance (Clint) data using machine learning†","authors":"Souvik Pore and Kunal Roy","doi":"10.1039/D4DD00082J","DOIUrl":"https://doi.org/10.1039/D4DD00082J","url":null,"abstract":"<p >An external chemical substance (which may be a medicinal drug or an exposome), after ingestion, undergoes a series of dynamic movements and metabolic alterations known as pharmacokinetic events while exerting different physiological actions on the body (pharmacodynamics events). Plasma protein binding and hepatocyte intrinsic clearance are crucial pharmacokinetic events that influence the efficacy and safety of a chemical substance. Plasma protein binding determines the fraction of a chemical compound bound to plasma proteins, affecting the distribution and duration of action of the compound. The compounds with high protein binding may have a smaller free fraction available for pharmacological activity, potentially altering their therapeutic effects. On the other hand, hepatocyte intrinsic clearance represents the liver's capacity to eliminate a chemical compound through metabolism. It is a critical determinant of the elimination half-life of the chemical substance. Understanding hepatic clearance is essential for predicting chemical toxicity and designing safety guidelines. Recently, the huge expansion of computational resources has led to the development of various <em>in silico</em> models to generate predictive models as an alternative to animal experimentation. In this research work, we developed different types of machine learning (ML) based quantitative structure–activity relationship (QSAR) models for the prediction of the compound's plasma protein fraction unbound values and hepatocyte intrinsic clearance. Here, we have developed regression-based models with the protein fraction unbound (<em>f</em><small><sub>u</sub></small>) human data set (<em>n</em> = 1812) and a classification-based model with the hepatocyte intrinsic clearance (Cl<small><sub>int</sub></small>) human data set (<em>n</em> = 1241) collected from the recently published ICE (Integrated Chemical Environment) database. We have further analyzed the influence of the plasma protein binding on the hepatocyte intrinsic clearance, by considering the compounds having both types of target variable values. For the fraction unbound data set, the support vector machine (SVM) model shows superior results compared to other models, but for the hepatocyte intrinsic clearance data set, random forest (RF) shows the best results. We have further made predictions of these important pharmacokinetic parameters through the similarity-based read-across (RA) method. A Python-based tool for predicting the endpoints has been developed and made available from https://sites.google.com/jadavpuruniversity.in/dtc-lab-software/home/pkpy-tool.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1852-1877"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00082j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dismai-Bench: benchmarking and designing generative models using disordered materials and interfaces†","authors":"Adrian Xiao Bin Yong, Tianyu Su and Elif Ertekin","doi":"10.1039/D4DD00100A","DOIUrl":"https://doi.org/10.1039/D4DD00100A","url":null,"abstract":"<p >Generative models have received significant attention in recent years for materials science applications, particularly in the area of inverse design for materials discovery. However, these models are usually assessed based on newly generated, unverified materials, using heuristic metrics such as charge neutrality, which provide a narrow evaluation of a model's performance. Also, current efforts for inorganic materials have predominantly focused on small, periodic crystals (≤20 atoms), even though the capability to generate large, more intricate and disordered structures would expand the applicability of generative modeling to a broader spectrum of materials. In this work, we present the Disordered Materials & Interfaces Benchmark (Dismai-Bench), a generative model benchmark that uses datasets of disordered alloys, interfaces, and amorphous silicon (256–264 atoms per structure). Models are trained on each dataset independently, and evaluated through direct structural comparisons between training and generated structures. Such comparisons are only possible because the material system of each training dataset is fixed. Benchmarking was performed on two graph diffusion models and two (coordinate-based) U-Net diffusion models. The graph models were found to significantly outperform the U-Net models due to the higher expressive power of graphs. While noise in the less expressive models can assist in discovering materials by facilitating exploration beyond the training distribution, these models face significant challenges when confronted with more complex structures. To further demonstrate the benefits of this benchmarking in the development process of a generative model, we considered the case of developing a point-cloud-based generative adversarial network (GAN) to generate low-energy disordered interfaces. We tested different GAN architectures and identified reasons for good/poor performance. We show that the best performing architecture, CryinGAN, outperforms the U-Net models, and is competitive against the graph models despite its lack of invariances and weaker expressive power. This work provides a new framework and insights to guide the development of future generative models, whether for ordered or disordered materials.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1889-1909"},"PeriodicalIF":6.2,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00100a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Liu, Berkay Yucel, Baskar Ganapathysubramanian, Surya R. Kalidindi, Daniel Wheeler and Olga Wodo
{"title":"Active learning for regression of structure–property mapping: the importance of sampling and representation†","authors":"Hao Liu, Berkay Yucel, Baskar Ganapathysubramanian, Surya R. Kalidindi, Daniel Wheeler and Olga Wodo","doi":"10.1039/D4DD00073K","DOIUrl":"10.1039/D4DD00073K","url":null,"abstract":"<p >Data-driven approaches now allow for systematic mappings from materials microstructures to materials properties. In particular, diverse data-driven approaches are available to establish mappings using varied microstructure representations, each posing different demands on the resources required to calibrate machine learning models. In this work, using active learning regression and iteratively increasing the data pool, three questions are explored: (a) what is the minimal subset of data required to train a predictive structure–property model with sufficient accuracy? (b) Is this minimal subset highly dependent on the sampling strategy managing the datapool? And (c) what is the cost associated with the model calibration? Using case studies with different types of microstructure (composite <em>vs.</em> spinodal), dimensionality (two- and three-dimensional), and properties (elastic and electronic), we explore these questions using two separate microstructure representations: graph-based descriptors derived from a graph representation of the microstructure and two-point correlation functions. This work demonstrates that as few as 5% of evaluations are required to calibrate robust data-driven structure–property maps when selections are made from a library of diverse microstructures. The findings show that both representations (graph-based descriptors and two-point correlation functions) can be effective with only a small quantity of property evaluations when combined with different active learning strategies. However, the dimensionality of the latent space differs substantially depending on the microstructure representation and active learning strategy.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1997-2009"},"PeriodicalIF":6.2,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00073k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Runzhe Liu, Zihao Wang, Wenbo Yang, Jinzhe Cao and Shengyang Tao
{"title":"Self-optimizing Bayesian for continuous flow synthesis process†","authors":"Runzhe Liu, Zihao Wang, Wenbo Yang, Jinzhe Cao and Shengyang Tao","doi":"10.1039/D4DD00223G","DOIUrl":"10.1039/D4DD00223G","url":null,"abstract":"<p >The integration of artificial intelligence (AI) and chemistry has propelled the advancement of continuous flow synthesis, facilitating program-controlled automatic process optimization. Optimization algorithms play a pivotal role in the automated optimization process. The increased accuracy and predictive capability of the algorithms will further mitigate the costs associated with optimization processes. A self-optimizing Bayesian algorithm (SOBayesian), incorporating Gaussian process regression as a proxy model, has been devised. Adaptive strategies are implemented during the model training process, rather than on the acquisition function, to elevate the modeling efficacy of the model. This algorithm facilitated optimizing the continuous flow synthesis process of pyridinylbenzamide, an important pharmaceutical intermediate, <em>via</em> the Buchwald–Hartwig reaction. Achieving a yield of 79.1% in under 30 rounds of iterative optimization, subsequent optimization with reduced prior data resulted in a successful 27.6% reduction in the number of experiments, significantly lowering experimental costs. Based on the experimental results, it can be concluded that the reaction is kinetically controlled. It provides ideas for optimizing similar reactions and new research ideas in continuous flow automated optimization.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 10","pages":" 1958-1966"},"PeriodicalIF":6.2,"publicationDate":"2024-08-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00223g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jialiang Xiong, Xiaojie Feng, Jingxuan Xue, Yueji Wang, Haoren Niu, Yu Gu, Qingzhu Jia, Qiang Wang and Fangyou Yan
{"title":"Connectivity stepwise derivation (CSD) method: a generic chemical structure information extraction method for the full step matrix†","authors":"Jialiang Xiong, Xiaojie Feng, Jingxuan Xue, Yueji Wang, Haoren Niu, Yu Gu, Qingzhu Jia, Qiang Wang and Fangyou Yan","doi":"10.1039/D4DD00125G","DOIUrl":"10.1039/D4DD00125G","url":null,"abstract":"<p >Emerging advanced exploration modalities such as property prediction, molecular recognition, and molecular design boost the fields of chemistry, drugs, and materials. Foremost in performing these advanced exploration tasks is how to describe/encode the molecular structure to the computer, <em>i.e.</em>, from what the human eye sees to what is machine-readable. In this effort, a chemical structure information extraction method termed connectivity step derivation (CSD) for generating the full step matrix (MS<small><sub>F</sub></small>) is exhaustively depicted. The CSD method consists of structure information extraction, atomic connectivity relationship extraction, adjacency matrix generation, and MS<small><sub>F</sub></small> generation. For testing the run speed of the MS<small><sub>F</sub></small> generation, over 54 000 molecules have been collected covering organic molecules, polymers, and MOF structures. Test outcomes show that as the number of atoms in a molecule increases from 100 to 1000, the CSD method has an increasing advantage over the classical Floyd–Warshall algorithm, with the running speed rising from 28.34 to 289.95 times in the Python environment and from 2.86 to 25.49 times in the C++ environment. The proposed CSD method, that is, the elaboration of chemical structure information extraction, promises to bring new inspiration to data scientists in chemistry, drugs, and materials as well as facilitating the development of property modeling and molecular generation methods.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1842-1851"},"PeriodicalIF":6.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00125g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benjamin Heckscher Sjølin, William Sandholt Hansen, Armando Antonio Morin-Martinez, Martin Hoffmann Petersen, Laura Hannemose Rieger, Tejs Vegge, Juan Maria García-Lastra and Ivano E. Castelli
{"title":"PerQueue: managing complex and dynamic workflows†","authors":"Benjamin Heckscher Sjølin, William Sandholt Hansen, Armando Antonio Morin-Martinez, Martin Hoffmann Petersen, Laura Hannemose Rieger, Tejs Vegge, Juan Maria García-Lastra and Ivano E. Castelli","doi":"10.1039/D4DD00134F","DOIUrl":"10.1039/D4DD00134F","url":null,"abstract":"<p >Workflow managers play a critical role in the efficient planning and execution of complex workloads. A handful of these already exist within the world of computational materials discovery, but their dynamic capabilities are somewhat lacking. The PerQueue workflow manager is the answer to this need. By utilizing modular and dynamic building blocks to define a workflow explicitly before starting, PerQueue can give a better overview of the workflow while allowing full flexibility and high dynamism. To exemplify its usage, we present four use cases at different scales within computational materials discovery. These encapsulate high-throughput screening with Density Functional Theory, using active learning to train a Machine-Learning Interatomic Potential with Molecular Dynamics and reusing this potential for kinetic Monte Carlo simulations of extended systems. Lastly, it is used for an active-learning-accelerated image segmentation procedure with a human-in-the-loop.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1832-1841"},"PeriodicalIF":6.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00134f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael A. Pence, Gavin Hazen and Joaquín Rodríguez-López
{"title":"An automated electrochemistry platform for studying pH-dependent molecular electrocatalysis†","authors":"Michael A. Pence, Gavin Hazen and Joaquín Rodríguez-López","doi":"10.1039/D4DD00186A","DOIUrl":"10.1039/D4DD00186A","url":null,"abstract":"<p >Comprehensive studies of molecular electrocatalysis require tedious titration-type experiments that slow down manual experimentation. We present eLab as an automated electrochemical platform designed for molecular electrochemistry that uses opensource software to modularly interconnect various commercial instruments, enabling users to chain together multiple instruments for complex electrochemical operations. We benchmarked the solution handling performance of our platform through gravimetric calibration, acid–base titrations, and voltammetric diffusion coefficient measurements. We then used the platform to explore the TEMPO-catalyzed electrooxidation of alcohols, demonstrating our platforms capabilities for pH-dependent molecular electrocatalysis. We performed combined acid–base titrations and cyclic voltammetry on six different alcohol substrates, collecting 684 voltammograms with 171 different solution conditions over the course of 16 hours, demonstrating high throughput in an unsupervised experiment. The high versatility, transferability, and ease of implementation of eLab promises the rapid discovery and characterization of pH-dependent processes, including mediated electrocatalysis for energy conversion, fuel valorization, and bioelectrochemical sensing, among many applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1812-1821"},"PeriodicalIF":6.2,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00186a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141945908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie and Connor W. Coley
{"title":"Extracting structured data from organic synthesis procedures using a fine-tuned large language model†","authors":"Qianxiang Ai, Fanwang Meng, Jiale Shi, Brenden Pelkie and Connor W. Coley","doi":"10.1039/D4DD00091A","DOIUrl":"10.1039/D4DD00091A","url":null,"abstract":"<p >The popularity of data-driven approaches and machine learning (ML) techniques in the field of organic chemistry and its various subfields has increased the value of structured reaction data. Most data in chemistry is represented by unstructured text, and despite the vastness of the organic chemistry literature (papers, patents), manual conversion from unstructured text to structured data remains a largely manual endeavor. Software tools for this task would facilitate downstream applications such as reaction prediction and condition recommendation. In this study, we fine-tune a large language model (LLM) to extract reaction information from organic synthesis procedure text into structured data following the Open Reaction Database (ORD) schema, a comprehensive data structure designed for organic reactions. The fine-tuned model produces syntactically correct ORD records with an average accuracy of 91.25% for ORD “messages” (<em>e.g.</em>, full compound, workups, or condition definitions) and 92.25% for individual data fields (<em>e.g.</em>, compound identifiers, mass quantities), with the ability to recognize compound-referencing tokens and to infer reaction roles. We investigate its failure modes and evaluate performance on specific subtasks such as reaction role classification.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 1822-1831"},"PeriodicalIF":6.2,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00091a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141885688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}