Tianbai Huang, Robert Geitner, Alexander Croy and Stefanie Gräfe
{"title":"Tailoring phosphine ligands for improved C–H activation: insights from Δ-machine learning†","authors":"Tianbai Huang, Robert Geitner, Alexander Croy and Stefanie Gräfe","doi":"10.1039/D4DD00037D","DOIUrl":"10.1039/D4DD00037D","url":null,"abstract":"<p >Transition metal complexes have played crucial roles in various homogeneous catalytic processes due to their exceptional versatility. This adaptability stems not only from the central metal ions but also from the vast array of choices of the ligand spheres, which form an enormously large chemical space. For example, Rh complexes, with a well-designed ligand sphere, are known to be efficient in catalyzing the C–H activation process in alkanes. To investigate the structure–property relation of the Rh complex and identify the optimal ligand that minimizes the calculated reaction energy Δ<em>E</em> of an alkane C–H activation, we have applied a Δ-machine learning method trained on various features to study 1743 pairs of reactants (Rh(PLP)(Cl)(CO)) and intermediates (Rh(PLP)(Cl)(CO)(H)(propyl)). Our findings demonstrate that the models exhibit robust predictive performance when trained on features derived from electron density (<em>R</em><small><sup>2</sup></small> = 0.816), and SOAPs (<em>R</em><small><sup>2</sup></small> = 0.819), a set of position-based descriptors. Leveraging the model trained on xTB-SOAPs that only depend on the xTB-equilibrium structures, we propose an efficient and accurate screening procedure to explore the extensive chemical space of bisphosphine ligands. By applying this screening procedure, we identify ten newly selected reactant–intermediate pairs with an average Δ<em>E</em> of 33.2 kJ mol<small><sup>−1</sup></small>, remarkably lower than the average Δ<em>E</em> of the original data set of 68.0 kJ mol<small><sup>−1</sup></small>. This underscores the efficacy of our screening procedure in pinpointing structures with significantly lower energy levels.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1350-1364"},"PeriodicalIF":6.2,"publicationDate":"2024-05-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00037d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141172220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stuart C. Smith, Christopher S. Horbaczewskyj, Theo F. N. Tanner, Jacob J. Walder and Ian J. S. Fairlamb
{"title":"Automated approaches, reaction parameterisation, and data science in organometallic chemistry and catalysis: towards improving synthetic chemistry and accelerating mechanistic understanding","authors":"Stuart C. Smith, Christopher S. Horbaczewskyj, Theo F. N. Tanner, Jacob J. Walder and Ian J. S. Fairlamb","doi":"10.1039/D3DD00249G","DOIUrl":"10.1039/D3DD00249G","url":null,"abstract":"<p >Automation technologies and data science techniques have been successfully applied to optimisation and discovery activities in the chemical sciences for decades. As the sophistication of these techniques and technologies have evolved, so too has the ambition to expand their scope of application to problems of significant synthetic difficulty. Of these applications, some of the most challenging involve investigation of chemical mechanism in organometallic processes (with particular emphasis on air- and moisture-sensitive processes), particularly with the reagent and/or catalyst used. We discuss herein the development of enabling methodologies to allow the study of these challenging systems and highlight some important applications of these technologies in problems of considerable interest to applied synthetic chemists.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 8","pages":" 1467-1495"},"PeriodicalIF":6.2,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d3dd00249g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maciej P. Polak, Shrey Modi, Anna Latosinska, Jinming Zhang, Ching-Wen Wang, Shaonan Wang, Ayan Deep Hazra and Dane Morgan
{"title":"Flexible, model-agnostic method for materials data extraction from text using general purpose language models","authors":"Maciej P. Polak, Shrey Modi, Anna Latosinska, Jinming Zhang, Ching-Wen Wang, Shaonan Wang, Ayan Deep Hazra and Dane Morgan","doi":"10.1039/D4DD00016A","DOIUrl":"10.1039/D4DD00016A","url":null,"abstract":"<p >Accurate and comprehensive material databases extracted from research papers are crucial for materials science and engineering, but their development requires significant human effort. With large language models (LLMs) transforming the way humans interact with text, LLMs provide an opportunity to revolutionize data extraction. In this study, we demonstrate a simple and efficient method for extracting materials data from full-text research papers leveraging the capabilities of LLMs combined with human supervision. This approach is particularly suitable for mid-sized databases and requires minimal to no coding or prior knowledge about the extracted property. It offers high recall and nearly perfect precision in the resulting database. The method is easily adaptable to new and superior language models, ensuring continued utility. We show this by evaluating and comparing its performance on GPT-3 and GPT-3.5/4 (which underlie ChatGPT), as well as free alternatives such as BART and DeBERTaV3. We provide a detailed analysis of the method's performance in extracting sentences containing bulk modulus data, achieving up to 90% precision at 96% recall, depending on the amount of human effort involved. We further demonstrate the method's broader effectiveness by developing a database of critical cooling rates for metallic glasses over twice the size of previous human curated databases.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1221-1235"},"PeriodicalIF":0.0,"publicationDate":"2024-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00016a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mario Negovetić, Erik Otović, Daniela Kalafatovic and Goran Mauša
{"title":"Efficiently solving the curse of feature-space dimensionality for improved peptide classification","authors":"Mario Negovetić, Erik Otović, Daniela Kalafatovic and Goran Mauša","doi":"10.1039/D4DD00079J","DOIUrl":"10.1039/D4DD00079J","url":null,"abstract":"<p >Machine learning is becoming an important tool for predicting peptide function that holds promise for accelerating their discovery. In this paper, we explore feature selection techniques to improve data mining of antimicrobial and catalytic peptides, boost predictive performance and model explainability. SMILES is a widely employed software-readable format for the chemical structures of peptides, and it allows for extraction of numerous molecular descriptors. To reduce the high number of features therein, we conduct a systematic data preprocessing procedure including the widespread wrapper techniques and a computationally better solution provided by the filter technique to build a classification model and make the search for relevant numerical descriptors more efficient without reducing its effectiveness. Comparison of the outcomes of four model implementations in terms of execution time and classification performance together with Shapley-based model explainability method provide valuable insight into the impact of feature selection and suitability of the models with SMILE-derived molecular descriptors. The best results were achieved using the filter method with a ROC-AUC score of 0.954 for catalytic and 0.977 for antimicrobial peptides, with the execution time of feature selection lower by 2 or 3 orders of magnitude. The proposed models were also validated by comparison with established models used for the prediction of antimicrobial and catalytic functions.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1182-1193"},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00079j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InterMat: accelerating band offset prediction in semiconductor interfaces with DFT and deep learning†","authors":"Kamal Choudhary and Kevin F. Garrity","doi":"10.1039/D4DD00031E","DOIUrl":"10.1039/D4DD00031E","url":null,"abstract":"<p >We introduce a computational framework (InterMat) to predict band offsets of semiconductor interfaces using density functional theory (DFT) and graph neural networks (GNN). As a first step, we benchmark OptB88vdW generalized gradient approximation (GGA) work functions and electron affinities for surfaces against experimental data with accuracies of 0.29 eV and 0.39 eV, respectively. Similarly, we evaluate band offset values using independent unit (IU) and alternate slab junction (ASJ) models leading to accuracies of 0.45 eV and 0.22 eV, respectively. We use bulk band structure calculations with the TBmBJ meta-GGA functional to correct for band gap underestimation when predicting conduction band properties. During ASJ structure generation, we use Zur's algorithm along with a unified GNN force-field to tackle the conformation challenges of interface design. At present, we have 607 surface work functions calculated with DFT, from which we can compute 183 921 IU band offsets as well as 593 directly calculated ASJ band offsets. Finally, as the space of all possible heterojunctions is too large to simulate with DFT, we develop generalized GNN models to quickly predict bulk band edges with an accuracy of 0.26 eV. We show how these models can be used to predict relevant quantities including ionization potentials, electron affinities, and IU-based band offsets. We establish simple rules using the above models to pre-screen potential semiconductor devices from a vast pool of nearly 1.4 trillion candidate interfaces. InterMat is available at website: https://github.com/usnistgov/intermat.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1365-1377"},"PeriodicalIF":6.2,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00031e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Friedrich Hastedt, Rowan M. Bailey, Klaus Hellgardt, Sophia N. Yaliraki, Ehecatl Antonio del Rio Chanona and Dongda Zhang
{"title":"Investigating the reliability and interpretability of machine learning frameworks for chemical retrosynthesis†","authors":"Friedrich Hastedt, Rowan M. Bailey, Klaus Hellgardt, Sophia N. Yaliraki, Ehecatl Antonio del Rio Chanona and Dongda Zhang","doi":"10.1039/D4DD00007B","DOIUrl":"10.1039/D4DD00007B","url":null,"abstract":"<p >Machine learning models for chemical retrosynthesis have attracted substantial interest in recent years. Unaddressed challenges, particularly the absence of robust evaluation metrics for performance comparison, and the lack of black-box interpretability, obscure model limitations and impede progress in the field. We present an automated benchmarking pipeline designed for effective model performance comparisons. With an emphasis on user-friendly design, we aim to streamline accessibility and facilitate utilisation within the research community. Additionally, we suggest and perform a new interpretability study to uncover the degree of chemical understanding acquired by retrosynthesis models. Our results reveal that frameworks based on chemical reaction rules yield the most diverse, chemically valid, and feasible reactions, whereas purely data-driven frameworks suffer from unfeasible and invalid predictions. The interpretability study emphasises that incorporating reaction rules not only enhances model performance but also improves interpretability. For simple molecules, we show that Graph Neural Networks identify relevant functional groups in the product molecule, offering model interpretability. Sequence-to-sequence Transformers are not found to provide such an explanation. As the molecule and reaction mechanism grow more complex, both data-driven models propose unfeasible disconnections without offering a chemical rationale. We stress the importance of incorporating chemically meaningful descriptors within deep-learning models. Our study provides valuable guidance for the future development of retrosynthesis frameworks.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1194-1212"},"PeriodicalIF":0.0,"publicationDate":"2024-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00007b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin Seifrid, Felix Strieth-Kalthoff, Mohammad Haddadnia, Tony C. Wu, Emre Alca, Leticia Bodo, Sebastian Arellano-Rubach, Naruki Yoshikawa, Marta Skreta, Rachel Keunen and Alán Aspuru-Guzik
{"title":"Chemspyd: an open-source python interface for Chemspeed robotic chemistry and materials platforms†","authors":"Martin Seifrid, Felix Strieth-Kalthoff, Mohammad Haddadnia, Tony C. Wu, Emre Alca, Leticia Bodo, Sebastian Arellano-Rubach, Naruki Yoshikawa, Marta Skreta, Rachel Keunen and Alán Aspuru-Guzik","doi":"10.1039/D4DD00046C","DOIUrl":"10.1039/D4DD00046C","url":null,"abstract":"<p >We introduce <em>Chemspyd</em>, a lightweight, open-source Python package for operating the popular laboratory robotic platforms from Chemspeed Technologies. As an add-on to the existing proprietary software suite, <em>Chemspyd</em> enables dynamic communication with the automated platform, laying the foundation for its modular integration into customizable, higher-level laboratory workflows. We show the applicability of <em>Chemspyd</em> in a set of case studies from chemistry and materials science. We demonstrate how the package can be used with large language models to provide a natural language interface. By providing an open-source software interface for a commercial robotic platform, we hope to inspire the development of open interfaces that facilitate the flexible, adaptive integration of existing laboratory equipment into automated laboratories.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1319-1326"},"PeriodicalIF":6.2,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00046c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deniz N. Cakan, Rishi E. Kumar, Eric Oberholtz, Moses Kodur, Jack R. Palmer, Apoorva Gupta, Ken Kaushal, Hendrik M. Vossler and David P. Fenning
{"title":"PASCAL: the perovskite automated spin coat assembly line accelerates composition screening in triple-halide perovskite alloys†","authors":"Deniz N. Cakan, Rishi E. Kumar, Eric Oberholtz, Moses Kodur, Jack R. Palmer, Apoorva Gupta, Ken Kaushal, Hendrik M. Vossler and David P. Fenning","doi":"10.1039/D4DD00075G","DOIUrl":"10.1039/D4DD00075G","url":null,"abstract":"<p >The Perovskite Automated Spin Coat Assembly Line – PASCAL – is introduced as a materials acceleration platform for the deposition and characterization of spin-coated thin films, with specific application to halide perovskites. We first demonstrate improved consistency of perovskite film fabrication by controlling process parameters, the influence of which is uniquely exposed under the automated experimental framework. Next, we report on an automated campaign of composition engineering to improve the durability of perovskite absorbers for tandem solar cell applications. We screen compositions spanning the triple-cation, triple-halide composition space, MA<small><sub><em>x</em></sub></small>FA<small><sub>0.78</sub></small>Cs<small><sub>0.22−<em>x</em></sub></small>Pb(I<small><sub>0.8−<em>y</em>−<em>z</em></sub></small>Br<small><sub><em>y</em></sub></small>Cl<small><sub><em>z</em></sub></small>)<small><sub>3</sub></small>. Data-driven clustering identifies four characteristic behaviors within this space regarding figures of merit for durability and open-circuit voltage, with data from each sample acquired automatically in PASCAL characterization line. Finally, a film composition durable to light and elevated temperature exposure is identified <em>via</em> a regression model trained on the high-throughput dataset. The approach, hardware, and data detailed herein highlight automated platforms as an opportunity to accelerate the identification and discovery of novel thin film materials and demonstrates the efficacy of PASCAL specifically for automation of solution-processed optoelectronic thin film research.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1236-1246"},"PeriodicalIF":0.0,"publicationDate":"2024-05-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00075g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxim A. Ziatdinov, Muammer Yusuf Yaman, Yongtao Liu, David Ginger and Sergei V. Kalinin
{"title":"Semi-supervised learning of images with strong rotational disorder: assembling nanoparticle libraries†","authors":"Maxim A. Ziatdinov, Muammer Yusuf Yaman, Yongtao Liu, David Ginger and Sergei V. Kalinin","doi":"10.1039/D3DD00196B","DOIUrl":"10.1039/D3DD00196B","url":null,"abstract":"<p >The proliferation of optical, electron, and scanning probe microscopies gives rise to large volumes of imaging data of objects as diversified as cells, bacteria, and pollen, to nanoparticles and atoms and molecules. In most cases, the experimental data streams contain images having arbitrary rotations and translations within the image. At the same time, for many cases, small amounts of labeled data are available in the form of prior published results, image collections, and catalogs, or even theoretical models. Here we develop an approach that allows generalizing from a small subset of labeled data with a weak orientational disorder to a large unlabeled dataset with a much stronger orientational (and positional) disorder, <em>i.e.</em>, it performs a classification of image data given a small number of examples even in the presence of a distribution shift between the labeled and unlabeled parts. This approach is based on the semi-supervised rotationally invariant variational autoencoder (ss-rVAE) model consisting of the encoder–decoder “block” that learns a rotationally-invariant latent representation of data and a classifier for categorizing data into different discrete classes. The classifier part of the trained ss-rVAE inherits the rotational (and translational) invariances and can be deployed independently of the other parts of the model. The performance of the ss-rVAE is illustrated using the synthetic data sets with known factors of variation. We further demonstrate its application for experimental data sets of nanoparticles, creating nanoparticle libraries and disentangling the representations defining the physical factors of variation in the data.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1213-1220"},"PeriodicalIF":0.0,"publicationDate":"2024-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Race to the bottom: Bayesian optimisation for chemical problems†","authors":"Yifan Wu, Aron Walsh and Alex M. Ganose","doi":"10.1039/D3DD00234A","DOIUrl":"10.1039/D3DD00234A","url":null,"abstract":"<p >What is the minimum number of experiments, or calculations, required to find an optimal solution? Relevant chemical problems range from identifying a compound with target functionality within a given phase space to controlling materials synthesis and device fabrication conditions. A common feature in this application domain is that both the dimensionality of the problems and the cost of evaluations are high. The selection of an appropriate optimisation technique is key, with standard choices including iterative (<em>e.g.</em> steepest descent) and heuristic (<em>e.g.</em> simulated annealing) approaches, which are complemented by a new generation of statistical machine learning methods. We introduce Bayesian optimisation and highlight recent success cases in materials research. The challenges of using machine learning with automated research workflows that produce small and noisy data sets are discussed. Finally, we outline opportunities for developments in multi-objective and parallel algorithms for robust and efficient search strategies.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1086-1100"},"PeriodicalIF":0.0,"publicationDate":"2024-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d3dd00234a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141149363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}