{"title":"ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning†","authors":"Alireza Ghafarollahi and Markus J. Buehler","doi":"10.1039/D4DD00013G","DOIUrl":"10.1039/D4DD00013G","url":null,"abstract":"<p >Designing <em>de novo</em> proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or <em>vice versa</em>. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for <em>de novo</em> protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data – natural vibrational frequencies – <em>via</em> physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of <em>de novo</em> proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00013g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141059333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte and Andrew D. Ellington
{"title":"Mining patents with large language models elucidates the chemical function landscape†","authors":"Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte and Andrew D. Ellington","doi":"10.1039/D4DD00011K","DOIUrl":"10.1039/D4DD00011K","url":null,"abstract":"<p >The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule–function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00011k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140881812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenneth López-Pérez, Taewon D. Kim and Ramón Alain Miranda-Quintana
{"title":"iSIM: instant similarity†","authors":"Kenneth López-Pérez, Taewon D. Kim and Ramón Alain Miranda-Quintana","doi":"10.1039/D4DD00041B","DOIUrl":"10.1039/D4DD00041B","url":null,"abstract":"<p >The quantification of molecular similarity has been present since the beginning of cheminformatics. Although several similarity indices and molecular representations have been reported, all of them ultimately reduce to the calculation of molecular similarities of only two objects at a time. Hence, to obtain the average similarity of a set of molecules, all the pairwise comparisons need to be computed, which demands a quadratic scaling in the number of computational resources. Here we propose an exact alternative to this problem: iSIM (instant similarity). iSIM performs comparisons of multiple molecules at the same time and yields the same value as the average pairwise comparisons of molecules represented by binary fingerprints and real-value descriptors. In this work, we introduce the mathematical framework and several applications of iSIM in chemical sampling, visualization, diversity selection, and clustering.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00041b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140881815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction: Predicting small molecules solubility on endpoint devices using deep ensemble neural networks","authors":"Mayk Caldas Ramos and Andrew D. White","doi":"10.1039/D4DD90020K","DOIUrl":"10.1039/D4DD90020K","url":null,"abstract":"<p >Correction for ‘Predicting small molecules solubility on endpoint devices using deep ensemble neural networks’ by Mayk Caldas Ramos and Andrew D. White, <em>Digital Discovery</em>, 2024, <strong>3</strong>, 786–795, https://doi.org/10.1039/D3DD00217A.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd90020k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangming Li, Brian DeCost, Kamal Choudhary and Jason Hattrick-Simpers
{"title":"A reproducibility study of atomistic line graph neural networks for materials property prediction†","authors":"Kangming Li, Brian DeCost, Kamal Choudhary and Jason Hattrick-Simpers","doi":"10.1039/D4DD00064A","DOIUrl":"10.1039/D4DD00064A","url":null,"abstract":"<p >Use of machine learning has been increasingly popular in materials science as data-driven materials discovery is becoming the new paradigm. Reproducibility of findings is paramount for promoting transparency and accountability in research and building trust in the scientific community. Here we conduct a reproducibility analysis of the work by K. Choudhary and B. Brian [<em>npj Comput. Mater.</em>, <strong>7</strong>, 2021, 185], in which a new graph neural network architecture was developed with improved performance on multiple atomistic prediction tasks. We examine the reproducibility for the model performance on 29 regression tasks and for an ablation analysis of the graph neural network layers. We find that the reproduced results generally exhibit a good quantitative agreement with the initial study, despite minor disparities in model performance and training efficiency that may be resulting from factors such as hardware difference and stochasticity involved in model training and data splits. The ease of conducting these reproducibility experiments confirms the great benefits of open data and code practices to which the initial work adhered. We also discuss some further enhancements in reproducible practices such as code and data archiving and providing data identifiers used in dataset splits.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00064a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838553","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan J. R. Jones, Yungchieh Lai, Dan Guevarra, Kevin Kan, Joel A. Haber and John M. Gregoire
{"title":"Accelerated screening of gas diffusion electrodes for carbon dioxide reduction†","authors":"Ryan J. R. Jones, Yungchieh Lai, Dan Guevarra, Kevin Kan, Joel A. Haber and John M. Gregoire","doi":"10.1039/D4DD00061G","DOIUrl":"10.1039/D4DD00061G","url":null,"abstract":"<p >The electrochemical conversion of carbon dioxide to chemicals and fuels is expected to be a key sustainability technology. Electrochemical carbon dioxide reduction technologies are challenged by several factors, including the limited solubility of carbon dioxide in aqueous electrolyte as well as the difficulty in utilizing polymer electrolytes. These considerations have driven system designs to incorporate gas diffusion electrodes (GDEs) to bring the electrocatalyst in contact with both a gaseous reactant/product stream as well as a liquid electrolyte. GDE optimization typically results from manual tuning by select experts. Automated preparation and operation of GDE cells could be a watershed for the systematic study of, and ultimately the development of a materials acceleration platform (MAP) for, catalyst discovery and system optimization. Toward this end, we present the automated GDE (AutoGDE) testing system. Given a catalyst-coated GDE, AutoGDE automates the insertion of the GDE into an electrochemical cell, the liquid and gas handling, the quantification of gaseous reaction products <em>via</em> online mass spectroscopy, and the archiving of the liquid electrolyte for subsequent analysis.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00061g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140838305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandra Volokhova, Michał Koziarski, Alex Hernández-García, Cheng-Hao Liu, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Alán Aspuru-Guzik and Yoshua Bengio
{"title":"Towards equilibrium molecular conformation generation with GFlowNets†","authors":"Alexandra Volokhova, Michał Koziarski, Alex Hernández-García, Cheng-Hao Liu, Santiago Miret, Pablo Lemos, Luca Thiede, Zichao Yan, Alán Aspuru-Guzik and Yoshua Bengio","doi":"10.1039/D4DD00023D","DOIUrl":"10.1039/D4DD00023D","url":null,"abstract":"<p >Sampling diverse, thermodynamically feasible molecular conformations plays a crucial role in predicting properties of a molecule. In this paper we propose to use GFlowNets for sampling conformations of small molecules from the Boltzmann distribution, as determined by the molecule's energy. The proposed approach can be used in combination with energy estimation methods of different fidelity and discovers a diverse set of low-energy conformations for drug-like molecules. We demonstrate that GFlowNets can reproduce molecular potential energy surfaces by sampling proportionally to the Boltzmann distribution.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00023d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140810659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maxime van der Heijden, Gabor Szendrei, Victor de Haas and Antoni Forner-Cuenca
{"title":"A versatile optimization framework for porous electrode design†","authors":"Maxime van der Heijden, Gabor Szendrei, Victor de Haas and Antoni Forner-Cuenca","doi":"10.1039/D3DD00247K","DOIUrl":"10.1039/D3DD00247K","url":null,"abstract":"<p >Porous electrodes are performance-defining components in electrochemical devices, such as redox flow batteries, as they govern the electrochemical performance and pumping demands of the reactor. Yet, conventional porous electrodes used in redox flow batteries are not tailored to sustain convection-enhanced electrochemical reactions. Thus, there is a need for electrode optimization to enhance the system performance. In this work, we present an optimization framework to carry out the bottom-up design of porous electrodes by coupling a genetic algorithm with a pore network modeling framework. We introduce geometrical versatility by adding a pore merging and splitting function, study the impact of various optimization parameters, geometrical definitions, and objective functions, and incorporate conventional electrode and flow field designs. Moreover, we show the need for optimizing geometries for specific reactor architectures and operating conditions to design next-generation electrodes, by analyzing the genetic algorithm optimization for initial starting geometries with diverse morphologies (cubic and a tomography-extracted commercial electrode), flow field designs (flow-through and interdigitated), and redox chemistries (VO<small><sup>2+</sup></small>/VO<small><sub>2</sub></small><small><sup>+</sup></small> and TEMPO/TEMPO<small><sup>+</sup></small>). We found that for kinetically sluggish electrolytes with high ionic conductivity, electrodes with numerous small pores and high internal surface area provide enhanced performance, whereas for kinetically facile electrolytes with low ionic conductivity, low through-plane tortuosity and high hydraulic conductance are desired. The computational tool developed in this work can further expanded to the design of high-performance electrode materials for a broad range of operating conditions, electrolyte chemistries, reactor designs, and electrochemical technologies.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":6.2,"publicationDate":"2024-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d3dd00247k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140803034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Anselmi, Greg Slabaugh, Rachel Crespo-Otero and Devis Di Tommaso
{"title":"Molecular graph transformer: stepping beyond ALIGNN into long-range interactions†","authors":"Marco Anselmi, Greg Slabaugh, Rachel Crespo-Otero and Devis Di Tommaso","doi":"10.1039/D4DD00014E","DOIUrl":"10.1039/D4DD00014E","url":null,"abstract":"<p >Graph Neural Networks (GNNs) have revolutionized material property prediction by learning directly from the structural information of molecules and materials. However, conventional GNN models rely solely on local atomic interactions, such as bond lengths and angles, neglecting crucial long-range electrostatic forces that affect certain properties. To address this, we introduce the Molecular Graph Transformer (MGT), a novel GNN architecture that combines local attention mechanisms with message passing on both bond graphs and their line graphs, explicitly capturing long-range interactions. Benchmarking on MatBench and Quantum MOF (QMOF) datasets demonstrates that MGT's improved understanding of electrostatic interactions significantly enhances the prediction accuracy of properties like exfoliation energy and refractive index, while maintaining state-of-the-art performance on all other properties. This breakthrough paves the way for the development of highly accurate and efficient materials design tools across diverse applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d4dd00014e?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140803032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning peptide properties with positive examples only","authors":"Mehrad Ansari and Andrew D. White","doi":"10.1039/D3DD00218G","DOIUrl":"10.1039/D3DD00218G","url":null,"abstract":"<p >Deep learning can create accurate predictive models by exploiting existing large-scale experimental data, and guide the design of molecules. However, a major barrier is the requirement of both positive and negative examples in the classical supervised learning frameworks. Notably, most peptide databases come with missing information and low number of observations on negative examples, as such sequences are hard to obtain using high-throughput screening methods. To address this challenge, we solely exploit the limited known positive examples in a semi-supervised setting, and discover peptide sequences that are likely to map to certain antimicrobial properties <em>via</em> positive-unlabeled learning (PU). In particular, we use the two learning strategies of adapting base classifier and reliable negative identification to build deep learning models for inferring solubility, hemolysis, binding against SHP-2, and non-fouling activity of peptides, given their sequence. We evaluate the predictive performance of our PU learning method and show that by only using the positive data, it can achieve competitive performance when compared with the classical positive–negative (PN) classification approach, where there is access to both positive and negative examples.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2024/dd/d3dd00218g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140629401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}