Taline Kerackian, Clément Wespiser, Matthieu Daniel, Eric Pasquinet and Eugénie Romero
{"title":"Investigation of arene and heteroarene nitration supported by high-throughput experimentation and machine learning†","authors":"Taline Kerackian, Clément Wespiser, Matthieu Daniel, Eric Pasquinet and Eugénie Romero","doi":"10.1039/D5DD00086F","DOIUrl":"https://doi.org/10.1039/D5DD00086F","url":null,"abstract":"<p >Access to the nitro functional group is a widespread and longstanding transformation of interest in many fields of chemistry. However, the robustness and specificity of this transformation can remain challenging, particularly in the case of heteroarene nitration. Based on this observation, a comprehensive investigation was initiated to screen nitration conditions on various arenes and heteroarenes. A systematic and diverse study of both nitrating agents and activating reagents was conducted using high-throughput experimentation to afford high-quantity and high-quality data generation. General trends were identified and correlated with the electronic properties of the heteroarenes; notably, the difficult nitration of electron-poor heteroarenes was highlighted. Original combinations of reagents were found to perform well in nitration reactions. The obtained data were also used to design a predictive tool relying on machine learning in order to provide the best nitration reaction conditions depending on the targeted substrate. The limited predictive efficiency obtained pointed out the importance of diversification and chemically relevant encoding of the data set.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1662-1671"},"PeriodicalIF":6.2,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00086f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David van der Spoel, Julián Marrades, Kristian Kříž, A. Najla Hosseini, Alfred T. Nordman, João Paulo, Marie-Madeleine Walz, Paul J. van Maaren and Mohammad M. Ghahremanpour
{"title":"Evolutionary machine learning of physics-based force fields in high-dimensional parameter-space†","authors":"David van der Spoel, Julián Marrades, Kristian Kříž, A. Najla Hosseini, Alfred T. Nordman, João Paulo, Marie-Madeleine Walz, Paul J. van Maaren and Mohammad M. Ghahremanpour","doi":"10.1039/D5DD00178A","DOIUrl":"https://doi.org/10.1039/D5DD00178A","url":null,"abstract":"<p >This work presents the Alexandria Chemistry Toolkit (ACT), an open-source software for machine learning of physics-based force fields (FFs) from scratch, based on user-specified potential functions. In this approach, a set of FF parameters for molecular simulation is described as a chromosome consisting of atom and bond genes. The accuracy of a FF, that is how well quantum chemical training data are reproduced, determines the fitness of the chromosome. The ACT implements a hierarchical parallel scheme that iterates between a genetic algorithm and Monte-Carlo steps for global and local search, to find “genomes” with high fitness. As a sample application, genome evolution is performed to create physical models that allow the prediction of properties of organic molecules in the gas and liquid phases. Evaluation of the prediction accuracy of different models showcases how force field science can contribute to systematically improve prediction accuracy of physicochemical observables.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1925-1935"},"PeriodicalIF":6.2,"publicationDate":"2025-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00178a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PAL – parallel active learning for machine-learned potentials†","authors":"Chen Zhou, Marlen Neubert, Yuri Koide, Yumeng Zhang, Van-Quan Vuong, Tobias Schlöder, Stefanie Dehnen and Pascal Friederich","doi":"10.1039/D5DD00073D","DOIUrl":"10.1039/D5DD00073D","url":null,"abstract":"<p >Constructing datasets representative of the target domain is essential for training effective machine learning models. Active learning (AL) is a promising method that iteratively extends training data to enhance model performance while minimizing data acquisition costs. However, current AL workflows often require human intervention and lack parallelism, leading to inefficiencies and underutilization of modern computational resources. In this work, we introduce PAL, an automated, modular, and parallel active learning library that integrates AL tasks and manages their execution and communication on shared- and distributed-memory systems using the Message Passing Interface (MPI). PAL provides users with the flexibility to design and customize all components of their active learning scenarios, including machine learning models with uncertainty estimation, oracles for ground truth labeling, and strategies for exploring the target space. We demonstrate that PAL significantly reduces computational overhead and improves scalability, achieving substantial speed-ups through asynchronous parallelization on CPU and GPU hardware. Applications of PAL to several real-world scenarios – including ground-state reactions in biomolecular systems, excited-state dynamics of molecules, simulations of inorganic clusters, and thermo-fluid dynamics – illustrate its effectiveness in accelerating the development of machine learning models. Our results show that PAL enables efficient utilization of high-performance computing resources in active learning workflows, fostering advancements in scientific research and engineering applications.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1901-1911"},"PeriodicalIF":6.2,"publicationDate":"2025-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12188519/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144531387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christofer Hardcastle, Ryan O'Mullan, Raymundo Arróyave and Brent Vela
{"title":"Physics-informed Gaussian process classification for constraint-aware alloy design†","authors":"Christofer Hardcastle, Ryan O'Mullan, Raymundo Arróyave and Brent Vela","doi":"10.1039/D5DD00084J","DOIUrl":"https://doi.org/10.1039/D5DD00084J","url":null,"abstract":"<p >Alloy design can be framed as a constraint-satisfaction problem. Building on previous methodologies, we propose equipping Gaussian Process Classifiers (GPCs) with physics-informed prior mean functions to model the centers of feasible design spaces. Through three case studies, we highlight the utility of informative priors for handling constraints on continuous and categorical properties. (1) <em>Phase stability</em>: by incorporating CALPHAD predictions as priors for solid-solution phase stability, we enhance model validation using a publicly available XRD dataset. (2) <em>Phase stability prediction refinement</em>: we demonstrate an <em>in silico</em> active learning approach to efficiently correct phase diagrams. (3) <em>Continuous property thresholds</em>: by embedding priors into continuous property models, we accelerate the discovery of alloys meeting specific property thresholds <em>via</em> active learning. In each case, integrating physics-based insights into the classification framework substantially improved model performance, demonstrating an efficient strategy for constraint-aware alloy design.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1884-1900"},"PeriodicalIF":6.2,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00084j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Solving an inverse problem with generative models","authors":"John R. Kitchin","doi":"10.1039/D5DD00137D","DOIUrl":"https://doi.org/10.1039/D5DD00137D","url":null,"abstract":"<p >Inverse problems, where we seek the values of inputs to a model that lead to a desired set of outputs, are considered a more challenging problem in science and engineering than forward problems where we compute or measure outputs from known inputs. In this work we demonstrate the use of two generative machine learning methods to solve inverse problems. We compare this approach to two more conventional approaches that use a forward model with nonlinear programming, and the use of a backward model. We illustrate each method on a dataset obtained from a simple remote instrument that has three inputs: the setting of the red, green and blue channels of an RGB LED. We focus on several outputs from a light sensor that measures intensity at 445 nm, 515 nm, 590 nm, and 630 nm. The specific problem we solve is identifying inputs that lead to a specific intensity in three of those channels. We show that generative models can be used to solve this kind of inverse problem, and they have some advantages over the conventional approaches.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1856-1869"},"PeriodicalIF":6.2,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00137d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuchao Tang, Bin Xiao, Shuizhou Chen, Quan Qian and Yi Liu
{"title":"Predefined attention-focused mechanism using center-environment features: a machine learning study of alloying effects on the stability of Nb5Si3 alloys†","authors":"Yuchao Tang, Bin Xiao, Shuizhou Chen, Quan Qian and Yi Liu","doi":"10.1039/D5DD00079C","DOIUrl":"https://doi.org/10.1039/D5DD00079C","url":null,"abstract":"<p >Digital encoding of material structures using graph-based features combined with deep neural networks often lacks local specificity. Additionally, incorporating a self-attention mechanism increases architectural complexity and demands extensive data. To overcome these challenges, we developed a Center-Environment (CE) feature representation—a less data-intensive, physics-informed predefined attention mechanism. The pre-attention mechanism underlying the CE model shifts attention from complex black-box machine learning (ML) algorithms to explicit feature models with physical meaning, reducing data requirements while enhancing the transparency and interpretability of ML models. This CE-based ML approach was employed to investigate the alloying effects on the structural stability of Nb<small><sub>5</sub></small>Si<small><sub>3</sub></small>, guiding data-driven compositional design for ultra-high-temperature NbSi superalloys. The CE features leveraged the Atomic Environment Type (AET) method to characterize the local low-symmetry physical environments of atoms. The optimized CE<small><sub>AET</sub></small> models reasonably predicted double-site substitution energies in α-Nb<small><sub>5</sub></small>Si<small><sub>3</sub></small>, achieving a mean absolute error (MAE) of 329.43 meV per cell. The robust transferability of the CE<small><sub>AET</sub></small> models was demonstrated by their successful prediction of untrained β-Nb<small><sub>5</sub></small>Si<small><sub>3</sub></small> structures. Site occupancy preferences were identified for B, Si, and Al at Si sites and for Ti, Hf, and Zr at Nb sites within β-Nb<small><sub>5</sub></small>Si<small><sub>3</sub></small>. This CE-based ML approach represents a broadly applicable and intelligent computational design method capable of handling complex crystal structures with strong transferability, even when working with small datasets.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1870-1883"},"PeriodicalIF":6.2,"publicationDate":"2025-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00079c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Murat Cihan Sorkun, Xuan Zhou, Joannes Murigneux, Nicola Menegazzo, Ayush Kumar Narsaria, David Thanoon, Peter A. A. Klusener, Kaustubh Kaluskar, Sharan Shetty, Efstathios Barmpoutsis and Süleyman Er
{"title":"RedCat, an automated discovery workflow for aqueous organic electrolytes†","authors":"Murat Cihan Sorkun, Xuan Zhou, Joannes Murigneux, Nicola Menegazzo, Ayush Kumar Narsaria, David Thanoon, Peter A. A. Klusener, Kaustubh Kaluskar, Sharan Shetty, Efstathios Barmpoutsis and Süleyman Er","doi":"10.1039/D5DD00111K","DOIUrl":"https://doi.org/10.1039/D5DD00111K","url":null,"abstract":"<p >Developing cost-effective organic molecules with robust redox activity and high solubility is crucial for widespread acceptance and deployment of aqueous organic redox flow batteries (AORFBs). We present RedCat, an automated workflow designed to accelerate the discovery of redox-active organic molecules from extensive molecular databases. This workflow employs structure-based selection, machine learning models for predicting redox reaction energy and aqueous solubility, and dynamically integrates up-to-date pricing data to prioritize candidates. Applying this workflow to 112 million molecules from the PubChem database, we identified 261 promising anolyte candidates. We validated their battery-related properties through first-principles and molecular dynamics calculations and experimentally tested two electrochemically active molecules. These molecules demonstrated higher energy densities than previously reported compounds, confirming the robustness of our workflow in discovering electrolytes. With its open-access code repository and modular design, RedCat is well-suited for integration into self-driving labs, offering a scalable framework for autonomous, data-driven electrolyte discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1844-1855"},"PeriodicalIF":6.2,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00111k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinpeng Li, Chuxuan Ding, Daobin Liu, Linjiang Chen and Jun Jiang
{"title":"Autonomous laboratories in China: an embodied intelligence-driven platform to accelerate chemical discovery","authors":"Jinpeng Li, Chuxuan Ding, Daobin Liu, Linjiang Chen and Jun Jiang","doi":"10.1039/D5DD00072F","DOIUrl":"https://doi.org/10.1039/D5DD00072F","url":null,"abstract":"<p >The emergence of autonomous laboratories—automated robotic platforms integrated with rapidly advancing artificial intelligence (AI)—is poised to transform research by shifting traditional trial-and-error approaches toward accelerated chemical discovery. These platforms combine AI models, hardware, and software to execute experiments, interact with robotic systems, and manage data, thereby closing the predict-make-measure discovery loop. However, key challenges remain, including how to efficiently achieve autonomous high-throughput experimentation and integrate diverse technologies into cohesive systems. In this perspective, we identify the fundamental elements required for closed-loop autonomous experimentation: chemical science databases, large-scale intelligent models, automated experimental platforms, and integrated management/decision-making systems. Furthermore, with the advancement of AI models, we emphasize the progress from simple iterative-algorithm-driven systems to comprehensive intelligent autonomous systems powered by large-scale models in China, which enable self-driving chemical discovery within individual laboratories. Looking ahead, the development of intelligent autonomous laboratories into a distributed network holds great promise for further accelerating chemical discoveries and fostering innovation on a broader scale.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1672-1684"},"PeriodicalIF":6.2,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00072f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Federico M. Mione, Martin F. Luna, Lucas Kaspersetz, Peter Neubauer, Ernesto C. Martinez and M. Nicolas Cruz Bournazou
{"title":"A property graph schema for automated metadata capture, reproducibility and knowledge discovery in high-throughput bioprocess development†","authors":"Federico M. Mione, Martin F. Luna, Lucas Kaspersetz, Peter Neubauer, Ernesto C. Martinez and M. Nicolas Cruz Bournazou","doi":"10.1039/D5DD00070J","DOIUrl":"https://doi.org/10.1039/D5DD00070J","url":null,"abstract":"<p >Recent advances in autonomous experimentation and self-driving laboratories have drastically increased the complexity of orchestrating robotic experiments and of recording the different computational processes involved including all related metadata. Addressing this challenge requires a flexible and scalable information storage system that prioritizes the relationships between data and metadata, surpassing the limitations of traditional relational databases. To foster knowledge discovery in high-throughput bioprocess development, the computational control of the experimentation must be fully automated, with the capability to efficiently collect and manage experimental data and their integration into a knowledge base. This work proposes the adoption of graph databases integrated with a semantic structure to enable knowledge transfer between humans and machines. To this end, a property graph schema (PG-schema) has been specifically designed for high-throughput experiments in robotic platforms, focused mainly on the automation of the computational workflow used to ensure the reproducibility, reusability, and credibility of learned bioprocess models. A prototype implementation of the PG-schema and its integration with the workflow management system using simulated experiments is presented to highlight the advantages of the proposed approach in the generation of FAIR data.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 9","pages":" 2401-2422"},"PeriodicalIF":6.2,"publicationDate":"2025-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00070j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145028075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mutual information informed novelty estimation of materials along chemical and structural axes†","authors":"Andrew R. Falkowski and Taylor D. Sparks","doi":"10.1039/D5DD00167F","DOIUrl":"https://doi.org/10.1039/D5DD00167F","url":null,"abstract":"<p >Assessing the novelty of computationally or experimentally discovered materials against vast databases is crucial for efficient materials exploration, yet robust, objective methods are lacking. This paper introduces a parameter-free approach to quantify material novelty along chemical and structural axes. Our method leverages mutual information (MI), analyzing how it changes with calculated inter-material distances (<em>e.g.</em>, using EIMD for chemistry, LoStOP for structure) to derive data-driven weight functions. These functions define meaningful similarity neighborhoods without preset cutoffs, yielding quantitative novelty scores based on local density. We validate the approach using synthetic data and demonstrate its effectiveness across diverse materials datasets, including perovskites with controlled subgroups, a collection with varied structure types, and predicted lithium compounds from the GNOME database compared against materials in the materials project. The MI-informed framework successfully identifies and differentiates chemical and structural novelty, offering an interpretable tool to guide materials discovery and assess new candidates within the context of existing knowledge.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 7","pages":" 1833-1843"},"PeriodicalIF":6.2,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00167f?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144589485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}