{"title":"SynCoTrain: a dual classifier PU-learning framework for synthesizability prediction†","authors":"Sasan Amariamir, Janine George and Philipp Benner","doi":"10.1039/D4DD00394B","DOIUrl":"https://doi.org/10.1039/D4DD00394B","url":null,"abstract":"<p >Material discovery is a cornerstone of modern science, driving advancements in diverse disciplines from biomedical technology to climate solutions. Predicting synthesizability, a critical factor in realizing novel materials, remains a complex challenge due to the limitations of traditional heuristics and thermodynamic proxies. While stability metrics such as formation energy offer partial insights, they fail to account for kinetic factors and technological constraints that influence synthesis outcomes. These challenges are further compounded by the scarcity of negative data, as failed synthesis attempts are often unpublished or context-specific. We present SynCoTrain, a semi-supervised machine learning model designed to predict the synthesizability of materials. SynCoTrain employs a co-training framework leveraging two complementary graph convolutional neural networks: SchNet and ALIGNN. By iteratively exchanging predictions between classifiers, SynCoTrain mitigates model bias and enhances generalizability. Our approach uses Positive and Unlabeled (PU) learning to address the absence of explicit negative data, iteratively refining predictions through collaborative learning. The model demonstrates robust performance, achieving high recall on internal and leave-out test sets. By focusing on oxide crystals, a well-characterized material family with extensive experimental data, we establish SynCoTrain as a reliable tool for predicting synthesizability while balancing dataset variability and computational efficiency. This work highlights the potential of co-training to advance high-throughput materials discovery and generative research, offering a scalable solution to the challenge of synthesizability prediction.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1437-1448"},"PeriodicalIF":6.2,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00394b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Armen G. Beck, Sanjay Iyer, Jonathan Fine and Gaurav Chopra
{"title":"Paddy: an evolutionary optimization algorithm for chemical systems and spaces†","authors":"Armen G. Beck, Sanjay Iyer, Jonathan Fine and Gaurav Chopra","doi":"10.1039/D4DD00226A","DOIUrl":"https://doi.org/10.1039/D4DD00226A","url":null,"abstract":"<p >Optimization of chemical systems and processes have been enhanced and enabled by the development of new algorithms and analytical approaches. While several methods systematically investigate how underlying variables correlate with a given outcome, there is often a substantial number of experiments needed to accurately model such relationships. As chemical systems increase in complexity, algorithms are needed to propose experiments that efficiently optimize the underlying objective, while effectively sampling parameter space to avoid convergence on local minima. We have developed the Paddy software package based on the Paddy field algorithm, a biologically inspired evolutionary optimization algorithm that propagates parameters without direct inference of the underlying objective function. We benchmarked Paddy against several optimization approaches: the Tree of Parzen Estimator through the Hyperopt software library, Bayesian optimization with a Gaussian process <em>via</em> Meta's Ax framework, and two population-based methods from EvoTorch—an evolutionary algorithm with Gaussian mutation, and a genetic algorithm using both a Gaussian mutation and single-point crossover—all representing diverse approaches to optimization. Paddy's performance is benchmarked for mathematical and chemical optimization tasks including global optimization of a two-dimensional bimodal distribution, interpolation of an irregular sinusoidal function, hyperparameter optimization of an artificial neural network tasked with classification of solvent for reaction components, targeted molecule generation by optimizing input vectors for a decoder network, and sampling discrete experimental space for optimal experimental planning. Paddy demonstrates robust versatility by maintaining strong performance across all optimization benchmarks, compared to other algorithms with varying performance. Additionally, Paddy avoids early convergence with its ability to bypass local optima in search of global solutions. We anticipate that the facile, versatile, robust and open-source nature of Paddy will serve as a toolkit in chemical problem-solving tasks towards automated experimentation with high priority for exploratory sampling and innate resistance to early convergence to identify optimal solutions.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 5","pages":" 1352-1371"},"PeriodicalIF":6.2,"publicationDate":"2025-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00226a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143944076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucy Vost, Vijil Chenthamarakshan, Payel Das and Charlotte M. Deane
{"title":"Improving structural plausibility in diffusion-based 3D molecule generation via property-conditioned training with distorted molecules†","authors":"Lucy Vost, Vijil Chenthamarakshan, Payel Das and Charlotte M. Deane","doi":"10.1039/D4DD00331D","DOIUrl":"https://doi.org/10.1039/D4DD00331D","url":null,"abstract":"<p >Traditional drug design methods are costly and time-consuming due to their reliance on trial-and-error processes. As a result, computational methods, including diffusion models, designed for molecule generation tasks have gained significant traction. Despite their potential, they have faced criticism for producing physically implausible outputs. As a solution to this problem, we propose a conditional training framework resulting in a model capable of generating molecules of varying and controllable levels of structural plausibility. This framework consists of adding distorted molecules to training datasets, and then annotating each molecule with a label representing the extent of its distortion, and hence its quality. By training the model to distinguish between favourable and unfavourable molecular conformations alongside the standard molecule generation training process, we can selectively sample molecules from the high-quality region of learned space, resulting in improvements in the validity of generated molecules. In addition to the standard two datasets used by molecule generation methods (QM9 and GEOM), we also test our method on a druglike dataset derived from ZINC. We use our conditional method with EDM, the first E(3) equivariant diffusion model for molecule generation, as well as two further models—a more recent diffusion model and a flow matching model—which were built off EDM. We demonstrate improvements in validity as assessed by RDKit parsability and the PoseBusters test suite; more broadly, though, our findings highlight the effectiveness of conditioning methods on low-quality data to improve the sampling of high-quality data.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 1092-1099"},"PeriodicalIF":6.2,"publicationDate":"2025-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00331d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gwonhak Lee, Seonghoon Choi, Joonsuk Huh and Artur F. Izmaylov
{"title":"Efficient strategies for reducing sampling error in quantum Krylov subspace diagonalization","authors":"Gwonhak Lee, Seonghoon Choi, Joonsuk Huh and Artur F. Izmaylov","doi":"10.1039/D4DD00321G","DOIUrl":"https://doi.org/10.1039/D4DD00321G","url":null,"abstract":"<p >Within the realm of early fault-tolerant quantum computing (EFTQC), quantum Krylov subspace diagonalization (QKSD) has emerged as a promising quantum algorithm for the approximate Hamiltonian diagonalization <em>via</em> projection onto the quantum Krylov subspace. However, the algorithm often requires solving an ill-conditioned generalized eigenvalue problem (GEVP) involving erroneous matrix pairs, which can significantly distort the solution. Since EFTQC assumes limited-scale error correction, finite sampling error becomes a dominant source of error in these matrices. This work focuses on quantifying sampling errors during the measurement of matrix element in the projected Hamiltonian examining two measurement approaches based on the Hamiltonian decompositions: the linear combination of unitaries and diagonalizable fragments. To reduce sampling error within a fixed budget of quantum circuit repetitions, we propose two measurement strategies: the shifting technique and coefficient splitting. The shifting technique eliminates redundant Hamiltonian components that annihilate either the bra or ket states, while coefficient splitting optimizes the measurement of common terms across different circuits. Numerical experiments with electronic structures of small molecules demonstrate the effectiveness of these strategies, reducing sampling costs by a factor of 20–500.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 954-969"},"PeriodicalIF":6.2,"publicationDate":"2025-03-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00321g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine learning-driven optimization of the output force in photo-actuated organic crystals†","authors":"Kazuki Ishizaki, Toru Asahi and Takuya Taniguchi","doi":"10.1039/D4DD00380B","DOIUrl":"https://doi.org/10.1039/D4DD00380B","url":null,"abstract":"<p >Photo-actuated organic crystals that can be remotely controlled by light are gaining attention as next-generation actuator materials. In the practical application of actuator materials, the mode of deformation and the output force are important properties. Since the output force depends on the crystal properties and experimental conditions, it is necessary to explore the optimal conditions from a vast parameter space. In this study, we employed two types of machine learning for molecular design and experimental optimization to maximize the blocking force. Machine learning in molecular design led to the creation of a material pool of salicylideneamine derivatives. Bayesian optimization was used for efficient sampling from the material pool for force measurements in the real world, achieving a maximum blocking force of 37.0 mN. This method was at least 73 times more efficient than the grid search approach.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 5","pages":" 1199-1208"},"PeriodicalIF":6.2,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00380b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143944047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaac Y. Miranda-Valdez, Aaro Niinistö, Tero Mäkinen, Juha Lejon, Juha Koivisto and Mikko J. Alava
{"title":"pyRheo: an open-source Python package for complex rheology†","authors":"Isaac Y. Miranda-Valdez, Aaro Niinistö, Tero Mäkinen, Juha Lejon, Juha Koivisto and Mikko J. Alava","doi":"10.1039/D5DD00021A","DOIUrl":"https://doi.org/10.1039/D5DD00021A","url":null,"abstract":"<p >Mathematical modeling is a powerful tool in rheology, and we present pyRheo, an open-source package for Python designed to streamline the analysis of creep, stress relaxation, small amplitude oscillatory shear, and steady shear flow tests. pyRheo contains a comprehensive selection of viscoelastic models, including fractional order approaches. It integrates model selection and fitting features and employs machine intelligence to suggest a model to describe a given dataset. The package fits the suggested model or one chosen by the user. An advantage of using pyRheo is that it addresses challenges associated with sensitivity to initial guesses in parameter optimization. It allows the user to iteratively search for the best initial guesses, avoiding convergence to local minima. We discuss the capabilities of pyRheo and compare them to other tools for rheological modeling of soft matter. We demonstrate that pyRheo significantly reduces the computation time required to fit high-performance viscoelastic models.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 1075-1082"},"PeriodicalIF":6.2,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00021a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chandima Fernando, Hailey Marcello, Jakub Wlodek, John Sinsheimer, Daniel Olds, Stuart I. Campbell and Phillip M. Maffettone
{"title":"Robotic integration for end-stations at scientific user facilities†","authors":"Chandima Fernando, Hailey Marcello, Jakub Wlodek, John Sinsheimer, Daniel Olds, Stuart I. Campbell and Phillip M. Maffettone","doi":"10.1039/D5DD00036J","DOIUrl":"https://doi.org/10.1039/D5DD00036J","url":null,"abstract":"<p >The integration of robotics and artificial intelligence (AI) into scientific workflows is transforming experimental research, particularly at large-scale user facilities such as the National Synchrotron Light Source II (NSLS-II). We present an extensible architecture for robotic sample management that combines the Robot Operating System 2 (ROS2) with the <em>Bluesky</em> experiment orchestration ecosystem. This approach enabled seamless integration of robotic systems into high-throughput experiments and adaptive workflows. Key innovations included a client-server model for managing robotic actions, real-time pose estimation using fiducial markers and computer vision, and closed-loop adaptive experimentation with agent-driven decision-making. Deployed using widely available hardware and open-source software, this architecture successfully automated a full shift (8 hours) of sample manipulation without errors. The system's flexibility and extensibility allow rapid re-deployment across different experimental environments, enabling scalable self-driving experiments for end stations at scientific user facilities. This work highlights the potential of robotics to enhance experimental throughput and reproducibility, providing a roadmap for future developments in automated scientific discovery where flexibility, extensibility, and adaptability are core requirements.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 1083-1091"},"PeriodicalIF":6.2,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00036j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan L. Hödl, Luc Hermans, Pim F. J. Dankloff, Aigars Piruska, Wilhelm T. S. Huck and William E. Robinson
{"title":"SurfPro – a curated database and predictive model of experimental properties of surfactants†","authors":"Stefan L. Hödl, Luc Hermans, Pim F. J. Dankloff, Aigars Piruska, Wilhelm T. S. Huck and William E. Robinson","doi":"10.1039/D4DD00393D","DOIUrl":"https://doi.org/10.1039/D4DD00393D","url":null,"abstract":"<p >Despite great industrial interest, modeling the physical properties of surfactants in water based on their molecular structure remains a challenge. A significant part of this challenge is in obtaining sufficient amounts of high-quality data. Experimentally determined properties such the critical micelle concentration (CMC) and surface tension at CMC (<em>γ</em><small><sub>CMC</sub></small>) have been reported for many surfactants. However, surfactant data are scattered across many literature sources, and reported in a manner which is often unsuitable as input for predictive models. In this work, we address this limitation by compiling the SurfPro database of surfactant properties. SurfPro consists of 1624 surfactant entries curated from 223 literature sources, containing 1395 CMC values, 972 <em>γ</em><small><sub>CMC</sub></small> values and more than 657 values for <em>Γ</em><small><sub>max</sub></small>, <em>C</em><small><sub>20</sub></small>, π<small><sub>CMC</sub></small> and <em>A</em><small><sub>min</sub></small>. However, only 647 structures have all reported properties, and for most surfactants multiple properties are missing. We trained a previously reported graph neural network architecture for single- and multi-property prediction on these incomplete data of all surfactant types in the database to accurately predict pCMC (−log<small><sub>10</sub></small>(CMC)), <em>γ</em><small><sub>CMC</sub></small>, <em>Γ</em><small><sub>max</sub></small> and p<em>C</em><small><sub>20</sub></small>. We achieved state-of-the-art performance of these four properties using an ensemble of AttentiveFP models trained on ten different folds of the training data in the multi-property setting. Finally, we leveraged the predictions and uncertainties of the ensemble model to impute all missing properties for all 977 surfactants with an incomplete set of properties. We make our curated SurfPro database, proposed test split and training datasets, the imputed database, as well as our code publicly available.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 5","pages":" 1176-1187"},"PeriodicalIF":6.2,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00393d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Said Byadi, Philippe Gantzer, Timur Gimadiev and Pavel Sidorov
{"title":"DOPtools: a Python platform for descriptor calculation and model optimization","authors":"Said Byadi, Philippe Gantzer, Timur Gimadiev and Pavel Sidorov","doi":"10.1039/D4DD00399C","DOIUrl":"https://doi.org/10.1039/D4DD00399C","url":null,"abstract":"<p >The DOPtools (Descriptors and Optimization tools) platform is a Python library for the calculation of chemical descriptors, hyperparameter optimization, and building and validation of QSPR models. In addition to the Python code that can be integrated in custom scripts, it provides a command line interface for the automatic calculation of various descriptors and for eventual hyperparameter optimization of statistical models, enabling its use in server applications for QSPR modeling. It is especially suited for modeling reaction properties <em>via</em> functions that calculate descriptors for all reaction components. While a variety of existing tools and libraries can calculate various molecular descriptors, their output format is often unique, which complicates their integration with standard machine learning libraries. DOPtools provides a unified API for the calculated descriptors as input for the scikit-learn library. The modular nature of the code allows easy addition of algorithms if required by the end user. The code for the platform is freely available at GitHub and can be installed through PyPI.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 5","pages":" 1188-1198"},"PeriodicalIF":6.2,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00399c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143944046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Boris N. Slautin, Yu Liu, Jan Dec, Vladimir V. Shvartsman, Doru C. Lupascu, Maxim A. Ziatdinov and Sergei V. Kalinin
{"title":"Measurements with noise: Bayesian optimization for co-optimizing noise and property discovery in automated experiments†","authors":"Boris N. Slautin, Yu Liu, Jan Dec, Vladimir V. Shvartsman, Doru C. Lupascu, Maxim A. Ziatdinov and Sergei V. Kalinin","doi":"10.1039/D4DD00391H","DOIUrl":"https://doi.org/10.1039/D4DD00391H","url":null,"abstract":"<p >We have developed a Bayesian optimization (BO) workflow that integrates intra-step noise optimization into automated experimental cycles. Traditional BO approaches in automated experiments focus on optimizing experimental trajectories but often overlook the impact of measurement noise on data quality and cost. Our proposed framework simultaneously optimizes both the target property and the associated measurement noise by introducing time as an additional input parameter, thereby balancing the signal-to-noise ratio and experimental duration. Two approaches are explored: a reward-driven noise optimization and a double-optimization acquisition function, both enhancing the efficiency of automated workflows by considering noise and cost within the optimization process. We validate our method through simulations and real-world experiments using Piezoresponse Force Microscopy (PFM), demonstrating the successful optimization of measurement duration and property exploration. Our approach offers a scalable solution for optimizing multiple variables in automated experimental workflows, improving data quality, and reducing resource expenditure in materials science and beyond.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 1066-1074"},"PeriodicalIF":6.2,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00391h?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}