{"title":"Developing large language models for quantum chemistry simulation input generation†","authors":"Pieter Floris Jacobs and Robert Pollice","doi":"10.1039/D4DD00366G","DOIUrl":"https://doi.org/10.1039/D4DD00366G","url":null,"abstract":"<p >Scientists across domains are often challenged to master domain-specific languages (DSLs) for their research, which are merely a means to an end but are pervasive in fields like computational chemistry. Automated code generation promises to overcome this barrier, allowing researchers to focus on their core expertise. While large language models (LLMs) have shown impressive capabilities in synthesizing code from natural language prompts, they often struggle with DSLs, likely due to their limited exposure during training. In this work, we investigate the potential of foundational LLMs for generating input files for the quantum chemistry package ORCA by establishing a general framework that can be adapted to other DSLs. To improve upon <img> as our base model, we explore the impact of prompt engineering, retrieval-augmented generation, and finetuning <em>via</em> synthetically generated datasets. We find that finetuning, even with synthetic datasets as small as 500 samples, significantly improves performance. Additionally, we observe that finetuning shows synergism with advanced prompt engineering such as chain-of-thought prompting. Consequently, our best finetuned models outperform the formally much more powerful <img> model. In turn, finetuning GPT-4o with the same small synthetic dataset leads to a further substantial performance improvement, suggesting our approach to be more general rather than limited to LLMs with poor base proficiency. All tools and datasets are made openly available for future research. We believe that this research lays the groundwork for a wider adoption of LLMs for DSLs in chemistry and beyond.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 762-775"},"PeriodicalIF":6.2,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00366g?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beatrice W. Soh, Aniket Chitre, Shu Zheng Tan, Yuhan Wang, Yinqi Yi, Wendy Soh, Kedar Hippalgaonkar and D. Ian Wilson
{"title":"Opentrons for automated and high-throughput viscometry†","authors":"Beatrice W. Soh, Aniket Chitre, Shu Zheng Tan, Yuhan Wang, Yinqi Yi, Wendy Soh, Kedar Hippalgaonkar and D. Ian Wilson","doi":"10.1039/D4DD00368C","DOIUrl":"https://doi.org/10.1039/D4DD00368C","url":null,"abstract":"<p >We present an improved high-throughput proxy viscometer based on the Opentrons (OT-2) automated liquid handler. The working principle of the viscometer lies in the differing rates at which air-displacement pipettes dispense liquids of different viscosities. The operating protocol involves measuring the amount of liquid dispensed over a set time for given dispense conditions. Data collected at different set dispense flow rates was used to train an ensemble machine learning regressor to predict Newtonian liquid viscosity in the range of 20–20 000 cP, with ∼450 cP error (∼8% relative to sample mean). A phenomenological model predicting the observed trends is presented and used to extend the applicability of the proxy viscometer to simple non-Newtonian liquids. As proof-of-concept, we demonstrate the ability of the proxy viscometer to characterize the rheological behavior of two types of power-law fluids.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 711-722"},"PeriodicalIF":6.2,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00368c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emilio Nuñez-Andrade, Isaac Vidal-Daza, James W. Ryan, Rafael Gómez-Bombarelli and Francisco J. Martin-Martinez
{"title":"Embedded machine-readable molecular representation for resource-efficient deep learning applications†","authors":"Emilio Nuñez-Andrade, Isaac Vidal-Daza, James W. Ryan, Rafael Gómez-Bombarelli and Francisco J. Martin-Martinez","doi":"10.1039/D4DD00230J","DOIUrl":"https://doi.org/10.1039/D4DD00230J","url":null,"abstract":"<p >The practical implementation of deep learning methods for chemistry applications relies on encoding chemical structures into machine-readable formats that can be efficiently processed by computational tools. To this end, One Hot Encoding (OHE) is an established representation of alphanumeric categorical data in expanded numerical matrices. We have developed an embedded alternative to OHE that encodes discrete alphanumeric tokens of an <em>N</em>-sized alphabet into a few real numbers that constitute a simpler matrix representation of chemical structures. The implementation of this embedded One Hot Encoding (eOHE) in training machine learning models achieves comparable results to OHE in model accuracy and robustness while significantly reducing the use of computational resources. Our benchmarks across three molecular representations (SMILES, DeepSMILES, and SELFIES) and three different molecular databases (ZINC, QM9, and GDB-13) for Variational Autoencoders (VAEs) and Recurrent Neural Networks (RNNs) show that using eOHE reduces vRAM memory usage by up to 50% while increasing disk Memory Reduction Efficiency (MRE) to 80% on average. This encoding method opens up new avenues for data representation in embedded formats that promote energy efficiency and scalable computing in resource-constrained devices or in scenarios with limited computing resources. The application of eOHE impacts not only the chemistry field but also other disciplines that rely on the use of OHE.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 776-789"},"PeriodicalIF":6.2,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00230j?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexander E. Siemenn, Basita Das, Eunice Aissi, Fang Sheng, Lleyton Elliott, Blake Hudspeth, Marilyn Meyers, James Serdy and Tonio Buonassisi
{"title":"Archerfish: a retrofitted 3D printer for high-throughput combinatorial experimentation via continuous printing†","authors":"Alexander E. Siemenn, Basita Das, Eunice Aissi, Fang Sheng, Lleyton Elliott, Blake Hudspeth, Marilyn Meyers, James Serdy and Tonio Buonassisi","doi":"10.1039/D4DD00249K","DOIUrl":"https://doi.org/10.1039/D4DD00249K","url":null,"abstract":"<p >The maturation of 3D printing technology has enabled low-cost, rapid prototyping capabilities for mainstreaming accelerated product design. The materials research community has recognized this need, but no universally accepted rapid prototyping technique currently exists for material design. Toward this end, we develop Archerfish, a 3D printer retrofitted to dispense liquid with <em>in situ</em> mixing capabilities for performing high-throughput combinatorial printing (HTCP) of material compositions. Using this HTCP design, we demonstrate continuous printing throughputs of up to 250 unique compositions per minute, 100× faster than similar tools such as Opentrons that utilize stepwise printing with <em>ex situ</em> mixing. We validate the formation of these combinatorial “prototype” material gradients using hyperspectral image analysis and energy-dispersive X-ray spectroscopy. Furthermore, we describe hardware challenges to realizing reproducible, accurate, and precise composition gradients with continuous printing, including those related to precursor dispensing, mixing, and deposition. Despite these limitations, the continuous printing and low-cost design of Archerfish demonstrate promising accelerated materials screening results across a range of materials systems from nanoparticles to perovskites.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 896-909"},"PeriodicalIF":6.2,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00249k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benedikt Winter, Philipp Rehner, Timm Esper, Johannes Schilling and André Bardow
{"title":"Understanding the language of molecules: predicting pure component parameters for the PC-SAFT equation of state from SMILES†","authors":"Benedikt Winter, Philipp Rehner, Timm Esper, Johannes Schilling and André Bardow","doi":"10.1039/D4DD00077C","DOIUrl":"https://doi.org/10.1039/D4DD00077C","url":null,"abstract":"<p >A major bottleneck in developing sustainable processes and materials is a lack of property data. Recently, machine learning approaches have vastly improved previous methods for predicting molecular properties. However, these machine learning models are often not able to handle thermodynamic constraints adequately. In this work, we present a machine learning model based on natural language processing to predict pure-component parameters for the perturbed-chain statistical associating fluid theory (PC-SAFT) equation of state. The model is based on our previously proposed SMILES-to-Properties-Transformer (SPT). By incorporating PC-SAFT into the neural network architecture, the machine learning model is trained directly on experimental vapor pressure and liquid density data. Combining established physical modeling approaches with state-of-the-art machine learning methods enables high-accuracy predictions across a wide range of pressures and temperatures, while keeping the thermodynamic consistency of an equation of state like PC-SAFT. SPT<small><sub>PC-SAFT</sub></small> demonstrates exceptional prediction accuracy even for complex molecules with various functional groups, outperforming traditional group contribution methods by a factor of four in the mean average percentage deviation. Moreover, SPT<small><sub>PC-SAFT</sub></small> captures the behavior of stereoisomers without any special consideration. To facilitate the application of our model, we provide predicted PC-SAFT parameters of 13 279 components, making PC-SAFT accessible to all researchers.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 5","pages":" 1142-1157"},"PeriodicalIF":6.2,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00077c?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143943994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Moran, Michael W. Gaultois, Vladimir V. Gusev, Dmytro Antypov and Matthew J. Rosseinsky
{"title":"Establishing Deep InfoMax as an effective self-supervised learning methodology in materials informatics†","authors":"Michael Moran, Michael W. Gaultois, Vladimir V. Gusev, Dmytro Antypov and Matthew J. Rosseinsky","doi":"10.1039/D4DD00202D","DOIUrl":"https://doi.org/10.1039/D4DD00202D","url":null,"abstract":"<p >The scarcity of property labels remains a key challenge in materials informatics, whereas materials data without property labels are abundant in comparison. By pre-training supervised property prediction models on self-supervised tasks that depend only on the “intrinsic information” available in any Crystallographic Information File (CIF), there is potential to leverage the large amount of crystal data without property labels to improve property prediction results on small datasets. We apply Deep InfoMax as a self-supervised machine learning framework for materials informatics that explicitly maximises the mutual information between a point set (or graph) representation of a crystal and a vector representation suitable for downstream learning. This allows the pre-training of supervised models on large materials datasets without the need for property labels and without requiring the model to reconstruct the crystal from a representation vector. We investigate the benefits of Deep InfoMax pre-training implemented on the Site-Net architecture to improve the performance of downstream property prediction models with small amounts (<10<small><sup>3</sup></small>) of data, a situation relevant to experimentally measured materials property databases. Using a property label masking methodology, where we perform self-supervised learning on larger supervised datasets and then train supervised models on a small subset of the labels, we isolate Deep InfoMax pre-training from the effects of distributional shift. We demonstrate performance improvements in the contexts of representation learning and transfer learning on the tasks of band gap and formation energy prediction. Having established the effectiveness of Deep InfoMax pre-training in a controlled environment, our findings provide a foundation for extending the approach to address practical challenges in materials informatics.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 790-811"},"PeriodicalIF":6.2,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00202d?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602003","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lowering the exponential wall: accelerating high-entropy alloy catalysts screening using local surface energy descriptors from neural network potentials","authors":"Tomoya Shiota, Kenji Ishihara and Wataru Mizukami","doi":"10.1039/D4DD00303A","DOIUrl":"https://doi.org/10.1039/D4DD00303A","url":null,"abstract":"<p >Computational screening is indispensable for the efficient design of high-entropy alloys (HEAs), which hold considerable potential for catalytic applications. However, the chemical space of HEAs is exponentially vast with respect to the number of constituent elements, making even machine learning-based screening calculations time-intensive. To address this challenge, we propose a rapid method for predicting HEA properties using data from monometallic systems (or few-component alloys). Central to our approach is the newly introduced local surface energy (LSE) descriptor, which captures local surface reactivity at atomic resolution. We established a correlation between LSE and adsorption energies using monometallic systems. Using this correlation in a linear regression model, we successfully estimated molecular adsorption energies on HEAs with significantly higher accuracy than a conventional descriptor (<em>i.e.</em>, generalized coordination numbers). Furthermore, we developed high-precision models by employing both classical and quantum machine learning. Our method enabled CO adsorption-energy calculations for 1000 quinary nanoparticles, comprising 201 atoms each, within a few days, considerably faster than density functional theory, which would require hundreds of years or neural network potentials, which would have taken hundreds of days. The proposed approach accelerates the exploration of the vast HEA chemical space, facilitating the design of novel catalysts.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 738-751"},"PeriodicalIF":6.2,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00303a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143601999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
James R. Deneault, Woojae Kim, Jiseob Kim, Yuzhe Gu, Jorge Chang, Benji Maruyama, Jay I. Myung and Mark A. Pitt
{"title":"Preferential Bayesian optimization improves the efficiency of printing objects with subjective qualities†‡","authors":"James R. Deneault, Woojae Kim, Jiseob Kim, Yuzhe Gu, Jorge Chang, Benji Maruyama, Jay I. Myung and Mark A. Pitt","doi":"10.1039/D4DD00320A","DOIUrl":"https://doi.org/10.1039/D4DD00320A","url":null,"abstract":"<p >Despite recent advances in closed-loop 3D printing, optimizing subjective and difficult-to-quantify qualities—such as surface finish and clarity of fine detail—remains a significant challenge, often relying on the traditional time-consuming and inefficient trial-and-error process. Preferential Bayesian optimization (PBO) is a machine learning technique that uses human preference judgements to efficiently guide the search for such abstract optimums in a high-dimensional space. We evaluated PBO's ability to identify optimal parameter values in printing profiles of vases and pairs of 3D cones. In semi-autonomous printing campaigns, a human observer ranked triplets of images of these objects with a target object in mind, preferring slender/bulbous vases and cone pairs that were smooth and well-formed. Results show that PBO consistently and quickly identified an optimal parameter combination across repeated testing. Modeling was then used to identify object dimensions responsible for preference judgements and to mimic preference behavior. Findings suggest that PBO is a promising tool for expanding the range of 3D objects that can be printed efficiently.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 723-737"},"PeriodicalIF":6.2,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00320a?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Benchmarking study of deep generative models for inverse polymer design†","authors":"Tianle Yue, Lei Tao, Vikas Varshney and Ying Li","doi":"10.1039/D4DD00395K","DOIUrl":"https://doi.org/10.1039/D4DD00395K","url":null,"abstract":"<p >Molecular generative models based on deep learning have increasingly gained attention for their ability in <em>de novo</em> polymer design. However, there remains a knowledge gap in the thorough evaluation of these models. This benchmark study explores <em>de novo</em> polymer design using six popular deep generative models: Variational Autoencoder (VAE), Adversarial Autoencoder (AAE), Objective-Reinforced Generative Adversarial Networks (ORGAN), Character-level Recurrent Neural Network (CharRNN), REINVENT, and GraphINVENT. Various metrics highlighted the excellent performance of CharRNN, REINVENT, and GraphINVENT, particularly when applied to the real polymer dataset, while VAE and AAE show more advantages in generating hypothetical polymers. The CharRNN, REINVENT, and GraphINVENT models were successfully further trained on real polymers using reinforcement learning methods, targeting the generation of hypothetical high-temperature polymers for extreme environments. The findings of this study provide critical insights into the capabilities and limitations of each generative model, offering valuable guidance for future endeavors in polymer design and discovery.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 4","pages":" 910-926"},"PeriodicalIF":6.2,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00395k?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143809078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengru Liu, Long Bian, Wenting Shao, Sean I. Hwang and Alexander Star
{"title":"An automated electrolyte-gate field-effect transistor test system for rapid screening of multiple sensors†","authors":"Zhengru Liu, Long Bian, Wenting Shao, Sean I. Hwang and Alexander Star","doi":"10.1039/D4DD00301B","DOIUrl":"https://doi.org/10.1039/D4DD00301B","url":null,"abstract":"<p >Automation of laboratory processes is crucial in analytical chemistry, as it enhances experimental reproducibility by eliminating repetitive tasks and reducing human errors. In this context, the integration of laboratory automation techniques into chemical analysis, particularly utilizing electrochemical field-effect transistor (FET)-based sensors, is highly desirable for high-throughput testing. In this study, we developed an automated electrolyte-gate FET test system designed for rapid screening of multiple sensors. Comprising five key components – printed circuit board, pipetting robot, source meter unit, system switch, and computer – the automated system achieves precision control through individual programming of each instrument, followed by the synergistic integration of the instruments using Python scripts. The automated system could perform FET measurements of 96 sensors in a single run, and different operations such as liquid transfer and waste removal were optimized. The automated system was evaluated by running a pH sensing test successfully and finally applied for opioid drug testing with high working efficiency and good accuracy, demonstrating that it could be an excellent tool for different sensing applications based on electrolyte-gate FET sensors.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 3","pages":" 752-761"},"PeriodicalIF":6.2,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00301b?page=search","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143602000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}