{"title":"Milestones in cheminformatics","authors":"Karina Martinez-Mayorga, José L. Medina-Franco","doi":"10.1186/s13321-025-01054-9","DOIUrl":"10.1186/s13321-025-01054-9","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01054-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shih-Cheng Li, Pei-Hua Wang, Jheng-Wei Su, Wei-Yin Chiang, Tzu-Lan Yeh, Alex Zhavoronkov, Shih-Hsien Huang, Yen-Chu Lin, Chia-Ho Ou, Chih-Yu Chen
{"title":"Application of the digital annealer unit in optimizing chemical reaction conditions for enhanced production yields","authors":"Shih-Cheng Li, Pei-Hua Wang, Jheng-Wei Su, Wei-Yin Chiang, Tzu-Lan Yeh, Alex Zhavoronkov, Shih-Hsien Huang, Yen-Chu Lin, Chia-Ho Ou, Chih-Yu Chen","doi":"10.1186/s13321-025-01043-y","DOIUrl":"10.1186/s13321-025-01043-y","url":null,"abstract":"<div><p>Finding optimal reaction conditions is crucial for chemical synthesis in the pharmaceutical and chemical industries. However, due to the vast chemical space, conducting experiments for all the possible combinations is impractical. Thus, quantitative structure–activity relationship (QSAR) models have been widely used to predict product yields, but evaluating all combinations is still computationally intensive. In this work, we demonstrate the use of Digital Annealer Unit (DAU) can tackle these large-scale optimization problems more efficiently. Two types of models are developed and tested on high-throughput experimentation (HTE) and Reaxys datasets. Our results suggest that the performance of models is comparable to classical machine learning (ML) methods (i.e., Random Forest and Multilayer Perceptron (MLP)), while the inference time of our models requires only seconds with a DAU. In active learning and autonomous reaction condition design, our model shows improvement for reaction yield prediction by incorporating new data, meaning that it can potentially be used in iterative processes. Our method can also accelerate the screening of billions of reaction conditions, achieving speeds millions of times faster than traditional computing units in identifying superior conditions.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01043-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144629918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A transformer based generative chemical language AI model for structural elucidation of organic compounds","authors":"Xiaofeng Tan","doi":"10.1186/s13321-025-01016-1","DOIUrl":"10.1186/s13321-025-01016-1","url":null,"abstract":"<div><p>For over half a century, computer-aided structural elucidation systems (CASE) for organic compounds have relied on complex expert systems with explicitly programmed algorithms. These systems are often computationally inefficient for complex compounds due to the vast chemical structural space that must be explored and filtered. In this study, we present a proof-of-concept transformer based generative chemical language artificial intelligence (AI) model, an innovative end-to-end architecture designed to replace the logic and workflow of the classic CASE framework for ultra-fast and accurate spectroscopic-based structural elucidation. Our model employs an encoder-decoder architecture and self-attention mechanisms, similar to those in large language models, to directly generate the most probable chemical structures that match the input spectroscopic data. Trained on ~ 102 k IR, UV, and <sup>1</sup>H NMR spectra, it performs structural elucidation of molecules with up to 29 atoms in just a few seconds on a modern CPU, achieving a top-15 accuracy of 83%. This approach demonstrates the potential of transformer based generative AI to accelerate traditional scientific problem-solving processes. The model's ability to iterate quickly based on new data highlights its potential for rapid advancements in structural elucidation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01016-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144611450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ana B. Caniceiro, Ana M. B. Amorim, Nícia Rosário-Ferreira, Irina S. Moreira
{"title":"GPCR-A17 MAAP: mapping modulators, agonists, and antagonists to predict the next bioactive target","authors":"Ana B. Caniceiro, Ana M. B. Amorim, Nícia Rosário-Ferreira, Irina S. Moreira","doi":"10.1186/s13321-025-01050-z","DOIUrl":"10.1186/s13321-025-01050-z","url":null,"abstract":"<div><p>G Protein-Coupled Receptors (GPCRs) are vital players in cellular signalling and key targets for drug discovery, especially within the GPCR-A17 subfamily, which is linked to various diseases. To address the growing need for effective treatments, the GPCR-A17 Modulator, Agonist, Antagonist Predictor (MAAP) was introduced as an advanced ensemble machine learning model that combines XGBoost, Random Forest, and LightGBM to predict the functional roles of agonists, antagonists, and modulators in GPCR-A17 interactions. The model was trained on a dataset of over 3,000 ligands (agonists, antagonists, and modulators) and 6,900 protein–ligand interactions, comprising all three ligand types, sourced from the Guide to Pharmacology, Therapeutic Target Database, and ChEMBL. It demonstrated a strong predictive performance, achieving F1 scores of 0.9179 and 0.7151, AUCs of 0.9766 and 0.8591, and specificities of 0.9703 and 0.8789, respectively, reflecting the overall performance across all classes in the testing and independent ligand validation datasets. A Ki-filtered subset of 4,274 interactions (where Ki is the inhibition constant that quantifies the ligand-binding affinity) improved the F1 scores to 0.9330 and 0.8267 for the testing and independent ligand datasets, respectively. By guiding experimental validation, GPCR-A17 MAAP accelerates drug discovery for various therapeutic targets. The code and data are available on GitHub (https://github.com/MoreiraLAB/GPCR-A17-MAAP).</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01050-z","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144603430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The development of the generative adversarial supporting vector machine for molecular property generation","authors":"Qing Lu","doi":"10.1186/s13321-025-01052-x","DOIUrl":"10.1186/s13321-025-01052-x","url":null,"abstract":"<div><p>The generative adversarial network (GAN) is a milestone technique in artificial intelligence, and it is widely used in image generation. However, it has a large hyper-parameter space, which makes it difficult for training. In this work, we propose a new generative model by introducing the supporting vector machine into the GAN architecture. Such modification reduces the hyper-parameter space by half, thus making the training more accessible. The formic acid dimer (FAD) system is studied to examine the generation capacity of the proposed model. The molecular structures, molecular energies and molecular dipole moments are combined as the feature vector to train the model. It is found that the proposed model can generate new feature vectors from scratch, and the generated data agrees well with the ab initio values. In addition, each generated feature vector is unique, so the mode collapse problem is avoided, which is often encountered in the GAN model. The proposed model is extensible to incorporate any molecular properties as the feature vector is established as the direct sum of corresponding component vectors; thus, it is expected that the proposed method will have a wide range of application scenarios.</p><p>Scientific contribution statement: A generative adversarial algorithm combing supporting vector machine is proposed for the first time to predict molecular properties from scratch, which agrees well with ab initio values. The new model is more efficient than generative adversarial networks, and it is convenient to extend for application in different scenarios.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01052-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144578313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Charlotte Neidiger, Tarek Saier, Kai Kühn, Victor Larignon, Michael Färber, Claudia Bizzarri, Helena Šimek Tosino, Laura Holzhauer, Michael Erdmann, An Nguyen, Dean Harvey, Pierre Tremouilhac, Claudia Kramer, Daniel Hansch, Fabian Schönle, Jana Alpin, Maximilian Hartmann, Jérome Wagner, Nicole Jung, Stefan Bräse
{"title":"Implementation of an open chemistry knowledge base with a Semantic Wiki","authors":"Charlotte Neidiger, Tarek Saier, Kai Kühn, Victor Larignon, Michael Färber, Claudia Bizzarri, Helena Šimek Tosino, Laura Holzhauer, Michael Erdmann, An Nguyen, Dean Harvey, Pierre Tremouilhac, Claudia Kramer, Daniel Hansch, Fabian Schönle, Jana Alpin, Maximilian Hartmann, Jérome Wagner, Nicole Jung, Stefan Bräse","doi":"10.1186/s13321-025-01037-w","DOIUrl":"10.1186/s13321-025-01037-w","url":null,"abstract":"<div><p>In this work, a concept for an open chemistry knowledge base was developed to integrate chemical research results into a collaboratively usable platform. To achieve this, we enhanced Semantic MediaWiki (SMW) to support the collection and structured summary of chemical data contained in publications. We implemented tools for capturing chemical structures in machine-readable formats and designed data forms along with a data model to ensure standardized input and organization of research results. These enhancements allow for effective data comparison and contextual analysis within an expandable Wiki environment. The use of the platform was specifically demonstrated by organizing and comparing research in the area of “CO<sub>2</sub> reduction in homogeneous photocatalytic systems,” showcasing its potential to significantly enhance the collaborative collection of research outcomes.</p><p><b>Scientific contribution</b></p><p>This work shows ways to collaboratively collect and manage subject-specific knowledge in the domain of chemistry via an open database. By integrating cheminformatic tools into Semantic Mediawiki, an established technology for building knowledge databases is made systematically usable for the chemical community. The integration of chemistry-specific workflows and forms allows the mapping of data from current research with links to the original sources. This work is intended to show how gaps in the information system of scientists can be closed without having to use commercial systems.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01037-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144568753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Todor Kondić, Anjana Elapavalore, Jessy Krier, Adelene Lai, Hiba Mohammed Taha, Mira Narayanan, Emma L. Schymanski
{"title":"Shinyscreen: mass spectrometry data inspection and quality checking utility","authors":"Todor Kondić, Anjana Elapavalore, Jessy Krier, Adelene Lai, Hiba Mohammed Taha, Mira Narayanan, Emma L. Schymanski","doi":"10.1186/s13321-025-01044-x","DOIUrl":"10.1186/s13321-025-01044-x","url":null,"abstract":"<div><p>Shinyscreen is an R package and Shiny-based web application designed for the exploration, visualization, and quality assessment of raw data from high resolution mass spectrometry instruments. Its versatile list-based approach supports the curation of data starting from either known or “suspected” compounds (compound list-based screening) or detected masses (mass list-based screening), making it adaptable to diverse analytical needs (target, suspect or non-target screening). Shinyscreen can be operated in multiple modes, including as an R package, an interactive command-line tool, a self-documented web GUI, or a network-deployable service. Shinyscreen has been applied in environmental research, database enrichment, and educational initiatives, showcasing its broad utility. Shinyscreen is available in GitLab (https://gitlab.com/uniluxembourg/lcsb/eci/shinyscreen) under the Apache License 2.0. The repository contains detailed instructions for deployment and use. Additionally, a pre-configured Docker image, designed for seamless installation and operation is available, with instructions also provided in the main repository. <b>Scientific Contribution</b>: Shinyscreen is a fully open source prescreening application to assist analysts in the high throughput quality control of the thousands of peaks detected in high resolution mass spectrometry experiments. As a vendor-independent, cross operating system application it covers an important niche in open mass spectrometry workflows. Shinyscreen supports quality control of data for further identification or upload of spectra to public data resources, as well as teaching efforts to educate students on the importance of data quality control and rigorous identification methods.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01044-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144328890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nico Domschke, Bruno J. Schmidt, Thomas Gatter, Richard Golnik, Paul Eisenhuth, Fabian Liessmann, Jens Meiler, Peter F. Stadler
{"title":"Crossover operators for molecular graphs with an application to virtual drug screening","authors":"Nico Domschke, Bruno J. Schmidt, Thomas Gatter, Richard Golnik, Paul Eisenhuth, Fabian Liessmann, Jens Meiler, Peter F. Stadler","doi":"10.1186/s13321-025-00958-w","DOIUrl":"https://doi.org/10.1186/s13321-025-00958-w","url":null,"abstract":"Genetic algorithms are a powerful method to solve optimization problems with complex cost functions over vast search spaces that rely in particular on recombining parts of previous solutions. Crossover operators play a crucial role in this context. Here, we describe a large class of these operators designed for searching over spaces of graphs. These operators are based on introducing small cuts into graphs and rejoining the resulting induced subgraphs of two parents. This form of cut-and-join crossover can be restricted in a consistent way to preserve local properties such as vertex-degrees (valency), or bond-orders, as well as global properties such as graph-theoretic planarity. In contrast to crossover on strings, cut-and-join crossover on graphs is powerful enough to ergodically explore chemical space even in the absence of mutation operators. Extensive benchmarking shows that the offspring of molecular graphs are again plausible molecules with high probability, while at the same time crossover drastically increases the diversity compared to initial molecule libraries. Moreover, desirable properties such as favorable indices of synthesizability are preserved with sufficient frequency that candidate offsprings can be filtered efficiently for such properties. As an application we utilized the cut-and-join crossover in REvoLd, a GA-based system for computer-aided drug design. In optimization runs searching for ligands binding to four different target proteins we consistently found candidate molecules with binding constants exceeding the best known binders as well as candidates found in make-on-demand libraries. Scientific contribution We define cut-and-join crossover operators on a variety of graph classes including molecular graphs. This constitutes a mathematically simple and well-characterized approach to recombination of molecules that performed very well in real-life CADD tasks.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"44 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144311943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Advancements in thermochemical predictions: a multi-output thermodynamics-informed neural network approach","authors":"Raheel Hammad, Sownyak Mondal","doi":"10.1186/s13321-025-01033-0","DOIUrl":"https://doi.org/10.1186/s13321-025-01033-0","url":null,"abstract":"The Gibbs free energy of an inorganic material represents its maximum reversible work potential under constant temperature and pressure. Its calculation is crucial for understanding material stability, phase transitions, and chemical reactions, thus guiding optimization for diverse applications like catalysis and energy storage. In this study, we have developed a Physics-Informed Neural Network model that leverages the Gibbs free energy equation. The overall loss function is adjusted to allow the model to simultaneously predict all three thermodynamic quantities, including Gibbs free energy, total energy, and entropy, thus transforming it into a multi-output model. In recent literature, there is a growing emphasis on evaluating machine learning models under challenging conditions, such as small datasets and out-of-distribution predictions. Reflecting this trend, we have rigorously benchmarked our model across these scenarios, demonstrating its robustness and adaptability. It turns out that our model demonstrates a 43% improvement for normal scenario and even more in out-of-distribution regime compared to the next-best model. Scientific Contribution This study introduces the application of a Physics-Informed Neural Network to simultaneously compute multiple thermodynamic properties, including Gibbs free energy, total energy, and entropy. By integrating the Gibbs free energy equation into the loss function, the model achieves superior accuracy in low data regimes and enhances robustness in the out-of-distribution scenarios.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"11 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144296245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}