Kenji Hori, Yujiro Matsuo, Toru Yamaguchi, Kimito Funatsu
{"title":"An Attempt to Classify Elementary Reactions on the Basis of TS Motifs.","authors":"Kenji Hori, Yujiro Matsuo, Toru Yamaguchi, Kimito Funatsu","doi":"10.1002/minf.202400040","DOIUrl":"10.1002/minf.202400040","url":null,"abstract":"<p><p>Reactions commonly used in synthetic organic chemistry are named after their discoverers or developers. They are called the name reactions and generally consist of several elementary reactions. Quantum chemical calculations can optimize transition state (TS) structures of the elementary reactions. The geometrical feature of TS is called TS motif. We have constructed a database (QMRDB) with the TS motif information and have been continuing to accumulate them. In the present study, we extracted 102 elementary reactions from the QMRDB and attempted to classify them using the Kohonen self-organization map. As the results, all the TS motifs were clustered. By firing a target compound on a Kohonen map generated, we expect to be able to easily find the TS motifs most similar to the target.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400040"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11833755/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143440926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of the Global Minimum and Conical Intersection with Bayesian Optimization.","authors":"Riho Somaki, Taichi Inagaki, Miho Hatanaka","doi":"10.1002/minf.202400041","DOIUrl":"10.1002/minf.202400041","url":null,"abstract":"<p><p>Conventional molecular geometry searches on a potential energy surface (PES) utilize energy gradients from quantum chemical calculations. However, replacing energy calculations with noisy quantum computer measurements generates errors in the energies, which makes geometry optimization using the energy gradient difficult. One gradient-free optimization method that can potentially solve this problem is Bayesian optimization (BO). To use BO in geometry search, an acquisition function (AF), which involves an objective variable, must be defined suitably. In this study, we propose a strategy for geometry searches using BO and examine the appropriate AFs to explore two critical structures: the global minimum (GM) on the singlet ground state (S<sub>0</sub>) and the most stable conical intersection (CI) point between S<sub>0</sub> and the singlet excited state. We applied our strategy to two molecules and located the GM and the most stable CI geometries with high accuracy for both molecules. We also succeeded in the geometry searches even when artificial random noises were added to the energies to simulate geometry optimization using noisy quantum computer measurements.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400041"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11781018/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143066818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"KNIME Workflows for Chemoinformatic Characterization of Chemical Databases.","authors":"Carlos D Ramírez-Márquez, José L Medina-Franco","doi":"10.1002/minf.202400337","DOIUrl":"https://doi.org/10.1002/minf.202400337","url":null,"abstract":"<p><p>In chemoinformatics, chemical databases have great importance since their main objective is to store and organize the chemical structures of molecules and their properties, from basic information such as chemical structure to more complex like molecular fingerprints or other types of calculated or experimental descriptors and biological activity. However, this data can only be utilized in projects to identify novel therapeutic molecules or other fields through their correct characterization and analysis. In this Application Note, we compiled five workflows within the open-source data analytics and visualization platform KNIME that can be implemented for the chemoinformatic characterization of databases. To illustrate the application of the workflows, we used BIOFACQUIM, a compound database of natural products isolated and characterized in Mexico [1].</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400337"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143365158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dorsa Dadashi, Marjan Kaedi, Parsa Dadashi, Suprakas Sinha Ray
{"title":"Prediction of the Appropriate Temperature and Pressure for Polymer Dissolution Using Machine Learning Models.","authors":"Dorsa Dadashi, Marjan Kaedi, Parsa Dadashi, Suprakas Sinha Ray","doi":"10.1002/minf.202400193","DOIUrl":"https://doi.org/10.1002/minf.202400193","url":null,"abstract":"<p><p>The widespread use of polymer solutions in the chemical industry poses a significant challenge in determining optimal dissolution conditions. Traditionally, researchers have relied on experimental methods to estimate the processing parameters needed to dissolve polymers, often requiring numerous iterations of testing different temperatures and pressures. This approach is both costly and time-consuming. In this study, for the first time, we present a machine learning-based approach to predict the minimum temperature and pressure required for polymer dissolution, correlating molecular weight and chemical structure of both the polymer and solvent and its weight percent. Using a dataset compiled from existing literature, which includes key factors influencing polymer dissolution, we also extracted chemical bond information from the molecular structures of polymer-solvent systems. Six different machine learning algorithms, including linear regression, k-nearest neighbors, regression trees, random forests, multilayer perceptron neural networks, and support vector regression, were employed to develop predictive models. Among these, the Random Forest model achieved the highest accuracy, with R<sup>2</sup> values of 0.931 and 0.942 for temperature and pressure predictions, respectively. This novel approach eliminates the need for repetitive experimental testing, offering a more efficient pathway to determining dissolution conditions.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 2","pages":"e202400193"},"PeriodicalIF":2.8,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143391324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular InformaticsPub Date : 2025-01-01Epub Date: 2024-12-05DOI: 10.1002/minf.202400305
Gabriel Corrêa Veríssimo, Rafaela Salgado Ferreira, Vinícius Gonçalves Maltarollo
{"title":"Ultra-Large Virtual Screening: Definition, Recent Advances, and Challenges in Drug Design.","authors":"Gabriel Corrêa Veríssimo, Rafaela Salgado Ferreira, Vinícius Gonçalves Maltarollo","doi":"10.1002/minf.202400305","DOIUrl":"10.1002/minf.202400305","url":null,"abstract":"<p><p>Virtual screening (VS) in drug design employs computational methodologies to systematically rank molecules from a virtual compound library based on predicted features related to their biological activities or chemical properties. The recent expansion in commercially accessible compound libraries and the advancements in artificial intelligence (AI) and computational power - including enhanced central processing units (CPUs), graphics processing units (GPUs), high-performance computing (HPC), and cloud computing - have significantly expanded our capacity to screen libraries containing over 10<sup>9</sup> molecules. Herein, we review the concept of ultra-large virtual screening (ULVS), focusing on the various algorithms and methodologies employed for virtual screening at this scale. In this context, we present the software utilized, applications, and results of different approaches, such as brute force docking, reaction-based docking approaches, machine learning (ML) strategies applied to docking or other VS methods, and similarity/pharmacophore search-based techniques. These examples represent a paradigm shift in the drug discovery process, demonstrating not only the feasibility of billion-scale compound screening but also their potential to identify hit candidates and increase the structural diversity of novel compounds with biological activities.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400305"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142780630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David F Nippa, Alex T Müller, Kenneth Atz, David B Konrad, Uwe Grether, Rainer E Martin, Gisbert Schneider
{"title":"Simple User-Friendly Reaction Format.","authors":"David F Nippa, Alex T Müller, Kenneth Atz, David B Konrad, Uwe Grether, Rainer E Martin, Gisbert Schneider","doi":"10.1002/minf.202400361","DOIUrl":"10.1002/minf.202400361","url":null,"abstract":"<p><p>Utilizing the growing wealth of chemical reaction data can boost synthesis planning and increase success rates. Yet, the effectiveness of machine learning tools for retrosynthesis planning and forward reaction prediction relies on accessible, well-curated data presented in a structured format. Although some public and licensed reaction databases exist, they often lack essential information about reaction conditions. To address this issue and promote the principles of findable, accessible, interoperable, and reusable (FAIR) data reporting and sharing, we introduce the Simple User-Friendly Reaction Format (SURF). SURF standardizes the documentation of reaction data through a structured tabular format, requiring only a basic understanding of spreadsheets. This format enables chemists to record the synthesis of molecules in a format that is understandable by both humans and machines, which facilitates seamless sharing and integration directly into machine learning pipelines. SURF files are designed to be interoperable, easily imported into relational databases, and convertible into other formats. This complements existing initiatives like the Open Reaction Database (ORD) and Unified Data Model (UDM). At Roche, SURF plays a crucial role in democratizing FAIR reaction data sharing and expediting the chemical synthesis process.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400361"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11755691/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143024131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular InformaticsPub Date : 2025-01-01Epub Date: 2024-10-10DOI: 10.1002/minf.202400186
Markus Orsi, Jean-Louis Reymond
{"title":"Navigating a 1E+60 Chemical Space of Peptide/Peptoid Oligomers.","authors":"Markus Orsi, Jean-Louis Reymond","doi":"10.1002/minf.202400186","DOIUrl":"10.1002/minf.202400186","url":null,"abstract":"<p><p>Herein we report a virtual library of 1E+60 members, a common estimate for the total size of the drug-like chemical space. The library is obtained from 100 commercially available peptide and peptoid building blocks assembled into linear or cyclic oligomers of up to 30 units, forming molecules within the size range of peptide drugs and potentially accessible by solid-phase synthesis. We demonstrate ligand-based virtual screening (LBVS) using the peptide design genetic algorithm (PDGA), which evolves a population of 50 members to resemble a given target molecule using molecular fingerprint similarity as fitness function. Target molecules are reached in less than 10,000 generations. Like in many journeys, the value of the chemical space journey using PDGA lies not in reaching the target but in the journey itself, here by encountering non-obvious analogs. We also show that PDGA can be used to generate median molecules and analogs of non-peptide target molecules.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400186"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11733718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142400782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Topology-Enhanced Multi-Viewed Contrastive Approach for Molecular Graph Representation Learning and Classification.","authors":"Phu Pham","doi":"10.1002/minf.202400252","DOIUrl":"https://doi.org/10.1002/minf.202400252","url":null,"abstract":"<p><p>In recent times, graph representation learning has been becoming a hot research topic which has attracted a lot of attention from researchers. Graph embeddings have diverse applications across fields such as information and social network analysis, bioinformatics and cheminformatics, natural language processing (NLP), and recommendation systems. Among the advanced deep learning (DL) based architectures used in graph representation learning, graph neural networks (GNNs) have emerged as the dominant and highly effective framework. The recent GNN-based methods have demonstrated state-of-the-art performance on complex supervised and unsupervised tasks at both the node and graph levels. In recent years, to enhance multi-view and structured graph representations, contrastive learning-based techniques have been developed, introducing models known as graph contrastive learning (GCL) models. These GCL approaches leverage unsupervised contrastive methods to capture multi-view graph representations by comparing node and graph embeddings, yielding significant improvements in both graph-level representations and task-specific applications, such as molecular embedding and classification. However, as most GCL techniques are primarily designed to focus on the explicit graph structure through GNN-based encoders, they often overlook critical topological insights that could be provided through topological data analysis (TDA). Given the promising research indicating that topological features can greatly benefit various graph learning tasks, we propose a novel topology-enhanced, multi-view graph contrastive learning model called TMGCL. Our TMGCL model is designed to capture and utilize both comprehensive multi-scale topological and global structural information from graphs. This enhanced representation capability positions TMGCL to directly support a range of applications, such as molecular classification, with improved accuracy and robustness. Extensive experiments within two real-world datasets proved the effectiveness and outperformance of our proposed TMGCL in comparing with state-of-the-art GNN/GCL-based baselines.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 1","pages":"e202400252"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142951853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular InformaticsPub Date : 2025-01-01Epub Date: 2024-08-06DOI: 10.1002/minf.202400154
I M Kashafutdinova, A Poyezzhayeva, T Gimadiev, T Madzhidov
{"title":"Active learning approaches in molecule pKi prediction.","authors":"I M Kashafutdinova, A Poyezzhayeva, T Gimadiev, T Madzhidov","doi":"10.1002/minf.202400154","DOIUrl":"10.1002/minf.202400154","url":null,"abstract":"<p><p>During the early stages of drug design, identifying compounds with suitable bioactivities is crucial. Given the vast array of potential drug databases, it's feasible to assay only a limited subset of candidates. The optimal method for selecting the candidates, aiming to minimize the overall number of assays, involves an active learning (AL) approach. In this work, we benchmarked a range of AL strategies with two main objectives: (1) to identify a strategy that ensures high model performance and (2) to select molecules with desired properties using minimal assays. To evaluate the different AL strategies, we employed the simulated AL workflow based on \"virtual\" experiments. These experiments leveraged ChEMBL datasets, which come with known biological activity values for the molecules. Furthermore, for classification tasks, we proposed the hybrid selection strategy that unified both exploration and exploitation AL strategies into a single acquisition function, defined by parameters n and c. We have also shown that popular minimal margin and maximal variance selection approaches for exploration selection correspond to minimization of the hybrid acquisition function with n=1 and 2 respectively. The balance between the exploration and exploitation strategies can be adjusted using a coefficient (c), making the optimal strategy selection straightforward. The primary strength of the hybrid selection method lies in its adaptability; it offers the flexibility to adjust the criteria for molecule selection based on the specific task by modifying the value of the contribution coefficient. Our analysis revealed that, in regression tasks, AL strategies didn't succeed at ensuring high model performance, however, they were successful in selecting molecules with desired properties using minimal number of tests. In analogous experiments in classification tasks, exploration strategy and the hybrid selection function with a constant c<1 (for n=1) and c≤0.2 (for n=2) were effective in achieving the goal of constructing a high-performance predictive model using minimal data. When searching for molecules with desired properties, exploitation, and the hybrid function with c≥1 (n=1) and c≥0.7 (n=2) demonstrated efficiency identifying molecules in fewer iterations compared to random selection method. Notably, when the hybrid function was set to an intermediate coefficient value (c=0.7), it successfully addressed both tasks simultaneously.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400154"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141893849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Molecular InformaticsPub Date : 2025-01-01Epub Date: 2024-10-18DOI: 10.1002/minf.202400169
Xinxin Yu, Yuanting Chen, Long Chen, Weihua Li, Yuhao Wang, Yun Tang, Guixia Liu
{"title":"GCLmf: A Novel Molecular Graph Contrastive Learning Framework Based on Hard Negatives and Application in Toxicity Prediction.","authors":"Xinxin Yu, Yuanting Chen, Long Chen, Weihua Li, Yuhao Wang, Yun Tang, Guixia Liu","doi":"10.1002/minf.202400169","DOIUrl":"10.1002/minf.202400169","url":null,"abstract":"<p><p>In silico methods for prediction of chemical toxicity can decrease the cost and increase the efficiency in the early stage of drug discovery. However, due to low accessibility of sufficient and reliable toxicity data, constructing robust and accurate prediction models is challenging. Contrastive learning, a type of self-supervised learning, leverages large unlabeled data to obtain more expressive molecular representations, which can boost the prediction performance on downstream tasks. While molecular graph contrastive learning has gathered growing attentions, current models neglect the quality of negative data set. Here, we proposed a self-supervised pretraining deep learning framework named GCLmf. We first utilized molecular fragments that meet specific conditions as hard negative samples to boost the quality of the negative set and thus increase the difficulty of the proxy tasks during pre-training to learn informative representations. GCLmf has shown excellent predictive power on various molecular property benchmarks and demonstrates high performance in 33 toxicity tasks in comparison with multiple baselines. In addition, we further investigated the necessity of introducing hard negatives in model building and the impact of the proportion of hard negatives on the model.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":" ","pages":"e202400169"},"PeriodicalIF":2.8,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142470301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}