{"title":"AutoTemplate: enhancing chemical reaction datasets for machine learning applications in organic chemistry","authors":"Lung-Yi Chen, Yi-Pei Li","doi":"10.1186/s13321-024-00869-2","DOIUrl":"10.1186/s13321-024-00869-2","url":null,"abstract":"<p>This paper presents AutoTemplate, an innovative data preprocessing protocol, addressing the crucial need for high-quality chemical reaction datasets in the realm of machine learning applications in organic chemistry. Recent advances in artificial intelligence have expanded the application of machine learning in chemistry, particularly in yield prediction, retrosynthesis, and reaction condition prediction. However, the effectiveness of these models hinges on the integrity of chemical reaction datasets, which are often plagued by inconsistencies like missing reactants, incorrect atom mappings, and outright erroneous reactions. AutoTemplate introduces a two-stage approach to refine these datasets. The first stage involves extracting meaningful reaction transformation rules and formulating generic reaction templates using a simplified SMARTS representation. This simplification broadens the applicability of templates across various chemical reactions. The second stage is template-guided reaction curation, where these templates are systematically applied to validate and correct the reaction data. This process effectively amends missing reactant information, rectifies atom-mapping errors, and eliminates incorrect data entries. A standout feature of AutoTemplate is its capability to concurrently identify and correct false chemical reactions. It operates on the premise that most reactions in datasets are accurate, using these as templates to guide the correction of flawed entries. The protocol demonstrates its efficacy across a range of chemical reactions, significantly enhancing dataset quality. This advancement provides a more robust foundation for developing reliable machine learning models in chemistry, thereby improving the accuracy of forward and retrosynthetic predictions. AutoTemplate marks a significant progression in the preprocessing of chemical reaction datasets, bridging a vital gap and facilitating more precise and efficient machine learning applications in organic synthesis.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00869-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141462625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Physicochemical modelling of the retention mechanism of temperature-responsive polymeric columns for HPLC through machine learning algorithms","authors":"Elena Bandini, Rodrigo Castellano Ontiveros, Ardiana Kajtazi, Hamed Eghbali, Frédéric Lynen","doi":"10.1186/s13321-024-00873-6","DOIUrl":"10.1186/s13321-024-00873-6","url":null,"abstract":"<div><p>Temperature-responsive liquid chromatography (TRLC) offers a promising alternative to reversed-phase liquid chromatography (RPLC) for environmentally friendly analytical techniques by utilizing pure water as a mobile phase, eliminating the need for harmful organic solvents. TRLC columns, packed with temperature-responsive polymers coupled to silica particles, exhibit a unique retention mechanism influenced by temperature-induced polymer hydration. An investigation of the physicochemical parameters driving separation at high and low temperatures is crucial for better column manufacturing and selectivity control. Assessment of predictability using a dataset of 139 molecules analyzed at different temperatures elucidated the molecular descriptors (MDs) relevant to retention mechanisms. Linear regression, support vector regression (SVR), and tree-based ensemble models were evaluated, with no standout performer. The precision, accuracy, and robustness of models were validated through metrics, such as <i>r</i> and mean absolute error (MAE), and statistical analysis. At <span>(45,^{circ }hbox {C})</span>, logP predominantly influenced retention, akin to reversed-phase columns, while at <span>(5^{circ }hbox {C})</span>, complex interactions with lipophilic and negative MDs, along with specific functional groups, dictated retention. These findings provide deeper insights into TRLC mechanisms, facilitating method development and maximizing column potential.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00873-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Llamol: a dynamic multi-conditional generative transformer for de novo molecular design","authors":"Niklas Dobberstein, Astrid Maass, Jan Hamaekers","doi":"10.1186/s13321-024-00863-8","DOIUrl":"10.1186/s13321-024-00863-8","url":null,"abstract":"<p>Generative models have demonstrated substantial promise in Natural Language Processing (NLP) and have found application in designing molecules, as seen in General Pretrained Transformer (GPT) models. In our efforts to develop such a tool for exploring the organic chemical space in search of potentially electro-active compounds, we present <i>Llamol</i>, a single novel generative transformer model based on the Llama 2 architecture, which was trained on a 12.5M superset of organic compounds drawn from diverse public sources. To allow for a maximum flexibility in usage and robustness in view of potentially incomplete data, we introduce <i>Stochastic Context Learning</i> (SCL) as a new training procedure. We demonstrate that the resulting model adeptly handles single- and multi-conditional organic molecule generation with up to four conditions, yet more are possible. The model generates valid molecular structures in SMILES notation while flexibly incorporating three numerical and/or one token sequence into the generative process, just as requested. The generated compounds are very satisfactory in all scenarios tested. In detail, we showcase the model’s capability to utilize token sequences for conditioning, either individually or in combination with numerical properties, making <i>Llamol</i> a potent tool for de novo molecule design, easily expandable with new properties.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00863-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141436540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence","authors":"Xiaofan Zheng, Yoichi Tomiura","doi":"10.1186/s13321-024-00848-7","DOIUrl":"10.1186/s13321-024-00848-7","url":null,"abstract":"<p>Among the various molecular properties and their combinations, it is a costly process to obtain the desired molecular properties through theory or experiment. Using machine learning to analyze molecular structure features and to predict molecular properties is a potentially efficient alternative for accelerating the prediction of molecular properties. In this study, we analyze molecular properties through the molecular structure from the perspective of machine learning. We use SMILES sequences as inputs to an artificial neural network in extracting molecular structural features and predicting molecular properties. A SMILES sequence comprises symbols representing molecular structures. To address the problem that a SMILES sequence is different from actual molecular structural data, we propose a pretraining model for a SMILES sequence based on the BERT model, which is widely used in natural language processing, such that the model learns to extract the molecular structural information contained in the SMILES sequence. In an experiment, we first pretrain the proposed model with 100,000 SMILES sequences and then use the pretrained model to predict molecular properties on 22 data sets and the odor characteristics of molecules (98 types of odor descriptor). The experimental results show that our proposed pretraining model effectively improves the performance of molecular property prediction</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00848-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141425535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arnau Comajuncosa-Creus, Aksel Lenes, Miguel Sánchez-Palomino, Dylan Dalton, Patrick Aloy
{"title":"Stereochemically-aware bioactivity descriptors for uncharacterized chemical compounds","authors":"Arnau Comajuncosa-Creus, Aksel Lenes, Miguel Sánchez-Palomino, Dylan Dalton, Patrick Aloy","doi":"10.1186/s13321-024-00867-4","DOIUrl":"10.1186/s13321-024-00867-4","url":null,"abstract":"<div><p>Stereochemistry plays a fundamental role in pharmacology. Here, we systematically investigate the relationship between stereoisomerism and bioactivity on over 1 M compounds, finding that a very significant fraction (~ 40%) of spatial isomer pairs show, to some extent, distinct bioactivities. We then use the 3D representation of these molecules to train a collection of deep neural networks (<i>Signaturizers3D</i>) to generate bioactivity descriptors associated to small molecules, that capture their effects at increasing levels of biological complexity (i.e. from protein targets to clinical outcomes). Further, we assess the ability of the descriptors to distinguish between stereoisomers and to recapitulate their different target binding profiles. Overall, we show how these new stereochemically-aware descriptors provide an even more faithful description of complex small molecule bioactivity properties, capturing key differences in the activity of stereoisomers.</p><p><b>Scientific contribution</b></p><p>We systematically assess the relationship between stereoisomerism and bioactivity on a large scale, focusing on compound-target binding events, and use our findings to train novel deep learning models to generate stereochemically-aware bioactivity signatures for any compound of interest.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00867-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141417136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PubChem synonym filtering process using crowdsourcing","authors":"Sunghwan Kim, Bo Yu, Qingliang Li, Evan E. Bolton","doi":"10.1186/s13321-024-00868-3","DOIUrl":"10.1186/s13321-024-00868-3","url":null,"abstract":"<div><p>PubChem (https://pubchem.ncbi.nlm.nih.gov) is a public chemical information resource containing more than 100 million unique chemical structures. One of the most requested tasks in PubChem and other chemical databases is to search chemicals by name (also commonly called a “chemical synonym”). PubChem performs this task by looking up chemical synonym-structure associations provided by individual depositors to PubChem. In addition, these synonyms are used for many purposes, including creating links between chemicals and PubMed articles (using Medical Subject Headings (MeSH) terms). However, these depositor-provided name-structure associations are subject to substantial discrepancies within and between depositors, making it difficult to unambiguously map a chemical name to a specific chemical structure. The present paper describes PubChem’s crowdsourcing-based synonym filtering strategy, which resolves inter- and intra-depositor discrepancies in synonym-structure associations as well as in the chemical-MeSH associations. The PubChem synonym filtering process was developed based on the analysis of four crowd-voting strategies, which differ in the consistency threshold value employed (60% vs 70%) and how to resolve intra-depositor discrepancies (a single vote vs. multiple votes per depositor) prior to inter-depositor crowd-voting. The agreement of voting was determined at six levels of chemical equivalency, which considers varying isotopic composition, stereochemistry, and connectivity of chemical structures and their primary components. While all four strategies showed comparable results, Strategy I (one vote per depositor with a 60% consistency threshold) resulted in the most synonyms assigned to a single chemical structure as well as the most synonym-structure associations disambiguated at the six chemical equivalency contexts. Based on the results of this study, Strategy I was implemented in PubChem’s filtering process that cleans up synonym-structure associations as well as chemical-MeSH associations. This consistency-based filtering process is designed to look for a consensus in name-structure associations but cannot attest to their correctness. As a result, it can fail to recognize correct name-structure associations (or incorrect ones), for example, when a synonym is provided by only one depositor or when many contributors are incorrect. However, this filtering process is an important starting point for quality control in name-structure associations in large chemical databases like PubChem.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00868-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141330048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Correction: QuanDB: a quantum chemical property database towards enhancing 3D molecular representation learning","authors":"Zhijiang Yang, Tengxin Huang, Li Pan, Jingjing Wang, Liangliang Wang, Junjie Ding, Junhua Xiao","doi":"10.1186/s13321-024-00864-7","DOIUrl":"10.1186/s13321-024-00864-7","url":null,"abstract":"","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00864-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141304470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An end-to-end method for predicting compound-protein interactions based on simplified homogeneous graph convolutional network and pre-trained language model","authors":"Yufang Zhang, Jiayi Li, Shenggeng Lin, Jianwei Zhao, Yi Xiong, Dong-Qing Wei","doi":"10.1186/s13321-024-00862-9","DOIUrl":"10.1186/s13321-024-00862-9","url":null,"abstract":"<div><p>Identification of interactions between chemical compounds and proteins is crucial for various applications, including drug discovery, target identification, network pharmacology, and elucidation of protein functions. Deep neural network-based approaches are becoming increasingly popular in efficiently identifying compound-protein interactions with high-throughput capabilities, narrowing down the scope of candidates for traditional labor-intensive, time-consuming and expensive experimental techniques. In this study, we proposed an end-to-end approach termed SPVec-SGCN-CPI, which utilized simplified graph convolutional network (SGCN) model with low-dimensional and continuous features generated from our previously developed model SPVec and graph topology information to predict compound-protein interactions. The SGCN technique, dividing the local neighborhood aggregation and nonlinearity layer-wise propagation steps, effectively aggregates K-order neighbor information while avoiding neighbor explosion and expediting training. The performance of the SPVec-SGCN-CPI method was assessed across three datasets and compared against four machine learning- and deep learning-based methods, as well as six state-of-the-art methods. Experimental results revealed that SPVec-SGCN-CPI outperformed all these competing methods, particularly excelling in unbalanced data scenarios. By propagating node features and topological information to the feature space, SPVec-SGCN-CPI effectively incorporates interactions between compounds and proteins, enabling the fusion of heterogeneity. Furthermore, our method scored all unlabeled data in ChEMBL, confirming the top five ranked compound-protein interactions through molecular docking and existing evidence. These findings suggest that our model can reliably uncover compound-protein interactions within unlabeled compound-protein pairs, carrying substantial implications for drug re-profiling and discovery. In summary, SPVec-SGCN demonstrates its efficacy in accurately predicting compound-protein interactions, showcasing potential to enhance target identification and streamline drug discovery processes.</p><p><b>Scientific contributions</b></p><p>The methodology presented in this work not only enables the comparatively accurate prediction of compound-protein interactions but also, for the first time, take sample imbalance which is very common in real world and computation efficiency into consideration simultaneously, accelerating the target identification and drug discovery process.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00862-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141287531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kandel Jeevan, Shrestha Palistha, Hilal Tayara, Kil T. Chong
{"title":"PUResNetV2.0: a deep learning model leveraging sparse representation for improved ligand binding site prediction","authors":"Kandel Jeevan, Shrestha Palistha, Hilal Tayara, Kil T. Chong","doi":"10.1186/s13321-024-00865-6","DOIUrl":"10.1186/s13321-024-00865-6","url":null,"abstract":"<div><p>Accurate ligand binding site prediction (LBSP) within proteins is essential for drug discovery. We developed ProteinUNetResNetV2.0 (PUResNetV2.0), leveraging sparse representation of protein structures to improve LBSP accuracy. Our training dataset included protein complexes from 4729 protein families. Evaluations on benchmark datasets showed that PUResNetV2.0 achieved an 85.4% Distance Center Atom (DCA) success rate and a 74.7% F1 Score on the Holo801 dataset, outperforming existing methods. However, its performance in specific cases, such as RNA, DNA, peptide-like ligand, and ion binding site prediction, was limited due to constraints in our training data. Our findings underscore the potential of sparse representation in LBSP, especially for oligomeric structures, suggesting PUResNetV2.0 as a promising tool for computational drug discovery.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00865-6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141286747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reagan M. Mogire, Silviane A. Miruka, Dennis W. Juma, Case W. McNamara, Ben Andagalu, Jeremy N. Burrows, Elodie Chenu, James Duffy, Bernhards R. Ogutu, Hoseah M. Akala
{"title":"Protein target similarity is positive predictor of in vitro antipathogenic activity: a drug repurposing strategy for Plasmodium falciparum","authors":"Reagan M. Mogire, Silviane A. Miruka, Dennis W. Juma, Case W. McNamara, Ben Andagalu, Jeremy N. Burrows, Elodie Chenu, James Duffy, Bernhards R. Ogutu, Hoseah M. Akala","doi":"10.1186/s13321-024-00856-7","DOIUrl":"10.1186/s13321-024-00856-7","url":null,"abstract":"<div><p>Drug discovery is an intricate and costly process. Repurposing existing drugs and active compounds offers a viable pathway to develop new therapies for various diseases. By leveraging publicly available biomedical information, it is possible to predict compounds’ activity and identify their potential targets across diverse organisms. In this study, we aimed to assess the antiplasmodial activity of compounds from the Repurposing, Focused Rescue, and Accelerated Medchem (ReFRAME) library using in vitro and bioinformatics approaches. We assessed the in vitro antiplasmodial activity of the compounds using blood-stage and liver-stage drug susceptibility assays. We used protein sequences of known targets of the ReFRAME compounds with high antiplasmodial activity (EC<sub>50</sub> < 10 uM) to conduct a protein-pairwise search to identify similar <i>Plasmodium falciparum</i> 3D7 proteins (from PlasmoDB) using NCBI protein BLAST. We further assessed the association between the compounds' in vitro antiplasmodial activity and level of similarity between their known and predicted <i>P. falciparum</i> target proteins using simple linear regression analyses. BLAST analyses revealed 735 <i>P. falciparum</i> proteins that were similar to the 226 known protein targets associated with the ReFRAME compounds. Antiplasmodial activity of the compounds was positively associated with the degree of similarity between the compounds’ known targets and predicted <i>P. falciparum</i> protein targets (percentage identity, E value, and bit score), the number of the predicted <i>P. falciparum</i> targets, and their respective mutagenesis index and fitness scores (R<sup>2</sup> between 0.066 and 0.92, <i>P</i> < 0.05). Compounds predicted to target essential <i>P. falciparum</i> proteins or those with a druggability index of 1 showed the highest antiplasmodial activity.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":8.6,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00856-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141236004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}