Journal of Cheminformatics最新文献

筛选
英文 中文
Be aware of overfitting by hyperparameter optimization! 注意超参数优化的过拟合!
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-12-09 DOI: 10.1186/s13321-024-00934-w
Igor V. Tetko, Ruud van Deursen, Guillaume Godin
{"title":"Be aware of overfitting by hyperparameter optimization!","authors":"Igor V. Tetko,&nbsp;Ruud van Deursen,&nbsp;Guillaume Godin","doi":"10.1186/s13321-024-00934-w","DOIUrl":"10.1186/s13321-024-00934-w","url":null,"abstract":"<div><p>Hyperparameter optimization is very frequently employed in machine learning. However, an optimization of a large space of parameters could result in overfitting of models. In recent studies on solubility prediction the authors collected seven thermodynamic and kinetic solubility datasets from different data sources. They used state-of-the-art graph-based methods and compared models developed for each dataset using different data cleaning protocols and hyperparameter optimization. In our study we showed that hyperparameter optimization did not always result in better models, possibly due to overfitting when using the same statistical measures. Similar results could be calculated using pre-set hyperparameters, reducing the computational effort by around 10,000 times. We also extended the previous analysis by adding a representation learning method based on Natural Language Processing of smiles called Transformer CNN. We show that across all analyzed sets using exactly the same protocol, Transformer CNN provided better results than graph-based methods for 26 out of 28 pairwise comparisons by using only a tiny fraction of time as compared to other methods. Last but not least we stressed the importance of comparing calculation results using exactly the same statistical measures.</p><p><b>Scientific Contribution</b> We showed that models with pre-optimized hyperparameters can suffer from overfitting and that using pre-set hyperparameters yields similar performances but four orders faster. Transformer CNN provided significantly higher accuracy compared to other investigated methods.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00934-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142796786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CSearch: chemical space search via virtual synthesis and global optimization CSearch:通过虚拟合成和全局优化的化学空间搜索
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-12-05 DOI: 10.1186/s13321-024-00936-8
Hakjean Kim, Seongok Ryu, Nuri Jung, Jinsol Yang, Chaok Seok
{"title":"CSearch: chemical space search via virtual synthesis and global optimization","authors":"Hakjean Kim,&nbsp;Seongok Ryu,&nbsp;Nuri Jung,&nbsp;Jinsol Yang,&nbsp;Chaok Seok","doi":"10.1186/s13321-024-00936-8","DOIUrl":"10.1186/s13321-024-00936-8","url":null,"abstract":"<div><p>The two key components of computational molecular design are virtually generating molecules and predicting the properties of these generated molecules. This study focuses on an effective method for molecular generation through virtual synthesis and global optimization of a given objective function. Using a pre-trained graph neural network (GNN) objective function to approximate the docking energies of compounds for four target receptors, we generated highly optimized compounds with 300–400 times less computational effort compared to virtual compound library screening. These optimized compounds exhibit similar synthesizability and diversity to known binders with high potency and are notably novel compared to library chemicals or known ligands. This method, called CSearch, can be effectively utilized to generate chemicals optimized for a given objective function. With the GNN function approximating docking energies, CSearch generated molecules with predicted binding poses to the target receptors similar to known inhibitors, demonstrating its effectiveness in producing drug-like binders.</p><p><b>Scientific Contribution</b> We have developed a method for effectively exploring the chemical space of drug-like molecules using a global optimization algorithm with fragment-based virtual synthesis. The compounds generated using this method optimize the given objective function efficiently and are synthesizable like commercial library compounds. Furthermore, they are diverse, novel drug-like molecules with properties similar to known inhibitors for target receptors.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00936-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deepmol: an automated machine and deep learning framework for computational chemistry Deepmol:用于计算化学的自动化机器和深度学习框架
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-12-05 DOI: 10.1186/s13321-024-00937-7
João Correia, João Capela, Miguel Rocha
{"title":"Deepmol: an automated machine and deep learning framework for computational chemistry","authors":"João Correia,&nbsp;João Capela,&nbsp;Miguel Rocha","doi":"10.1186/s13321-024-00937-7","DOIUrl":"10.1186/s13321-024-00937-7","url":null,"abstract":"<div><p>The domain of computational chemistry has experienced a significant evolution due to the introduction of Machine Learning (ML) technologies. Despite its potential to revolutionize the field, researchers are often encumbered by obstacles, such as the complexity of selecting optimal algorithms, the automation of data pre-processing steps, the necessity for adaptive feature engineering, and the assurance of model performance consistency across different datasets. Addressing these issues head-on, <i>DeepMol</i> stands out as an Automated ML (AutoML) tool by automating critical steps of the ML pipeline. <i>DeepMol</i> rapidly and automatically identifies the most effective data representation, pre-processing methods and model configurations for a specific molecular property/activity prediction problem. On 22 benchmark datasets, <i>DeepMol</i> obtained competitive pipelines compared with those requiring time-consuming feature engineering, model design and selection processes. As one of the first AutoML tools specifically developed for the computational chemistry domain, <i>DeepMol</i> stands out with its open-source code, in-depth tutorials, detailed documentation, and examples of real-world applications, all available at https://github.com/BioSystemsUM/DeepMol and https://deepmol.readthedocs.io/en/latest/. By introducing AutoML as a groundbreaking feature in computational chemistry, DeepMol establishes itself as the pioneering state-of-the-art tool in the field.</p><p><b>Scientific contribution</b></p><p><i>DeepMol</i> aims to provide an integrated framework of AutoML for computational chemistry. <i>DeepMol</i> provides a more robust alternative to other tools with its integrated pipeline serialization, enabling seamless deployment using the <i>fit</i>, <i>transform</i>, and <i>predict</i> paradigms. It uniquely supports both conventional and deep learning models for regression, classification and multi-task, offering unmatched flexibility compared to other AutoML tools. <i>DeepMol's</i> predefined configurations and customizable objective functions make it accessible to users at all skill levels while enabling efficient and reproducible workflows. Benchmarking on diverse datasets demonstrated its ability to deliver optimized pipelines and superior performance across various molecular machine-learning tasks.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00937-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142776778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints Sort & Slice:对于扩展连接指纹,它是基于散列的折叠的一个简单而优越的替代方案
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-12-03 DOI: 10.1186/s13321-024-00932-y
Markus Dablander, Thierry Hanser, Renaud Lambiotte, Garrett M. Morris
{"title":"Sort & Slice: a simple and superior alternative to hash-based folding for extended-connectivity fingerprints","authors":"Markus Dablander,&nbsp;Thierry Hanser,&nbsp;Renaud Lambiotte,&nbsp;Garrett M. Morris","doi":"10.1186/s13321-024-00932-y","DOIUrl":"10.1186/s13321-024-00932-y","url":null,"abstract":"<div><p>Extended-connectivity fingerprints (ECFPs) are a ubiquitous tool in current cheminformatics and molecular machine learning, and one of the most prevalent molecular feature extraction techniques used for chemical prediction. Atom features learned by graph neural networks can be aggregated to compound-level representations using a large spectrum of graph pooling methods. In contrast, sets of detected ECFP substructures are by default transformed into bit vectors using only a simple hash-based folding procedure. We introduce a general mathematical framework for the vectorisation of structural fingerprints via a formal operation called substructure pooling that encompasses hash-based folding, algorithmic substructure selection, and a wide variety of other potential techniques. We go on to describe <i>Sort &amp; Slice</i>, an easy-to-implement and bit-collision-free alternative to hash-based folding for the pooling of ECFP substructures. Sort &amp; Slice first sorts ECFP substructures according to their relative prevalence in a given set of training compounds and then slices away all but the <i>L</i> most frequent substructures which are subsequently used to generate a binary fingerprint of desired length, <i>L</i>. We computationally compare the performance of hash-based folding, Sort &amp; Slice, and two advanced supervised substructure-selection schemes (filtering and mutual-information maximisation) for ECFP-based molecular property prediction. Our results indicate that, despite its technical simplicity, Sort &amp; Slice robustly (and at times substantially) outperforms traditional hash-based folding as well as the other investigated substructure-pooling methods across distinct prediction tasks, data splitting techniques, machine-learning models and ECFP hyperparameters. We thus recommend that Sort &amp; Slice canonically replace hash-based folding as the default substructure-pooling technique to vectorise ECFPs for supervised molecular machine learning. </p><br><p><b>Scientific contribution</b> </p><p>A general mathematical framework for the vectorisation of structural fingerprints called <i>substructure pooling</i>; and the technical description and computational evaluation of <i>Sort &amp; Slice</i>, a conceptually simple and bit-collision-free method for the pooling of ECFP substructures that robustly and markedly outperforms classical hash-based folding at molecular property prediction.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00932-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142760674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research cidalsDB:人工智能赋能的抗病原治疗研究平台
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-11-28 DOI: 10.1186/s13321-024-00929-7
Emna Harigua-Souiai, Ons Masmoudi, Samer Makni, Rafeh Oualha, Yosser Z. Abdelkrim, Sara Hamdi, Oussama Souiai, Ikram Guizani
{"title":"cidalsDB: an AI-empowered platform for anti-pathogen therapeutics research","authors":"Emna Harigua-Souiai,&nbsp;Ons Masmoudi,&nbsp;Samer Makni,&nbsp;Rafeh Oualha,&nbsp;Yosser Z. Abdelkrim,&nbsp;Sara Hamdi,&nbsp;Oussama Souiai,&nbsp;Ikram Guizani","doi":"10.1186/s13321-024-00929-7","DOIUrl":"10.1186/s13321-024-00929-7","url":null,"abstract":"<div><p>Computer-aided drug discovery (CADD) is nurtured by late advances in big data analytics and Artificial Intelligence (AI) towards enhanced drug discovery (DD) outcomes. In this context, reliable datasets are of utmost importance. We herein present <i>CidalsDB</i> a novel web server for AI-assisted DD against infectious pathogens, namely <i>Leishmania</i> parasites and Coronaviruses. We performed a literature search on molecules with validated anti-pathogen effects. Then, we consolidated these data with bioassays from PubChem. Finally, we constructed a database to store these datasets and make them accessible and ready-to-use for the scientific community through <i>CidalsDB</i>, a web-based interface. In a second step, we implemented and optimized four machine learning (ML) and three deep learning (DL) algorithms that optimally predicted the biological activity of molecules. Random Forests (RF), Multi-Layer Perceptron (MLP) and ChemBERTa were the best classifiers of anti-<i>Leishmania</i> molecules, while Gradient Boosting (GB), Graph-Convolutional Network (GCN) and ChemBERTa achieved the best performances on the Coronaviruses dataset. All six models were optimized and deployed through <i>CidalsDB</i> as anti-pathogen activity prediction models.</p><p><b>Scientific contribution</b></p><p>CidalsDB is an open access web-based tool that allows browsing and access to ready-to-use datasets of anti-pathogen molecules, alongside best performing AI models for biological activity prediction. It offers a democratized no-code platform for AI-based CADD, which shall foster innovation and collaboration within the DD community. <i>CidalsDB</i> is accessible through https://cidalsdb.streamlit.app/.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00929-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142737029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability 组图:一种性能、效率和可解释性更强的分子图表示法
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-11-28 DOI: 10.1186/s13321-024-00933-x
Piao-Yang Cao, Yang He, Ming-Yang Cui, Xiao-Min Zhang, Qingye Zhang, Hong-Yu Zhang
{"title":"Group graph: a molecular graph representation with enhanced performance, efficiency and interpretability","authors":"Piao-Yang Cao,&nbsp;Yang He,&nbsp;Ming-Yang Cui,&nbsp;Xiao-Min Zhang,&nbsp;Qingye Zhang,&nbsp;Hong-Yu Zhang","doi":"10.1186/s13321-024-00933-x","DOIUrl":"10.1186/s13321-024-00933-x","url":null,"abstract":"<div><p>The exploration of chemical space holds promise for developing influential chemical entities. Molecular representations, which reflect features of molecular structure in silico, assist in navigating chemical space appropriately. Unlike atom-level molecular representations, such as SMILES and atom graph, which can sometimes lead to confusing interpretations about chemical substructures, substructure-level molecular representations encode important substructures into molecular features; they not only provide more information for predicting molecular properties and drug‒drug interactions but also help to interpret the correlations between molecular properties and substructures. However, it remains challenging to represent the entire molecular structure both intactly and simply with substructure-level molecular representations. In this study, we developed a novel substructure-level molecular representation and named it a group graph. The group graph offers three advantages: (a) the substructure of the group graph reflects the diversity and consistency of different molecular datasets; (b) the group graph retains molecular structural features with minimal information loss because the graph isomorphism network (GIN) of the group graph performs well in molecular properties and drug‒drug interactions prediction, showing higher accuracy and efficiency than the model of other molecular graphs, even without any pretraining; and (c) the molecular property may change when the substructure is substituted with another of differing importance in group graph, facilitating the detection of activity cliffs. In addition, we successfully predicted structural modifications to improve blood‒brain barrier permeability (BBBP) via the GIN of group graph. Therefore, the group graph takes advantages for simultaneously representing molecular local characteristics and global features.</p><p><b>Scientific contribution</b> The group graph, as a substructure-level molecular representation, has the ability to retain molecular structural features with minimal information loss. As a result, it shows superior performance in predicting molecular properties and drug‒drug interactions with enhanced efficiency and interpretability. </p><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00933-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142737030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature 从专利文献中提取高质量化学反应数据集的大语言模型的适用性
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-11-26 DOI: 10.1186/s13321-024-00928-8
Sarveswara Rao Vangala, Sowmya Ramaswamy Krishnan, Navneet Bung, Dhandapani Nandagopal, Gomathi Ramasamy, Satyam Kumar, Sridharan Sankaran, Rajgopal Srinivasan, Arijit Roy
{"title":"Suitability of large language models for extraction of high-quality chemical reaction dataset from patent literature","authors":"Sarveswara Rao Vangala,&nbsp;Sowmya Ramaswamy Krishnan,&nbsp;Navneet Bung,&nbsp;Dhandapani Nandagopal,&nbsp;Gomathi Ramasamy,&nbsp;Satyam Kumar,&nbsp;Sridharan Sankaran,&nbsp;Rajgopal Srinivasan,&nbsp;Arijit Roy","doi":"10.1186/s13321-024-00928-8","DOIUrl":"10.1186/s13321-024-00928-8","url":null,"abstract":"<div><p>With the advent of artificial intelligence (AI), it is now possible to design diverse and novel molecules from previously unexplored chemical space. However, a challenge for chemists is the synthesis of such molecules. Recently, there have been attempts to develop AI models for retrosynthesis prediction, which rely on the availability of a high-quality training dataset. In this work, we explore the suitability of large language models (LLMs) for extraction of high-quality chemical reaction data from patent documents. A comparative study on the same set of patents from an earlier study showed that the proposed automated approach can enhance the current datasets by addition of 26% new reactions. Several challenges were identified during reaction mining, and for some of them alternative solutions were proposed. A detailed analysis was also performed wherein several wrong entries were identified in the previously curated dataset. Reactions extracted using the proposed pipeline over a larger patent dataset can improve the accuracy and efficiency of synthesis prediction models in future.</p><p><b>Scientific contribution</b></p><p>In this work we evaluated the suitability of large language models for mining a high-quality chemical reaction dataset from patent literature. We showed that the proposed approach can significantly improve the quantity of the reaction database by identifying more chemical reactions and improve the quality of the reaction database by correcting previous errors/false positives.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00928-8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts GT-NMR:基于图变换器的新型核磁共振化学位移精确预测方法
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-11-26 DOI: 10.1186/s13321-024-00927-9
Haochen Chen, Tao Liang, Kai Tan, Anan Wu, Xin Lu
{"title":"GT-NMR: a novel graph transformer-based approach for accurate prediction of NMR chemical shifts","authors":"Haochen Chen,&nbsp;Tao Liang,&nbsp;Kai Tan,&nbsp;Anan Wu,&nbsp;Xin Lu","doi":"10.1186/s13321-024-00927-9","DOIUrl":"10.1186/s13321-024-00927-9","url":null,"abstract":"<div><p>In this work, inspired by the graph transformer, we presented an improved protocol, termed GT-NMR, which integrates 2D molecular graph representation with Transformer architecture, for accurate yet efficient prediction of NMR chemical shifts. The effectiveness of the GT-NMR was thoroughly examined with the standard nmrshiftdb2 dataset, 37 natural products and structural elucidation of 11 pairs of natural products. Systematical analysis affirms that GT-NMR outperforms traditional graph-based methods in all aspects, achieving state-of-the-art performance, with the mean absolute error of 0.158 and 1.189 ppm in predicting <sup>1</sup>H and <sup>13</sup>C NMR chemical shifts, respectively, for the standard nmrshiftdb2 dataset. Further scrutiny of its practical applications indicates that GT-NMR's efficacy is closely tied to molecular complexity, as quantified by the size-normalized spatial score (nSPS). For relatively simple molecules (nSPS &lt; = 27.71), GT-NMR performs comparably to the best density functional while its effectiveness significantly diminishes with complex molecules characterized by higher nSPS values (nSPS &gt; = 38.42). This trend is consistent across other graph-based NMR chemical shift prediction methods as well. Therefore, while employing GT-NMR or other graph-based methods for the rapid and routine prediction of NMR chemical shifts, it is advisable to utilize nSPS to assess their suitability. The source codes and trained model of GT-NMR are publicly available at GitHub.</p><p><b>Scientific contribution</b></p><p>GT-NMR, which combines the 2D molecular graph representation with the Transformer architecture, was implemented for the first time to predict atom-level NMR chemical shifts, achieving state-of-the-art performance. More importantly, the reliability of the GT-NMR and graph-based methods was assessed for the first time in terms of molecular complexity, as quantified by the size-normalized spacial score (nSPS). Systematical scrutiny demonstrated that GT-NMR offer a valuable way for routine application in structural screening and elucidation of relatively simple molecules.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00927-9","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142713121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Molecular identification via molecular fingerprint extraction from atomic force microscopy images 从原子力显微镜图像中提取分子指纹进行分子鉴定
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-11-25 DOI: 10.1186/s13321-024-00921-1
Manuel González Lastre, Pablo Pou, Miguel Wiche, Daniel Ebeling, Andre Schirmeisen, Rubén Pérez
{"title":"Molecular identification via molecular fingerprint extraction from atomic force microscopy images","authors":"Manuel González Lastre,&nbsp;Pablo Pou,&nbsp;Miguel Wiche,&nbsp;Daniel Ebeling,&nbsp;Andre Schirmeisen,&nbsp;Rubén Pérez","doi":"10.1186/s13321-024-00921-1","DOIUrl":"10.1186/s13321-024-00921-1","url":null,"abstract":"<div><p>Non–Contact Atomic Force Microscopy with CO–functionalized metal tips (referred to as HR-AFM) provides access to the internal structure of individual molecules adsorbed on a surface with totally unprecedented resolution. Previous works have shown that deep learning (DL) models can retrieve the chemical and structural information encoded in a 3D stack of constant-height HR–AFM images, leading to molecular identification. In this work, we overcome their limitations by using a well-established description of the molecular structure in terms of topological fingerprints, the 1024–bit Extended Connectivity Chemical Fingerprints of radius 2 (ECFP4), that were developed for substructure and similarity searching. ECFPs provide local structural information of the molecule, each bit correlating with a particular substructure within the molecule. Our DL model is able to extract this optimized structural descriptor from the 3D HR–AFM stacks and use it, through virtual screening, to identify molecules from their predicted ECFP4 with a retrieval accuracy on theoretical images of 95.4%. Furthermore, this approach, unlike previous DL models, assigns a confidence score, the Tanimoto similarity, to each of the candidate molecules, thus providing information on the reliability of the identification. By construction, the number of times a certain substructure is present in the molecule is lost during the hashing process, necessary to make them useful for machine learning applications. We show that it is possible to complement the fingerprint-based virtual screening with global information provided by another DL model that predicts from the same HR–AFM stacks the chemical formula, boosting the identification accuracy up to a 97.6%. Finally, we perform a limited test with experimental images, obtaining promising results towards the application of this pipeline under real conditions.</p><p><b>Scientific contribution</b></p><p>Previous works on molecular identification from AFM images used chemical descriptors that were intuitive for humans but sub–optimal for neural networks. We propose a novel method to extract the ECFP4 from AFM images and identify the molecule via a virtual screening, beating previous state-of-the-art models.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00921-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142697120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A systematic review of deep learning chemical language models in recent era 近代深度学习化学语言模型的系统回顾。
IF 7.1 2区 化学
Journal of Cheminformatics Pub Date : 2024-11-18 DOI: 10.1186/s13321-024-00916-y
Hector Flores-Hernandez, Emmanuel Martinez-Ledesma
{"title":"A systematic review of deep learning chemical language models in recent era","authors":"Hector Flores-Hernandez,&nbsp;Emmanuel Martinez-Ledesma","doi":"10.1186/s13321-024-00916-y","DOIUrl":"10.1186/s13321-024-00916-y","url":null,"abstract":"<div><p>Discovering new chemical compounds with specific properties can provide advantages for fields that rely on materials for their development, although this task comes at a high cost in terms of complexity and resources. Since the beginning of the data age, deep learning techniques have revolutionized the process of designing molecules by analyzing and learning from representations of molecular data, greatly reducing the resources and time involved. Various deep learning approaches have been developed to date, using a variety of architectures and strategies, in order to explore the extensive and discontinuous chemical space, providing benefits for generating compounds with specific properties. In this study, we present a systematic review that offers a statistical description and comparison of the strategies utilized to generate molecules through deep learning techniques, utilizing the metrics proposed in Molecular Sets (MOSES) or Guacamol. The study included 48 articles retrieved from a query-based search of Scopus and Web of Science and 25 articles retrieved from citation search, yielding a total of 72 retrieved articles, of which 62 correspond to chemical language models approaches to molecule generation and other 10 retrieved articles correspond to molecular graph representations. Transformers, recurrent neural networks (RNNs), generative adversarial networks (GANs), Structured Space State Sequence (S4) models, and variational autoencoders (VAEs) are considered the main deep learning architectures used for molecule generation in the set of retrieved articles. In addition, transfer learning, reinforcement learning, and conditional learning are the most employed techniques for biased model generation and exploration of specific chemical space regions. Finally, this analysis focuses on the central themes of molecular representation, databases, training dataset size, validity-novelty trade-off, and performance of unbiased and biased chemical language models. These themes were selected to conduct a statistical analysis utilizing graphical representation and statistical tests. The resulting analysis reveals the main challenges, advantages, and opportunities in the field of chemical language models over the past four years.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"16 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-024-00916-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142666624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信