{"title":"Logistic regression and random forest unveil key molecular descriptors of druglikeness","authors":"L. T. Billones, Nadia B. Morales, J. Billones","doi":"10.1273/CBIJ.21.39","DOIUrl":null,"url":null,"abstract":"The identification of molecular descriptors that embody the chemical information for druglikeness will be a step forward in data-driven drug discovery and development endeavor. In this study, over 4000 Dragon-type molecular properties were generated for approximately 2000 known drugs and 2000 surrogate nondrugs. Logistic Regression (LogR) and Random Forest (RF) techniques were carried out to unveil the crucial molecular descriptors that can adequately classify a compound as drug or nondrug. Ten one-variable LogR models each demonstrated at least 70% prediction accuracy. A two-variable model consisting of HVcpx and MDDD correctly classified 85% of the test compounds. The best LogR model with 89.0% prediction accuracy identified five most influential descriptors for druglikeness: an information index HVcpx , topological index MDDD , a ring descriptor NNRS , X2A or average connectivity index of order 2, and walk and path count SRW05. The best RF model involving 10 only weakly correlated descriptors was found to be 92.5% accurate and at par with the RF and LogR models that consisted of over 200 variables. The model featured: molecular weight, MW ; average molecular weight, AMW ; rotatable bond fraction, RBF; percentage carbon, C%; maximal electrotopological negative variation, MAXDN ; all-path Wiener index, Wap ; structural information content index, neighborhood symmetry of 1 order, SIC1 ; number of nitrogen atoms, nN; 2D Petitjean shape index, PJI2 ; and self-returning walk count of order 5, SRW05 . Many of these descriptors have straightforward chemical interpretability and future applicability as druglikeness filters in virtual high throughput drug discovery.","PeriodicalId":40659,"journal":{"name":"Chem-Bio Informatics Journal","volume":null,"pages":null},"PeriodicalIF":0.4000,"publicationDate":"2021-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chem-Bio Informatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1273/CBIJ.21.39","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 1
Abstract
The identification of molecular descriptors that embody the chemical information for druglikeness will be a step forward in data-driven drug discovery and development endeavor. In this study, over 4000 Dragon-type molecular properties were generated for approximately 2000 known drugs and 2000 surrogate nondrugs. Logistic Regression (LogR) and Random Forest (RF) techniques were carried out to unveil the crucial molecular descriptors that can adequately classify a compound as drug or nondrug. Ten one-variable LogR models each demonstrated at least 70% prediction accuracy. A two-variable model consisting of HVcpx and MDDD correctly classified 85% of the test compounds. The best LogR model with 89.0% prediction accuracy identified five most influential descriptors for druglikeness: an information index HVcpx , topological index MDDD , a ring descriptor NNRS , X2A or average connectivity index of order 2, and walk and path count SRW05. The best RF model involving 10 only weakly correlated descriptors was found to be 92.5% accurate and at par with the RF and LogR models that consisted of over 200 variables. The model featured: molecular weight, MW ; average molecular weight, AMW ; rotatable bond fraction, RBF; percentage carbon, C%; maximal electrotopological negative variation, MAXDN ; all-path Wiener index, Wap ; structural information content index, neighborhood symmetry of 1 order, SIC1 ; number of nitrogen atoms, nN; 2D Petitjean shape index, PJI2 ; and self-returning walk count of order 5, SRW05 . Many of these descriptors have straightforward chemical interpretability and future applicability as druglikeness filters in virtual high throughput drug discovery.