Edward W. Lowe, Mariusz Butkiewicz, Matthew Spellings, A. Omlor, J. Meiler
{"title":"Comparative analysis of machine learning techniques for the prediction of logP","authors":"Edward W. Lowe, Mariusz Butkiewicz, Matthew Spellings, A. Omlor, J. Meiler","doi":"10.1109/CIBCB.2011.5948478","DOIUrl":null,"url":null,"abstract":"Several machine learning techniques were evaluated for the prediction of logP. The algorithms used include artificial neural networks (ANN), support vector machines (SVM) with the extension for regression, and kappa nearest neighbor (k-NN). Molecules were described using optimized feature sets derived from a series of scalar, two- and three-dimensional descriptors including 2-D and 3-D autocorrelation, and radial distribution function. Feature optimization was performed as a sequential forward feature selection. The data set contained over 25,000 molecules with experimentally determined logP values collected from the Reaxys and MDDR databases, as well as data mining through SciFinder. LogP, the logarithm of the equilibrium octanol-water partition coefficient for a given substance is a metric of the hydrophobicity. This property is an important metric for drug absorption, distribution, metabolism, and excretion (ADME). In this work, models were built by systematically optimizing feature sets and algorithmic parameters that predict logP with a root mean square deviation (rmsd) of 0.86 for compounds in an independent test set. This result presents a substantial improvement over XlogP, an incremental system that achieves a rmsd of 1.41 over the same dataset. The final models were 5-fold cross-validated. These fully in silico models can be useful in guiding early stages of drug discovery, such as virtual library screening and analogue prioritization prior to synthesis and biological testing. These models are freely available for academic use.","PeriodicalId":395505,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2011.5948478","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
Several machine learning techniques were evaluated for the prediction of logP. The algorithms used include artificial neural networks (ANN), support vector machines (SVM) with the extension for regression, and kappa nearest neighbor (k-NN). Molecules were described using optimized feature sets derived from a series of scalar, two- and three-dimensional descriptors including 2-D and 3-D autocorrelation, and radial distribution function. Feature optimization was performed as a sequential forward feature selection. The data set contained over 25,000 molecules with experimentally determined logP values collected from the Reaxys and MDDR databases, as well as data mining through SciFinder. LogP, the logarithm of the equilibrium octanol-water partition coefficient for a given substance is a metric of the hydrophobicity. This property is an important metric for drug absorption, distribution, metabolism, and excretion (ADME). In this work, models were built by systematically optimizing feature sets and algorithmic parameters that predict logP with a root mean square deviation (rmsd) of 0.86 for compounds in an independent test set. This result presents a substantial improvement over XlogP, an incremental system that achieves a rmsd of 1.41 over the same dataset. The final models were 5-fold cross-validated. These fully in silico models can be useful in guiding early stages of drug discovery, such as virtual library screening and analogue prioritization prior to synthesis and biological testing. These models are freely available for academic use.