Comparative analysis of machine learning techniques for the prediction of logP

2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB) Pub Date : 2011-04-11 DOI:10.1109/CIBCB.2011.5948478

Edward W. Lowe, Mariusz Butkiewicz, Matthew Spellings, A. Omlor, J. Meiler

{"title":"Comparative analysis of machine learning techniques for the prediction of logP","authors":"Edward W. Lowe, Mariusz Butkiewicz, Matthew Spellings, A. Omlor, J. Meiler","doi":"10.1109/CIBCB.2011.5948478","DOIUrl":null,"url":null,"abstract":"Several machine learning techniques were evaluated for the prediction of logP. The algorithms used include artificial neural networks (ANN), support vector machines (SVM) with the extension for regression, and kappa nearest neighbor (k-NN). Molecules were described using optimized feature sets derived from a series of scalar, two- and three-dimensional descriptors including 2-D and 3-D autocorrelation, and radial distribution function. Feature optimization was performed as a sequential forward feature selection. The data set contained over 25,000 molecules with experimentally determined logP values collected from the Reaxys and MDDR databases, as well as data mining through SciFinder. LogP, the logarithm of the equilibrium octanol-water partition coefficient for a given substance is a metric of the hydrophobicity. This property is an important metric for drug absorption, distribution, metabolism, and excretion (ADME). In this work, models were built by systematically optimizing feature sets and algorithmic parameters that predict logP with a root mean square deviation (rmsd) of 0.86 for compounds in an independent test set. This result presents a substantial improvement over XlogP, an incremental system that achieves a rmsd of 1.41 over the same dataset. The final models were 5-fold cross-validated. These fully in silico models can be useful in guiding early stages of drug discovery, such as virtual library screening and analogue prioritization prior to synthesis and biological testing. These models are freely available for academic use.","PeriodicalId":395505,"journal":{"name":"2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CIBCB.2011.5948478","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Several machine learning techniques were evaluated for the prediction of logP. The algorithms used include artificial neural networks (ANN), support vector machines (SVM) with the extension for regression, and kappa nearest neighbor (k-NN). Molecules were described using optimized feature sets derived from a series of scalar, two- and three-dimensional descriptors including 2-D and 3-D autocorrelation, and radial distribution function. Feature optimization was performed as a sequential forward feature selection. The data set contained over 25,000 molecules with experimentally determined logP values collected from the Reaxys and MDDR databases, as well as data mining through SciFinder. LogP, the logarithm of the equilibrium octanol-water partition coefficient for a given substance is a metric of the hydrophobicity. This property is an important metric for drug absorption, distribution, metabolism, and excretion (ADME). In this work, models were built by systematically optimizing feature sets and algorithmic parameters that predict logP with a root mean square deviation (rmsd) of 0.86 for compounds in an independent test set. This result presents a substantial improvement over XlogP, an incremental system that achieves a rmsd of 1.41 over the same dataset. The final models were 5-fold cross-validated. These fully in silico models can be useful in guiding early stages of drug discovery, such as virtual library screening and analogue prioritization prior to synthesis and biological testing. These models are freely available for academic use.

查看原文本刊更多论文

logP预测的机器学习技术比较分析

评估了几种机器学习技术对logP的预测。使用的算法包括人工神经网络(ANN)、扩展回归的支持向量机(SVM)和kappa最近邻(k-NN)。利用优化的特征集来描述分子，这些特征集来自一系列标量、二维和三维描述符，包括二维和三维自相关以及径向分布函数。特征优化作为顺序前向特征选择进行。该数据集包含超过25,000个分子，实验确定的logP值来自Reaxys和MDDR数据库，以及通过SciFinder进行数据挖掘。LogP，对给定物质的平衡辛醇-水分配系数的对数是疏水性的度量。这一特性是衡量药物吸收、分布、代谢和排泄(ADME)的重要指标。在这项工作中，通过系统优化特征集和算法参数建立模型，预测独立测试集中化合物的logP的均方根偏差(rmsd)为0.86。这个结果比XlogP有了很大的改进，XlogP是一个增量系统，在相同的数据集上实现了1.41的rmsd。最终的模型进行了5次交叉验证。这些全硅模型可用于指导药物发现的早期阶段，例如虚拟文库筛选和合成和生物测试之前的模拟物优先排序。这些模型可免费用于学术用途。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB)

自引率

0.00%

发文量