谷本系数与随机森林接近度相结合的复合活性预测方法

Ipsj Digital Courier Pub Date : 2008-03-15 DOI:10.2197/IPSJDC.4.238

G. Kawamura, S. Seno, Y. Takenaka, H. Matsuda

{"title":"谷本系数与随机森林接近度相结合的复合活性预测方法","authors":"G. Kawamura, S. Seno, Y. Takenaka, H. Matsuda","doi":"10.2197/IPSJDC.4.238","DOIUrl":null,"url":null,"abstract":"Chemical and biological activities of compounds provide valuable information for discovering new drugs. The compound fingerprint that is represented by structural information of the activities is used for candidates for investigating similarity. However, there are several problems with predicting accuracy from the requirement in the compound structural similarity. Although the amount of compound data is growing rapidly, the number of well-annotated compounds, e.g., those in the MDL Drug Data Report (MDDR)database, has not increased quickly. Since the compounds that are known to have some activities of a biological class of the target are rare in the drug discovery process, the accuracy of the prediction should be increased as the activity decreases or the false positive rate should be maintained in databases that have a large number of un-annotated compounds and a small number of annotated compounds of the biological activity. In this paper, we propose a new similarity scoring method composed of a combination of the Tanimoto coefficient and the proximity measure of random forest. The score contains two properties that are derived from unsupervised and supervised methods of partial dependence for compounds. Thus, the proposed method is expected to indicate compounds that have accurate activities. By evaluating the performance of the prediction compared with the two scores of the Tanimoto coefficient and the proximity measure, we demonstrate that the prediction result of the proposed scoring method is better than those of the two methods by using the Linear Discriminant Analysis (LDA) method. We estimate the prediction accuracy of compound datasets extracted from MDDR using the proposed method. It is also shown that the proposed method can identify active compounds in datasets including several un-annotated compounds.","PeriodicalId":432390,"journal":{"name":"Ipsj Digital Courier","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Combination Method of the Tanimoto Coefficient and Proximity Measure of Random Forest for Compound Activity Prediction\",\"authors\":\"G. Kawamura, S. Seno, Y. Takenaka, H. Matsuda\",\"doi\":\"10.2197/IPSJDC.4.238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Chemical and biological activities of compounds provide valuable information for discovering new drugs. The compound fingerprint that is represented by structural information of the activities is used for candidates for investigating similarity. However, there are several problems with predicting accuracy from the requirement in the compound structural similarity. Although the amount of compound data is growing rapidly, the number of well-annotated compounds, e.g., those in the MDL Drug Data Report (MDDR)database, has not increased quickly. Since the compounds that are known to have some activities of a biological class of the target are rare in the drug discovery process, the accuracy of the prediction should be increased as the activity decreases or the false positive rate should be maintained in databases that have a large number of un-annotated compounds and a small number of annotated compounds of the biological activity. In this paper, we propose a new similarity scoring method composed of a combination of the Tanimoto coefficient and the proximity measure of random forest. The score contains two properties that are derived from unsupervised and supervised methods of partial dependence for compounds. Thus, the proposed method is expected to indicate compounds that have accurate activities. By evaluating the performance of the prediction compared with the two scores of the Tanimoto coefficient and the proximity measure, we demonstrate that the prediction result of the proposed scoring method is better than those of the two methods by using the Linear Discriminant Analysis (LDA) method. We estimate the prediction accuracy of compound datasets extracted from MDDR using the proposed method. It is also shown that the proposed method can identify active compounds in datasets including several un-annotated compounds.\",\"PeriodicalId\":432390,\"journal\":{\"name\":\"Ipsj Digital Courier\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-03-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ipsj Digital Courier\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2197/IPSJDC.4.238\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ipsj Digital Courier","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2197/IPSJDC.4.238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

化合物的化学和生物活性为发现新药提供了有价值的信息。用活动的结构信息表示的复合指纹作为候选物进行相似性调查。然而，从复合结构相似性的要求出发，预测精度存在一些问题。虽然化合物数据的数量正在快速增长，但注释良好的化合物的数量，例如MDL药物数据报告(MDDR)数据库中的化合物，并没有迅速增加。由于已知具有某一类靶标生物活性的化合物在药物发现过程中是罕见的，因此在具有大量未注释化合物和少量该生物活性已注释化合物的数据库中，应随着活性的降低而提高预测的准确性或保持假阳性率。本文提出了一种将谷本系数与随机森林的接近度测度相结合的相似性评分方法。分数包含两个属性，它们是由化合物的部分依赖的非监督和监督方法导出的。因此，所提出的方法有望表明具有准确活性的化合物。通过比较谷本系数和接近度两种评分方法的预测效果，证明了该评分方法的预测效果优于线性判别分析(LDA)方法的预测效果。我们使用该方法对从MDDR中提取的复合数据集的预测精度进行了估计。结果表明，该方法可以在包含多个未注释化合物的数据集中识别出活性化合物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Combination Method of the Tanimoto Coefficient and Proximity Measure of Random Forest for Compound Activity Prediction

Chemical and biological activities of compounds provide valuable information for discovering new drugs. The compound fingerprint that is represented by structural information of the activities is used for candidates for investigating similarity. However, there are several problems with predicting accuracy from the requirement in the compound structural similarity. Although the amount of compound data is growing rapidly, the number of well-annotated compounds, e.g., those in the MDL Drug Data Report (MDDR)database, has not increased quickly. Since the compounds that are known to have some activities of a biological class of the target are rare in the drug discovery process, the accuracy of the prediction should be increased as the activity decreases or the false positive rate should be maintained in databases that have a large number of un-annotated compounds and a small number of annotated compounds of the biological activity. In this paper, we propose a new similarity scoring method composed of a combination of the Tanimoto coefficient and the proximity measure of random forest. The score contains two properties that are derived from unsupervised and supervised methods of partial dependence for compounds. Thus, the proposed method is expected to indicate compounds that have accurate activities. By evaluating the performance of the prediction compared with the two scores of the Tanimoto coefficient and the proximity measure, we demonstrate that the prediction result of the proposed scoring method is better than those of the two methods by using the Linear Discriminant Analysis (LDA) method. We estimate the prediction accuracy of compound datasets extracted from MDDR using the proposed method. It is also shown that the proposed method can identify active compounds in datasets including several un-annotated compounds.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ipsj Digital Courier

自引率

0.00%

发文量