Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM

Wojciech M. Czarnecki, Krzysztof Rataj
{"title":"Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM","authors":"Wojciech M. Czarnecki, Krzysztof Rataj","doi":"10.1109/Trustcom.2015.581","DOIUrl":null,"url":null,"abstract":"Modern drug design procedures involve the process of virtual screening, a highly efficient filtering step used for maximizing the efficiency of the preselection of compounds which are valuable drug candidates. Recent advances in introduction of machine learning models to this process can lead to significant increase in the overall quality of the drug designing pipeline. Unfortunately, for many proteins it is still extremely hard to come up with a valid statistical model. It is a consequence of huge classes disproportion (even 1000:1), large datasets (over 100,000 of samples) and restricted data representation (mostly high-dimensional, sparse, binary vectors). In this paper, we try to tackle this problem through three important innovations. First we represent compounds with 2-dimensional, graph representation. Second, we show how one can provide extremely fast method for measuring similarity of such data. Finally, we use the Extreme Entropy Machine which shows increase in balanced accuracy over Extreme Learning Machines, Support Vector Machines, one-class Support Vector Machines as well as Random Forest. Proposed pipeline brings significantly better results than all considered alternative, state-of-the-art approaches. We introduce some important novel elements and show why they lead to better model. Despite this, it should still be considered as a proof of concept and further investigations in the field are needed.","PeriodicalId":277092,"journal":{"name":"2015 IEEE Trustcom/BigDataSE/ISPA","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Trustcom/BigDataSE/ISPA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom.2015.581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

Modern drug design procedures involve the process of virtual screening, a highly efficient filtering step used for maximizing the efficiency of the preselection of compounds which are valuable drug candidates. Recent advances in introduction of machine learning models to this process can lead to significant increase in the overall quality of the drug designing pipeline. Unfortunately, for many proteins it is still extremely hard to come up with a valid statistical model. It is a consequence of huge classes disproportion (even 1000:1), large datasets (over 100,000 of samples) and restricted data representation (mostly high-dimensional, sparse, binary vectors). In this paper, we try to tackle this problem through three important innovations. First we represent compounds with 2-dimensional, graph representation. Second, we show how one can provide extremely fast method for measuring similarity of such data. Finally, we use the Extreme Entropy Machine which shows increase in balanced accuracy over Extreme Learning Machines, Support Vector Machines, one-class Support Vector Machines as well as Random Forest. Proposed pipeline brings significantly better results than all considered alternative, state-of-the-art approaches. We introduce some important novel elements and show why they lead to better model. Despite this, it should still be considered as a proof of concept and further investigations in the field are needed.
基于亚结构关系指纹图谱和EEM的大型不平衡数据集化合物活性预测
现代药物设计程序涉及虚拟筛选过程,这是一种高效的过滤步骤,用于最大化预选有价值的候选药物化合物的效率。在这一过程中引入机器学习模型的最新进展可以显著提高药物设计管道的整体质量。不幸的是,对于许多蛋白质来说,仍然很难提出一个有效的统计模型。这是巨大的类不比例(甚至1000:1),大型数据集(超过100,000个样本)和有限的数据表示(主要是高维,稀疏,二进制向量)的结果。在本文中,我们试图通过三个重要的创新来解决这个问题。首先,我们用二维图形表示化合物。其次,我们展示了如何提供一种非常快速的方法来测量这些数据的相似性。最后,我们使用了极端熵机,它比极端学习机、支持向量机、一类支持向量机和随机森林的平衡精度更高。拟议的管道比所有考虑过的最先进的替代方法带来了明显更好的结果。我们介绍了一些重要的新元素,并说明了为什么它们会导致更好的模型。尽管如此,它仍应被视为概念的证明,需要在实地进行进一步的调查。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信