{"title":"基于亚结构关系指纹图谱和EEM的大型不平衡数据集化合物活性预测","authors":"Wojciech M. Czarnecki, Krzysztof Rataj","doi":"10.1109/Trustcom.2015.581","DOIUrl":null,"url":null,"abstract":"Modern drug design procedures involve the process of virtual screening, a highly efficient filtering step used for maximizing the efficiency of the preselection of compounds which are valuable drug candidates. Recent advances in introduction of machine learning models to this process can lead to significant increase in the overall quality of the drug designing pipeline. Unfortunately, for many proteins it is still extremely hard to come up with a valid statistical model. It is a consequence of huge classes disproportion (even 1000:1), large datasets (over 100,000 of samples) and restricted data representation (mostly high-dimensional, sparse, binary vectors). In this paper, we try to tackle this problem through three important innovations. First we represent compounds with 2-dimensional, graph representation. Second, we show how one can provide extremely fast method for measuring similarity of such data. Finally, we use the Extreme Entropy Machine which shows increase in balanced accuracy over Extreme Learning Machines, Support Vector Machines, one-class Support Vector Machines as well as Random Forest. Proposed pipeline brings significantly better results than all considered alternative, state-of-the-art approaches. We introduce some important novel elements and show why they lead to better model. Despite this, it should still be considered as a proof of concept and further investigations in the field are needed.","PeriodicalId":277092,"journal":{"name":"2015 IEEE Trustcom/BigDataSE/ISPA","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM\",\"authors\":\"Wojciech M. Czarnecki, Krzysztof Rataj\",\"doi\":\"10.1109/Trustcom.2015.581\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern drug design procedures involve the process of virtual screening, a highly efficient filtering step used for maximizing the efficiency of the preselection of compounds which are valuable drug candidates. Recent advances in introduction of machine learning models to this process can lead to significant increase in the overall quality of the drug designing pipeline. Unfortunately, for many proteins it is still extremely hard to come up with a valid statistical model. It is a consequence of huge classes disproportion (even 1000:1), large datasets (over 100,000 of samples) and restricted data representation (mostly high-dimensional, sparse, binary vectors). In this paper, we try to tackle this problem through three important innovations. First we represent compounds with 2-dimensional, graph representation. Second, we show how one can provide extremely fast method for measuring similarity of such data. Finally, we use the Extreme Entropy Machine which shows increase in balanced accuracy over Extreme Learning Machines, Support Vector Machines, one-class Support Vector Machines as well as Random Forest. Proposed pipeline brings significantly better results than all considered alternative, state-of-the-art approaches. We introduce some important novel elements and show why they lead to better model. Despite this, it should still be considered as a proof of concept and further investigations in the field are needed.\",\"PeriodicalId\":277092,\"journal\":{\"name\":\"2015 IEEE Trustcom/BigDataSE/ISPA\",\"volume\":\"29 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE Trustcom/BigDataSE/ISPA\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Trustcom.2015.581\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Trustcom/BigDataSE/ISPA","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Trustcom.2015.581","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Compounds Activity Prediction in Large Imbalanced Datasets with Substructural Relations Fingerprint and EEM
Modern drug design procedures involve the process of virtual screening, a highly efficient filtering step used for maximizing the efficiency of the preselection of compounds which are valuable drug candidates. Recent advances in introduction of machine learning models to this process can lead to significant increase in the overall quality of the drug designing pipeline. Unfortunately, for many proteins it is still extremely hard to come up with a valid statistical model. It is a consequence of huge classes disproportion (even 1000:1), large datasets (over 100,000 of samples) and restricted data representation (mostly high-dimensional, sparse, binary vectors). In this paper, we try to tackle this problem through three important innovations. First we represent compounds with 2-dimensional, graph representation. Second, we show how one can provide extremely fast method for measuring similarity of such data. Finally, we use the Extreme Entropy Machine which shows increase in balanced accuracy over Extreme Learning Machines, Support Vector Machines, one-class Support Vector Machines as well as Random Forest. Proposed pipeline brings significantly better results than all considered alternative, state-of-the-art approaches. We introduce some important novel elements and show why they lead to better model. Despite this, it should still be considered as a proof of concept and further investigations in the field are needed.