B. C. Santos, Cora Silberschneider, M. W. Rodrigues, C. L. N. Pinto, C. Nobre, Luis E. Zárate
{"title":"蛋白质分类器的特征选择与比较","authors":"B. C. Santos, Cora Silberschneider, M. W. Rodrigues, C. L. N. Pinto, C. Nobre, Luis E. Zárate","doi":"10.5753/jidm.2019.2034","DOIUrl":null,"url":null,"abstract":"Knowing the function of proteins is essential for understanding several biological systems. The experiments in laboratory to determine protein class are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models to identify the class to which a protein belongs. Nowadays, a significant volume of information regarding proteins and their structure is continually being made available in public data repositories. For example, the STING_DB database has a lot of information extracted from all protein structural levels (primary, secondary, tertiary, and quaternary), which are frequently used in classification models for this type of problem. However, it is unknown which physical-chemical properties are the most relevant ones to contribute to the prediction of the class. Therefore, there is a need to identify the subset of more suitable properties. In this work, we propose an approach based on a multi-objective genetic algorithm with the classifier k-NN to select the best physical-chemical properties. Our strategy uses a multi-objective genetic algorithm to obtain a smaller subset of features that contribute significantly to the prediction problem. To improve the prediction’s performance, we choose to perform a post enrichment process, then we compare the performance of our methodology with several classifiers: ANN, SVM, Random Forest, and k-NN. Our method achieved an average F-measure value of 70.22% with the Random Forest classifier. Finally, a comparative analysis, with statistical significance, shows the relevance of our approach in relation to other methodologies.","PeriodicalId":293511,"journal":{"name":"Journal of Information and Data Management","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Feature selection and comparison of classifiers for predicting protein class\",\"authors\":\"B. C. Santos, Cora Silberschneider, M. W. Rodrigues, C. L. N. Pinto, C. Nobre, Luis E. Zárate\",\"doi\":\"10.5753/jidm.2019.2034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Knowing the function of proteins is essential for understanding several biological systems. The experiments in laboratory to determine protein class are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models to identify the class to which a protein belongs. Nowadays, a significant volume of information regarding proteins and their structure is continually being made available in public data repositories. For example, the STING_DB database has a lot of information extracted from all protein structural levels (primary, secondary, tertiary, and quaternary), which are frequently used in classification models for this type of problem. However, it is unknown which physical-chemical properties are the most relevant ones to contribute to the prediction of the class. Therefore, there is a need to identify the subset of more suitable properties. In this work, we propose an approach based on a multi-objective genetic algorithm with the classifier k-NN to select the best physical-chemical properties. Our strategy uses a multi-objective genetic algorithm to obtain a smaller subset of features that contribute significantly to the prediction problem. To improve the prediction’s performance, we choose to perform a post enrichment process, then we compare the performance of our methodology with several classifiers: ANN, SVM, Random Forest, and k-NN. Our method achieved an average F-measure value of 70.22% with the Random Forest classifier. Finally, a comparative analysis, with statistical significance, shows the relevance of our approach in relation to other methodologies.\",\"PeriodicalId\":293511,\"journal\":{\"name\":\"Journal of Information and Data Management\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information and Data Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5753/jidm.2019.2034\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information and Data Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5753/jidm.2019.2034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
了解蛋白质的功能对于理解一些生物系统是必不可少的。在实验室中测定蛋白质类别的实验成本高,耗时长。因此,有必要提供有效的计算模型来识别蛋白质所属的类别。如今,关于蛋白质及其结构的大量信息不断在公共数据存储库中提供。例如,STING_DB数据库从所有蛋白质结构级别(一级、二级、三级和四级)中提取了大量信息,这些信息经常用于这类问题的分类模型。然而,目前尚不清楚哪些物理化学性质与该类预测最相关。因此,有必要确定更合适属性的子集。在这项工作中,我们提出了一种基于多目标遗传算法和分类器k-NN的方法来选择最佳的物理化学性质。我们的策略使用多目标遗传算法来获得对预测问题有重要贡献的更小的特征子集。为了提高预测的性能,我们选择执行后浓缩过程,然后将我们的方法与几种分类器(ANN, SVM, Random Forest和k-NN)的性能进行比较。使用随机森林分类器,我们的方法获得了70.22%的平均f测量值。最后,具有统计意义的比较分析显示了我们的方法与其他方法的相关性。
Feature selection and comparison of classifiers for predicting protein class
Knowing the function of proteins is essential for understanding several biological systems. The experiments in laboratory to determine protein class are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models to identify the class to which a protein belongs. Nowadays, a significant volume of information regarding proteins and their structure is continually being made available in public data repositories. For example, the STING_DB database has a lot of information extracted from all protein structural levels (primary, secondary, tertiary, and quaternary), which are frequently used in classification models for this type of problem. However, it is unknown which physical-chemical properties are the most relevant ones to contribute to the prediction of the class. Therefore, there is a need to identify the subset of more suitable properties. In this work, we propose an approach based on a multi-objective genetic algorithm with the classifier k-NN to select the best physical-chemical properties. Our strategy uses a multi-objective genetic algorithm to obtain a smaller subset of features that contribute significantly to the prediction problem. To improve the prediction’s performance, we choose to perform a post enrichment process, then we compare the performance of our methodology with several classifiers: ANN, SVM, Random Forest, and k-NN. Our method achieved an average F-measure value of 70.22% with the Random Forest classifier. Finally, a comparative analysis, with statistical significance, shows the relevance of our approach in relation to other methodologies.