Alireza Fazelpour, T. Khoshgoftaar, D. Dittman, Amri Napolitano
{"title":"数据采样的加入是否提高了非平衡生物信息学数据的增强算法的性能?","authors":"Alireza Fazelpour, T. Khoshgoftaar, D. Dittman, Amri Napolitano","doi":"10.1109/ICMLA.2015.23","DOIUrl":null,"url":null,"abstract":"Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.","PeriodicalId":288427,"journal":{"name":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?\",\"authors\":\"Alireza Fazelpour, T. Khoshgoftaar, D. Dittman, Amri Napolitano\",\"doi\":\"10.1109/ICMLA.2015.23\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.\",\"PeriodicalId\":288427,\"journal\":{\"name\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICMLA.2015.23\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2015.23","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Does the Inclusion of Data Sampling Improve the Performance of Boosting Algorithms on Imbalanced Bioinformatics Data?
Bioinformatics datasets contain many challenging characteristics, such as class imbalance, which adversely impacts the performance of supervised classification models built on these datasets. Techniques such as ensemble learning and data sampling from the domain of data mining can be deployed to alleviate the problem and to improve the classification performance. In this study, we sought to seek whether inclusion of data sampling within the ensemble framework can further improve the performance of classification models. To this end, we performed an experimental study using two newly hybrid ensemble techniques, one integrates feature selection within the boosting process and the other incorporates random under-sampling followed by feature selection within the boosting framework, two learners, three forms of feature rankers, and four feature subset sizes on 15 highly imbalanced bioinformatics datasets. Our results and statistical analysis demonstrate that the difference between the two boosting methods is statistically insignificant. Therefore, as the inclusion of data sampling has no significant positive effect on the performance of ensemble classifiers, it is not required to achieve maximum classification performance. To our knowledge, this is the first empirical study that examined the effects of data sampling, random under-sampling, to enhance classification performance of boosting algorithm for highly imbalanced bioinformatics data.