Randall Wald, T. Khoshgoftaar, Alireza Fazelpour, D. Dittman
{"title":"Hidden dependencies between class imbalance and difficulty of learning for bioinformatics datasets","authors":"Randall Wald, T. Khoshgoftaar, Alireza Fazelpour, D. Dittman","doi":"10.1109/IRI.2013.6642477","DOIUrl":null,"url":null,"abstract":"Many bioinformatics datasets share certain problems: they have class imbalance (one class with many more instances than the remaining class(es)), or are difficult to learn from (build accurate models with). Much research has investigated these two problems, or even considered both at once. However, hidden dependencies can exist between these two problems: in a given collection of datasets, the highly imbalanced datasets may be particularly difficult or easy to learn from, and so conclusions based on the level of class imbalance may actually reflect the difficulty of learning. We present a case study with twenty-six bioinformatics datasets which exhibits this dependency, and highlights how it can result in misleading conclusions regarding the absolute and relative performance of learners and feature rankers across balance levels.","PeriodicalId":418492,"journal":{"name":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 14th International Conference on Information Reuse & Integration (IRI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2013.6642477","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Many bioinformatics datasets share certain problems: they have class imbalance (one class with many more instances than the remaining class(es)), or are difficult to learn from (build accurate models with). Much research has investigated these two problems, or even considered both at once. However, hidden dependencies can exist between these two problems: in a given collection of datasets, the highly imbalanced datasets may be particularly difficult or easy to learn from, and so conclusions based on the level of class imbalance may actually reflect the difficulty of learning. We present a case study with twenty-six bioinformatics datasets which exhibits this dependency, and highlights how it can result in misleading conclusions regarding the absolute and relative performance of learners and feature rankers across balance levels.