{"title":"Tackling class imbalance and data scarcity in literature-based gene function annotation","authors":"Mathieu Blondel, Kazuhiro Seki, K. Uehara","doi":"10.1145/2009916.2010080","DOIUrl":null,"url":null,"abstract":"In recent years, a number of machine learning approaches to literature-based gene function annotation have been proposed. However, due to issues such as lack of labeled data, class imbalance and computational cost, they have usually been unable to surpass simpler approaches based on string-matching. In this paper, we propose a principled machine learning approach based on kernel classifiers. We show that kernels can address the task's inherent data scarcity by embedding additional knowledge and we propose a simple yet effective solution to deal with class imbalance. From experiments on the TREC Genomics Track data, our approach achieves better F1-score than two state-of-the-art approaches based on string-matching and cross-species information.","PeriodicalId":356580,"journal":{"name":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2009916.2010080","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
In recent years, a number of machine learning approaches to literature-based gene function annotation have been proposed. However, due to issues such as lack of labeled data, class imbalance and computational cost, they have usually been unable to surpass simpler approaches based on string-matching. In this paper, we propose a principled machine learning approach based on kernel classifiers. We show that kernels can address the task's inherent data scarcity by embedding additional knowledge and we propose a simple yet effective solution to deal with class imbalance. From experiments on the TREC Genomics Track data, our approach achieves better F1-score than two state-of-the-art approaches based on string-matching and cross-species information.