J. Wozniak, H. Yoo, J. Mohd-Yusof, Bogdan Nicolae, Nicholson T. Collier, J. Ozik, T. Brettin, Rick L. Stevens
{"title":"High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response","authors":"J. Wozniak, H. Yoo, J. Mohd-Yusof, Bogdan Nicolae, Nicholson T. Collier, J. Ozik, T. Brettin, Rick L. Stevens","doi":"10.1109/MLHPCAI4S51975.2020.00012","DOIUrl":null,"url":null,"abstract":"Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning-based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.","PeriodicalId":47667,"journal":{"name":"Foundations and Trends in Machine Learning","volume":"82 1","pages":"1-10"},"PeriodicalIF":65.3000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations and Trends in Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLHPCAI4S51975.2020.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 6
Abstract
Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning-based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.
期刊介绍:
Each issue of Foundations and Trends® in Machine Learning comprises a monograph of at least 50 pages written by research leaders in the field. We aim to publish monographs that provide an in-depth, self-contained treatment of topics where there have been significant new developments. Typically, this means that the monographs we publish will contain a significant level of mathematical detail (to describe the central methods and/or theory for the topic at hand), and will not eschew these details by simply pointing to existing references. Literature surveys and original research papers do not fall within these aims.