高旁路学习:自动检测显著影响药物反应的肿瘤细胞

IF 65.3 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Foundations and Trends in Machine Learning Pub Date : 2020-11-01 DOI:10.1109/MLHPCAI4S51975.2020.00012

J. Wozniak, H. Yoo, J. Mohd-Yusof, Bogdan Nicolae, Nicholson T. Collier, J. Ozik, T. Brettin, Rick L. Stevens

{"title":"高旁路学习:自动检测显著影响药物反应的肿瘤细胞","authors":"J. Wozniak, H. Yoo, J. Mohd-Yusof, Bogdan Nicolae, Nicholson T. Collier, J. Ozik, T. Brettin, Rick L. Stevens","doi":"10.1109/MLHPCAI4S51975.2020.00012","DOIUrl":null,"url":null,"abstract":"Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning-based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.","PeriodicalId":47667,"journal":{"name":"Foundations and Trends in Machine Learning","volume":"82 1","pages":"1-10"},"PeriodicalIF":65.3000,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response\",\"authors\":\"J. Wozniak, H. Yoo, J. Mohd-Yusof, Bogdan Nicolae, Nicholson T. Collier, J. Ozik, T. Brettin, Rick L. Stevens\",\"doi\":\"10.1109/MLHPCAI4S51975.2020.00012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning-based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.\",\"PeriodicalId\":47667,\"journal\":{\"name\":\"Foundations and Trends in Machine Learning\",\"volume\":\"82 1\",\"pages\":\"1-10\"},\"PeriodicalIF\":65.3000,\"publicationDate\":\"2020-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Foundations and Trends in Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/MLHPCAI4S51975.2020.00012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Foundations and Trends in Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/MLHPCAI4S51975.2020.00012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 6

摘要

生物医学中的机器学习依赖于大量高质量数据集的可用性。这些语料库用于训练统计或基于深度学习的模型，这些模型可以针对其他数据集进行验证，并最终用于指导决策。这些数据集的质量是模型及其决策质量的重要组成部分。因此，识别和检查异常数据对于评估、管理和使用生物医学数据集至关重要。有许多技术可用于寻找离群数据，但尚不清楚如何评估对高度复杂的深度学习方法的影响。在本文中，我们使用深度学习集成和工作流来构建一个系统，用于自动识别对训练模型有很大影响的数据子集。这些影响可以量化并呈现给用户以供进一步检查，这可以提高总体数据质量。然后，我们展示了在接近百亿亿次的Summit超级计算机上运行该方法的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

High-bypass Learning: Automated Detection of Tumor Cells That Significantly Impact Drug Response

Machine learning in biomedicine is reliant on the availability of large, high-quality data sets. These corpora are used for training statistical or deep learning-based models that can be validated against other data sets and ultimately used to guide decisions. The quality of these data sets is an essential component of the quality of the models and their decisions. Thus, identifying and inspecting outlier data is critical for evaluating, curating, and using biomedical data sets. Many techniques are available to look for outlier data, but it is not clear how to evaluate the impact on highly complex deep learning methods. In this paper, we use deep learning ensembles and workflows to construct a system for automatically identifying data subsets that have a large impact on the trained models. These effects can be quantified and presented to the user for further inspection, which could improve data quality overall. We then present results from running this method on the near-exascale Summit supercomputer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Foundations and Trends in Machine Learning COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

108.50

自引率

0.00%

发文量

期刊介绍： Each issue of Foundations and Trends® in Machine Learning comprises a monograph of at least 50 pages written by research leaders in the field. We aim to publish monographs that provide an in-depth, self-contained treatment of topics where there have been significant new developments. Typically, this means that the monographs we publish will contain a significant level of mathematical detail (to describe the central methods and/or theory for the topic at hand), and will not eschew these details by simply pointing to existing references. Literature surveys and original research papers do not fall within these aims.