Trustworthy machine learning for health care: scalable data valuation with the shapley value

Proceedings of the ACM Conference on Health, Inference, and Learning Pub Date : 2021-04-08 DOI:10.1145/3450439.3451861

Konstantin D. Pandl, Fabian Feiland, Scott Thiebes, A. Sunyaev

引用次数: 10

Abstract

Collecting data from many sources is an essential approach to generate large data sets required for the training of machine learning models. Trustworthy machine learning requires incentives, guarantees of data quality, and information privacy. Applying recent advancements in data valuation methods for machine learning can help to enable these. In this work, we analyze the suitability of three different data valuation methods for medical image classification tasks, specifically pleural effusion, on an extensive data set of chest X-ray scans. Our results reveal that a heuristic for calculating the Shapley valuation scheme based on a k-nearest neighbor classifier can successfully value large quantities of data instances. We also demonstrate possible applications for incentivizing data sharing, the efficient detection of mislabeled data, and summarizing data sets to exclude private information. Thereby, this work contributes to developing modern data infrastructures for trustworthy machine learning in health care.

查看原文本刊更多论文

可信赖的医疗机器学习:shapley值的可扩展数据估值

从许多来源收集数据是生成训练机器学习模型所需的大型数据集的基本方法。值得信赖的机器学习需要激励、数据质量保证和信息隐私。将最新的数据评估方法应用于机器学习可以帮助实现这些目标。在这项工作中，我们分析了三种不同的数据评估方法对医学图像分类任务的适用性，特别是胸膜积液，在胸部x射线扫描的广泛数据集上。我们的研究结果表明，基于k近邻分类器的Shapley估值方案的启发式计算可以成功地对大量数据实例进行估值。我们还演示了激励数据共享、有效检测错误标记数据以及汇总数据集以排除私人信息的可能应用。因此，这项工作有助于为医疗保健领域的可信机器学习开发现代数据基础设施。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM Conference on Health, Inference, and Learning

自引率

0.00%

发文量