A Tool for Optimizing De-identified Health Data for Use in Statistical Classification

2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS) Pub Date : 2017-06-22 DOI:10.1109/CBMS.2017.105

F. Prasser, J. Eicher, Raffael Bild, Helmut Spengler, K. Kuhn

{"title":"A Tool for Optimizing De-identified Health Data for Use in Statistical Classification","authors":"F. Prasser, J. Eicher, Raffael Bild, Helmut Spengler, K. Kuhn","doi":"10.1109/CBMS.2017.105","DOIUrl":null,"url":null,"abstract":"When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our implementation to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our approach enables the creation of privacy-preserving classifiers with optimal prediction accuracy.","PeriodicalId":141105,"journal":{"name":"2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS.2017.105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our implementation to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our approach enables the creation of privacy-preserving classifiers with optimal prediction accuracy.

查看原文本刊更多论文

优化用于统计分类的去识别健康数据的工具

在生物医学研究中共享个人层面的健康数据时，必须保护患者和先证者的隐私。这通常是通过数据去识别的方法来实现的，这种方法以一种可以提供关于防止再次识别的保护程度的正式保证的方式对数据进行转换。在这个过程中，重要的是尽量减少信息的丢失，以确保得到的数据是有用的。一个典型的用例是为知识发现和决策支持创建预测模型，例如推断诊断或预测治疗结果。已经开发了各种方法，可用于从去识别数据构建健壮的统计分类器。然而，它们还没有经过实际应用的调整，也没有被实现到成熟的软件工具中。为了弥补这一差距，我们扩展了ARX，这是一个用于健康数据的开源匿名化工具，具有几个新功能。我们实现了一种方法来优化去标识数据的适用性，以构建统计分类器，并实现了一种方法来评估从去标识数据构建的分类器的性能。所有的方法都可以通过一个全面的图形用户界面访问。我们使用我们的实现从患者出院数据集创建逻辑回归模型，用于预测住院费用。结果表明，我们的方法能够创建具有最佳预测精度的隐私保护分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)

自引率

0.00%

发文量