F. Prasser, J. Eicher, Raffael Bild, Helmut Spengler, K. Kuhn
{"title":"A Tool for Optimizing De-identified Health Data for Use in Statistical Classification","authors":"F. Prasser, J. Eicher, Raffael Bild, Helmut Spengler, K. Kuhn","doi":"10.1109/CBMS.2017.105","DOIUrl":null,"url":null,"abstract":"When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our implementation to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our approach enables the creation of privacy-preserving classifiers with optimal prediction accuracy.","PeriodicalId":141105,"journal":{"name":"2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 30th International Symposium on Computer-Based Medical Systems (CBMS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CBMS.2017.105","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20
Abstract
When individual-level health data is shared in biomedical research the privacy of patients and probands must be protected. This is typically achieved with methods of data de-identification, which transform data in such a way that formal guarantees about the degree of protection from re-identification can be provided. In the process it is important to minimize loss of information to ensure that the resulting data is useful. A typical use case is the creation of predictive models for knowledge discovery and decision support, e.g. to infer diagnoses or to predict outcomes of therapies. A variety of methods have been developed which can be used to build robust statistical classifiers from de-identified data. However, they have not been tuned for practical use and they have not been implemented into mature software tools. To bridge this gap, we have extended ARX, an open source anonymization tool for health data, with several new features. We have implemented a method for optimizing the suitability of de-identified data for building statistical classifiers and a method for assessing the performance of classifiers built from de-identified data. All methods are accessible via a comprehensive graphical user interface. We have used our implementation to create logistic regression models from a patient discharge dataset for predicting the costs of hospital stays. The results show that our approach enables the creation of privacy-preserving classifiers with optimal prediction accuracy.