{"title":"To Link or Synthesize? An Approach to Data Quality Comparison","authors":"Duncan Smith, M. Elliot, J. Sakshaug","doi":"10.1145/3580487","DOIUrl":null,"url":null,"abstract":"Linking administrative data to produce more informative data for subsequent analysis has become an increasingly common practice. However, there might be concomitant risks of disclosing sensitive information about individuals. One practice that reduces these risks is data synthesis. In data synthesis the data are used to fit a model from which synthetic data are then generated. The synthetic data are then released to end users. There are some scenarios where an end user might have the option of using linked data or accepting synthesized data. However, linkage and synthesis are susceptible to errors that could limit their usefulness. Here, we investigate the problem of comparing the quality of linked data to synthesized data and demonstrate through simulations how the problem might be approached. These comparisons are important when considering how an end user can be supplied with the highest-quality data and in situations where one must consider risk/utility tradeoffs.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"58 1","pages":"1 - 20"},"PeriodicalIF":1.5000,"publicationDate":"2023-02-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3580487","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 1
Abstract
Linking administrative data to produce more informative data for subsequent analysis has become an increasingly common practice. However, there might be concomitant risks of disclosing sensitive information about individuals. One practice that reduces these risks is data synthesis. In data synthesis the data are used to fit a model from which synthetic data are then generated. The synthetic data are then released to end users. There are some scenarios where an end user might have the option of using linked data or accepting synthesized data. However, linkage and synthesis are susceptible to errors that could limit their usefulness. Here, we investigate the problem of comparing the quality of linked data to synthesized data and demonstrate through simulations how the problem might be approached. These comparisons are important when considering how an end user can be supplied with the highest-quality data and in situations where one must consider risk/utility tradeoffs.