Fully Synthetic Data for Complex Surveys.

IF 1.2 4区数学 Q3 SOCIAL SCIENCES, MATHEMATICAL METHODS

Survey Methodology Pub Date : 2024-01-01 Epub Date: 2024-12-20

Shirley Mathur, Yajuan Si, Jerome P Reiter

{"title":"Fully Synthetic Data for Complex Surveys.","authors":"Shirley Mathur, Yajuan Si, Jerome P Reiter","doi":"","DOIUrl":null,"url":null,"abstract":"<p><p>When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.</p>","PeriodicalId":51191,"journal":{"name":"Survey Methodology","volume":"50 2","pages":"347-373"},"PeriodicalIF":1.2000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11759325/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Survey Methodology","FirstCategoryId":"100","ListUrlMain":"","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/20 0:00:00","PubModel":"Epub","JCR":"Q3","JCRName":"SOCIAL SCIENCES, MATHEMATICAL METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.

本刊更多论文

复杂调查的完全合成数据。

在寻求发布机密数据的公共使用文件时，统计机构可以生成完全合成的数据。我们提出了一种从复杂抽样设计收集的调查中获得完全合成数据的方法。我们的方法遵循Rubin（1993）提出的总体策略。具体来说，我们通过应用加权有限总体贝叶斯bootstrap来解释调查权重，从这些伪总体中获取简单随机样本，使用这些简单随机样本估计综合模型，并将从模型中提取的模拟数据作为公共使用文件发布。为了便于方差估计，我们采用了多重插值框架和两种数据生成策略。首先，我们从每个简单随机样本中生成多个数据集。在第二步中，我们从每个简单的随机样本中生成一个单一的合成数据集。我们为每一种设置都提出了多个imputation组合规则。我们通过仿真研究说明了组合规则的重复采样特性，包括与基于伪似然方法的合成数据生成的比较。我们将提出的方法应用于来自美国社区调查的数据子集。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Survey Methodology 数学-统计学与概率论

CiteScore

0.80

自引率

22.20%

发文量

审稿时长

>12 weeks

期刊介绍： The journal publishes articles dealing with various aspects of statistical development relevant to a statistical agency, such as design issues in the context of practical constraints, use of different data sources and collection techniques, total survey error, survey evaluation, research in survey methodology, time series analysis, seasonal adjustment, demographic studies, data integration, estimation and data analysis methods, and general survey systems development. The emphasis is placed on the development and evaluation of specific methodologies as applied to data collection or the data themselves.