利用研发调查为复杂调查创建综合数据：一项比较研究。

Q1 Medicine

NCHS data brief Pub Date : 2025-04-01 DOI:10.15620/cdc/174586

Guangyu Zhang, Yulei He, Anna Oganian, Bill Cai

{"title":"利用研发调查为复杂调查创建综合数据：一项比较研究。","authors":"Guangyu Zhang, Yulei He, Anna Oganian, Bill Cai","doi":"10.15620/cdc/174586","DOIUrl":null,"url":null,"abstract":"Background: Synthetic data has been gaining popularity in many fields as an approach to retain data utility (the validity of inference using synthetic data) and protect confidentiality. However, creating synthetic data for complex surveys remains a challenge.Methods: This research compared three approaches to incorporate survey design information (stratification, clustering, and sampling weights) during the synthetic data-generating process using the Research and Development Survey (RANDS), a series of primarily web surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. Both parametric (logistic and linear regression models) and nonparametric (classification and regression trees [CART]) methods were used to create synthetic data. Data utility and disclosure risk were evaluated via confidence interval overlap, propensity score measurement, and average matching probability for re-identification.Results: Using the original survey design information as predictors during the synthesis process improved data utility for the parametric method. However, the nonparametric method yielded results with better data utility but slightly higher disclosure risk.","PeriodicalId":39458,"journal":{"name":"NCHS data brief","volume":" 212","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12336966/pdf/","citationCount":"0","resultStr":"{\"title\":\"Creating Synthetic Data for Complex Surveys Using the Research and Development Survey: A Comparison Study.\",\"authors\":\"Guangyu Zhang, Yulei He, Anna Oganian, Bill Cai\",\"doi\":\"10.15620/cdc/174586\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Synthetic data has been gaining popularity in many fields as an approach to retain data utility (the validity of inference using synthetic data) and protect confidentiality. However, creating synthetic data for complex surveys remains a challenge.Methods: This research compared three approaches to incorporate survey design information (stratification, clustering, and sampling weights) during the synthetic data-generating process using the Research and Development Survey (RANDS), a series of primarily web surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. Both parametric (logistic and linear regression models) and nonparametric (classification and regression trees [CART]) methods were used to create synthetic data. Data utility and disclosure risk were evaluated via confidence interval overlap, propensity score measurement, and average matching probability for re-identification.Results: Using the original survey design information as predictors during the synthesis process improved data utility for the parametric method. However, the nonparametric method yielded results with better data utility but slightly higher disclosure risk.\",\"PeriodicalId\":39458,\"journal\":{\"name\":\"NCHS data brief\",\"volume\":\" 212\",\"pages\":\"1-10\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12336966/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NCHS data brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15620/cdc/174586\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NCHS data brief","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15620/cdc/174586","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

摘要

背景：合成数据作为保留数据效用（使用合成数据的推理有效性）和保护机密性的一种方法，在许多领域越来越受欢迎。然而，为复杂的调查创建合成数据仍然是一个挑战。方法：本研究比较了在合成数据生成过程中采用研究与发展调查（rand）的三种方法来纳入调查设计信息（分层、聚类和抽样权重），该调查是由国家卫生统计中心、疾病控制与预防中心进行的一系列主要网络调查。参数（逻辑和线性回归模型）和非参数（分类和回归树[CART]）方法用于创建合成数据。通过置信区间重叠、倾向得分测量和重新识别的平均匹配概率来评估数据效用和披露风险。结果：在综合过程中使用原始调查设计信息作为预测因子，提高了参数化方法的数据效用。然而，非参数方法产生的结果具有更好的数据效用，但披露风险略高。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Creating Synthetic Data for Complex Surveys Using the Research and Development Survey: A Comparison Study.

Background: Synthetic data has been gaining popularity in many fields as an approach to retain data utility (the validity of inference using synthetic data) and protect confidentiality. However, creating synthetic data for complex surveys remains a challenge.

Methods: This research compared three approaches to incorporate survey design information (stratification, clustering, and sampling weights) during the synthetic data-generating process using the Research and Development Survey (RANDS), a series of primarily web surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. Both parametric (logistic and linear regression models) and nonparametric (classification and regression trees [CART]) methods were used to create synthetic data. Data utility and disclosure risk were evaluated via confidence interval overlap, propensity score measurement, and average matching probability for re-identification.

Results: Using the original survey design information as predictors during the synthesis process improved data utility for the parametric method. However, the nonparametric method yielded results with better data utility but slightly higher disclosure risk.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

NCHS data brief Medicine-Medicine (all)

CiteScore

33.50

自引率

0.00%

发文量