利用研发调查为复杂调查创建综合数据:一项比较研究。

Q1 Medicine
Guangyu Zhang, Yulei He, Anna Oganian, Bill Cai
{"title":"利用研发调查为复杂调查创建综合数据:一项比较研究。","authors":"Guangyu Zhang, Yulei He, Anna Oganian, Bill Cai","doi":"10.15620/cdc/174586","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Synthetic data has been gaining popularity in many fields as an approach to retain data utility (the validity of inference using synthetic data) and protect confidentiality. However, creating synthetic data for complex surveys remains a challenge.</p><p><strong>Methods: </strong>This research compared three approaches to incorporate survey design information (stratification, clustering, and sampling weights) during the synthetic data-generating process using the Research and Development Survey (RANDS), a series of primarily web surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. Both parametric (logistic and linear regression models) and nonparametric (classification and regression trees [CART]) methods were used to create synthetic data. Data utility and disclosure risk were evaluated via confidence interval overlap, propensity score measurement, and average matching probability for re-identification.</p><p><strong>Results: </strong>Using the original survey design information as predictors during the synthesis process improved data utility for the parametric method. However, the nonparametric method yielded results with better data utility but slightly higher disclosure risk.</p>","PeriodicalId":39458,"journal":{"name":"NCHS data brief","volume":" 212","pages":"1-10"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Creating Synthetic Data for Complex Surveys Using the Research and Development Survey: A Comparison Study.\",\"authors\":\"Guangyu Zhang, Yulei He, Anna Oganian, Bill Cai\",\"doi\":\"10.15620/cdc/174586\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Synthetic data has been gaining popularity in many fields as an approach to retain data utility (the validity of inference using synthetic data) and protect confidentiality. However, creating synthetic data for complex surveys remains a challenge.</p><p><strong>Methods: </strong>This research compared three approaches to incorporate survey design information (stratification, clustering, and sampling weights) during the synthetic data-generating process using the Research and Development Survey (RANDS), a series of primarily web surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. Both parametric (logistic and linear regression models) and nonparametric (classification and regression trees [CART]) methods were used to create synthetic data. Data utility and disclosure risk were evaluated via confidence interval overlap, propensity score measurement, and average matching probability for re-identification.</p><p><strong>Results: </strong>Using the original survey design information as predictors during the synthesis process improved data utility for the parametric method. However, the nonparametric method yielded results with better data utility but slightly higher disclosure risk.</p>\",\"PeriodicalId\":39458,\"journal\":{\"name\":\"NCHS data brief\",\"volume\":\" 212\",\"pages\":\"1-10\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"NCHS data brief\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15620/cdc/174586\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"Medicine\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"NCHS data brief","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15620/cdc/174586","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

摘要

背景:合成数据作为保留数据效用(使用合成数据的推理有效性)和保护机密性的一种方法,在许多领域越来越受欢迎。然而,为复杂的调查创建合成数据仍然是一个挑战。方法:本研究比较了在合成数据生成过程中采用研究与发展调查(rand)的三种方法来纳入调查设计信息(分层、聚类和抽样权重),该调查是由国家卫生统计中心、疾病控制与预防中心进行的一系列主要网络调查。参数(逻辑和线性回归模型)和非参数(分类和回归树[CART])方法用于创建合成数据。通过置信区间重叠、倾向得分测量和重新识别的平均匹配概率来评估数据效用和披露风险。结果:在综合过程中使用原始调查设计信息作为预测因子,提高了参数化方法的数据效用。然而,非参数方法产生的结果具有更好的数据效用,但披露风险略高。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Creating Synthetic Data for Complex Surveys Using the Research and Development Survey: A Comparison Study.

Background: Synthetic data has been gaining popularity in many fields as an approach to retain data utility (the validity of inference using synthetic data) and protect confidentiality. However, creating synthetic data for complex surveys remains a challenge.

Methods: This research compared three approaches to incorporate survey design information (stratification, clustering, and sampling weights) during the synthetic data-generating process using the Research and Development Survey (RANDS), a series of primarily web surveys conducted by the National Center for Health Statistics, Centers for Disease Control and Prevention. Both parametric (logistic and linear regression models) and nonparametric (classification and regression trees [CART]) methods were used to create synthetic data. Data utility and disclosure risk were evaluated via confidence interval overlap, propensity score measurement, and average matching probability for re-identification.

Results: Using the original survey design information as predictors during the synthesis process improved data utility for the parametric method. However, the nonparametric method yielded results with better data utility but slightly higher disclosure risk.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
NCHS data brief
NCHS data brief Medicine-Medicine (all)
CiteScore
33.50
自引率
0.00%
发文量
23
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信