A National Synthetic Populations Dataset for the United States.

IF 6.9 2区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Scientific Data Pub Date : 2025-01-25 DOI:10.1038/s41597-025-04380-7

James Rineer, Nicholas Kruskamp, Caroline Kery, Kasey Jones, Rainer Hilscher, Georgiy Bobashev

{"title":"A National Synthetic Populations Dataset for the United States.","authors":"James Rineer, Nicholas Kruskamp, Caroline Kery, Kasey Jones, Rainer Hilscher, Georgiy Bobashev","doi":"10.1038/s41597-025-04380-7","DOIUrl":null,"url":null,"abstract":"<p><p>Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015-2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson's r correlation coefficient.</p>","PeriodicalId":21597,"journal":{"name":"Scientific Data","volume":"12 1","pages":"144"},"PeriodicalIF":6.9000,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11762717/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Scientific Data","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41597-025-04380-7","RegionNum":2,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Geospatially explicit and statistically accurate person and household data allow researchers to study community-and neighborhood-level effects and design and test hypotheses that would otherwise not be possible without the generation of synthetic data. In this article, we demonstrate the workflow for generating spatially explicit household- and individual-level synthetic populations for the United States representing the year 2019. We use publicly available U.S. Census American Community Survey (ACS) 5-year estimates from the 2015-2019 ACS. We use Iterative Proportional Fitting (IPF) to create our synthetic population and use the resulting joint counts to sample representative households and people directly from microdata. Our dataset contains records for 120,754,708 households and 303,128,287 individuals across the United States. We spatially allocate households using the Environmental Protection Agency (EPA) Integrated Climate and Land Use Scenarios (ICLUS) project household distribution estimates to create a spatially explicit dataset. Our validation shows strong correlation with original census variables, with many categories reporting a greater than 0.99 Pearson's r correlation coefficient.

Abstract Image

查看原文本刊更多论文

美国国家综合人口数据集。

地理空间上的明确和统计上准确的个人和家庭数据使研究人员能够研究社区和邻里水平的影响，设计和检验假设，否则没有合成数据的产生是不可能的。在本文中，我们演示了用于生成代表2019年的美国空间明确的家庭和个人层面合成人口的工作流程。我们使用公开的美国人口普查美国社区调查（ACS） 2015-2019年ACS的5年估计数。我们使用迭代比例拟合（IPF）来创建我们的合成人口，并使用由此产生的联合计数直接从微数据中对代表性家庭和人员进行抽样。我们的数据集包含美国120,754,708个家庭和303,128,287个个人的记录。我们使用环境保护局（EPA）综合气候和土地利用情景（ICLUS）项目家庭分布估算来创建空间明确的数据集。我们的验证显示与原始人口普查变量有很强的相关性，许多类别报告的Pearson’s r相关系数大于0.99。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Scientific Data Social Sciences-Education

CiteScore

11.20

自引率

4.10%

发文量

689

审稿时长

16 weeks

期刊介绍： Scientific Data is an open-access journal focused on data, publishing descriptions of research datasets and articles on data sharing across natural sciences, medicine, engineering, and social sciences. Its goal is to enhance the sharing and reuse of scientific data, encourage broader data sharing, and acknowledge those who share their data. The journal primarily publishes Data Descriptors, which offer detailed descriptions of research datasets, including data collection methods and technical analyses validating data quality. These descriptors aim to facilitate data reuse rather than testing hypotheses or presenting new interpretations, methods, or in-depth analyses.