Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling.

Gates Open Research Pub Date : 2024-10-18 eCollection Date: 2024-01-01 DOI:10.12688/gatesopenres.15418.2

Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, Os Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman

{"title":"Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling.","authors":"Beatrix Haddock, Alix Pletcher, Nathaniel Blair-Stahn, Os Keyes, Matt Kappel, Steve Bachmeier, Syl Lutze, James Albright, Alison Bowman, Caroline Kinuthia, Zeb Burke-Conte, Rajan Mudambi, Abraham Flaxman","doi":"10.12688/gatesopenres.15418.2","DOIUrl":null,"url":null,"abstract":"Background: Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets with configurable noise approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information.Methods: We created the simulated population data available for noising with pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems.Results: Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.","PeriodicalId":12593,"journal":{"name":"Gates Open Research","volume":"8 ","pages":"36"},"PeriodicalIF":0.0000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11518969/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Gates Open Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.12688/gatesopenres.15418.2","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets with configurable noise approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information.

Methods: We created the simulated population data available for noising with pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems.

Results: Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.

Abstract Image

查看原文本刊更多论文

无隐私限制的普查规模实体解析研究模拟数据：基于个体建模生成的大规模数据集。

背景：实体解析（ER）是识别和连接指向同一现实世界实体的记录的过程。实体解析是数据科学中的一个基本挑战，而实体解析研究和开发的一个常见障碍是，用于模糊匹配的数据字段是个人身份信息，如姓名、地址和出生日期。对访问和共享这些真实数据的必要限制减缓了ER新方法和软件的开发、测试和采用工作。我们最近发布了一个 Python 软件包 pseudopeople，它允许用户生成具有可配置噪声的模拟数据集，其规模和复杂程度接近大型组织和联邦机构（如美国人口普查局）定期执行 ER 的数据。有了 pseudopeople，研究人员就可以开发用于美国人口数据ER的新算法和软件，而无需获取个人机密信息：方法：我们利用 Vivarium 仿真平台创建了模拟人口数据，用于使用伪人口进行噪声处理。我们的模型模拟了个人及其家庭、住户和就业在一段时间内的动态变化，我们通过模拟人口普查、调查和行政数据收集系统来观察这些动态变化：我们的模拟过程产生了超过 900 千兆字节的模拟人口普查、调查和行政数据，代表了数以亿计的模拟人。由数千名模拟人组成的模拟人口样本现已向所有伪人民软件包用户开放，而由数百万和数亿模拟人组成的大规模模拟人口也可通过 GitHub 在线申请获得。这些模拟种群数据的结构可供伪人群软件包使用，该软件包还可为数据添加各种噪声，从而为ER研究人员提供现实的、可共享的挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Gates Open Research Immunology and Microbiology-Immunology and Microbiology (miscellaneous)

CiteScore

3.60

自引率

0.00%

发文量