Bayesian network Motifs for reasoning over heterogeneous unlinked datasets

IF 4.3 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Data Mining and Knowledge Discovery Pub Date : 2024-08-17 DOI:10.1007/s10618-024-01054-7

Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver

{"title":"Bayesian network Motifs for reasoning over heterogeneous unlinked datasets","authors":"Yi Sui, Alex Kwan, Alexander W. Olson, Scott Sanner, Daniel A. Silver","doi":"10.1007/s10618-024-01054-7","DOIUrl":null,"url":null,"abstract":"<p>Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"125 1","pages":""},"PeriodicalIF":4.3000,"publicationDate":"2024-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01054-7","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Modern data-oriented applications often require integrating data from multiple heterogeneous sources. When these datasets share attributes, but are otherwise unlinked, there is no way to join them and reason at the individual level explicitly. However, as we show in this work, this does not prevent probabilistic reasoning over these heterogeneous datasets even when the data and shared attributes exhibit significant mismatches that are common in real-world data. Different datasets have different sample biases, disagree on category definitions and spatial representations, collect data at different temporal intervals, and mix aggregate-level with individual data. In this work, we demonstrate how a set of Bayesian network motifs allows all of these mismatches to be resolved in a composable framework that permits joint probabilistic reasoning over all datasets without manipulating, modifying, or imputing the original data, thus avoiding potentially harmful assumptions. We provide an open source Python tool that encapsulates our methodology and demonstrate this tool on a number of real-world use cases.

Abstract Image

查看原文本刊更多论文

用于推理异构非链接数据集的贝叶斯网络动机

现代面向数据的应用程序通常需要整合来自多个异构来源的数据。当这些数据集共享属性，但没有其他链接时，就无法将它们连接起来，也就无法明确地在单个层面上进行推理。然而，正如我们在本研究中所展示的，这并不妨碍对这些异构数据集进行概率推理，即使数据和共享属性表现出现实世界数据中常见的严重不匹配。不同的数据集具有不同的样本偏差，在类别定义和空间表示上存在分歧，在不同的时间间隔收集数据，并将总体数据与个体数据混合在一起。在这项工作中，我们展示了一组贝叶斯网络主题如何在一个可组合框架中解决所有这些不匹配问题，该框架允许对所有数据集进行联合概率推理，而无需操作、修改或归因原始数据，从而避免了潜在的有害假设。我们提供了一个开源 Python 工具，该工具封装了我们的方法，并在一些实际应用案例中演示了这一工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data Mining and Knowledge Discovery 工程技术-计算机：人工智能

CiteScore

10.40

自引率

4.20%

发文量

审稿时长

10 months

期刊介绍： Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.