Discovering causal structures in corrupted data: frugality in anchored Gaussian DAG models

IF 1.6 3区数学 Q3 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computational Statistics & Data Analysis Pub Date : 2025-08-18 DOI:10.1016/j.csda.2025.108267

Joonho Shin , Junhyoung Chung , Seyong Hwang , Gunwoong Park

{"title":"Discovering causal structures in corrupted data: frugality in anchored Gaussian DAG models","authors":"Joonho Shin , Junhyoung Chung , Seyong Hwang , Gunwoong Park","doi":"10.1016/j.csda.2025.108267","DOIUrl":null,"url":null,"abstract":"<div><div>This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108267"},"PeriodicalIF":1.6000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947325001434","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.

查看原文本刊更多论文

发现损坏数据中的因果结构：锚定高斯DAG模型中的节俭性

本研究的重点是恢复锚定的高斯有向无环图形（DAG）模型，以解决发现数据集中变量之间的因果关系或有向关系的挑战，这些变量要么被故意掩盖，要么被测量误差污染。主要贡献是通过引入锚定节俭假设，放宽了锚定高斯DAG模型现有的限制性可识别条件。这个假设假定真图是那些满足潜在变量和观察变量可能分布的图中最节俭的，从而使真马尔可夫等价类（MEC）可识别。锚定节俭假设的有效性分别用图论和概率论来证明。另一个主要贡献是锚定sp和节俭pc算法的发展。其中，锚定- sp算法在满足马尔可夫条件的所有可能图中寻找最节俭的图，而节俭- pc算法在一些图中寻找最节俭的图。因此，节俭- pc算法在计算上更可行，但它需要一个额外的节俭-忠诚假设来保证可靠性。各种模拟支持本研究的理论发现，并证明了所提出的算法与最先进的算法（如ACI， PC和MMHC）相比的实际有效性。此外，该算法在蛋白质信号数据和乳腺癌数据中的应用表明，它在揭示蛋白质之间的关系和癌症相关细胞核特征方面是有效的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computational Statistics & Data Analysis 数学-计算机：跨学科应用

CiteScore

3.70

自引率

5.60%

发文量

167

审稿时长

60 days

期刊介绍： Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas: I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article. II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures. [...] III) Special Applications - [...] IV) Annals of Statistical Data Science [...]