{"title":"Discovering causal structures in corrupted data: frugality in anchored Gaussian DAG models","authors":"Joonho Shin , Junhyoung Chung , Seyong Hwang , Gunwoong Park","doi":"10.1016/j.csda.2025.108267","DOIUrl":null,"url":null,"abstract":"<div><div>This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.</div></div>","PeriodicalId":55225,"journal":{"name":"Computational Statistics & Data Analysis","volume":"213 ","pages":"Article 108267"},"PeriodicalIF":1.6000,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computational Statistics & Data Analysis","FirstCategoryId":"100","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167947325001434","RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
This study focuses on the recovery of anchored Gaussian directed acyclic graphical (DAG) models to address the challenge of discovering causal or directed relationships among variables in datasets that are either intentionally masked or contaminated due to measurement errors. A main contribution is to relax the existing restrictive identifiability conditions for anchored Gaussian DAG models by introducing the anchored-frugality assumption. This assumption posits that the true graph is the most frugal among those satisfying the possible distributions of the latent and observed variables, thereby making the true Markov equivalent class (MEC) identifiable. The validity of the anchored-frugality assumption is justified using both graph and probability theories, respectively. Another main contribution is the development of the anchored-SP and frugal-PC algorithms. Specifically, the anchored-SP algorithm finds the most frugal graph among all possible graphs satisfying the Markov condition while the frugal-PC algorithm finds the most frugal graph among some graphs. Hence, the frugal-PC algorithm is more computationally feasible, while it requires an additional frugality-faithfulness assumption for soundness. Various simulations support the theoretical findings of this study and demonstrate the practical effectiveness of the proposed algorithm against state-of-the-art algorithms such as ACI, PC, and MMHC. Furthermore, the applications of the proposed algorithm to protein signaling data and breast cancer data illustrate its effectiveness in uncovering relationships among proteins and among cancer-related cell nuclei characteristics.
期刊介绍:
Computational Statistics and Data Analysis (CSDA), an Official Publication of the network Computational and Methodological Statistics (CMStatistics) and of the International Association for Statistical Computing (IASC), is an international journal dedicated to the dissemination of methodological research and applications in the areas of computational statistics and data analysis. The journal consists of four refereed sections which are divided into the following subject areas:
I) Computational Statistics - Manuscripts dealing with: 1) the explicit impact of computers on statistical methodology (e.g., Bayesian computing, bioinformatics,computer graphics, computer intensive inferential methods, data exploration, data mining, expert systems, heuristics, knowledge based systems, machine learning, neural networks, numerical and optimization methods, parallel computing, statistical databases, statistical systems), and 2) the development, evaluation and validation of statistical software and algorithms. Software and algorithms can be submitted with manuscripts and will be stored together with the online article.
II) Statistical Methodology for Data Analysis - Manuscripts dealing with novel and original data analytical strategies and methodologies applied in biostatistics (design and analytic methods for clinical trials, epidemiological studies, statistical genetics, or genetic/environmental interactions), chemometrics, classification, data exploration, density estimation, design of experiments, environmetrics, education, image analysis, marketing, model free data exploration, pattern recognition, psychometrics, statistical physics, image processing, robust procedures.
[...]
III) Special Applications - [...]
IV) Annals of Statistical Data Science [...]