Large-Scale Hierarchical Causal Discovery via Weak Prior Knowledge

IF 8.9 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-02-04 DOI:10.1109/TKDE.2025.3537832

Xiangyu Wang;Taiyu Ban;Lyuzhou Chen;Derui Lyu;Qinrui Zhu;Huanhuan Chen

{"title":"Large-Scale Hierarchical Causal Discovery via Weak Prior Knowledge","authors":"Xiangyu Wang;Taiyu Ban;Lyuzhou Chen;Derui Lyu;Qinrui Zhu;Huanhuan Chen","doi":"10.1109/TKDE.2025.3537832","DOIUrl":null,"url":null,"abstract":"Causal discovery faces significant challenges as the number of hypotheses grows exponentially with the number of variables. This complexity becomes particularly daunting when dealing with large sets of variables. We introduce a novel divide-and-conquer method that uniquely handles this challenge. The existing division strategies often rely on conditional independency (CI) tests or data-driven clustering to split variables, which can suffer from the typical data scarcity in large-scale settings, thus leading to inaccurate division results. The proposed method overcomes this by implementing a data-independent division strategy, which constructs a prior structure, informed by potential causal relationships identified using a Large Language Model (LLM), to guide recursively dividing variables into sub-sets. This approach avoids the impact of data insufficiency and is robust against potential incompleteness in the prior structure. In the merging phase, we adopt a score-based refinement strategy to address fake causal links caused by hidden variables in sub-sets, which eliminates edges in the intersected parts of sub-sets to optimize the score of local structures. While maintaining both correctness and completeness under the faithfulness assumption, this novel merging approach demonstrates enhanced performance than the conventional CI-test based merging strategy in practical scenarios. Empirical evaluations on various large-scale datasets demonstrate the proposed approach's superior accuracy and efficiency compared to existing causal discovery methods.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 5","pages":"2695-2711"},"PeriodicalIF":8.9000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10872923/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Causal discovery faces significant challenges as the number of hypotheses grows exponentially with the number of variables. This complexity becomes particularly daunting when dealing with large sets of variables. We introduce a novel divide-and-conquer method that uniquely handles this challenge. The existing division strategies often rely on conditional independency (CI) tests or data-driven clustering to split variables, which can suffer from the typical data scarcity in large-scale settings, thus leading to inaccurate division results. The proposed method overcomes this by implementing a data-independent division strategy, which constructs a prior structure, informed by potential causal relationships identified using a Large Language Model (LLM), to guide recursively dividing variables into sub-sets. This approach avoids the impact of data insufficiency and is robust against potential incompleteness in the prior structure. In the merging phase, we adopt a score-based refinement strategy to address fake causal links caused by hidden variables in sub-sets, which eliminates edges in the intersected parts of sub-sets to optimize the score of local structures. While maintaining both correctness and completeness under the faithfulness assumption, this novel merging approach demonstrates enhanced performance than the conventional CI-test based merging strategy in practical scenarios. Empirical evaluations on various large-scale datasets demonstrate the proposed approach's superior accuracy and efficiency compared to existing causal discovery methods.

查看原文本刊更多论文

基于弱先验知识的大规模层次因果发现

当假设的数量随着变量的数量呈指数增长时，因果发现面临着重大挑战。当处理大量变量时，这种复杂性变得特别令人生畏。我们引入了一种独特的分而治之的方法来处理这一挑战。现有的除法策略通常依赖于条件独立（CI）测试或数据驱动的聚类来分割变量，这在大规模环境下容易受到典型的数据稀缺性的影响，从而导致除法结果不准确。该方法通过实现与数据无关的划分策略来克服这一问题，该策略构建了一个先验结构，并通过使用大型语言模型（LLM）识别潜在的因果关系来指导递归地将变量划分为子集。这种方法避免了数据不足的影响，并且对先前结构中潜在的不完整性具有鲁棒性。在合并阶段，我们采用基于分数的改进策略来解决子集中隐藏变量导致的假因果关系，消除子集相交部分的边，优化局部结构的分数。在忠实度假设下，这种新的合并方法在保持正确性和完整性的同时，在实际场景中比传统的基于ci测试的合并策略表现出更高的性能。对各种大规模数据集的实证评估表明，与现有的因果发现方法相比，本文提出的方法具有更高的准确性和效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.