{"title":"利用拉德马赫平均值限定局部因果发现中的族内误差率","authors":"Dario Simionato, Fabio Vandin","doi":"10.1007/s10618-024-01069-0","DOIUrl":null,"url":null,"abstract":"<p>Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.</p>","PeriodicalId":55183,"journal":{"name":"Data Mining and Knowledge Discovery","volume":"10 1","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bounding the family-wise error rate in local causal discovery using Rademacher averages\",\"authors\":\"Dario Simionato, Fabio Vandin\",\"doi\":\"10.1007/s10618-024-01069-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.</p>\",\"PeriodicalId\":55183,\"journal\":{\"name\":\"Data Mining and Knowledge Discovery\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2024-09-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Data Mining and Knowledge Discovery\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s10618-024-01069-0\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data Mining and Knowledge Discovery","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s10618-024-01069-0","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
摘要
人们提出了许多算法来学习观察数据中目标变量周围的局部图形结构,重点是两组变量。第一个变量集称为父子变量集(PC),包含所有与目标变量直接相关的变量;第二个变量集称为马尔可夫边界变量集(MB),是对目标变量具有最佳预测性能的最小变量集。在本文中,我们针对 PC 和 MB 发现任务介绍了两种新型算法,它们都能严格保证全族误差率(FWER),即输出中报告任何假阳性的概率。我们的算法使用了统计学习理论中的一个关键概念--拉德马赫平均值,以适当考虑此类任务中出现的多重假设检验问题。我们在模拟数据上进行的评估表明,我们的算法能正确控制 FWER,而广泛使用的算法即使对多重假设检验进行了校正,也不能保证不会出现错误发现。我们的实验还表明,我们的算法能识别真实世界数据中的有意义关系。
Bounding the family-wise error rate in local causal discovery using Rademacher averages
Many algorithms have been proposed to learn local graphical structures around target variables of interest from observational data, focusing on two sets of variables. The first one, called Parent–Children (PC) set, contains all the variables that are direct causes or consequences of the target while the second one, known as Markov boundary (MB), is the minimal set of variables with optimal prediction performances of the target. In this paper we introduce two novel algorithms for the PC and MB discovery tasks with rigorous guarantees on the Family-Wise Error Rate (FWER), that is, the probability of reporting any false positive in output. Our algorithms use Rademacher averages, a key concept from statistical learning theory, to properly account for the multiple-hypothesis testing problem arising in such tasks. Our evaluation on simulated data shows that our algorithms properly control for the FWER, while widely used algorithms do not provide guarantees on false discoveries even when correcting for multiple-hypothesis testing. Our experiments also show that our algorithms identify meaningful relations in real-world data.
期刊介绍:
Advances in data gathering, storage, and distribution have created a need for computational tools and techniques to aid in data analysis. Data Mining and Knowledge Discovery in Databases (KDD) is a rapidly growing area of research and application that builds on techniques and theories from many fields, including statistics, databases, pattern recognition and learning, data visualization, uncertainty modelling, data warehousing and OLAP, optimization, and high performance computing.