{"title":"From Association Analysis to Causal Discovery","authors":"Jiuyong Li","doi":"10.1145/2542652.2542659","DOIUrl":null,"url":null,"abstract":"Association analysis is an important technique in data mining, and it has been widely used in many application areas [6]. However, associations found in data can be spurious and do not reflect the ‘true’ relationships between the variables under consideration. For example, it is easily for hundreds or thousands of association rules to be generated even in a small data set, but most of them could be spurious and have no practical meaning [11, 21, 22]. This has hindered the applications of association analysis to solving real world problems. While the development of efficient techniques for finding association patterns in data, especially in large data sets, is well underway, the problem for identifying non-spurious associations has become prominent. Causal relationships imply the real data generating mechanisms and how the outcome would change when the cause is changed, so finding them has been the ultimate goals of many scientific explorations and social studies [18]. The gold standard for causal discover is randomised controlled trials (RCTs) [4, 16]. However, a RCT is infeasible in many real world applications, particularly in the case of high dimensional problem of a large number of potential causes. As part of the efforts on causal discovery, statisticians have studied various methods for testing a hypothetical causal relationship based on observational data [16]. However, these methods are designed for validating a known candidate causal relationship and they are incapable of dealing with a large number of potential causes either. Although an association between two variables does not always imply causation, it is well known that associations are indicators for causal relationships [7]. Therefore a practical approach to causal discovery in large data sets could start with association analysis of the data. A question is then whether we can filter out associations that do not have causal indications. Note that this objective is different from that of mining interesting associations [9, 20] or discovering statistically sound associations [5, 21] because interestingness criteria do not measure causality and a test of statistical significance only determines if an association is due to random chance. We have integrated two statis-","PeriodicalId":248909,"journal":{"name":"MLSDA '13","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"MLSDA '13","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2542652.2542659","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Association analysis is an important technique in data mining, and it has been widely used in many application areas [6]. However, associations found in data can be spurious and do not reflect the ‘true’ relationships between the variables under consideration. For example, it is easily for hundreds or thousands of association rules to be generated even in a small data set, but most of them could be spurious and have no practical meaning [11, 21, 22]. This has hindered the applications of association analysis to solving real world problems. While the development of efficient techniques for finding association patterns in data, especially in large data sets, is well underway, the problem for identifying non-spurious associations has become prominent. Causal relationships imply the real data generating mechanisms and how the outcome would change when the cause is changed, so finding them has been the ultimate goals of many scientific explorations and social studies [18]. The gold standard for causal discover is randomised controlled trials (RCTs) [4, 16]. However, a RCT is infeasible in many real world applications, particularly in the case of high dimensional problem of a large number of potential causes. As part of the efforts on causal discovery, statisticians have studied various methods for testing a hypothetical causal relationship based on observational data [16]. However, these methods are designed for validating a known candidate causal relationship and they are incapable of dealing with a large number of potential causes either. Although an association between two variables does not always imply causation, it is well known that associations are indicators for causal relationships [7]. Therefore a practical approach to causal discovery in large data sets could start with association analysis of the data. A question is then whether we can filter out associations that do not have causal indications. Note that this objective is different from that of mining interesting associations [9, 20] or discovering statistically sound associations [5, 21] because interestingness criteria do not measure causality and a test of statistical significance only determines if an association is due to random chance. We have integrated two statis-