实值数据结果评估的最大熵模型

2011 IEEE 11th International Conference on Data Mining Pub Date : 2011-12-11 DOI:10.1109/ICDM.2011.98

Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie

{"title":"实值数据结果评估的最大熵模型","authors":"Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie","doi":"10.1109/ICDM.2011.98","DOIUrl":null,"url":null,"abstract":"Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Maximum Entropy Modelling for Assessing Results on Real-Valued Data\",\"authors\":\"Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie\",\"doi\":\"10.1109/ICDM.2011.98\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation\",\"PeriodicalId\":106216,\"journal\":{\"name\":\"2011 IEEE 11th International Conference on Data Mining\",\"volume\":\"3 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 IEEE 11th International Conference on Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDM.2011.98\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 11th International Conference on Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDM.2011.98","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

数据挖掘结果的统计评估越来越被认为是知识发现过程中的一项核心任务。这在实践中非常重要，因为乍一看可能很有趣的结果通常可以用众所周知的数据基本属性来解释。例如，在模式挖掘中，这些微不足道的结果可能会大量出现，以至于为了识别真正有趣的模式，必须将它们过滤掉。在本文中，我们提出了一种评估实值矩形数据库结果的方法。更具体地说，使用我们的分析模型，我们能够统计地评估发现的结构是否可能是数据库中行和列边缘分布的平凡结果。我们的主要方法是使用最大熵原理来拟合数据的背景模型，同时尊重其边际分布。为了找到这些分布，我们采用了基于MDL的直方图估计器，并使用高效的凸优化技术将这些分布拟合到我们的模型中。随后，我们的模型可以直接用于计算概率，也可以通过实证假设检验有效地对数据进行抽样，以评估结果。值得注意的是，我们的方法是有效的，无参数的，并且自然地处理缺失值。因此，它代表了交换随机化的一种有充分根据的替代方案

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Maximum Entropy Modelling for Assessing Results on Real-Valued Data

Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 IEEE 11th International Conference on Data Mining

自引率

0.00%

发文量