Fast mining of massive tabular data via approximate distance computations

Proceedings 18th International Conference on Data Engineering Pub Date : 2002-08-07 DOI:10.1109/ICDE.2002.994778

Graham Cormode, P. Indyk, Nick Koudas, S. Muthukrishnan

{"title":"Fast mining of massive tabular data via approximate distance computations","authors":"Graham Cormode, P. Indyk, Nick Koudas, S. Muthukrishnan","doi":"10.1109/ICDE.2002.994778","DOIUrl":null,"url":null,"abstract":"Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. We present methods for determining similar regions in massive tabular data. Our methods are for computing the \"distance\" between any two subregions of tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use L/sub p/ norms. A novelty of our distance computation procedures is that they work for any L/sub p/ norms, not only the traditional p = 2 or p = 1, but for all p /spl les/ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T's data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 33

Abstract

Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. We present methods for determining similar regions in massive tabular data. Our methods are for computing the "distance" between any two subregions of tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use L/sub p/ norms. A novelty of our distance computation procedures is that they work for any L/sub p/ norms, not only the traditional p = 2 or p = 1, but for all p /spl les/ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T's data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.

查看原文本刊更多论文

通过近似距离计算快速挖掘大量表格数据

表格数据在许多数据存储中大量存在:传统的关系数据库存储表格，新的应用程序也生成大量的表格数据集。我们提出了在大量表格数据中确定相似区域的方法。我们的方法用于计算表格数据的任意两个子区域之间的“距离”:它们是近似的，但正如我们在数学上证明的那样非常精确，而且它们速度很快，在表大小上运行的时间几乎是线性的。我们的方法是通用的，因为这些距离计算可以应用于使用L/ p/范数的任何挖掘或相似算法。我们的距离计算程序的一个新颖之处在于，它们适用于任何L/ p/规范，不仅适用于传统的p = 2或p = 1，而且适用于所有p/ spl les/ 2;选择p，比如分数p，提供了一种有趣的替代相似性行为!我们在AT&T的一个数据存储中获得的真实表格数据的聚类模式的详细实验研究中使用了我们的算法，并表明我们的方法比直接方法要快得多，同时保持了高度的准确性，并且能够通过改变p的值来检测有趣的模式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings 18th International Conference on Data Engineering

自引率

0.00%

发文量