Fast mining of massive tabular data via approximate distance computations

Graham Cormode, P. Indyk, Nick Koudas, S. Muthukrishnan
{"title":"Fast mining of massive tabular data via approximate distance computations","authors":"Graham Cormode, P. Indyk, Nick Koudas, S. Muthukrishnan","doi":"10.1109/ICDE.2002.994778","DOIUrl":null,"url":null,"abstract":"Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. We present methods for determining similar regions in massive tabular data. Our methods are for computing the \"distance\" between any two subregions of tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use L/sub p/ norms. A novelty of our distance computation procedures is that they work for any L/sub p/ norms, not only the traditional p = 2 or p = 1, but for all p /spl les/ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T's data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.","PeriodicalId":191529,"journal":{"name":"Proceedings 18th International Conference on Data Engineering","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 18th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2002.994778","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 33

Abstract

Tabular data abound in many data stores: traditional relational databases store tables, and new applications also generate massive tabular datasets. We present methods for determining similar regions in massive tabular data. Our methods are for computing the "distance" between any two subregions of tabular data: they are approximate, but highly accurate as we prove mathematically, and they are fast, running in time nearly linear in the table size. Our methods are general since these distance computations can be applied to any mining or similarity algorithms that use L/sub p/ norms. A novelty of our distance computation procedures is that they work for any L/sub p/ norms, not only the traditional p = 2 or p = 1, but for all p /spl les/ 2; the choice of p, say fractional p, provides an interesting alternative similarity behavior! We use our algorithms in a detailed experimental study of the clustering patterns in real tabular data obtained from one of AT&T's data stores and show that our methods are substantially faster than straightforward methods while remaining highly accurate, and able to detect interesting patterns by varying the value of p.
通过近似距离计算快速挖掘大量表格数据
表格数据在许多数据存储中大量存在:传统的关系数据库存储表格,新的应用程序也生成大量的表格数据集。我们提出了在大量表格数据中确定相似区域的方法。我们的方法用于计算表格数据的任意两个子区域之间的“距离”:它们是近似的,但正如我们在数学上证明的那样非常精确,而且它们速度很快,在表大小上运行的时间几乎是线性的。我们的方法是通用的,因为这些距离计算可以应用于使用L/ p/范数的任何挖掘或相似算法。我们的距离计算程序的一个新颖之处在于,它们适用于任何L/ p/规范,不仅适用于传统的p = 2或p = 1,而且适用于所有p/ spl les/ 2;选择p,比如分数p,提供了一种有趣的替代相似性行为!我们在AT&T的一个数据存储中获得的真实表格数据的聚类模式的详细实验研究中使用了我们的算法,并表明我们的方法比直接方法要快得多,同时保持了高度的准确性,并且能够通过改变p的值来检测有趣的模式。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信