关联规则数据挖掘的抽样评价

Mohammed J. Zaki, S. Parthasarathy, Wei Li, M. Ogihara
{"title":"关联规则数据挖掘的抽样评价","authors":"Mohammed J. Zaki, S. Parthasarathy, Wei Li, M. Ogihara","doi":"10.1109/RIDE.1997.583696","DOIUrl":null,"url":null,"abstract":"The discovery of association rules is a prototypical problem in data mining. The current algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring item sets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. The authors show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transactions to be considered. They may also be able to make the sampled database resident in main-memory. Furthermore, they show that sampling can accurately represent the data patterns in the database with high confidence. They experimentally evaluate the effectiveness of sampling on different databases, and study the relationship between the performance, accuracy, and confidence of the chosen sample.","PeriodicalId":177468,"journal":{"name":"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"228","resultStr":"{\"title\":\"Evaluation of sampling for data mining of association rules\",\"authors\":\"Mohammed J. Zaki, S. Parthasarathy, Wei Li, M. Ogihara\",\"doi\":\"10.1109/RIDE.1997.583696\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The discovery of association rules is a prototypical problem in data mining. The current algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring item sets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. The authors show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transactions to be considered. They may also be able to make the sampled database resident in main-memory. Furthermore, they show that sampling can accurately represent the data patterns in the database with high confidence. They experimentally evaluate the effectiveness of sampling on different databases, and study the relationship between the performance, accuracy, and confidence of the chosen sample.\",\"PeriodicalId\":177468,\"journal\":{\"name\":\"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"228\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RIDE.1997.583696\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIDE.1997.583696","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 228

摘要

关联规则的发现是数据挖掘中的一个典型问题。目前提出的关联规则数据挖掘算法是通过重复遍历数据库来确定常见的项集(或项集)。对于大型数据库,扫描数据库时的I/O开销可能非常高。对数据库中的事务进行随机抽样是一种寻找关联规则的有效方法。采样可以通过减少I/O成本和大幅减少要考虑的事务数量来加快挖掘过程,速度超过一个数量级。它们还可以使采样的数据库驻留在主内存中。此外,他们还表明,采样可以准确地表示数据库中的数据模式,并具有很高的置信度。他们通过实验评估了在不同数据库上采样的有效性,并研究了所选样本的性能、准确性和置信度之间的关系。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluation of sampling for data mining of association rules
The discovery of association rules is a prototypical problem in data mining. The current algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring item sets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. The authors show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transactions to be considered. They may also be able to make the sampled database resident in main-memory. Furthermore, they show that sampling can accurately represent the data patterns in the database with high confidence. They experimentally evaluate the effectiveness of sampling on different databases, and study the relationship between the performance, accuracy, and confidence of the chosen sample.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信