关联规则数据挖掘的抽样评价

Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications Pub Date : 1997-04-07 DOI:10.1109/RIDE.1997.583696

Mohammed J. Zaki, S. Parthasarathy, Wei Li, M. Ogihara

{"title":"关联规则数据挖掘的抽样评价","authors":"Mohammed J. Zaki, S. Parthasarathy, Wei Li, M. Ogihara","doi":"10.1109/RIDE.1997.583696","DOIUrl":null,"url":null,"abstract":"The discovery of association rules is a prototypical problem in data mining. The current algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring item sets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. The authors show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transactions to be considered. They may also be able to make the sampled database resident in main-memory. Furthermore, they show that sampling can accurately represent the data patterns in the database with high confidence. They experimentally evaluate the effectiveness of sampling on different databases, and study the relationship between the performance, accuracy, and confidence of the chosen sample.","PeriodicalId":177468,"journal":{"name":"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"228","resultStr":"{\"title\":\"Evaluation of sampling for data mining of association rules\",\"authors\":\"Mohammed J. Zaki, S. Parthasarathy, Wei Li, M. Ogihara\",\"doi\":\"10.1109/RIDE.1997.583696\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The discovery of association rules is a prototypical problem in data mining. The current algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring item sets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. The authors show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transactions to be considered. They may also be able to make the sampled database resident in main-memory. Furthermore, they show that sampling can accurately represent the data patterns in the database with high confidence. They experimentally evaluate the effectiveness of sampling on different databases, and study the relationship between the performance, accuracy, and confidence of the chosen sample.\",\"PeriodicalId\":177468,\"journal\":{\"name\":\"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications\",\"volume\":\"26 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-04-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"228\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/RIDE.1997.583696\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/RIDE.1997.583696","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 228

摘要

关联规则的发现是数据挖掘中的一个典型问题。目前提出的关联规则数据挖掘算法是通过重复遍历数据库来确定常见的项集(或项集)。对于大型数据库，扫描数据库时的I/O开销可能非常高。对数据库中的事务进行随机抽样是一种寻找关联规则的有效方法。采样可以通过减少I/O成本和大幅减少要考虑的事务数量来加快挖掘过程，速度超过一个数量级。它们还可以使采样的数据库驻留在主内存中。此外，他们还表明，采样可以准确地表示数据库中的数据模式，并具有很高的置信度。他们通过实验评估了在不同数据库上采样的有效性，并研究了所选样本的性能、准确性和置信度之间的关系。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of sampling for data mining of association rules

The discovery of association rules is a prototypical problem in data mining. The current algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring item sets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. The authors show that random sampling of transactions in the database is an effective method for finding association rules. Sampling can speed up the mining process by more than an order of magnitude by reducing I/O costs and drastically shrinking the number of transactions to be considered. They may also be able to make the sampled database resident in main-memory. Furthermore, they show that sampling can accurately represent the data patterns in the database with high confidence. They experimentally evaluate the effectiveness of sampling on different databases, and study the relationship between the performance, accuracy, and confidence of the chosen sample.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings Seventh International Workshop on Research Issues in Data Engineering. High Performance Database Management for Large-Scale Applications

自引率

0.00%

发文量