SMM: A data stream management system for knowledge discovery

2011 IEEE 27th International Conference on Data Engineering Pub Date : 2011-04-11 DOI:10.1109/ICDE.2011.5767879

Hetal Thakkar, N. Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, C. Zaniolo

{"title":"SMM: A data stream management system for knowledge discovery","authors":"Hetal Thakkar, N. Laptev, Hamid Mousavi, Barzan Mozafari, Vincenzo Russo, C. Zaniolo","doi":"10.1109/ICDE.2011.5767879","DOIUrl":null,"url":null,"abstract":"The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.","PeriodicalId":332374,"journal":{"name":"2011 IEEE 27th International Conference on Data Engineering","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 27th International Conference on Data Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDE.2011.5767879","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

The problem of supporting data mining applications proved to be difficult for database management systems and it is now proving to be very challenging for data stream management systems (DSMSs), where the limitations of SQL are made even more severe by the requirements of continuous queries. The major technical advances that achieved separately on DSMSs and on data stream mining algorithms have failed to converge and produce powerful data stream mining systems. Such systems, however, are essential since the traditional pull-based approach of cache mining is no longer applicable, and the push-based computing mode of data streams and their bursty traffic complicate application development. For instance, to write mining applications with quality of service (QoS) levels approaching those of DSMSs, a mining analyst would have to contend with many arduous tasks, such as support for data buffering, complex storage and retrieval methods, scheduling, fault-tolerance, synopsis-management, load shedding, and query optimization. Our Stream Mill Miner (SMM) system solves these problems by providing a data stream mining workbench that combines the ease of specifying high-level mining tasks, as in Weka, with the performance and QoS guarantees of a DSMS. This is accomplished in three main steps. The first is an open and extensible DSMS architecture where KDD queries can be easily expressed as user-defined aggregates (UDAs)—our system combines that with the efficiency of synoptic data structures and mining-aware load shedding and optimizations. The second key component of SMM is its integrated library of fast mining algorithms that are light enough to be effective on data streams. The third advanced feature of SMM is a Mining Model Definition Language (MMDL) that allows users to define the flow of mining tasks, integrated with a simple box&arrow GUI, to shield the mining analyst from the complexities of lower-level queries. SMM is the first DSMS capable of online mining and this paper describes its architecture, design, and performance on mining queries.

查看原文本刊更多论文

用于知识发现的数据流管理系统

支持数据挖掘应用程序的问题对数据库管理系统来说是困难的，现在对数据流管理系统(DSMSs)来说是非常具有挑战性的，其中SQL的限制由于连续查询的需求而变得更加严重。在dsm和数据流挖掘算法上分别取得的主要技术进步未能融合并产生强大的数据流挖掘系统。然而，由于传统的基于拉的缓存挖掘方法不再适用，并且基于推送的数据流计算模式及其突发流量使应用程序开发复杂化，因此这种系统是必不可少的。例如，要编写服务质量(QoS)级别接近dsm的挖掘应用程序，挖掘分析师必须处理许多艰巨的任务，例如支持数据缓冲、复杂的存储和检索方法、调度、容错、概要管理、负载减少和查询优化。我们的Stream Mill Miner (SMM)系统通过提供一个数据流挖掘工作台解决了这些问题，该工作台结合了指定高级挖掘任务的便利性，就像在Weka中一样，以及DSMS的性能和QoS保证。这可以通过三个主要步骤来完成。第一个是开放和可扩展的DSMS体系结构，其中KDD查询可以很容易地表示为用户定义的聚合(UDAs)——我们的系统将其与概要数据结构的效率以及挖掘感知的负载减少和优化相结合。SMM的第二个关键组件是其集成的快速挖掘算法库，这些算法足够轻，可以有效地处理数据流。SMM的第三个高级特性是挖掘模型定义语言(MMDL)，它允许用户定义挖掘任务的流程，并集成了一个简单的方框和箭头GUI，以保护挖掘分析师免受低级查询的复杂性。SMM是第一个能够在线挖掘的DSMS，本文描述了它的架构、设计和挖掘查询的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 27th International Conference on Data Engineering

自引率

0.00%

发文量