Mining deviants in time series data streams

Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004. Pub Date : 2004-06-21 DOI:10.1109/SSDBM.2004.51

S. Muthukrishnan, R. Shah, J. Vitter

{"title":"Mining deviants in time series data streams","authors":"S. Muthukrishnan, R. Shah, J. Vitter","doi":"10.1109/SSDBM.2004.51","DOIUrl":null,"url":null,"abstract":"One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of \"deviants\" from Jagadish et al. (1999) as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.","PeriodicalId":383615,"journal":{"name":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","volume":"212 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SSDBM.2004.51","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

Abstract

One of the central tasks in managing, monitoring and mining data streams is that of identifying outliers. There is a long history of study of various outliers in statistics and databases, and a recent focus on mining outliers in data streams. Here, we adopt the notion of "deviants" from Jagadish et al. (1999) as outliers. Deviants are based on one of the most fundamental statistical concept of standard deviation (or variance). Formally, deviants are defined based on a representation sparsity metric, i.e., deviants are values whose removal from the dataset leads to an improved compressed representation of the remaining items. Thus, deviants are not global maxima/minima, but rather these are appropriate local aberrations. Deviants are known to be of great mining value in time series databases. We present first-known algorithms for identifying deviants on massive data streams. Our algorithms monitor streams using very small space (polylogarithmic in data size) and are able to quickly find deviants at any instant, as the data stream evolves over time. For all versions of this problem - uni- vs multivariate time series, optimal vs near-optimal vs heuristic solutions, offline vs streaming - our algorithms have the same framework of maintaining a hierarchical set of candidate deviants that are updated as the time series data gets progressively revealed. We show experimentally using real network traffic data (SNMP aggregate time series) as well as synthetic data that our algorithm is remarkably accurate in determining the deviants.

查看原文本刊更多论文

挖掘时间序列数据流中的偏差

管理、监控和挖掘数据流的中心任务之一是识别异常值。对统计数据和数据库中各种异常值的研究已有很长的历史，最近的重点是挖掘数据流中的异常值。在这里，我们采用Jagadish等人(1999)的“越轨者”概念作为异常值。偏差是基于最基本的统计概念之一的标准差(或方差)。形式上，偏差是基于表示稀疏度度量来定义的，也就是说，偏差是从数据集中删除的值，会导致剩余项的压缩表示得到改进。因此，偏差不是全局的最大值/最小值，而是适当的局部畸变。众所周知，偏差在时间序列数据库中具有很大的挖掘价值。我们提出了在海量数据流中识别偏差的已知算法。我们的算法使用非常小的空间(数据大小的多对数)监控流，并且能够在任何时刻快速发现偏差，因为数据流随着时间的推移而演变。对于这个问题的所有版本——单一vs多元时间序列，最优vs近最优vs启发式解决方案，离线vs流——我们的算法都有相同的框架来维护一组候选偏差，这些偏差会随着时间序列数据的逐步显示而更新。我们通过实验证明，使用真实网络流量数据(SNMP聚合时间序列)和合成数据，我们的算法在确定偏差方面非常准确。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings. 16th International Conference on Scientific and Statistical Database Management, 2004.

自引率

0.00%

发文量