检测高度相关的实时数据流

Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics Pub Date : 2017-08-28 DOI:10.1145/3129292.3129298

Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis, M. Sharaf, Alexandros Labrinidis

{"title":"检测高度相关的实时数据流","authors":"Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis, M. Sharaf, Alexandros Labrinidis","doi":"10.1145/3129292.3129298","DOIUrl":null,"url":null,"abstract":"More and more organizations (commercial, health, government and security) currently base their decisions on real-time analysis of fast arriving, large volumes of data streams. For such analysis to lead to actionable information in real-time and at the right time, the most recent data needs to be processed within a specified delay target. Effective solutions for analysis of such data streams rely on two techniques, (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations and (2) intelligent scheduling of computational steps and operations. In this paper, we propose a solution that combines both of these techniques to find highly correlated data streams in real-time, using the Pearson Correlation Coefficient as a correlation metric for two windows of data streams. Specifically, we propose to partition a set of data streams into micro-batches that capture the delay target, use sliding windows within a range as the subsequences of values exhibiting a certain level of correlation, utilize the idea of sufficient statistics to incrementally compute the Pearson Correlation Coefficient of pairs of sliding windows, and adopt a deadline-aware priority scheduling to detect the highly correlated pairs of data streams. Our experimental results show that our scheme and in particular our Price-DCS with warm start scheduling algorithm outperform existing ones and enable high degree of interactivity in correlating live data streams micro-batches.","PeriodicalId":407894,"journal":{"name":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Detection of Highly Correlated Live Data Streams\",\"authors\":\"Rakan Alseghayer, Daniel Petrov, Panos K. Chrysanthis, M. Sharaf, Alexandros Labrinidis\",\"doi\":\"10.1145/3129292.3129298\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"More and more organizations (commercial, health, government and security) currently base their decisions on real-time analysis of fast arriving, large volumes of data streams. For such analysis to lead to actionable information in real-time and at the right time, the most recent data needs to be processed within a specified delay target. Effective solutions for analysis of such data streams rely on two techniques, (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations and (2) intelligent scheduling of computational steps and operations. In this paper, we propose a solution that combines both of these techniques to find highly correlated data streams in real-time, using the Pearson Correlation Coefficient as a correlation metric for two windows of data streams. Specifically, we propose to partition a set of data streams into micro-batches that capture the delay target, use sliding windows within a range as the subsequences of values exhibiting a certain level of correlation, utilize the idea of sufficient statistics to incrementally compute the Pearson Correlation Coefficient of pairs of sliding windows, and adopt a deadline-aware priority scheduling to detect the highly correlated pairs of data streams. Our experimental results show that our scheme and in particular our Price-DCS with warm start scheduling algorithm outperform existing ones and enable high degree of interactivity in correlating live data streams micro-batches.\",\"PeriodicalId\":407894,\"journal\":{\"name\":\"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics\",\"volume\":\"87 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3129292.3129298\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3129292.3129298","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

目前，越来越多的组织(商业、卫生、政府和安全)基于对快速到达的大量数据流的实时分析来做出决策。为了使这种分析在正确的时间实时产生可操作的信息，需要在指定的延迟目标内处理最新的数据。分析此类数据流的有效解决方案依赖于两种技术:(1)增量滑动窗口计算聚合，以避免不必要的重新计算;(2)计算步骤和操作的智能调度。在本文中，我们提出了一种结合这两种技术的解决方案，使用Pearson相关系数作为数据流两个窗口的相关度量，实时发现高度相关的数据流。具体而言，我们建议将一组数据流划分为捕获延迟目标的微批，使用一定范围内的滑动窗口作为具有一定相关性的值的子序列，利用充分统计的思想增量计算滑动窗口对的Pearson相关系数，并采用截止日期感知优先级调度来检测高度相关的数据流对。实验结果表明，我们的方案，特别是我们的带有热启动调度算法的Price-DCS优于现有的方案，并在关联微批实时数据流方面实现了高度的交互性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Detection of Highly Correlated Live Data Streams

More and more organizations (commercial, health, government and security) currently base their decisions on real-time analysis of fast arriving, large volumes of data streams. For such analysis to lead to actionable information in real-time and at the right time, the most recent data needs to be processed within a specified delay target. Effective solutions for analysis of such data streams rely on two techniques, (1) incremental sliding-window computation of aggregates, to avoid unnecessary recomputations and (2) intelligent scheduling of computational steps and operations. In this paper, we propose a solution that combines both of these techniques to find highly correlated data streams in real-time, using the Pearson Correlation Coefficient as a correlation metric for two windows of data streams. Specifically, we propose to partition a set of data streams into micro-batches that capture the delay target, use sliding windows within a range as the subsequences of values exhibiting a certain level of correlation, utilize the idea of sufficient statistics to incrementally compute the Pearson Correlation Coefficient of pairs of sliding windows, and adopt a deadline-aware priority scheduling to detect the highly correlated pairs of data streams. Our experimental results show that our scheme and in particular our Price-DCS with warm start scheduling algorithm outperform existing ones and enable high degree of interactivity in correlating live data streams micro-batches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the International Workshop on Real-Time Business Intelligence and Analytics

自引率

0.00%

发文量