Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics

Proceedings of the Seventh ACM Symposium on Cloud Computing Pub Date : 2016-10-05 DOI:10.1145/2987550.2987580

Benjamin Heintz, A. Chandra, R. Sitaraman

{"title":"Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics","authors":"Benjamin Heintz, A. Chandra, R. Sitaraman","doi":"10.1145/2987550.2987580","DOIUrl":null,"url":null,"abstract":"Many applications must ingest rapid data streams and produce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from geographically distributed sources. The typical infrastructure for processing such geo-distributed streams follows a hub-and-spoke model, where several edge servers perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results. In such cases, applications must either sacrifice timeliness by allowing delayed---i.e., stale---results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations. Using a workload derived from an analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through both trace-driven simulation as well as experiments on an Apache Storm-based implementation deployed on PlanetLab. Our experiments show that our proposed algorithms reduce staleness by 81.8% to 96.6%, and error by 83.4% to 99.1% compared to a practical random sampling/batching-based aggregation algorithm across a diverse set of aggregation functions.","PeriodicalId":362207,"journal":{"name":"Proceedings of the Seventh ACM Symposium on Cloud Computing","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"52","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Seventh ACM Symposium on Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2987550.2987580","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 52

Abstract

Many applications must ingest rapid data streams and produce analytics results in near-real-time. It is increasingly common for inputs to such applications to originate from geographically distributed sources. The typical infrastructure for processing such geo-distributed streams follows a hub-and-spoke model, where several edge servers perform partial computation before forwarding results over a wide-area network (WAN) to a central location for final processing. Due to limited WAN bandwidth, it is not always possible to produce exact results. In such cases, applications must either sacrifice timeliness by allowing delayed---i.e., stale---results, or sacrifice accuracy by allowing some error in final results. In this paper, we focus on windowed grouped aggregation, an important and widely used primitive in streaming analytics, and we study the tradeoff between staleness and error. We present optimal offline algorithms for minimizing staleness under an error constraint and for minimizing error under a staleness constraint. Using these offline algorithms as references, we present practical online algorithms for effectively trading off timeliness and accuracy under bandwidth limitations. Using a workload derived from an analytics service offered by a large commercial CDN, we demonstrate the effectiveness of our techniques through both trace-driven simulation as well as experiments on an Apache Storm-based implementation deployed on PlanetLab. Our experiments show that our proposed algorithms reduce staleness by 81.8% to 96.6%, and error by 83.4% to 99.1% compared to a practical random sampling/batching-based aggregation algorithm across a diverse set of aggregation functions.

查看原文本刊更多论文

地理分布流分析中的交易时效性和准确性

许多应用程序必须摄取快速的数据流，并在接近实时的情况下产生分析结果。这种应用程序的输入来自地理上分散的来源，这种情况越来越普遍。处理这种地理分布流的典型基础设施遵循集线器和辐射式模型，其中几个边缘服务器在通过广域网(WAN)将结果转发到中心位置进行最终处理之前执行部分计算。由于广域网带宽有限，并不总是能够产生精确的结果。在这种情况下，应用程序必须通过允许延迟来牺牲时效性。， stale——结果，或者通过允许最终结果出现一些误差来牺牲准确性。本文主要研究了窗口分组聚合，这是流分析中一个重要且广泛使用的原语，我们研究了过时和错误之间的权衡。我们提出了在错误约束下最小化过时和在过时约束下最小化错误的最优离线算法。以这些离线算法为参考，我们提出了在带宽限制下有效权衡时效性和准确性的实用在线算法。使用来自大型商业CDN提供的分析服务的工作负载，我们通过跟踪驱动的模拟以及在PlanetLab上部署的基于Apache storm的实现上的实验来证明我们技术的有效性。我们的实验表明，与实际的随机抽样/批处理聚合算法相比，我们提出的算法将过时率降低了81.8%至96.6%，错误率降低了83.4%至99.1%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Seventh ACM Symposium on Cloud Computing

自引率

0.00%

发文量