成本意识流式数据分析:分布式vs单线程

Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems Pub Date : 2018-06-25 DOI:10.1145/3210284.3210294

Marco Balduini, Sivam Pasupathipillai, Emanuele Della Valle

{"title":"成本意识流式数据分析:分布式vs单线程","authors":"Marco Balduini, Sivam Pasupathipillai, Emanuele Della Valle","doi":"10.1145/3210284.3210294","DOIUrl":null,"url":null,"abstract":"Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications. In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.","PeriodicalId":412438,"journal":{"name":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread\",\"authors\":\"Marco Balduini, Sivam Pasupathipillai, Emanuele Della Valle\",\"doi\":\"10.1145/3210284.3210294\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications. In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.\",\"PeriodicalId\":412438,\"journal\":{\"name\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3210284.3210294\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3210284.3210294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

分布式系统已经成为处理大数据分析任务的首选解决方案。这些系统能够通过将大量资源池作为单个实体进行管理来实现卓越的性能。然而，在许多情况下，性能并不是要考虑的唯一指标。在比较两个性能等效的解决方案时，它们的成本成为一个重要的因素。分布式系统的部署成本通常比传统的单线程应用程序要高。在这项工作中，我们通过提出一项实证研究来建立这些考虑，该研究比较了电信行业实际流数据分析任务的两种性能等效解决方案的成本。第一种解决方案是基于流行的分布式处理引擎(Apache Spark)构建的，而第二种解决方案是基于自制流处理框架(Natron)构建的单线程应用程序。我们表明，在连续分析的情况下，分布式处理的好处被分布式数据摄取成本所低估。这也是周期分析的情况。然而，如果数据摄取成本固定且很小，我们表明最具成本效益的解决方案取决于数据集大小。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread

Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications. In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems

自引率

0.00%

发文量