成本意识流式数据分析:分布式vs单线程

Marco Balduini, Sivam Pasupathipillai, Emanuele Della Valle
{"title":"成本意识流式数据分析:分布式vs单线程","authors":"Marco Balduini, Sivam Pasupathipillai, Emanuele Della Valle","doi":"10.1145/3210284.3210294","DOIUrl":null,"url":null,"abstract":"Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications. In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.","PeriodicalId":412438,"journal":{"name":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread\",\"authors\":\"Marco Balduini, Sivam Pasupathipillai, Emanuele Della Valle\",\"doi\":\"10.1145/3210284.3210294\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications. In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.\",\"PeriodicalId\":412438,\"journal\":{\"name\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"volume\":\"48 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3210284.3210294\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th ACM International Conference on Distributed and Event-based Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3210284.3210294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

分布式系统已经成为处理大数据分析任务的首选解决方案。这些系统能够通过将大量资源池作为单个实体进行管理来实现卓越的性能。然而,在许多情况下,性能并不是要考虑的唯一指标。在比较两个性能等效的解决方案时,它们的成本成为一个重要的因素。分布式系统的部署成本通常比传统的单线程应用程序要高。在这项工作中,我们通过提出一项实证研究来建立这些考虑,该研究比较了电信行业实际流数据分析任务的两种性能等效解决方案的成本。第一种解决方案是基于流行的分布式处理引擎(Apache Spark)构建的,而第二种解决方案是基于自制流处理框架(Natron)构建的单线程应用程序。我们表明,在连续分析的情况下,分布式处理的好处被分布式数据摄取成本所低估。这也是周期分析的情况。然而,如果数据摄取成本固定且很小,我们表明最具成本效益的解决方案取决于数据集大小。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Cost-Aware Streaming Data Analysis: Distributed vs Single-Thread
Distributed systems have become the preferred solution for dealing with Big Data analysis tasks. These systems are able to achieve superior performance by managing a large pool of resources as a single entity. However, in many contexts, performance is not the only metric to consider. When comparing two performance equivalent solutions, their cost becomes an important factor. Distributed systems are usually more expensive to deploy than traditional single-threaded applications. In this work, we build on these considerations by presenting an empirical study that compares the cost of two performance equivalent solutions for a real streaming data analysis task for the Telecommunication industry. The first solution is built on popular distributed processing engines (Apache Spark), while the second solution is a single-threaded application built on an home-brew stream processing framework (Natron). We show that, in the case of continuous analysis, the benefits of distributed processing are outvalued by the distributed data ingestion costs. This is also the case for periodic analysis. However, if data ingestion costs are fixed and small, we show that the most cost-effective solution depends on the dataset size.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信