云环境下大数据存储工作负载的表征

Sacheendra Talluri, Alicja Luszczak, Cristina L. Abad, A. Iosup
{"title":"云环境下大数据存储工作负载的表征","authors":"Sacheendra Talluri, Alicja Luszczak, Cristina L. Abad, A. Iosup","doi":"10.1145/3297663.3310302","DOIUrl":null,"url":null,"abstract":"The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.","PeriodicalId":273447,"journal":{"name":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"Characterization of a Big Data Storage Workload in the Cloud\",\"authors\":\"Sacheendra Talluri, Alicja Luszczak, Cristina L. Abad, A. Iosup\",\"doi\":\"10.1145/3297663.3310302\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.\",\"PeriodicalId\":273447,\"journal\":{\"name\":\"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering\",\"volume\":\"60 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-04-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3297663.3310302\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3297663.3310302","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

摘要

大数据处理平台的激增导致了截然不同的系统设计,例如MapReduce和较新的Spark。了解此类系统的工作负载有助于调优,并可以促进新的设计。然而,尽管MapReduce工作负载已经被广泛地描述过,但在代表性环境中,关于Spark工作负载特征的公共知识相对较少。为了解决这个问题,在这项工作中,我们从大数据处理服务的主要提供商Databricks那里收集并分析了6个月的Spark工作负载。我们的分析集中在一些关键特性上,比如读取和修改的长期趋势、读取的统计属性以及集群和文件格式的流行程度。总的来说,我们提出了许多可以形成新系统研究和设计基础的发现。我们的定量证据及其分析表明,存在每日和每周的负载失衡、重尾和突发行为、相对罕见的修改以及大数据特定格式的激增。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Characterization of a Big Data Storage Workload in the Cloud
The proliferation of big data processing platforms has led to radically different system designs, such as MapReduce and the newer Spark. Understanding the workloads of such systems facilitates tuning and could foster new designs. However, whereas MapReduce workloads have been characterized extensively, relatively little public knowledge exists about the characteristics of Spark workloads in representative environments. To address this problem, in this work we collect and analyze a 6-month Spark workload from a major provider of big data processing services, Databricks. Our analysis focuses on a number of key features, such as the long-term trends of reads and modifications, the statistical properties of reads, and the popularity of clusters and of file formats. Overall, we present numerous findings that could form the basis of new systems studies and designs. Our quantitative evidence and its analysis suggest the existence of daily and weekly load imbalances, of heavy-tailed and bursty behaviour, of the relative rarity of modifications, and of proliferation of big data specific formats.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信