面向集群内内存大数据分析的自动内存调优

2016 IEEE International Congress on Big Data (BigData Congress) Pub Date : 2016-06-27 DOI:10.1109/BigDataCongress.2016.56

Aris-Kyriakos Koliopoulos, Paraskevas Yiapanis, F. Tekiner, G. Nenadic, J. Keane

{"title":"面向集群内内存大数据分析的自动内存调优","authors":"Aris-Kyriakos Koliopoulos, Paraskevas Yiapanis, F. Tekiner, G. Nenadic, J. Keane","doi":"10.1109/BigDataCongress.2016.56","DOIUrl":null,"url":null,"abstract":"Hadoop provides a scalable solution on traditional cluster-based Big Data platforms but imposes performance overheads due to only supporting on-disk data. Data Analytic algorithms usually require multiple iterations over a dataset and thus, multiple, slow, disk accesses. In contrast, modern clusters possess increasing amounts of main memory that can provide performance benefits by efficiently using main memory caching mechanisms. Apache Spark is an innovative distributed computing framework that supports in-memory computations. Even though this type of computations is very fast, memory is a scarce resource and this can cause bottlenecks to execution or, even worse, lead to failures. Spark offers various choices for memory tuning but this requires in-depth systems-level knowledge and the choices will be different across various workloads and cluster settings. Generally, the optimal choice is achieved by adopting a trial and error approach. This work describes a first step towards an automated selection mechanism for memory optimization that assesses workload and cluster characteristics and selects an appropriate caching scheme. The proposed caching mechanism decreases execution times by up to 25% compared to the default strategy and reduces the risk of main memory exceptions.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters\",\"authors\":\"Aris-Kyriakos Koliopoulos, Paraskevas Yiapanis, F. Tekiner, G. Nenadic, J. Keane\",\"doi\":\"10.1109/BigDataCongress.2016.56\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop provides a scalable solution on traditional cluster-based Big Data platforms but imposes performance overheads due to only supporting on-disk data. Data Analytic algorithms usually require multiple iterations over a dataset and thus, multiple, slow, disk accesses. In contrast, modern clusters possess increasing amounts of main memory that can provide performance benefits by efficiently using main memory caching mechanisms. Apache Spark is an innovative distributed computing framework that supports in-memory computations. Even though this type of computations is very fast, memory is a scarce resource and this can cause bottlenecks to execution or, even worse, lead to failures. Spark offers various choices for memory tuning but this requires in-depth systems-level knowledge and the choices will be different across various workloads and cluster settings. Generally, the optimal choice is achieved by adopting a trial and error approach. This work describes a first step towards an automated selection mechanism for memory optimization that assesses workload and cluster characteristics and selects an appropriate caching scheme. The proposed caching mechanism decreases execution times by up to 25% compared to the default strategy and reduces the risk of main memory exceptions.\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2016.56\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

Hadoop在传统的基于集群的大数据平台上提供了一个可扩展的解决方案，但由于只支持磁盘上的数据而增加了性能开销。数据分析算法通常需要对数据集进行多次迭代，因此需要多次缓慢的磁盘访问。相比之下，现代集群拥有越来越多的主存，可以通过有效地使用主存缓存机制来提供性能优势。Apache Spark是一个创新的分布式计算框架，支持内存计算。尽管这种类型的计算非常快，但内存是一种稀缺资源，这可能会导致执行瓶颈，甚至更糟，导致失败。Spark为内存调优提供了多种选择，但这需要深入的系统级知识，而且在不同的工作负载和集群设置中，选择也会有所不同。通常，最优选择是通过采用试错法来实现的。这项工作描述了迈向内存优化自动选择机制的第一步，该机制可评估工作负载和集群特征，并选择适当的缓存方案。与默认策略相比，提议的缓存机制最多可减少25%的执行时间，并降低主存异常的风险。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Towards Automatic Memory Tuning for In-Memory Big Data Analytics in Clusters

Hadoop provides a scalable solution on traditional cluster-based Big Data platforms but imposes performance overheads due to only supporting on-disk data. Data Analytic algorithms usually require multiple iterations over a dataset and thus, multiple, slow, disk accesses. In contrast, modern clusters possess increasing amounts of main memory that can provide performance benefits by efficiently using main memory caching mechanisms. Apache Spark is an innovative distributed computing framework that supports in-memory computations. Even though this type of computations is very fast, memory is a scarce resource and this can cause bottlenecks to execution or, even worse, lead to failures. Spark offers various choices for memory tuning but this requires in-depth systems-level knowledge and the choices will be different across various workloads and cluster settings. Generally, the optimal choice is achieved by adopting a trial and error approach. This work describes a first step towards an automated selection mechanism for memory optimization that assesses workload and cluster characteristics and selects an appropriate caching scheme. The proposed caching mechanism decreases execution times by up to 25% compared to the default strategy and reduces the risk of main memory exceptions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 IEEE International Congress on Big Data (BigData Congress)

自引率

0.00%

发文量