ALOJA-ML: Hadoop部署中自动化特性描述和知识发现的框架

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining Pub Date : 2015-08-10 DOI:10.1145/2783258.2788600

J. L. Berral, Nicolás Poggi, David Carrera, A. Call, Rob Reinauer, Daron Green

{"title":"ALOJA-ML: Hadoop部署中自动化特性描述和知识发现的框架","authors":"J. L. Berral, Nicolás Poggi, David Carrera, A. Call, Rob Reinauer, Daron Green","doi":"10.1145/2783258.2788600","DOIUrl":null,"url":null,"abstract":"This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results. The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop. Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices. Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services. Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces. ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters. The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests. Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs. In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.","PeriodicalId":243428,"journal":{"name":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"19","resultStr":"{\"title\":\"ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments\",\"authors\":\"J. L. Berral, Nicolás Poggi, David Carrera, A. Call, Rob Reinauer, Daron Green\",\"doi\":\"10.1145/2783258.2788600\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results. The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop. Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices. Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services. Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces. ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters. The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests. Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs. In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.\",\"PeriodicalId\":243428,\"journal\":{\"name\":\"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"19\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2783258.2788600\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2783258.2788600","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 19

摘要

本文介绍了ALOJA-机器学习(ALOJA- ml)，这是ALOJA项目的扩展，它使用机器学习技术来解释Hadoop基准性能数据和性能调优;本文详细介绍了该模型的方法、有效性和初步结果。ALOJA-ML项目是BSC和微软长期合作的最新阶段，旨在自动化大数据部署的成本效益表征，重点是Hadoop。Hadoop呈现了一个复杂的执行环境，其成本和性能取决于大量的软件(SW)配置和多个硬件(HW)部署选择。最近，ALOJA项目提出了一个开放的、厂商中立的存储库，包含了超过16000个Hadoop执行。这些结果伴随着测试平台和工具，用于部署和评估不同硬件配置、参数调优和云服务的成本效益。尽管专家指导的基准测试在ALOJA中取得了早期的成功，但很明显，真正全面的研究需要建模过程的自动化，以便对大型和资源受限的搜索空间进行系统分析。ALOJA-ML提供了这样一个自动化的系统，通过在广泛的配置参数集上对观察到的基准的Hadoop执行进行建模，从而允许知识发现。由此得出的经验推导的性能模型可用于预测各种工作负载的执行行为;它们允许对新配置和硬件选择的执行时间进行先验预测，并提供基于模型的异常检测途径。此外，这些模型可以通过自动确定候选未来基准测试的优先级，有效地指导基准测试探索。来自ALOJA-ML模型的见解可用于减少集群上的操作时间，加快数据获取和知识发现过程，重要的是，降低运行成本。除了从本工作中提出的方法中学习之外，社区还可以从ALOJA数据集、框架和派生的见解中获益，以改进大数据应用程序的设计和部署。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments

This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results. The ALOJA-ML project is the latest phase of a long-term collaboration between BSC and Microsoft, to automate the characterization of cost-effectiveness on Big Data deployments, focusing on Hadoop. Hadoop presents a complex execution environment, where costs and performance depends on a large number of software (SW) configurations and on multiple hardware (HW) deployment choices. Recently the ALOJA project presented an open, vendor-neutral repository, featuring over 16.000 Hadoop executions. These results are accompanied by a test bed and tools to deploy and evaluate the cost-effectiveness of the different hardware configurations, parameter tunings, and Cloud services. Despite early success within ALOJA from expert-guided benchmarking, it became clear that a genuinely comprehensive study requires automation of modeling procedures to allow a systematic analysis of large and resource-constrained search spaces. ALOJA-ML provides such an automated system allowing knowledge discovery by modeling Hadoop executions from observed benchmarks across a broad set of configuration parameters. The resulting empirically-derived performance models can be used to forecast execution behavior of various workloads; they allow a-priori prediction of the execution times for new configurations and HW choices and they offer a route to model-based anomaly detection. In addition, these models can guide the benchmarking exploration efficiently, by automatically prioritizing candidate future benchmark tests. Insights from ALOJA-ML's models can be used to reduce the operational time on clusters, speed-up the data acquisition and knowledge discovery process, and importantly, reduce running costs. In addition to learning from the methodology presented in this work, the community can benefit in general from ALOJA data-sets, framework, and derived insights to improve the design and deployment of Big Data applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

自引率

0.00%

发文量