{"title":"Hadoop集群资源利用建模与预测:一种机器学习方法","authors":"H. Tariq, Harith Al-Sahaf, I. Welch","doi":"10.1145/3344341.3368821","DOIUrl":null,"url":null,"abstract":"Hadoop is a distributed computing framework that has a large number of configurable parameters. These parameters have impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches does not include complex interaction between configuration parameters and the cluster environment changes such as different datasets or query. This makes it difficult to predict the performance or resource utilization of a cluster when we use real-world datasets because of their size and content. This paper presents the modeling of resource utilization of Hadoop cluster on the basis of Hadoop configuration parameters and dataset structure. Our approach builds a machine learning based-model using Hive-based Hadoop query and then predict the outcome for a particular parameter setting and query type. We used decision trees to build models for each of our performance metric measures. Decision rules were extracted from these tree-based models and evaluated for their ability to generalize to unseen data. Our experiments predicted that the percentage of columns selected, mappers and replica has a statistically significant impact over the utilization of different resources in Hadoop cluster.","PeriodicalId":261870,"journal":{"name":"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Modelling and Prediction of Resource Utilization of Hadoop Clusters: A Machine Learning Approach\",\"authors\":\"H. Tariq, Harith Al-Sahaf, I. Welch\",\"doi\":\"10.1145/3344341.3368821\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hadoop is a distributed computing framework that has a large number of configurable parameters. These parameters have impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches does not include complex interaction between configuration parameters and the cluster environment changes such as different datasets or query. This makes it difficult to predict the performance or resource utilization of a cluster when we use real-world datasets because of their size and content. This paper presents the modeling of resource utilization of Hadoop cluster on the basis of Hadoop configuration parameters and dataset structure. Our approach builds a machine learning based-model using Hive-based Hadoop query and then predict the outcome for a particular parameter setting and query type. We used decision trees to build models for each of our performance metric measures. Decision rules were extracted from these tree-based models and evaluated for their ability to generalize to unseen data. Our experiments predicted that the percentage of columns selected, mappers and replica has a statistically significant impact over the utilization of different resources in Hadoop cluster.\",\"PeriodicalId\":261870,\"journal\":{\"name\":\"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-12-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3344341.3368821\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 12th IEEE/ACM International Conference on Utility and Cloud Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3344341.3368821","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Modelling and Prediction of Resource Utilization of Hadoop Clusters: A Machine Learning Approach
Hadoop is a distributed computing framework that has a large number of configurable parameters. These parameters have impact on system resources and execution time. Optimizing the performance of a Hadoop cluster by tuning such a large number of parameters is a tedious task. Most current big data modeling approaches does not include complex interaction between configuration parameters and the cluster environment changes such as different datasets or query. This makes it difficult to predict the performance or resource utilization of a cluster when we use real-world datasets because of their size and content. This paper presents the modeling of resource utilization of Hadoop cluster on the basis of Hadoop configuration parameters and dataset structure. Our approach builds a machine learning based-model using Hive-based Hadoop query and then predict the outcome for a particular parameter setting and query type. We used decision trees to build models for each of our performance metric measures. Decision rules were extracted from these tree-based models and evaluated for their ability to generalize to unseen data. Our experiments predicted that the percentage of columns selected, mappers and replica has a statistically significant impact over the utilization of different resources in Hadoop cluster.