A. Karthik, Harsh Mishra, S. Jayanth, G. Shobha, Jyoti Shetty
{"title":"HPCC系统的性能偏差预测","authors":"A. Karthik, Harsh Mishra, S. Jayanth, G. Shobha, Jyoti Shetty","doi":"10.1109/Confluence52989.2022.9734182","DOIUrl":null,"url":null,"abstract":"Over the last decade, the volume of data has been growing at a larger rate in comparison to the processing power available. The advent of distributed computing was essential in being able to handle these vast amounts of data. However, the distribution of data across the systems may not be uniform and gives rise to the problems of data skew and performance skew. A key challenge is to estimate the effective performance skew of a set of queries based on the data skew of the dataset on a multi-computing cluster. We use HPCC Systems, a modern big data management and analysis tool. Methods used to measure the impact of performance skew on the performance of queries on a HPCC cluster are heavily dependent on human interpretation. This project aims to automate the process of skew prediction by analyzing the execution graphs of a job on the HPCC Systems cluster and predicting the probable performance skew for a given set of queries using a Random Forest Regressor Model.","PeriodicalId":261941,"journal":{"name":"2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence)","volume":"123 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Performance Skew Prediction in HPCC Systems\",\"authors\":\"A. Karthik, Harsh Mishra, S. Jayanth, G. Shobha, Jyoti Shetty\",\"doi\":\"10.1109/Confluence52989.2022.9734182\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Over the last decade, the volume of data has been growing at a larger rate in comparison to the processing power available. The advent of distributed computing was essential in being able to handle these vast amounts of data. However, the distribution of data across the systems may not be uniform and gives rise to the problems of data skew and performance skew. A key challenge is to estimate the effective performance skew of a set of queries based on the data skew of the dataset on a multi-computing cluster. We use HPCC Systems, a modern big data management and analysis tool. Methods used to measure the impact of performance skew on the performance of queries on a HPCC cluster are heavily dependent on human interpretation. This project aims to automate the process of skew prediction by analyzing the execution graphs of a job on the HPCC Systems cluster and predicting the probable performance skew for a given set of queries using a Random Forest Regressor Model.\",\"PeriodicalId\":261941,\"journal\":{\"name\":\"2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence)\",\"volume\":\"123 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Confluence52989.2022.9734182\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 12th International Conference on Cloud Computing, Data Science & Engineering (Confluence)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Confluence52989.2022.9734182","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Over the last decade, the volume of data has been growing at a larger rate in comparison to the processing power available. The advent of distributed computing was essential in being able to handle these vast amounts of data. However, the distribution of data across the systems may not be uniform and gives rise to the problems of data skew and performance skew. A key challenge is to estimate the effective performance skew of a set of queries based on the data skew of the dataset on a multi-computing cluster. We use HPCC Systems, a modern big data management and analysis tool. Methods used to measure the impact of performance skew on the performance of queries on a HPCC cluster are heavily dependent on human interpretation. This project aims to automate the process of skew prediction by analyzing the execution graphs of a job on the HPCC Systems cluster and predicting the probable performance skew for a given set of queries using a Random Forest Regressor Model.