{"title":"SMiPE:估计循环迭代分布式数据流的进展","authors":"Jannis Koch, L. Thamsen, Florian Schmidt, O. Kao","doi":"10.1109/PDCAT.2017.00034","DOIUrl":null,"url":null,"abstract":"Distributed dataflow systems such as Apache Spark allow the execution of iterative programs at large scale on clusters. In production use, programs are often recurring and have strict latency requirements. Yet, choosing appropriate resource allocations is difficult as runtimes are dependent on hard-to-predict factors, including failures, cluster utilization and dataset characteristics. Offline runtime prediction helps to estimate resource requirements, but cannot take into account inherent variance due to, for example, changing cluster states. We present SMiPE, a system estimating the progress of iterative dataflows by matching a running job to previous executions based on similarity, capturing properties such as convergence, hardware utilization and runtime. SMiPE is not limited to a specific framework due to its black-box approach and is able to adapt to changing cluster states reflected in the current job’s statistics. SMiPE automatically adapts its similarity matching to algorithm-specific profiles by training parameters on the job history. We evaluated SMiPE with three iterative Spark jobs and nine datasets. The results show that SMiPE is effective in choosing useful historic runs and predicts runtimes with a mean relative error of 9.1% to 13.1%.","PeriodicalId":119197,"journal":{"name":"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"SMiPE: Estimating the Progress of Recurring Iterative Distributed Dataflows\",\"authors\":\"Jannis Koch, L. Thamsen, Florian Schmidt, O. Kao\",\"doi\":\"10.1109/PDCAT.2017.00034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Distributed dataflow systems such as Apache Spark allow the execution of iterative programs at large scale on clusters. In production use, programs are often recurring and have strict latency requirements. Yet, choosing appropriate resource allocations is difficult as runtimes are dependent on hard-to-predict factors, including failures, cluster utilization and dataset characteristics. Offline runtime prediction helps to estimate resource requirements, but cannot take into account inherent variance due to, for example, changing cluster states. We present SMiPE, a system estimating the progress of iterative dataflows by matching a running job to previous executions based on similarity, capturing properties such as convergence, hardware utilization and runtime. SMiPE is not limited to a specific framework due to its black-box approach and is able to adapt to changing cluster states reflected in the current job’s statistics. SMiPE automatically adapts its similarity matching to algorithm-specific profiles by training parameters on the job history. We evaluated SMiPE with three iterative Spark jobs and nine datasets. The results show that SMiPE is effective in choosing useful historic runs and predicts runtimes with a mean relative error of 9.1% to 13.1%.\",\"PeriodicalId\":119197,\"journal\":{\"name\":\"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PDCAT.2017.00034\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 18th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDCAT.2017.00034","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SMiPE: Estimating the Progress of Recurring Iterative Distributed Dataflows
Distributed dataflow systems such as Apache Spark allow the execution of iterative programs at large scale on clusters. In production use, programs are often recurring and have strict latency requirements. Yet, choosing appropriate resource allocations is difficult as runtimes are dependent on hard-to-predict factors, including failures, cluster utilization and dataset characteristics. Offline runtime prediction helps to estimate resource requirements, but cannot take into account inherent variance due to, for example, changing cluster states. We present SMiPE, a system estimating the progress of iterative dataflows by matching a running job to previous executions based on similarity, capturing properties such as convergence, hardware utilization and runtime. SMiPE is not limited to a specific framework due to its black-box approach and is able to adapt to changing cluster states reflected in the current job’s statistics. SMiPE automatically adapts its similarity matching to algorithm-specific profiles by training parameters on the job history. We evaluated SMiPE with three iterative Spark jobs and nine datasets. The results show that SMiPE is effective in choosing useful historic runs and predicts runtimes with a mean relative error of 9.1% to 13.1%.