{"title":"推测槽位保留:为依赖数据并行计算强制服务隔离","authors":"Chen Chen, Wei Wang, Bo Li","doi":"10.1109/ICDCS.2017.174","DOIUrl":null,"url":null,"abstract":"Priority scheduling is a fundamental tool to provide service isolation for different jobs in shared clusters. Ideally, the performance of a high-priority job should not be dragged down by another with a lower priority. However, we show in this paper that simply assigning a high priority provides no isolation for jobs with dependent computations. A job, even receiving the highest priority, may give up compute slots to another before proceeding to the downstream computation, which is because of barrier, i.e., that the downstream computation cannot start until all the upstream tasks have completed. Such an interruption of execution inevitably results in a significant delay. In this paper, we propose speculative slot reservation that judiciously reserves slots for downstream computations, so as to retain service isolation for high-priority jobs. To mitigate the utilization loss due to slot reservation, we analyze the trade-off between utilization and isolation, and expose a tunable knob to navigate the trade-off. We also propose a complementary straggler mitigation strategy that uses the reserved slots to run extra copies of slow tasks. We have implemented speculative slot reservation in Spark. Evaluations based on both cluster deployment and trace-driven simulations show that our approach enforces strict service isolation for high-priority jobs, without slowing down the other jobs with a lower priority.","PeriodicalId":127689,"journal":{"name":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Speculative Slot Reservation: Enforcing Service Isolation for Dependent Data-Parallel Computations\",\"authors\":\"Chen Chen, Wei Wang, Bo Li\",\"doi\":\"10.1109/ICDCS.2017.174\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Priority scheduling is a fundamental tool to provide service isolation for different jobs in shared clusters. Ideally, the performance of a high-priority job should not be dragged down by another with a lower priority. However, we show in this paper that simply assigning a high priority provides no isolation for jobs with dependent computations. A job, even receiving the highest priority, may give up compute slots to another before proceeding to the downstream computation, which is because of barrier, i.e., that the downstream computation cannot start until all the upstream tasks have completed. Such an interruption of execution inevitably results in a significant delay. In this paper, we propose speculative slot reservation that judiciously reserves slots for downstream computations, so as to retain service isolation for high-priority jobs. To mitigate the utilization loss due to slot reservation, we analyze the trade-off between utilization and isolation, and expose a tunable knob to navigate the trade-off. We also propose a complementary straggler mitigation strategy that uses the reserved slots to run extra copies of slow tasks. We have implemented speculative slot reservation in Spark. Evaluations based on both cluster deployment and trace-driven simulations show that our approach enforces strict service isolation for high-priority jobs, without slowing down the other jobs with a lower priority.\",\"PeriodicalId\":127689,\"journal\":{\"name\":\"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICDCS.2017.174\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 37th International Conference on Distributed Computing Systems (ICDCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDCS.2017.174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speculative Slot Reservation: Enforcing Service Isolation for Dependent Data-Parallel Computations
Priority scheduling is a fundamental tool to provide service isolation for different jobs in shared clusters. Ideally, the performance of a high-priority job should not be dragged down by another with a lower priority. However, we show in this paper that simply assigning a high priority provides no isolation for jobs with dependent computations. A job, even receiving the highest priority, may give up compute slots to another before proceeding to the downstream computation, which is because of barrier, i.e., that the downstream computation cannot start until all the upstream tasks have completed. Such an interruption of execution inevitably results in a significant delay. In this paper, we propose speculative slot reservation that judiciously reserves slots for downstream computations, so as to retain service isolation for high-priority jobs. To mitigate the utilization loss due to slot reservation, we analyze the trade-off between utilization and isolation, and expose a tunable knob to navigate the trade-off. We also propose a complementary straggler mitigation strategy that uses the reserved slots to run extra copies of slow tasks. We have implemented speculative slot reservation in Spark. Evaluations based on both cluster deployment and trace-driven simulations show that our approach enforces strict service isolation for high-priority jobs, without slowing down the other jobs with a lower priority.