{"title":"用对数基数分形改进不规则应用的调度","authors":"James Fox, Alok Tripathy, Oded Green","doi":"10.1109/HPEC.2019.8916333","DOIUrl":null,"url":null,"abstract":"Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multithreaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems. We show that LRB has better scalability than PPS for high thread counts on Intel’s Knight’s Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch-avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU.","PeriodicalId":184253,"journal":{"name":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Improving Scheduling for Irregular Applications with Logarithmic Radix Binning\",\"authors\":\"James Fox, Alok Tripathy, Oded Green\",\"doi\":\"10.1109/HPEC.2019.8916333\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multithreaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems. We show that LRB has better scalability than PPS for high thread counts on Intel’s Knight’s Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch-avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU.\",\"PeriodicalId\":184253,\"journal\":{\"name\":\"2019 IEEE High Performance Extreme Computing Conference (HPEC)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IEEE High Performance Extreme Computing Conference (HPEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPEC.2019.8916333\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC.2019.8916333","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
摘要
尽管经过几十年的研究,大规模多线程系统上的应用程序的有效调度和负载平衡仍然具有挑战性,特别是对于不规则和数据依赖的问题,其中执行控制路径直到运行时才知道。用于数据相关问题的最广泛使用的负载平衡方案之一是对每个任务的预期工作量使用并行前缀和(PPS)数组,然后将任务划分为线程。虽然对于许多系统来说已经足够了,但对于具有SIMD/SIMT执行的大规模多线程系统(例如gpu)来说并不理想。为了有效地利用SIMD/SIMT单元,需要更细粒度的负载平衡。在本文中,我们介绍了对数基数分割(LRB)作为一个更合适的替代并行前缀求和在这类系统上的负载平衡。我们证明LRB在Intel的Knight’s Landing处理器上具有比PPS更好的高线程数可扩展性,在NVIDIA Volta gpu上具有类似的可扩展性。在应用程序端,我们展示了LRB如何使用避免分支模型将PageRank的性能提高到1.75倍。我们还展示了如何更好地负载平衡分段排序和提高GPU上的性能。
Improving Scheduling for Irregular Applications with Logarithmic Radix Binning
Effective scheduling and load balancing of applications on massively multi-threading systems remains challenging despite decades of research, especially for irregular and data dependent problems where the execution control path is unknown until run-time. One of the most widely used load-balancing schemes used for data dependent problems is a parallel prefix sum (PPS) array over the expected amount of work per task, followed by a partitioning of tasks to threads. While sufficient for many systems, it is not ideal for massively multithreaded systems with SIMD/SIMT execution, such as GPUs. More fine-grained load-balancing is needed to effectively utilize SIMD/SIMT units. In this paper we introduce Logarithmic Radix Binning (LRB) as a more suitable alternative to parallel prefix summation for load-balancing on such systems. We show that LRB has better scalability than PPS for high thread counts on Intel’s Knight’s Landing processor and comparable scalability on NVIDIA Volta GPUs. On the application side, we show how LRB improves the performance of PageRank up to 1.75X using the branch-avoiding model. We also show how to better load-balance segmented sort and improve performance on the GPU.