Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar
{"title":"通过自定义调度策略优化并行循环的执行","authors":"Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar","doi":"10.1145/3337821.3337913","DOIUrl":null,"url":null,"abstract":"On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Optimized Execution of Parallel Loops via User-Defined Scheduling Policies\",\"authors\":\"Seonmyeong Bak, Yanfei Guo, P. Balaji, Vivek Sarkar\",\"doi\":\"10.1145/3337821.3337913\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.\",\"PeriodicalId\":405273,\"journal\":{\"name\":\"Proceedings of the 48th International Conference on Parallel Processing\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 48th International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3337821.3337913\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 48th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3337821.3337913","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimized Execution of Parallel Loops via User-Defined Scheduling Policies
On-node parallelism continues to increase in importance for high-performance computing and most newly deployed supercomputers have tens of processor cores per node. These higher levels of on-node parallelism exacerbate the impact of load imbalance and locality in parallel computations, and current programming systems notably lack features to enable efficient use of these large numbers of cores or require users to modify codes significantly. Our work is motivated by the need to address application-specific load balance and locality requirements with minimal changes to application codes. In this paper, we propose a new approach to extend the specification of parallel loops via user functions that specify iteration chunks. We also extend the runtime system to invoke these user functions when determining how to create chunks and schedule them on worker threads. Our runtime system starts with subspaces specified in the user functions, performs load balancing of chunks concurrently, and stores the balanced groups of chunks to reduce load imbalance in future invocations. Our approach can be used to improve load balance and locality in many dynamic iterative applications, including graph and sparse matrix applications. We demonstrate the benefits of this work using MiniMD, a miniapp derived from LAMMPS, and three kernels from the GAP Benchmark Suite: Breadth-First Search, Connected Components, and PageRank, each evaluated with six different graph data sets. Our approach achieves geometric mean speedups of 1.16× to 1.54× over four standard OpenMP schedules and 1.07× over the static_steal schedule from recent research.