改进通用并行代码的运行时调度

2011 International Conference on Parallel Architectures and Compilation Techniques Pub Date : 2011-10-10 DOI:10.1109/PACT.2011.49

Alexandros Tzannes, R. Barua, U. Vishkin

{"title":"改进通用并行代码的运行时调度","authors":"Alexandros Tzannes, R. Barua, U. Vishkin","doi":"10.1109/PACT.2011.49","DOIUrl":null,"url":null,"abstract":"Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {\\em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.","PeriodicalId":106423,"journal":{"name":"2011 International Conference on Parallel Architectures and Compilation Techniques","volume":"28 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Improving Run-Time Scheduling for General-Purpose Parallel Code\",\"authors\":\"Alexandros Tzannes, R. Barua, U. Vishkin\",\"doi\":\"10.1109/PACT.2011.49\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {\\\\em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.\",\"PeriodicalId\":106423,\"journal\":{\"name\":\"2011 International Conference on Parallel Architectures and Compilation Techniques\",\"volume\":\"28 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2011-10-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2011 International Conference on Parallel Architectures and Compilation Techniques\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PACT.2011.49\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 International Conference on Parallel Architectures and Compilation Techniques","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2011.49","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

今天，几乎所有的台式机和笔记本电脑都是多核共享内存，但它们运行的代码绝大多数是串行的。高级语言扩展和库(例如，Open MP, cilk++， TBB)使程序员比以前的方法(例如，MPI)更容易编写并行代码，这在很大程度上要归功于高效的{\em work-stealing}调度器，它允许程序员暴露比实际硬件并行性更多的并行性。但是，当并行任务太短或太多时，调度开销就会变得很大，并影响性能。因为这种情况经常发生(例如，数据并行，PRAM算法)，程序员需要通过将许多任务组合成更长的任务来手动粗化任务以提高性能。但是手动粗化通常会导致代码过度拟合用于粗化的输入数据、平台和上下文，并损害性能可移植性。我们建议区分两种类型的粗化，并使用不同的技术。然后改进我们之前关于Lazy Binary Splitting (LBS)的工作，这是一个调度程序，可以动态执行第二种类型的粗化，但无法在大型商业多核上进行扩展。我们改进的调度器，宽度优先延迟调度(BF-LS)克服了LBS的可伸缩性问题，在大型机器上的性能要好得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Run-Time Scheduling for General-Purpose Parallel Code

Today, almost all desktop and laptop computers are shared-memory multicores, but the code they run is overwhelmingly serial. High level language extensions and libraries (e.g., Open MP, Cilk++, TBB) make it much easier for programmers to write parallel code than previous approaches (e.g., MPI), in large part thanks to the efficient {\em work-stealing} scheduler that allows the programmer to expose more parallelism than the actual hardware parallelism. But when the parallel tasks are too short or too many, the scheduling overheads become significant and hurt performance. Because this happens frequently (e.g, data-parallelism, PRAM algorithms), programmers need to manually coarsen tasks for performance by combining many of them into longer tasks. But manual coarsening typically causes over fitting of the code to the input data, platform and context used to do the coarsening, and harms performance-portability. We propose distinguishing between two types of coarsening and using different techniques for them. Then improve on our previous work on Lazy Binary Splitting (LBS), a scheduler that performs the second type of coarsening dynamically, but fails to scale on large commercial multicores. Our improved scheduler, Breadth-First Lazy Scheduling (BF-LS) overcomes the scalability issue of LBS and performs much better on large machines.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2011 International Conference on Parallel Architectures and Compilation Techniques

自引率

0.00%

发文量